Data Preprocessing
This section handles data cleaning, transformation, and feature engineering to prepare the dataset for analysis.

Objectives
Clean invalid transactions (e.g., negative quantities, zero prices, canceled orders)
Handle missing values (specifically for CustomerID)
Create temporal features (year, month, day of week, hour) from the InvoiceDate
Calculate derived metrics like basket size and transaction value

In [None]:
# Remove rows with missing CustomerID
df.dropna(subset=['CustomerID'], inplace=True)

# Convert CustomerID to integer type
df['CustomerID'] = df['CustomerID'].astype(int)

# Remove rows with negative Quantity
df = df[df['Quantity'] > 0]

# Remove rows with UnitPrice equal to 0
df = df[df['UnitPrice'] > 0]

# Remove cancel transactions (InvoiceNo starting with 'C')
df = df[~df['InvoiceNo'].astype(str).str.startswith('C')]

# Display shape after cleaning
print("Shape after cleaning:", df.shape)

In [None]:
# Extract date, hour, day of week, month, and year from InvoiceDate
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])
df['InvoiceYear'] = df['InvoiceDate'].dt.year
df['InvoiceMonth'] = df['InvoiceDate'].dt.month
df['InvoiceDay'] = df['InvoiceDate'].dt.dayofweek  # Monday=0, Sunday=6
df['InvoiceHour'] = df['InvoiceDate'].dt.hour


In [None]:
# Calculate 'Total Price' for each transaction
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']

# Calculate 'Basket Size' (number of unique products in each invoice)
basket_size = df.groupby('InvoiceNo')['StockCode'].nunique().reset_index()
basket_size.columns = ['InvoiceNo', 'BasketSize']
df = pd.merge(df, basket_size, on='InvoiceNo', how='left')

# Calculate 'Transaction Value' (sum of TotalPrice for each invoice)
transaction_value = df.groupby('InvoiceNo')['TotalPrice'].sum().reset_index()
transaction_value.columns = ['InvoiceNo', 'TransactionValue']
df = pd.merge(df, transaction_value, on='InvoiceNo', how='left')

# Display the updated DataFrame with new metrics and temporal features
display(df.head())