# 02 - Data Cleaning

**Customer Lifetime Value Prediction**

**Team:** The Starks
- Othmane Zizi (261255341)
- Fares Joni (261254593)
- Tanmay Giri (261272443)

This notebook cleans and preprocesses the Online Retail II dataset.

In [None]:
import pandas as pd
import numpy as np
import sys
from pathlib import Path

# Add src to path
sys.path.append(str(Path('../src').resolve()))
from data_loader import load_raw_data, clean_data

pd.set_option('display.max_columns', None)

## 1. Load Raw Data

In [None]:
# Load the raw data
data_path = Path('../data/raw/online_retail_II.xlsx')
df = load_raw_data(data_path)

print(f"Raw data shape: {df.shape}")
print(f"\nColumns: {df.columns.tolist()}")
df.head()

## 2. Data Quality Check

In [None]:
print("Data Quality Summary Before Cleaning:")
print(f"\nTotal rows: {len(df):,}")

# Missing values
print("\nMissing Values:")
for col in df.columns:
    missing = df[col].isnull().sum()
    if missing > 0:
        print(f"  {col}: {missing:,} ({missing/len(df)*100:.2f}%)")

# Cancelled transactions
cancelled = df['Invoice'].astype(str).str.startswith('C').sum()
print(f"\nCancelled transactions (Invoice starts with 'C'): {cancelled:,} ({cancelled/len(df)*100:.2f}%)")

# Negative/zero values
neg_qty = (df['Quantity'] <= 0).sum()
neg_price = (df['Price'] <= 0).sum()
print(f"\nNon-positive Quantity: {neg_qty:,} ({neg_qty/len(df)*100:.2f}%)")
print(f"Non-positive Price: {neg_price:,} ({neg_price/len(df)*100:.2f}%)")

## 3. Clean the Data

In [None]:
# Apply cleaning function
df_clean = clean_data(df)

print(f"Original rows: {len(df):,}")
print(f"Cleaned rows: {len(df_clean):,}")
print(f"Rows removed: {len(df) - len(df_clean):,} ({(len(df) - len(df_clean))/len(df)*100:.1f}%)")

In [None]:
# Verify cleaning
print("\nPost-Cleaning Verification:")
print(f"  Missing Customer ID: {df_clean['Customer ID'].isnull().sum()}")
print(f"  Cancelled transactions: {df_clean['Invoice'].str.startswith('C').sum()}")
print(f"  Non-positive Quantity: {(df_clean['Quantity'] <= 0).sum()}")
print(f"  Non-positive Price: {(df_clean['Price'] <= 0).sum()}")

In [None]:
# Data info after cleaning
print("\nData Types:")
print(df_clean.dtypes)
print(f"\nMemory usage: {df_clean.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

## 4. Handle Outliers

In [None]:
# Check for extreme outliers
print("Outlier Analysis:")
print("\nQuantity Statistics:")
print(df_clean['Quantity'].describe())
print(f"\n99th percentile: {df_clean['Quantity'].quantile(0.99)}")
print(f"Max: {df_clean['Quantity'].max()}")

print("\nPrice Statistics:")
print(df_clean['Price'].describe())
print(f"\n99th percentile: {df_clean['Price'].quantile(0.99)}")
print(f"Max: {df_clean['Price'].max()}")

print("\nTotalAmount Statistics:")
print(df_clean['TotalAmount'].describe())
print(f"\n99th percentile: {df_clean['TotalAmount'].quantile(0.99)}")
print(f"Max: {df_clean['TotalAmount'].max()}")

In [None]:
# Remove extreme outliers (optional - keeping for now but flagging)
# Very high prices might be bulk/wholesale orders which are valid
# We'll keep the data but be aware of outliers during modeling

# Flag potential outliers for reference
qty_threshold = df_clean['Quantity'].quantile(0.999)
price_threshold = df_clean['Price'].quantile(0.999)

extreme_qty = (df_clean['Quantity'] > qty_threshold).sum()
extreme_price = (df_clean['Price'] > price_threshold).sum()

print(f"Extreme quantity outliers (>99.9th percentile): {extreme_qty}")
print(f"Extreme price outliers (>99.9th percentile): {extreme_price}")

## 5. Summary Statistics of Cleaned Data

In [None]:
print("Cleaned Dataset Summary:")
print(f"\nRows: {len(df_clean):,}")
print(f"Unique Customers: {df_clean['Customer ID'].nunique():,}")
print(f"Unique Invoices: {df_clean['Invoice'].nunique():,}")
print(f"Unique Products: {df_clean['StockCode'].nunique():,}")
print(f"Countries: {df_clean['Country'].nunique()}")
print(f"\nDate Range: {df_clean['InvoiceDate'].min().date()} to {df_clean['InvoiceDate'].max().date()}")
print(f"Duration: {(df_clean['InvoiceDate'].max() - df_clean['InvoiceDate'].min()).days} days")
print(f"\nTotal Revenue: Â£{df_clean['TotalAmount'].sum():,.2f}")

In [None]:
# Customer-level summary
customer_summary = df_clean.groupby('Customer ID').agg({
    'Invoice': 'nunique',
    'TotalAmount': 'sum',
    'InvoiceDate': ['min', 'max']
}).reset_index()

customer_summary.columns = ['Customer ID', 'NumOrders', 'TotalSpend', 'FirstPurchase', 'LastPurchase']

print("\nCustomer-Level Statistics:")
print(customer_summary[['NumOrders', 'TotalSpend']].describe())

## 6. Save Cleaned Data

In [None]:
# Save cleaned data to processed folder
output_path = Path('../data/processed/cleaned_retail.csv')
df_clean.to_csv(output_path, index=False)

print(f"Cleaned data saved to: {output_path}")
print(f"File size: {output_path.stat().st_size / 1024**2:.2f} MB")

In [None]:
# Verify saved data
df_verify = pd.read_csv(output_path, parse_dates=['InvoiceDate'])
print(f"\nVerification - Loaded shape: {df_verify.shape}")
df_verify.head()

## 7. Cleaning Summary

### Steps Performed:
1. Removed rows with missing Customer ID (~20% of data)
2. Removed cancelled transactions (Invoice starting with 'C')
3. Removed rows with non-positive Quantity
4. Removed rows with non-positive Price
5. Created TotalAmount = Quantity * Price

### Data Retained:
- ~75% of original data retained after cleaning
- All retained rows have valid customer IDs for CLV calculation

### Next Steps:
- Feature engineering (RFM + behavioral features)
- Create train/test temporal split