# 03 - Feature Engineering

**Customer Lifetime Value Prediction**

**Team:** The Starks
- Othmane Zizi (261255341)
- Fares Joni (261254593)
- Tanmay Giri (261272443)

This notebook creates RFM and behavioral features for CLV prediction.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sys
from pathlib import Path
from datetime import datetime, timedelta

# Add src to path
sys.path.append(str(Path('../src').resolve()))
from data_loader import load_processed_data
from features import (
    calculate_rfm_features,
    calculate_behavioral_features,
    create_customer_features,
    split_data_temporal,
    calculate_clv_target
)

plt.style.use('seaborn-v0_8-whitegrid')
pd.set_option('display.max_columns', None)

## 1. Load Cleaned Data

In [None]:
# Load cleaned transaction data
df = load_processed_data('cleaned_retail.csv')
print(f"Loaded {len(df):,} transactions")
print(f"Date range: {df['InvoiceDate'].min().date()} to {df['InvoiceDate'].max().date()}")
df.head()

## 2. Define Observation and Prediction Periods

For CLV prediction, we split the data temporally:
- **Observation Period**: First 12 months - used to calculate features
- **Prediction Period**: Next 6 months - used to calculate CLV target

In [None]:
# Split data into observation and prediction periods
observation_df, prediction_df, obs_end, pred_end = split_data_temporal(
    df,
    date_col='InvoiceDate',
    observation_months=12,
    prediction_months=6
)

print(f"Observation Period: {observation_df['InvoiceDate'].min().date()} to {obs_end.date()}")
print(f"  Transactions: {len(observation_df):,}")
print(f"  Customers: {observation_df['Customer ID'].nunique():,}")

print(f"\nPrediction Period: {obs_end.date()} to {pred_end.date()}")
print(f"  Transactions: {len(prediction_df):,}")
print(f"  Customers: {prediction_df['Customer ID'].nunique():,}")

## 3. Calculate RFM Features

**RFM (Recency, Frequency, Monetary)** is a classic customer segmentation framework:
- **Recency**: Days since last purchase (lower = better)
- **Frequency**: Number of purchases (higher = better)
- **Monetary**: Total spend (higher = better)

In [None]:
# Calculate RFM features based on observation period
rfm = calculate_rfm_features(
    observation_df,
    customer_id_col='Customer ID',
    date_col='InvoiceDate',
    amount_col='TotalAmount',
    invoice_col='Invoice',
    reference_date=obs_end
)

print(f"RFM features calculated for {len(rfm):,} customers")
rfm.head(10)

In [None]:
# RFM statistics
print("RFM Feature Statistics:")
rfm[['Recency', 'Frequency', 'Monetary']].describe()

In [None]:
# Visualize RFM distributions
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Recency
axes[0].hist(rfm['Recency'], bins=50, edgecolor='black', alpha=0.7, color='coral')
axes[0].set_title('Recency Distribution')
axes[0].set_xlabel('Days Since Last Purchase')
axes[0].set_ylabel('Number of Customers')

# Frequency
freq_clipped = rfm['Frequency'].clip(upper=rfm['Frequency'].quantile(0.95))
axes[1].hist(freq_clipped, bins=50, edgecolor='black', alpha=0.7, color='steelblue')
axes[1].set_title('Frequency Distribution (95th percentile)')
axes[1].set_xlabel('Number of Purchases')
axes[1].set_ylabel('Number of Customers')

# Monetary
monetary_clipped = rfm['Monetary'].clip(upper=rfm['Monetary'].quantile(0.95))
axes[2].hist(monetary_clipped, bins=50, edgecolor='black', alpha=0.7, color='green')
axes[2].set_title('Monetary Distribution (95th percentile)')
axes[2].set_xlabel('Total Spend (GBP)')
axes[2].set_ylabel('Number of Customers')

plt.tight_layout()
plt.savefig('../reports/figures/rfm_distributions.png', dpi=150, bbox_inches='tight')
plt.show()

## 4. Calculate Behavioral Features

Additional features beyond RFM:
- **Tenure**: Days since first purchase
- **AvgTimeBetweenPurchases**: Purchase cadence
- **NumUniqueProducts**: Product diversity
- **AvgBasketSize**: Items per transaction
- **AvgOrderValue**: Spend per transaction

In [None]:
# Calculate behavioral features
behavioral = calculate_behavioral_features(
    observation_df,
    customer_id_col='Customer ID',
    date_col='InvoiceDate',
    amount_col='TotalAmount',
    invoice_col='Invoice',
    quantity_col='Quantity',
    product_col='StockCode',
    reference_date=obs_end
)

print(f"Behavioral features calculated for {len(behavioral):,} customers")
behavioral.head(10)

In [None]:
# Behavioral feature statistics
print("Behavioral Feature Statistics:")
behavioral.describe()

## 5. Combine All Features

In [None]:
# Merge RFM and behavioral features
customer_features = rfm.merge(behavioral, on='Customer ID', how='left')

print(f"Total customer features: {len(customer_features):,}")
print(f"\nFeature columns: {customer_features.columns.tolist()}")
customer_features.head(10)

## 6. Calculate CLV Target Variable

In [None]:
# Calculate CLV (total spend in prediction period)
clv_target = calculate_clv_target(
    observation_df,
    prediction_df,
    customer_id_col='Customer ID',
    amount_col='TotalAmount'
)

print(f"CLV calculated for {len(clv_target):,} customers")
print(f"\nCLV Statistics:")
print(clv_target['CLV'].describe())

In [None]:
# Visualize CLV distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Full distribution (log scale)
axes[0].hist(clv_target['CLV'], bins=100, edgecolor='black', alpha=0.7)
axes[0].set_title('CLV Distribution')
axes[0].set_xlabel('CLV (GBP)')
axes[0].set_ylabel('Number of Customers')
axes[0].set_yscale('log')

# Zoomed in (95th percentile)
clv_clipped = clv_target['CLV'].clip(upper=clv_target['CLV'].quantile(0.95))
axes[1].hist(clv_clipped, bins=50, edgecolor='black', alpha=0.7, color='green')
axes[1].set_title('CLV Distribution (95th percentile)')
axes[1].set_xlabel('CLV (GBP)')
axes[1].set_ylabel('Number of Customers')

plt.tight_layout()
plt.savefig('../reports/figures/clv_distribution.png', dpi=150, bbox_inches='tight')
plt.show()

# Check churned customers
churned = (clv_target['CLV'] == 0).sum()
print(f"\nCustomers with CLV = 0 (churned): {churned:,} ({churned/len(clv_target)*100:.1f}%)")

## 7. Create Final Dataset

In [None]:
# Merge features with CLV target
final_dataset = customer_features.merge(clv_target, on='Customer ID', how='left')

print(f"Final dataset shape: {final_dataset.shape}")
print(f"\nColumns: {final_dataset.columns.tolist()}")
final_dataset.head(10)

In [None]:
# Check for missing values
print("Missing Values:")
missing = final_dataset.isnull().sum()
print(missing[missing > 0] if missing.sum() > 0 else "No missing values!")

In [None]:
# Handle any missing values (fill with 0 for features that can be 0)
final_dataset = final_dataset.fillna(0)

# Final summary statistics
print("\nFinal Dataset Summary:")
final_dataset.describe()

## 8. Feature Correlation Analysis

In [None]:
# Correlation matrix
feature_cols = ['Recency', 'Frequency', 'Monetary', 'Tenure', 
                'AvgTimeBetweenPurchases', 'NumUniqueProducts', 
                'AvgBasketSize', 'AvgOrderValue', 'CLV']

corr_matrix = final_dataset[feature_cols].corr()

plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='RdBu_r', center=0, fmt='.2f')
plt.title('Feature Correlation Matrix')
plt.tight_layout()
plt.savefig('../reports/figures/correlation_matrix.png', dpi=150, bbox_inches='tight')
plt.show()

In [None]:
# Correlation with CLV
print("\nCorrelation with CLV (Target):")
clv_corr = final_dataset[feature_cols].corr()['CLV'].drop('CLV').sort_values(ascending=False)
print(clv_corr)

## 9. Save Feature Dataset

In [None]:
# Save the final feature dataset
output_path = Path('../data/processed/customer_features.csv')
final_dataset.to_csv(output_path, index=False)

print(f"Feature dataset saved to: {output_path}")
print(f"File size: {output_path.stat().st_size / 1024:.2f} KB")

In [None]:
# Verify saved data
df_verify = pd.read_csv(output_path)
print(f"\nVerification - Loaded shape: {df_verify.shape}")
df_verify.head()

## 10. Feature Engineering Summary

### Features Created:

**RFM Features:**
- `Recency`: Days since last purchase (observation period end)
- `Frequency`: Number of unique invoices/transactions
- `Monetary`: Total spend in observation period

**Behavioral Features:**
- `Tenure`: Days since first purchase
- `AvgTimeBetweenPurchases`: Average days between purchases
- `NumUniqueProducts`: Number of different products purchased
- `AvgBasketSize`: Average items per transaction
- `AvgOrderValue`: Average spend per transaction

**Target Variable:**
- `CLV`: Total spend in prediction period (6 months)

### Key Observations:
- Monetary value has strongest correlation with CLV
- Frequency and NumUniqueProducts also strongly correlated
- Recency shows negative correlation (more recent = higher CLV)

### Next Steps:
- Train ML models to predict CLV
- Perform customer segmentation