# Demo 5: Multi-Strategy Customer Segmentation

This notebook compares three customer segmentation approaches:
1. **Cohort Analysis**
2. **Churn Prediction Model**
3. **Latent Trait Personas via PCA**

Each method is applied to the same e-commerce dataset to illustrate different segmentation strategies and their business applications.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta


## Load and Clean Dataset

In [None]:
# Load Excel or CSV dataset
df = pd.read_excel("data/online_retail.xlsx")

# Clean and preprocess
df = df.dropna(subset=['CustomerID'])
df = df[df['Quantity'] >= 0]
df = df.drop_duplicates()

excluded_rows = df.shape[0] - df_clean.shape[0]
excluded_pct = round((excluded_rows / df.shape[0]) * 100, 2)
print(f"{excluded_rows} rows removed ({excluded_pct}% of original data).")

df['TotalPrice'] = df['Quantity'] * df['UnitPrice']
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])
df['InvoiceMonth'] = df['InvoiceDate'].dt.to_period('M')


## Part 1: Cohort Analysis

In [None]:
# Assign cohort month based on user's first purchase
df['CohortMonth'] = df.groupby('CustomerID')['InvoiceDate'].transform('min').dt.to_period('M')

# Create cohort index (months since first purchase)
df['CohortIndex'] = (df['InvoiceMonth'].dt.to_timestamp() - df['CohortMonth'].dt.to_timestamp()).dt.days // 30

# Count unique customers per cohort
cohort_data = df.groupby(['CohortMonth', 'CohortIndex'])['CustomerID'].nunique().reset_index()
cohort_pivot = cohort_data.pivot(index='CohortMonth', columns='CohortIndex', values='CustomerID')

# Retention matrix
cohort_size = cohort_pivot.iloc[:,0]
retention = cohort_pivot.divide(cohort_size, axis=0).round(3)

# Plot heatmap
plt.figure(figsize=(10,6))
sns.heatmap(retention, annot=True, fmt='.0%', cmap='Blues')
plt.title("Customer Retention by Cohort")
plt.ylabel("Cohort Month")
plt.xlabel("Months Since First Purchase")
plt.show()


## Part 2: Churn Prediction

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, classification_report

# Create churn label: no purchase in last 90 days
snapshot_date = df['InvoiceDate'].max() + timedelta(days=1)
rfm = df.groupby('CustomerID').agg({
    'InvoiceDate': [lambda x: (snapshot_date - x.max()).days,
                    lambda x: (x.max() - x.min()).days],
    'InvoiceNo': 'count',
    'TotalPrice': 'sum'
})
rfm.columns = ['Recency', 'Tenure', 'Frequency', 'Monetary']
rfm = rfm[rfm['Monetary'] > 0]

rfm['Churn'] = (rfm['Recency'] > 90).astype(int)

X = rfm[['Recency', 'Tenure', 'Frequency', 'Monetary']]
y = rfm['Churn']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print(classification_report(y_test, y_pred))
print("ROC AUC Score:", roc_auc_score(y_test, model.predict_proba(X_test)[:,1]))


## Part 3: Latent Trait Personas with PCA

In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Plot PCA components
plt.figure(figsize=(8,6))
plt.scatter(X_pca[:,0], X_pca[:,1], c=rfm['Churn'], cmap='coolwarm', alpha=0.5)
plt.title("Latent Customer Traits via PCA")
plt.xlabel("Component 1")
plt.ylabel("Component 2")
plt.show()


## Summary: Comparing Segmentation Methods


| Method           | Pros                          | Cons                         | Best Use Case                     |
|------------------|-------------------------------|-------------------------------|------------------------------------|
| Cohort           | Easy to interpret, retention-focused | Misses individual variation | Lifecycle marketing, LTV modeling  |
| Churn Prediction | Predictive, actionable        | Needs label, some modeling   | Win-back campaigns, risk targeting |
| Latent Personas  | Unsupervised, exploratory     | Harder to explain segments   | UX, personalization, discovery     |

**Recommendation**: Choose the method based on the business question:
- Use **cohort** when analyzing lifecycle/retention
- Use **churn model** when targeting actions
- Use **PCA or clustering** to explore unknown customer behaviors
