# Customer Lifetime Value (CLV) Prediction

## üìö  Overview

In this notebook, we'll build a **Machine Learning model** to predict customer lifetime value using the customer segments we created with EMR.

### What We'll Learn:
1. **Exploratory Data Analysis (EDA)** - Understanding our data
2. **Feature Engineering** - Creating predictive features
3. **Model Training** - Random Forest Regression
4. **Model Evaluation** - Metrics and validation
5. **Predictions** - Making business decisions

### Business Problem:
**Goal**: Predict how much revenue each customer will generate in the next 12 months

**Why it matters**:
- Budget allocation (how much to spend on retention vs acquisition)
- Identify high-value customers for VIP treatment
- Predict churn risk (low predicted CLV = potential churner)
- Personalize marketing spend per customer

---

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import boto3
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.preprocessing import LabelEncoder
import joblib
import warnings
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("‚úÖ Libraries imported successfully!")
print(f"   Pandas version: {pd.__version__}")
print(f"   Numpy version: {np.__version__}")

## 1. Load Data from S3

We'll load the customer segments data that EMR created using K-Means clustering.

In [None]:
# S3 configuration
CURATED_BUCKET = 'data-lake-curated-zone-616129051451'
PROCESSED_BUCKET = 'data-lake-processed-zone-616129051451'
SCRIPTS_BUCKET = 'data-lake-scripts-616129051451'

s3 = boto3.client('s3')

print("üìä Loading customer segments from S3...")

# Load all segments
segments_dfs = []
for segment_id in range(4):
    response = s3.list_objects_v2(
        Bucket=CURATED_BUCKET,
        Prefix=f'customer-segments/segment={segment_id}/'
    )
    
    for obj in response.get('Contents', []):
        if obj['Key'].endswith('.parquet'):
            # Download from S3
            s3.download_file(CURATED_BUCKET, obj['Key'], f'/tmp/segment_{segment_id}.parquet')
            df = pd.read_parquet(f'/tmp/segment_{segment_id}.parquet')
            segments_dfs.append(df)

# Combine all segments
customers_df = pd.concat(segments_dfs, ignore_index=True)

print(f"‚úÖ Loaded {len(customers_df)} customers")
print(f"   Columns: {list(customers_df.columns)}")
customers_df.head()

## 2. Exploratory Data Analysis (EDA)

Let's understand our data before building the model.

In [None]:
# Basic statistics
print("üìä Dataset Overview:")
print(f"   Shape: {customers_df.shape}")
print(f"   Memory: {customers_df.memory_usage(deep=True).sum() / 1024:.2f} KB")
print("\nüìà Data Types:")
print(customers_df.dtypes)
print("\nüìâ Missing Values:")
print(customers_df.isnull().sum())
print("\nüìä Numerical Statistics:")
customers_df.describe()

In [None]:
# Segment distribution
print("\nüéØ Customer Segments Distribution:")
segment_dist = customers_df['segment_label'].value_counts()
print(segment_dist)

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Segment counts
segment_dist.plot(kind='bar', ax=axes[0], color=['#2ecc71', '#e74c3c', '#f39c12', '#3498db'])
axes[0].set_title('Customer Count by Segment', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Segment')
axes[0].set_ylabel('Count')
axes[0].tick_params(axis='x', rotation=45)

# Revenue by segment
revenue_by_segment = customers_df.groupby('segment_label')['total_spent'].sum().sort_values(ascending=False)
revenue_by_segment.plot(kind='bar', ax=axes[1], color=['#2ecc71', '#e74c3c', '#f39c12', '#3498db'])
axes[1].set_title('Total Revenue by Segment', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Segment')
axes[1].set_ylabel('Total Spent ($)')
axes[1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

print("\nüí∞ Revenue by Segment:")
print(revenue_by_segment)

In [None]:
# Correlation analysis
print("\nüîó Feature Correlations:")
numeric_cols = ['total_spent', 'purchase_count', 'days_since_purchase', 'segment']
correlation_matrix = customers_df[numeric_cols].corr()

plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, 
            square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Feature Correlation Matrix', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print(correlation_matrix)

## 3. Feature Engineering

Create new features that help predict customer lifetime value.

### Features We'll Create:
1. **avg_order_value** = total_spent / purchase_count
2. **recency_score** = Inverse of days_since_purchase (recent = higher score)
3. **frequency_score** = Normalized purchase count
4. **monetary_score** = Normalized total spent
5. **engagement_score** = Combined metric
6. **One-hot encoding** for categorical features (region, payment method)

In [None]:
# Create a copy for feature engineering
df_features = customers_df.copy()

print("üîß Engineering features...")

# 1. Average order value
df_features['avg_order_value'] = df_features['total_spent'] / df_features['purchase_count']

# 2. Recency score (inverse of days since purchase, normalized)
max_days = df_features['days_since_purchase'].max()
df_features['recency_score'] = 1 - (df_features['days_since_purchase'] / max_days)

# 3. Frequency score (normalized purchase count)
max_purchases = df_features['purchase_count'].max()
df_features['frequency_score'] = df_features['purchase_count'] / max_purchases

# 4. Monetary score (normalized total spent)
max_spent = df_features['total_spent'].max()
df_features['monetary_score'] = df_features['total_spent'] / max_spent

# 5. Engagement score (weighted combination)
df_features['engagement_score'] = (
    0.3 * df_features['recency_score'] +
    0.3 * df_features['frequency_score'] +
    0.4 * df_features['monetary_score']
)

# 6. Encode categorical features
# Region
region_dummies = pd.get_dummies(df_features['primary_region'], prefix='region')
df_features = pd.concat([df_features, region_dummies], axis=1)

# Payment method
payment_dummies = pd.get_dummies(df_features['preferred_payment'], prefix='payment')
df_features = pd.concat([df_features, payment_dummies], axis=1)

# Segment label encoding
le = LabelEncoder()
df_features['segment_encoded'] = le.fit_transform(df_features['segment_label'])

print(f"‚úÖ Created {len(df_features.columns) - len(customers_df.columns)} new features")
print(f"   Total features now: {len(df_features.columns)}")
print("\nüìä New features sample:")
df_features[['customer_id', 'avg_order_value', 'recency_score', 'frequency_score', 
             'monetary_score', 'engagement_score']].head()

## 4. Create Target Variable (CLV)

### What is Customer Lifetime Value (CLV)?

**CLV** = Predicted revenue a customer will generate over their lifetime (or a specific period).

### Our Approach:
Since we don't have future data, we'll create a **proxy CLV** based on historical behavior:

```
CLV_12_months = (total_spent / days_active) * 365
```

This estimates annual revenue based on their historical spending rate.

**In production**, you'd use:
- Actual 12-month revenue (if you have historical data)
- Cohort analysis
- Survival analysis

In [None]:
# Create target variable: 12-month CLV estimate
# Assume customers have been active for 'days_since_purchase' days
df_features['days_active'] = df_features['days_since_purchase'].clip(lower=1)  # Avoid division by zero
df_features['clv_12_months'] = (df_features['total_spent'] / df_features['days_active']) * 365

# Cap extreme values (prevent unrealistic predictions)
df_features['clv_12_months'] = df_features['clv_12_months'].clip(upper=df_features['clv_12_months'].quantile(0.95))

print("üéØ Target Variable Created: 12-Month CLV")
print(f"   Mean CLV: ${df_features['clv_12_months'].mean():.2f}")
print(f"   Median CLV: ${df_features['clv_12_months'].median():.2f}")
print(f"   Min CLV: ${df_features['clv_12_months'].min():.2f}")
print(f"   Max CLV: ${df_features['clv_12_months'].max():.2f}")

# Visualize CLV distribution
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Histogram
axes[0].hist(df_features['clv_12_months'], bins=10, color='skyblue', edgecolor='black')
axes[0].set_title('CLV Distribution', fontsize=14, fontweight='bold')
axes[0].set_xlabel('12-Month CLV ($)')
axes[0].set_ylabel('Frequency')

# Box plot by segment
df_features.boxplot(column='clv_12_months', by='segment_label', ax=axes[1])
axes[1].set_title('CLV by Customer Segment', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Segment')
axes[1].set_ylabel('12-Month CLV ($)')
plt.suptitle('')  # Remove default title

plt.tight_layout()
plt.show()

## 5. Prepare Training Data

Split data into:
- **Training set** (80%) - Used to train the model
- **Test set** (20%) - Used to evaluate model performance

In [None]:
# Select features for modeling
feature_cols = [
    'total_spent',
    'purchase_count',
    'days_since_purchase',
    'avg_order_value',
    'recency_score',
    'frequency_score',
    'monetary_score',
    'engagement_score',
    'segment_encoded'
]

# Add one-hot encoded region columns
region_cols = [col for col in df_features.columns if col.startswith('region_')]
feature_cols.extend(region_cols)

# Add one-hot encoded payment columns
payment_cols = [col for col in df_features.columns if col.startswith('payment_')]
feature_cols.extend(payment_cols)

X = df_features[feature_cols]
y = df_features['clv_12_months']

print(f"üìä Feature Matrix Shape: {X.shape}")
print(f"   Number of features: {X.shape[1]}")
print(f"   Number of samples: {X.shape[0]}")
print(f"\nüéØ Target Variable Shape: {y.shape}")

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"\n‚úÖ Data Split Complete:")
print(f"   Training samples: {len(X_train)} ({len(X_train)/len(X)*100:.1f}%)")
print(f"   Test samples: {len(X_test)} ({len(X_test)/len(X)*100:.1f}%)")

## 6. Train Random Forest Model

### What is Random Forest?

**Random Forest** is an ensemble learning algorithm that:
1. Creates many decision trees (a "forest")
2. Each tree makes a prediction
3. Final prediction = average of all trees

### Why Random Forest?
- ‚úÖ Handles non-linear relationships
- ‚úÖ Works well with small datasets
- ‚úÖ Provides feature importance
- ‚úÖ Resistant to overfitting
- ‚úÖ No feature scaling required

In [None]:
print("üå≤ Training Random Forest Regressor...")

# Initialize model
rf_model = RandomForestRegressor(
    n_estimators=100,      # Number of trees
    max_depth=10,          # Maximum tree depth
    min_samples_split=2,   # Minimum samples to split a node
    min_samples_leaf=1,    # Minimum samples in leaf node
    random_state=42,       # For reproducibility
    n_jobs=-1              # Use all CPU cores
)

# Train model
rf_model.fit(X_train, y_train)

print("‚úÖ Model trained successfully!")
print(f"   Number of trees: {rf_model.n_estimators}")
print(f"   Number of features used: {rf_model.n_features_in_}")

## 7. Model Evaluation

### Metrics We'll Use:
1. **R¬≤ Score** (0-1): How much variance the model explains (higher = better)
2. **RMSE** (Root Mean Squared Error): Average prediction error in dollars
3. **MAE** (Mean Absolute Error): Average absolute prediction error

In [None]:
# Make predictions
y_train_pred = rf_model.predict(X_train)
y_test_pred = rf_model.predict(X_test)

# Calculate metrics
train_r2 = r2_score(y_train, y_train_pred)
test_r2 = r2_score(y_test, y_test_pred)
train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
train_mae = mean_absolute_error(y_train, y_train_pred)
test_mae = mean_absolute_error(y_test, y_test_pred)

print("="*70)
print("üìä MODEL PERFORMANCE METRICS")
print("="*70)
print(f"\nüéØ R¬≤ Score (Variance Explained):")
print(f"   Training: {train_r2:.4f}")
print(f"   Test:     {test_r2:.4f}")
print(f"\nüìâ RMSE (Root Mean Squared Error):")
print(f"   Training: ${train_rmse:.2f}")
print(f"   Test:     ${test_rmse:.2f}")
print(f"\nüìè MAE (Mean Absolute Error):")
print(f"   Training: ${train_mae:.2f}")
print(f"   Test:     ${test_mae:.2f}")
print("\n" + "="*70)

# Visualize predictions vs actual
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Training set
axes[0].scatter(y_train, y_train_pred, alpha=0.6, color='blue')
axes[0].plot([y_train.min(), y_train.max()], [y_train.min(), y_train.max()], 'r--', lw=2)
axes[0].set_title(f'Training Set (R¬≤ = {train_r2:.3f})', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Actual CLV ($)')
axes[0].set_ylabel('Predicted CLV ($)')
axes[0].grid(True, alpha=0.3)

# Test set
axes[1].scatter(y_test, y_test_pred, alpha=0.6, color='green')
axes[1].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
axes[1].set_title(f'Test Set (R¬≤ = {test_r2:.3f})', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Actual CLV ($)')
axes[1].set_ylabel('Predicted CLV ($)')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 8. Feature Importance Analysis

Which features are most important for predicting CLV?

In [None]:
# Get feature importance
feature_importance = pd.DataFrame({
    'feature': feature_cols,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

print("üéØ TOP 10 MOST IMPORTANT FEATURES:")
print(feature_importance.head(10).to_string(index=False))

# Visualize
plt.figure(figsize=(12, 6))
top_features = feature_importance.head(10)
plt.barh(range(len(top_features)), top_features['importance'], color='steelblue')
plt.yticks(range(len(top_features)), top_features['feature'])
plt.xlabel('Importance Score', fontsize=12)
plt.title('Top 10 Feature Importance for CLV Prediction', fontsize=14, fontweight='bold')
plt.gca().invert_yaxis()
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

## 9. Make Predictions on New Customers

Let's predict CLV for all customers and add business recommendations.

In [None]:
# Predict CLV for all customers
df_features['predicted_clv'] = rf_model.predict(X)

# Add prediction confidence (based on proximity to training data)
# Simple approach: compare to segment average
segment_avg_clv = df_features.groupby('segment_label')['predicted_clv'].mean()
df_features['segment_avg_clv'] = df_features['segment_label'].map(segment_avg_clv)
df_features['clv_vs_segment_avg'] = df_features['predicted_clv'] / df_features['segment_avg_clv']

# Business recommendations
def get_recommendation(row):
    clv = row['predicted_clv']
    segment = row['segment_label']
    
    if clv > 500:
        return "üåü VIP Treatment - Personal account manager, exclusive offers"
    elif clv > 300:
        return "üíé Premium Care - Priority support, loyalty rewards"
    elif clv > 150:
        return "üìà Growth Potential - Upsell campaigns, engagement programs"
    elif segment == "At-Risk":
        return "‚ö†Ô∏è Retention Focus - Win-back campaign with 20% discount"
    else:
        return "üìä Standard Care - Regular newsletters, seasonal promotions"

df_features['recommendation'] = df_features.apply(get_recommendation, axis=1)

# Display results
print("\n" + "="*100)
print("üéØ CUSTOMER LIFETIME VALUE PREDICTIONS & RECOMMENDATIONS")
print("="*100)

results = df_features[[
    'customer_id', 'segment_label', 'total_spent', 'purchase_count',
    'predicted_clv', 'recommendation'
]].sort_values('predicted_clv', ascending=False)

print(results.to_string(index=False))

print("\n" + "="*100)
print("üìä CLV SUMMARY BY SEGMENT:")
print("="*100)
summary = df_features.groupby('segment_label').agg({
    'customer_id': 'count',
    'predicted_clv': ['mean', 'min', 'max', 'sum']
}).round(2)
print(summary)

## 10. Save Model to S3

Save the trained model so we can use it later for predictions.

In [None]:
# Save model locally first
model_filename = '/tmp/clv_rf_model.joblib'
joblib.dump(rf_model, model_filename)
print(f"‚úÖ Model saved locally: {model_filename}")

# Upload to S3
model_s3_key = 'ml-models/customer_lifetime_value/clv_rf_model.joblib'
s3.upload_file(model_filename, SCRIPTS_BUCKET, model_s3_key)
print(f"‚úÖ Model uploaded to S3: s3://{SCRIPTS_BUCKET}/{model_s3_key}")

# Save predictions to S3
predictions_df = df_features[[
    'customer_id', 'segment_label', 'total_spent', 'purchase_count',
    'predicted_clv', 'recommendation'
]]

predictions_filename = '/tmp/clv_predictions.csv'
predictions_df.to_csv(predictions_filename, index=False)
predictions_s3_key = 'ml-predictions/clv_predictions.csv'
s3.upload_file(predictions_filename, SCRIPTS_BUCKET, predictions_s3_key)
print(f"‚úÖ Predictions uploaded to S3: s3://{SCRIPTS_BUCKET}/{predictions_s3_key}")

print("\nüéâ Model training and deployment complete!")

## üéì Key Takeaways

### What We Learned:
1. **EDA is Critical** - Understanding data before modeling
2. **Feature Engineering** - Creating meaningful predictive features
3. **Random Forest** - Powerful ensemble method for regression
4. **Model Evaluation** - R¬≤, RMSE, MAE metrics
5. **Business Application** - Turning predictions into actions

### Business Impact:
- ‚úÖ Identified high-value customers for VIP treatment
- ‚úÖ Predicted revenue potential for budget allocation
- ‚úÖ Created personalized recommendations per customer
- ‚úÖ Model saved for future predictions

### Next Steps:
1. **Monitor Model Performance** - Retrain periodically with new data
2. **A/B Testing** - Test recommendations in production
3. **Feature Expansion** - Add more features (product categories, seasonality)
4. **Advanced Models** - Try XGBoost, Neural Networks
5. **Deploy as Endpoint** - Real-time predictions via API

---