# Causal Uplift & Revenue Optimization for VOD

This notebook demonstrates the complete workflow for building a causal uplift model
to optimize promotional campaigns for a Video-on-Demand platform.

## Contents

1. **Data Generation** - Create synthetic VOD dataset with hidden causal effects
2. **Feature Engineering** - Transform raw logs into model features
3. **X-Learner Training** - Train causal model with XGBoost base learners
4. **Model Evaluation** - Qini curves, AUUC, and oracle validation
5. **Policy Simulation** - Generate recommendations and simulate ROI

## Setup

In [None]:
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')

# Display settings
pd.set_option('display.max_columns', 50)
pd.set_option('display.float_format', '{:.4f}'.format)

print("Setup complete!")

In [None]:
# Import our modules
import sys
sys.path.insert(0, '../src')

from vod_causal.data import VODSyntheticData, CausalOracle
from vod_causal.preprocessing import FeatureTransformer, PropensityModel
from vod_causal.models import BaseLearners, XLearner, DoubleMachineLearning
from vod_causal.evaluation import (
    UpliftMetrics, PolicyRanker,
    plot_qini_curve, plot_cate_distribution, plot_cate_calibration,
    plot_propensity_distribution, create_evaluation_dashboard
)

print("Modules imported successfully!")

---

## 1. Data Generation

Generate synthetic VOD data with 10,000 users and 500 titles. The dataset includes
a hidden causal structure (via `CausalOracle`) that represents the true treatment effects.

In [None]:
# Initialize generator
generator = VODSyntheticData(
    n_users=10_000,
    n_titles=500,
    n_interactions=100_000,
    treatment_probability=0.15,  # ~15% of interactions are treated
    seed=42
)

# Generate all data
data = generator.generate_all()

print("Generated datasets:")
for name, df in data.items():
    if isinstance(df, pd.DataFrame):
        print(f"  {name}: {df.shape}")
    else:
        print(f"  {name}: {df}")

In [None]:
# Examine the summary statistics
print("\nDataset Summary:")
print("="*50)
for key, value in data['summary'].items():
    if isinstance(value, float):
        print(f"{key:30s}: {value:.4f}")
    else:
        print(f"{key:30s}: {value}")

In [None]:
# Explore the data
print("\n=== Users Metadata ===")
display(data['users_metadata'].head())

print("\n=== Titles Metadata ===")
display(data['titles_metadata'].head())

print("\n=== Treatment Log ===")
display(data['treatment_log'].head())

print("\n=== Interaction Outcomes ===")
display(data['interaction_outcomes'].head())

In [None]:
# Visualize data distributions
fig, axes = plt.subplots(2, 3, figsize=(15, 10))

# Treatment distribution
treatment_rate = data['treatment_log']['is_treated'].value_counts(normalize=True)
axes[0, 0].bar(['Control', 'Treated'], [treatment_rate[False], treatment_rate[True]])
axes[0, 0].set_title('Treatment Distribution')
axes[0, 0].set_ylabel('Proportion')

# Conversion by treatment
outcomes = data['interaction_outcomes'].merge(
    data['treatment_log'][['user_id', 'title_id', 'is_treated']], 
    on=['user_id', 'title_id']
)
conv_by_treatment = outcomes.groupby('is_treated')['did_rent'].mean()
axes[0, 1].bar(['Control', 'Treated'], [conv_by_treatment[False], conv_by_treatment[True]])
axes[0, 1].set_title('Conversion Rate by Treatment')
axes[0, 1].set_ylabel('Conversion Rate')

# True CATE distribution
axes[0, 2].hist(data['interaction_outcomes']['true_cate'], bins=50, alpha=0.7)
axes[0, 2].axvline(data['interaction_outcomes']['true_cate'].mean(), color='red', linestyle='--', 
                   label=f"Mean: {data['interaction_outcomes']['true_cate'].mean():.3f}")
axes[0, 2].set_title('True CATE Distribution (Oracle)')
axes[0, 2].set_xlabel('True CATE')
axes[0, 2].legend()

# Genre distribution
data['titles_metadata']['genre'].value_counts().plot(kind='bar', ax=axes[1, 0])
axes[1, 0].set_title('Genre Distribution')
axes[1, 0].tick_params(axis='x', rotation=45)

# Region distribution
data['users_metadata']['geo_region'].value_counts().plot(kind='bar', ax=axes[1, 1])
axes[1, 1].set_title('User Region Distribution')

# Revenue distribution
axes[1, 2].hist(data['interaction_outcomes']['revenue_generated'], bins=50, alpha=0.7)
axes[1, 2].set_title('Revenue Distribution')
axes[1, 2].set_xlabel('Revenue')

plt.tight_layout()
plt.show()

---

## 2. Feature Engineering

Transform raw data into a feature matrix suitable for modeling.

In [None]:
# Create modeling dataset
modeling_df = generator.create_modeling_dataset(data)
print(f"Modeling dataset shape: {modeling_df.shape}")
display(modeling_df.head())

In [None]:
# Initialize and fit feature transformer
transformer = FeatureTransformer(
    handle_cold_start=True,
    include_embeddings=False  # Skip embeddings for this demo
)

# Fit and transform
X = transformer.fit_transform(data, modeling_df)
print(f"Feature matrix shape: {X.shape}")
print(f"Number of features: {len(transformer.get_feature_names())}")
print(f"\nFeature names: {transformer.get_feature_names()[:20]}...")

In [None]:
# Prepare target and treatment variables
y = modeling_df['did_rent'].astype(int)
treatment = modeling_df['is_treated'].astype(int)
true_cate = modeling_df['true_cate']

print(f"Outcome distribution: {y.value_counts().to_dict()}")
print(f"Treatment distribution: {treatment.value_counts().to_dict()}")

### 2.1 Propensity Scoring

Train a propensity model and check overlap assumption.

In [None]:
# Train propensity model
propensity_model = PropensityModel(model_type='xgboost')
propensity_model.fit(X, treatment)

# Get propensity scores
propensity = propensity_model.predict_propensity(X)

# Check overlap
overlap_diagnostics = propensity_model.check_overlap(X, treatment)
print("Propensity Score Diagnostics:")
print("="*50)
for key, value in overlap_diagnostics.items():
    if isinstance(value, float):
        print(f"{key:30s}: {value:.4f}")
    else:
        print(f"{key:30s}: {value}")

In [None]:
# Visualize propensity distribution
fig = plot_propensity_distribution(propensity, treatment)
plt.show()

---

## 3. X-Learner Training

Train the X-Learner with XGBoost base learners.

In [None]:
# Initialize X-Learner
xlearner = XLearner(
    base_learner_params={'n_estimators': 100, 'max_depth': 6},
    propensity_model_type='xgboost',
    random_state=42
)

# Fit the model
print("Training X-Learner...")
xlearner.fit(X, y, treatment)
print("Training complete!")

# Get diagnostics
diagnostics = xlearner.get_diagnostics()
print("\nModel Diagnostics:")
for key, value in diagnostics.items():
    if isinstance(value, dict):
        print(f"{key}:")
        for k, v in value.items():
            print(f"  {k}: {v:.4f}" if isinstance(v, float) else f"  {k}: {v}")
    else:
        print(f"{key}: {value:.4f}" if isinstance(value, float) else f"{key}: {value}")

In [None]:
# Predict CATE
predicted_cate = xlearner.predict(X)

print(f"Predicted CATE statistics:")
print(f"  Mean: {predicted_cate.mean():.4f}")
print(f"  Std:  {predicted_cate.std():.4f}")
print(f"  Min:  {predicted_cate.min():.4f}")
print(f"  Max:  {predicted_cate.max():.4f}")

In [None]:
# Feature importance for CATE
feature_names = transformer.get_feature_names()
importance_df = xlearner.get_feature_importance(
    model='cate',
    feature_names=feature_names[:X.shape[1]],  # Match feature count
    top_k=15
)

print("\nTop 15 CATE-Driving Features:")
display(importance_df)

---

## 4. Model Evaluation

Evaluate the model using Qini curves, AUUC, and comparison to oracle ground truth.

In [None]:
# Compute Qini curve
qini_x, qini_y = UpliftMetrics.compute_qini_curve(
    y_true=y.values,
    treatment=treatment.values,
    predictions=predicted_cate
)

# Compute AUUC
auuc = UpliftMetrics.compute_auuc(qini_x, qini_y, normalize=True)
print(f"Area Under Uplift Curve (AUUC): {auuc:.4f}")

# Plot Qini curve
fig = plot_qini_curve(
    qini_x, qini_y,
    title=f"Qini Curve (AUUC: {auuc:.4f})"
)
plt.show()

In [None]:
# Compare to oracle ground truth
oracle_mse = UpliftMetrics.compute_oracle_mse(predicted_cate, true_cate.values)
oracle_corr = UpliftMetrics.compute_oracle_correlation(predicted_cate, true_cate.values)

print(f"Oracle Validation Metrics:")
print(f"  MSE (Predicted vs True CATE): {oracle_mse:.6f}")
print(f"  Correlation: {oracle_corr:.4f}")

In [None]:
# Plot CATE distribution comparison
fig = plot_cate_distribution(
    predicted_cate=predicted_cate,
    true_cate=true_cate.values,
    title="Predicted vs True CATE Distribution"
)
plt.show()

In [None]:
# Calibration plot
fig = plot_cate_calibration(
    predicted_cate=predicted_cate,
    true_cate=true_cate.values,
    n_bins=10,
    title="CATE Calibration Plot"
)
plt.show()

In [None]:
# Uplift by percentile
uplift_df = UpliftMetrics.compute_uplift_by_percentile(
    y_true=y.values,
    treatment=treatment.values,
    predictions=predicted_cate,
    n_bins=10
)

print("Observed Uplift by Predicted Percentile:")
display(uplift_df)

---

## 5. Policy Simulation

Generate promotional recommendations and simulate campaign performance.

In [None]:
# Initialize policy ranker
ranker = PolicyRanker(
    discount_cost=0.50,  # $0.50 cost per discount offered
    min_expected_roi=0.0,
    base_price=4.99
)

# Prepare user-title pairs for ranking
user_title_pairs = modeling_df[['user_id', 'title_id']].copy()
if 'genre' in modeling_df.columns:
    user_title_pairs['genre'] = modeling_df['genre']

# Generate recommendations
recommendations = ranker.rank(
    user_title_pairs=user_title_pairs,
    predicted_uplift=predicted_cate,
    top_k=5
)

print(f"Generated {len(recommendations)} recommendations")
print(f"For {recommendations['user_id'].nunique()} users")
display(recommendations.head(20))

In [None]:
# Simulate policy performance
outcomes_for_sim = modeling_df[['user_id', 'title_id', 'did_rent']].copy()
outcomes_for_sim['is_treated'] = treatment.values

simulation_results = ranker.simulate_policy(
    recommendations=recommendations,
    outcomes_df=outcomes_for_sim,
    treatment_col='is_treated',
    outcome_col='did_rent'
)

print("\nPolicy Simulation Results:")
print("="*50)
for key, value in simulation_results.items():
    if isinstance(value, float):
        print(f"{key:35s}: {value:.4f}")
    else:
        print(f"{key:35s}: {value}")

In [None]:
# Compare different targeting intensities
k_results = ranker.successive_k_item_ranking(
    user_title_pairs=user_title_pairs,
    predicted_uplift=predicted_cate,
    k_values=[1, 3, 5, 10, 20]
)

print("\nRecommendations at Different k Values:")
for k, recs in k_results.items():
    avg_uplift = recs['predicted_uplift'].mean() if len(recs) > 0 else 0
    avg_roi = recs['expected_roi'].mean() if len(recs) > 0 else 0
    print(f"  k={k:2d}: {len(recs):6d} recs, avg_uplift={avg_uplift:.4f}, avg_roi=${avg_roi:.2f}")

In [None]:
# Create campaign report
report = ranker.create_campaign_report(
    recommendations=recommendations,
    title_metadata=data['titles_metadata'],
    user_metadata=data['users_metadata']
)

print("Overall Summary:")
display(report['overall'])

print("\nBy Genre:")
display(report['by_genre'])

print("\nBy Region:")
display(report['by_region'])

---

## Summary

In this notebook, we demonstrated:

1. **Synthetic Data Generation**: Created a realistic VOD dataset with hidden causal effects using `VODSyntheticData` and `CausalOracle`

2. **Feature Engineering**: Transformed raw logs into model features with one-hot encoding, cyclical timestamps, and cold-start handling

3. **X-Learner Training**: Trained a causal uplift model using the X-Learner meta-learner with XGBoost base models

4. **Evaluation**: Validated the model using Qini curves, AUUC, and comparison to oracle ground truth

5. **Policy Simulation**: Generated ROI-optimized promotional recommendations and simulated campaign performance

### Key Findings

- The X-Learner successfully learns the heterogeneous treatment effects
- Model predictions correlate well with true CATE (oracle)
- Targeting users by predicted uplift delivers significantly higher ROI than random targeting