# Price Elasticity Estimation with Double Machine Learning (EconML)

This notebook demonstrates how to use **Double Machine Learning (DML)** to estimate price elasticity for Video-on-Demand (VOD) titles. Unlike standard churn models which predict *risk*, this model predicts *sensitivity to price*, allowing us to find the revenue-maximizing price for each user.

### Workflow
1. **Data Generation**: Create synthetic data with continuous pricing and hidden confounding (e.g., peak demand bias).
2. **DML Modeling**: Use `EconML`'s `LinearDML` or `CausalForestDML` to estimate elasticity ($\theta(X)$).
3. **Optimization**: Use the estimated elasticity to calculate the optimal price point for each user.
4. **Evaluation**: Compare estimated elasticity against the ground truth from our Oracle.

In [None]:
import sys
import os

# Add src to path
sys.path.append(os.path.abspath("../src"))

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from vod_causal.data.generator import VODSyntheticData
from vod_causal.models.dml import DMLWithEconML
from revenue_optimizer import bulk_optimize

%matplotlib inline
sns.set_style("whitegrid")

## 1. Generate Synthetic Data with Pricing Bias

We generate 5,000 users and 20,000 interactions. 
**Crucial:** The data generator now simulates "confounding":
- Users with high watch time (loyal) are offered *higher* prices (or smaller discounts) on average.
- This simulates a naive algorithm attempting to maximize immediate revenue.
- A simple regression would incorrectly conclude that "higher prices -> higher purchase probability" (because loyal users buy anyway).

In [None]:
generator = VODSyntheticData(n_users=5000, n_titles=100, n_interactions=20000, seed=42)
data_dict = generator.generate_all()
df = generator.create_modeling_dataset(data_dict)

print(f"Dataset shape: {df.shape}")
df.head()

### Inspect Confounding
Let's see if our generator successfully introduced bias. We expect higher prices for users with higher `avg_daily_watch_time`.

In [None]:
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df.sample(2000), x="avg_daily_watch_time", y="offered_price", alpha=0.3)
plt.title("Confounding Check: Watch Time vs. Offered Price")
plt.xlabel("Avg Daily Watch Time (min)")
plt.ylabel("Offered Price ($)")
plt.show()

## 2. Train Double Machine Learning Model

We use `LinearDML` from `econml`. 

**The Causal Question:**
- **T (Treatment):** `offered_price` (Continuous)
- **Y (Outcome):** `did_rent` (Binary 0/1)
- **X (Effect Modifiers):** `price_sensitivity`, `subscription_tenure_months`, `avg_daily_watch_time`, `geo_region`
- **W (Controls):** Genre, Popularity, Director

In [None]:
# Prepare features
feature_cols = ["price_sensitivity", "subscription_tenure_months", "avg_daily_watch_time"]
control_cols = ["is_cold_start", "base_popularity", "release_year"]

# One-Hot Encoding for categorical regions
X = pd.get_dummies(df[feature_cols + ["geo_region"]], columns=["geo_region"], drop_first=True)
W = df[control_cols]
T = df["offered_price"]
Y = df["did_rent"]

print("Training DML Model (this may take a minute)...")
# We use 'forest' (CausalForestDML) for better heterogeneity detection, 
# or 'linear' for speed and interpretability. Let's start with LinearDML wrapped in our class.
dml = DMLWithEconML(model_type="linear", n_folds=3, random_state=42)
dml.fit(X, T, Y, W=W)

print("Model Fitted!")

## 3. Analyze Estimated Elasticity

Let's look at the estimated elasticity coefficients. Remember, a **more negative** value means **higher sensitivity** (demand drops faster as price increases).

In [None]:
# Predict elasticity for the training set
elasticities = dml.effect(X)

# Add to dataframe for analysis
df["predicted_elasticity"] = elasticities

plt.figure(figsize=(10, 5))
sns.histplot(df["predicted_elasticity"], kde=True, bins=30)
plt.title("Distribution of Predicted Price Elasticity")
plt.xlabel("Elasticity (Change in Prob per $1 Increase)")
plt.axvline(x=0, color='red', linestyle='--')
plt.show()

### Validation: Does it match ground truth?
In synthetic data, we know the `true_elasticity`. Let's compare.

In [None]:
try:
    # If true_elasticity column exists (updated generator)
    plt.figure(figsize=(8, 8))
    sns.scatterplot(x=df["true_elasticity"], y=df["predicted_elasticity"], alpha=0.1)
    plt.plot([df["true_elasticity"].min(), df["true_elasticity"].max()], 
             [df["true_elasticity"].min(), df["true_elasticity"].max()], 
             color='red', linestyle='--')
    plt.title("Predicted vs. True Elasticity")
    plt.xlabel("True Elasticity (Oracle)")
    plt.ylabel("Predicted Elasticity (DML)")
    plt.show()
except KeyError:
    print("Column 'true_elasticity' not found in dataset. Ensure generator code is updated.")

## 4. Revenue Optimization

Now we use the `revenue_optimizer` to find the best price for each user.

In [None]:
# Select a sample of users
sample_users = X.iloc[:100].copy()

# Run bulk optimization
optimization_results = bulk_optimize(sample_users, dml, base_probas=None)

optimization_results.head()

### Recommended Prices Distribution

In [None]:
plt.figure(figsize=(10, 5))
sns.histplot(optimization_results["optimal_price"], bins=15)
plt.title("Distribution of Optimized Prices")
plt.xlabel("Recommended Price ($)")
plt.show()