
# AI-Driven Revenue Optimization: LTV, Churn & Uplift Modeling (End-to-End)

**Author:** James "Jay" Burgess  
**Role:** Data & AI Engineer · Applied Data Scientist · Revenue Architect

This notebook is designed as a *portfolio-grade* showcase.

It simulates a realistic B2C/B2B funnel and walks through:

1. Synthetic but structured **Lakehouse-style** data generation (customers, marketing, product usage, revenue).
2. Feature engineering for:
   - Customer Lifetime Value (LTV)
   - Churn risk
   - Marketing responsiveness
3. Training:
   - A **churn model** (classification)
   - An **LTV model** (regression)
   - A simple **uplift / treatment effect** model (who should get offers)
4. Evaluation with real metrics (AUC, RMSE, calibration checks).
5. **Explainability & decision intelligence**:
   - Feature importance
   - Policy simulation: which customers to target to maximize incremental revenue.

This is how I think about AI/ML: not as a toy model, but as a decision system wired to revenue.


In [None]:

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, roc_curve, mean_squared_error
from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor, RandomForestClassifier

import matplotlib.pyplot as plt

np.random.seed(42)



## 1. Generate Synthetic Customer & Funnel Data

We simulate:

- Acquisition channels & spend efficiency
- Engagement behavior (sessions, product usage, support touches)
- An experimental **offer** (treatment vs control)
- Resulting revenue over an observation window
- Churn (binary) as *no revenue after a cutoff*

The goal: create a dataset that *behaves like* what you'd see in a real growth / RevOps environment.


In [None]:

n_customers = 20000

# Acquisition channels
channels = ["paid_search", "paid_social", "organic", "affiliate", "direct"]
channel_probs = [0.25, 0.25, 0.2, 0.15, 0.15]

acq_channel = np.random.choice(channels, size=n_customers, p=channel_probs)

# Base propensities by channel
channel_base_value = {
    "paid_search": 1.1,
    "paid_social": 0.9,
    "organic": 1.3,
    "affiliate": 1.0,
    "direct": 1.4,
}

# Customer attributes
tenure_months = np.random.gamma(shape=3, scale=4, size=n_customers)  # >0
tenure_months = np.clip(tenure_months, 1, 48)

monthly_visits = np.random.poisson(lam=4, size=n_customers) + np.random.binomial(4, 0.3, size=n_customers)
monthly_visits = np.clip(monthly_visits, 0, None)

product_events = np.random.poisson(lam=8, size=n_customers) + (monthly_visits * np.random.uniform(0.5, 1.5, n_customers))
product_events = np.clip(product_events, 0, None)

support_tickets = np.random.poisson(lam=0.3, size=n_customers)
region = np.random.choice(["NA", "EU", "LATAM", "APAC"], size=n_customers, p=[0.4, 0.25, 0.2, 0.15])

# Experimental treatment: some customers receive an offer
treatment = np.random.binomial(1, 0.5, size=n_customers)

# Latent "affinity" score
base_affinity = (
    0.05 * tenure_months
    + 0.02 * monthly_visits
    + 0.015 * product_events
    - 0.1 * support_tickets
    + np.array([channel_base_value[c] for c in acq_channel])
    + np.random.normal(0, 1, n_customers)
)

# Channel multipliers for revenue
channel_multiplier = {
    "paid_search": 1.0,
    "paid_social": 0.8,
    "organic": 1.2,
    "affiliate": 0.9,
    "direct": 1.3,
}

# Treatment effect: only some segments lift
treatment_effect = (
    0.25 * (np.isin(acq_channel, ["paid_search", "paid_social"]).astype(int))
    + 0.15 * (monthly_visits > 4)
)

# Baseline LTV (over 6-12 months)
baseline_ltv = np.exp(2.2 + 0.12 * base_affinity / np.std(base_affinity))
baseline_ltv = baseline_ltv / 50.0  # scale down

# Apply channel multiplier
baseline_ltv *= np.array([channel_multiplier[c] for c in acq_channel])

# Apply treatment uplift on a multiplicative scale
ltv_with_treatment = baseline_ltv * (1 + treatment * treatment_effect)

# Introduce noise
observed_ltv = ltv_with_treatment * np.exp(np.random.normal(0, 0.35, n_customers))

# Cap LTV
observed_ltv = np.clip(observed_ltv, 0, 5000)

# Define churn: no significant revenue (e.g., LTV < threshold)
churn = (observed_ltv < 150).astype(int)

data = pd.DataFrame({
    "customer_id": np.arange(1, n_customers + 1),
    "acq_channel": acq_channel,
    "tenure_months": tenure_months.round(1),
    "monthly_visits": monthly_visits,
    "product_events": product_events,
    "support_tickets": support_tickets,
    "region": region,
    "treatment_offer": treatment,
    "observed_ltv": observed_ltv.round(2),
    "churned": churn,
})

data.head()



### Quick Sanity Check

Check distribution of LTV and churn to ensure this "feels" like a real-world dataset.


In [None]:

print(data.describe(include='all').T)

plt.figure(figsize=(6,4))
plt.hist(data["observed_ltv"], bins=50)
plt.title("Observed LTV Distribution")
plt.xlabel("LTV")
plt.ylabel("Count")
plt.show()

churn_rate = data["churned"].mean()
print(f"Churn rate: {churn_rate:.2%}")



## 2. Feature Engineering

We:
- One-hot encode categorical variables
- Create interaction-style features (e.g., intensity / tenure)
- Keep the pipeline explicit so it reads like production code.


In [None]:

df = data.copy()

# Basic derived features
df["events_per_visit"] = np.where(df["monthly_visits"] > 0,
                                  df["product_events"] / df["monthly_visits"],
                                  0)
df["support_intensity"] = df["support_tickets"] / (df["tenure_months"] + 1)

# One-hot encode
cat_cols = ["acq_channel", "region"]
df_encoded = pd.get_dummies(df, columns=cat_cols, drop_first=True)

feature_cols = [c for c in df_encoded.columns
                if c not in ["customer_id", "observed_ltv", "churned"]]

X = df_encoded[feature_cols]
y_churn = df_encoded["churned"]
y_ltv = df_encoded["observed_ltv"]

X.shape, y_churn.shape, y_ltv.shape



## 3. Train/Test Split

We keep the split simple and reproducible.


In [None]:

X_train, X_test, y_churn_train, y_churn_test, y_ltv_train, y_ltv_test = train_test_split(
    X, y_churn, y_ltv, test_size=0.25, random_state=42, stratify=y_churn
)

X_train.shape, X_test.shape



## 4. Churn Model (Classification)

We start with a Gradient Boosting Classifier as a strong baseline.

We care about:
- ROC-AUC
- Rank ordering (who is likely to churn)
- Business interpretation: target high-risk customers with retention actions.


In [None]:

churn_model = GradientBoostingClassifier(random_state=42)
churn_model.fit(X_train, y_churn_train)

churn_proba_test = churn_model.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_churn_test, churn_proba_test)
print(f"Churn ROC-AUC: {auc:.3f}")


In [None]:

fpr, tpr, _ = roc_curve(y_churn_test, churn_proba_test)
plt.figure(figsize=(5,4))
plt.plot(fpr, tpr, label=f"AUC = {auc:.3f}")
plt.plot([0,1], [0,1], linestyle="--")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Churn Model ROC Curve")
plt.legend()
plt.show()



## 5. LTV Model (Regression)

We model *log(LTV + 1)* for stability.

This supports:
- Segmentation by value
- Forecasting revenue impact of cohorts
- Feeding downstream bidding / targeting / prioritization systems.


In [None]:

y_ltv_train_log = np.log1p(y_ltv_train)
y_ltv_test_log = np.log1p(y_ltv_test)

ltv_model = GradientBoostingRegressor(random_state=42)
ltv_model.fit(X_train, y_ltv_train_log)

y_ltv_pred_log = ltv_model.predict(X_test)
y_ltv_pred = np.expm1(y_ltv_pred_log)

rmse = math.sqrt(mean_squared_error(y_ltv_test, y_ltv_pred))
print(f"LTV RMSE: {rmse:.2f}")


In [None]:

plt.figure(figsize=(5,4))
plt.scatter(y_ltv_test, y_ltv_pred, alpha=0.3)
plt.xlabel("Actual LTV")
plt.ylabel("Predicted LTV")
plt.title("LTV: Actual vs Predicted")
plt.show()



## 6. Uplift Modeling (Who Should Get the Offer?)

We approximate uplift in a **two-model** fashion:

1. Train separate models:
   - on treated customers
   - on control customers

2. For each customer, estimate:
   - predicted LTV if treated
   - predicted LTV if not treated

3. Uplift = LTV_treated - LTV_control

In production, you'd likely use:
- Causal forests
- Meta-learners (T-learner / X-learner)
- Or library support for uplift models

Here we show the concept in a compact, readable way.


In [None]:

treated_mask = (df_encoded["treatment_offer"] == 1)
control_mask = (df_encoded["treatment_offer"] == 0)

X_treated = X[treated_mask]
y_treated = np.log1p(y_ltv[treated_mask])

X_control = X[control_mask]
y_control = np.log1p(y_ltv[control_mask])

model_treated = RandomForestClassifier(n_estimators=80, random_state=42)
model_control = RandomForestClassifier(n_estimators=80, random_state=42)

# For simplicity in this demo, model uplift on high-vs-low value
high_value_treated = (y_ltv[treated_mask] > y_ltv.median()).astype(int)
high_value_control = (y_ltv[control_mask] > y_ltv.median()).astype(int)

model_treated.fit(X_treated, high_value_treated)
model_control.fit(X_control, high_value_control)

# Estimate uplift on full population
proba_treated = model_treated.predict_proba(X)[:, 1]
proba_control = model_control.predict_proba(X)[:, 1]
uplift_score = proba_treated - proba_control

df_uplift = df_encoded.copy()
df_uplift["uplift_score"] = uplift_score

df_uplift[["customer_id", "uplift_score"]].head()


In [None]:

# How much incremental value if we target top 20% uplift?

top_k = int(0.2 * len(df_uplift))
top_segment = df_uplift.sort_values("uplift_score", ascending=False).head(top_k)
bottom_segment = df_uplift.sort_values("uplift_score", ascending=True).head(top_k)

avg_ltv_top = data.loc[top_segment.index, "observed_ltv"].mean()
avg_ltv_bottom = data.loc[bottom_segment.index, "observed_ltv"].mean()

print(f"Avg LTV (top 20% predicted uplift):   {avg_ltv_top:8.2f}")
print(f"Avg LTV (bottom 20% predicted uplift): {avg_ltv_bottom:8.2f}")
print(f"Lift ratio: {avg_ltv_top / max(avg_ltv_bottom,1e-6):.2f}x")



## 7. Model Explainability (Feature Importance)

We inspect which signals the models use.

The point isn’t just accuracy — it’s **governance**:
stakeholders must see *why* the system is making decisions.


In [None]:

def plot_feature_importance(model, feature_names, top_n=15, title="Feature Importance"):
    importances = model.feature_importances_
    idx = np.argsort(importances)[::-1][:top_n]
    plt.figure(figsize=(6, max(4, top_n * 0.25)))
    plt.barh(range(len(idx)), importances[idx][::-1])
    plt.yticks(range(len(idx)), [feature_names[i] for i in idx][::-1])
    plt.title(title)
    plt.xlabel("Importance")
    plt.tight_layout()
    plt.show()

plot_feature_importance(churn_model, feature_cols, top_n=15, title="Churn Model Feature Importance")
plot_feature_importance(ltv_model, feature_cols, top_n=15, title="LTV Model Feature Importance")



## 8. Framing This for Recruiters & Hiring Managers

This notebook is intentionally designed to read like **production thinking**:

- Starts from a **business problem**: who to acquire, retain, and target to maximize revenue.
- Simulates realistic **multi-source data**: channels, behavior, support, offers, outcomes.
- Builds:
  - A churn model for retention plays.
  - An LTV model for value-based prioritization.
  - An uplift-style model for targeted incentives.
- Includes:
  - Performance metrics (ROC-AUC, RMSE).
  - Sanity checks & visual diagnostics.
  - Interpretability via feature importance.
  - A simple **policy simulation** (top uplift vs bottom uplift).

Use this in your portfolio as:

> "AI-Driven Revenue Optimization in a Lakehouse Context: from synthetic data design to decision-ready models."

In a Databricks environment, this naturally maps to:
- Delta tables instead of local DataFrames
- Unity Catalog for governance
- MLflow for experiment tracking
- Jobs / DLT for orchestration

The structure here is the important part: it proves I know how to turn models into levers.
