# **ML-BASED CLV UPLIFT**

In this step, we build a machine-learning layer on top of the probabilistic CLV baseline. Instead of predicting CLV from scratch, the ML model learns to predict future realized revenue using rich customer-level features and the probabilistic CLV as a strong prior signal. This hybrid approach combines the interpretability and stability of probabilistic models with the flexibility of ML.

### Define Supervised Target

We define realized holdout revenue as the supervised target for the ML layer. To stabilize training and mitigate heavy-tailed revenue distribution, the target is log-transformed using log1p.

In [None]:
ml_df = validation_df.copy()

ml_df["target_future_revenue"] = ml_df["actual_revenue"]

# Log-transform target to reduce heavy-tail effect
ml_df["log_target_revenue"] = np.log1p(ml_df["target_future_revenue"])

ml_df[["Customer ID", "target_future_revenue", "log_target_revenue"]].head()

Unnamed: 0,Customer ID,target_future_revenue,log_target_revenue
0,18102,152586.31,11.935492
1,14646,144203.91,11.878991
2,14156,63560.06,11.059756
3,14911,95594.59,11.467882
4,13694,32728.72,10.396039


### Select Features

Features are intentionally limited to probabilistic CLV signals and core RFM behavior. This allows the ML layer to act as a refinement on top of a stable CLV baseline, balancing predictive flexibility with interpretability and business consistency.

In [213]:
# Feature selection
feature_cols = [
    # Probabilistic signals
    "clv_h",
    "p_alive",
    "exp_purchases_h",
    "exp_avg_value",

    # Behavioral features (from FE)
    "frequency",
    "recency",
    "T",
]

X = ml_df[feature_cols]
y = ml_df["log_target_revenue"]

### Train ML Model Using Gradient Boosting

We use Gradient Boosting because it is well-suited for refining probabilistic CLV estimates because it captures non-linear interactions in skewed, tabular customer data while maintaining stability and interpretability. Its controlled bias–variance trade-off makes it effective for improving customer ranking quality rather than chasing point-level accuracy.

In [216]:
# Train-test splot
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=19)

In [217]:
# Train model
model_gb = GradientBoostingRegressor(
    n_estimators=300,
    learning_rate=0.05,
    max_depth=3,
    random_state=19
)

In [219]:
# Fit model
model_gb.fit(X_train, y_train)

In [224]:
# Evaluate
y_pred = model_gb.predict(X_test)

mae_log = mean_absolute_error(y_test, y_pred)
print(f"ML MAE (log revenue): {mae_log:.4f}")

ML MAE (log revenue): 2.1507


The MAE on log-transformed revenue is reported as a sanity check. Given the heavy-tailed nature of revenue and the ranking-oriented objective of this model, point-wise error is not the primary optimization target.

ML MAE is monitored for stability, not used as a decision metric.

### Generate ML-enhanced CLV score

In [226]:
# ML uplift score
ml_df["ml_clv_score"] = model_gb.predict(X)

# Convert back to revenue scale
ml_df["ml_clv_estimated_revenue"] = np.expm1(ml_df["ml_clv_score"])

ml_df[
    ["Customer ID", "clv_h", "ml_clv_estimated_revenue"]
].sort_values("ml_clv_estimated_revenue", ascending=False).head(10)

Unnamed: 0,Customer ID,clv_h,ml_clv_estimated_revenue
5,17450,39213.909899,131836.60435
1,14646,105646.927861,96094.984174
0,18102,113836.158305,94810.728399
3,14911,54110.346153,56353.362605
6,12415,39147.222397,54978.7284
7,17511,33047.804393,47845.873145
10,15061,27282.667091,47845.873145
4,13694,44549.862091,45512.069829
15,14298,19533.477601,45053.600235
2,14156,69028.513215,44811.363211


The ML-enhanced CLV introduces meaningful re-ranking compared to the probabilistic baseline. While the top customers largely overlap, the ML layer adjusts relative positions and revenue magnitude by learning non-linear corrections from realized holdout outcomes. This indicates that the ML model is not replacing the CLV logic, but refining it where the probabilistic assumptions are systematically biased. Importantly, the resulting estimates remain within a reasonable business range, suggesting uplift rather than distortion.

### Compare probabilistic CLV vs ML-uplifted CLV

In [227]:
# Decile comparison
ml_df["prob_clv_decile"] = pd.qcut(ml_df["clv_h"], 10, labels=False)
ml_df["ml_clv_decile"] = pd.qcut(ml_df["ml_clv_estimated_revenue"], 10, labels=False)

comparison = (
    ml_df
    .groupby("ml_clv_decile")
    .agg(
        customers=("Customer ID", "count"),
        avg_actual_revenue=("actual_revenue", "mean"),
        total_actual_revenue=("actual_revenue", "sum"),
    )
    .sort_index(ascending=False)
)

display(comparison)

Unnamed: 0_level_0,customers,avg_actual_revenue,total_actual_revenue
ml_clv_decile,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
9,504,5127.141627,2584079.38
8,504,1088.768214,548739.18
7,504,628.962738,316997.22
6,503,467.502308,235153.661
5,504,279.571984,140904.28
4,504,192.718948,97130.35
3,365,85.354767,31154.49
2,642,136.65441,87732.131
1,494,57.469595,28389.98
0,514,39.624183,20366.83


**Analysis**

---

The decile-based comparison shows that the ML-enhanced CLV successfully concentrates actual future revenue in the top-ranked segments. Customers in the highest ML-CLV deciles generate disproportionately higher realized revenue, indicating that the model improves customer prioritization rather than merely producing different numeric estimates. This validates the ML layer as a decision-quality improvement on top of the probabilistic CLV baseline.