# **Telecom Customer Churn Prediction Model Training**

This notebook implements a complete modeling pipeline for Telco customer churn prediction.


In [1]:
!pip install catboost imbalanced-learn joblib lightgbm numpy optuna pandas plotly scikit-learn seaborn xgboost




In [2]:
import joblib
import os
import optuna


import lightgbm as lgb
import numpy as np
import optuna.visualization as ov
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
import xgboost as xgb


from catboost import CatBoostClassifier
from imblearn.over_sampling import SMOTE

from sklearn.calibration import calibration_curve
from sklearn.ensemble import ExtraTreesClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    average_precision_score,
    brier_score_loss,
    classification_report,
    confusion_matrix,
    precision_recall_curve,
    roc_auc_score,
    roc_curve,
)
from sklearn.model_selection import StratifiedKFold, train_test_split

# Plotly & Seaborn styling
px.defaults.template = "plotly_white"
px.defaults.width = 800
px.defaults.height = 500
sns.set(style="whitegrid")


In [3]:
df = pd.read_csv('../data/processed/telco_features_final.csv')

# Features & target
X = df.drop(columns=['Churn'])
y = df['Churn'].astype(int)

print(f"Dataset: {len(df):,} samples × {X.shape[1]} features")
print(f"Overall churn rate: {y.mean()*100:.2f}%")


Dataset: 7,043 samples × 96 features
Overall churn rate: 26.54%


In [4]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.20,
    stratify=y,
    random_state=42
)
print(f"Train: {len(X_train):,} samples ({y_train.mean()*100:.2f}% churn)")
print(f"Test: {len(X_test):,} samples ({y_test.mean()*100:.2f}% churn)")


Train: 5,634 samples (26.54% churn)
Test: 1,409 samples (26.54% churn)


The test set preserves the 26.54% churn rate of the full dataset, ensuring representative evaluation.

## **Class Imbalance**

In [5]:
sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X_train, y_train)

print(f"Resampled train churn rate: {y_res.mean()*100:.2f}%")


Resampled train churn rate: 50.00%


SMOTE has balanced the training set to 50/50, providing equal weight to churners and non-churners during tuning.

In [6]:
orig = y_train.value_counts(normalize=True).mul(100).round(1).rename('Original')
res  = pd.Series(y_res).value_counts(normalize=True).mul(100).round(1).rename('Resampled')

dist_df = pd.concat([orig, res], axis=1).reset_index().melt(
    id_vars='Churn',       
    var_name='Dataset', 
    value_name='Percent'
)

fig = px.bar(
    dist_df,
    x='Churn',
    y='Percent',
    color='Dataset',
    barmode='group',
    labels={
      'Churn': 'Churn (0 = No, 1 = Yes)',
      'Percent': '% of Samples'
    },
    title='Original vs. Resampled Churn Distribution'
)
fig.show()


## **Hyperparameter Tuning with Optuna**
A 3-fold StratifiedKFold is used to evaluate validation AUC on the resampled training set for each model. Progress bars and parameter-importance plots aid interpretability.

### **LightGBM**

In [7]:
def lgb_objective(trial):
    params = {
        'boosting_type': 'gbdt',
        'objective': 'binary',
        'metric': 'auc',
        'verbosity': -1,
        'num_leaves': trial.suggest_int('num_leaves', 20, 150),
        'learning_rate': trial.suggest_float('learning_rate', 1e-3, 1e-1, log=True),
        'feature_fraction': trial.suggest_float('feature_fraction', 0.6, 1.0),
        'bagging_fraction': trial.suggest_float('bagging_fraction', 0.6, 1.0),
        'bagging_freq': trial.suggest_int('bagging_freq', 1, 7),
        'min_data_in_leaf': trial.suggest_int('min_data_in_leaf', 20, 100),
        'n_estimators': 1000,
        'random_state': 42
    }

    cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
    aucs = []

    for tr_idx, val_idx in cv.split(X_res, y_res):
        clf = lgb.LGBMClassifier(**params)

        clf.fit(
            X_res.iloc[tr_idx], y_res.iloc[tr_idx],
            eval_set=[(X_res.iloc[val_idx], y_res[val_idx])],
            eval_metric='auc',
            callbacks=[lgb.early_stopping(stopping_rounds=30)]
        )

        preds = clf.predict_proba(X_res.iloc[val_idx])[:,1]
        aucs.append(roc_auc_score(y_res[val_idx], preds))

    return np.mean(aucs)



study_lgb = optuna.create_study(direction='maximize')
study_lgb.optimize(lgb_objective, n_trials=30, show_progress_bar=True)

print(f"LightGBM best validation AUC: {study_lgb.best_value:.4f}")
lgb_params = study_lgb.best_params


[I 2025-05-22 23:19:56,628] A new study created in memory with name: no-name-0ceab3a6-2f64-4910-bb7d-3cf9f0325aa3


  0%|          | 0/30 [00:00<?, ?it/s]

Training until validation scores don't improve for 30 rounds
Early stopping, best iteration is:
[721]	valid_0's auc: 0.931821
Training until validation scores don't improve for 30 rounds
Early stopping, best iteration is:
[846]	valid_0's auc: 0.935561
Training until validation scores don't improve for 30 rounds
Early stopping, best iteration is:
[735]	valid_0's auc: 0.93465
[I 2025-05-22 23:20:12,025] Trial 0 finished with value: 0.9340105178359522 and parameters: {'num_leaves': 77, 'learning_rate': 0.00622316430571101, 'feature_fraction': 0.890482392368919, 'bagging_fraction': 0.8531121537299291, 'bagging_freq': 2, 'min_data_in_leaf': 44}. Best is trial 0 with value: 0.9340105178359522.
Training until validation scores don't improve for 30 rounds
Early stopping, best iteration is:
[108]	valid_0's auc: 0.931277
Training until validation scores don't improve for 30 rounds
Early stopping, best iteration is:
[149]	valid_0's auc: 0.933189
Training until validation scores don't improve for 

In [8]:
fig = ov.plot_optimization_history(study_lgb)
fig.update_layout(title='LightGBM Optimization History')
fig.show()

In [9]:
fig2 = ov.plot_param_importances(study_lgb)
fig2.update_layout(title='LightGBM Parameter Importances')
fig2.show()

Beyond learning‐rate, feature_fraction and num_leaves matter the most.

### **XGBoost**

In [10]:
def xgb_objective(trial):
    params = {
        'objective': 'binary:logistic',
        'eval_metric': 'auc',
        'tree_method': 'hist',
        'eta': trial.suggest_float('eta', 1e-3, 1e-1, log=True),
        'max_depth': trial.suggest_int('max_depth', 3, 10),
        'subsample': trial.suggest_float('subsample', 0.6, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.6, 1.0),
        'lambda': trial.suggest_float('lambda', 1e-3, 10.0, log=True),
        'alpha': trial.suggest_float('alpha', 1e-3, 10.0, log=True)
    }
    cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
    aucs = []
    for tr_idx, val_idx in cv.split(X_res, y_res):
        dtrain = xgb.DMatrix(X_res.iloc[tr_idx], label=y_res[tr_idx])
        dvalid = xgb.DMatrix(X_res.iloc[val_idx],   label=y_res[val_idx])
        bst = xgb.train(
            params, dtrain, num_boost_round=500,
            evals=[(dvalid,'valid')],
            early_stopping_rounds=30,
            verbose_eval=False
        )
        preds = bst.predict(dvalid)
        aucs.append(roc_auc_score(y_res[val_idx], preds))
    return np.mean(aucs)


study_xgb = optuna.create_study(direction='maximize')
study_xgb.optimize(xgb_objective, n_trials=30, show_progress_bar=True)

print(f"XGBoost best validation AUC: {study_xgb.best_value:.4f}")
xgb_params = study_xgb.best_params


[I 2025-05-22 23:22:47,570] A new study created in memory with name: no-name-bb440138-5554-4ec7-8c18-5aee8ce02883


  0%|          | 0/30 [00:00<?, ?it/s]

[I 2025-05-22 23:22:52,360] Trial 0 finished with value: 0.9359632218446379 and parameters: {'eta': 0.03918566217605475, 'max_depth': 7, 'subsample': 0.8062326377160953, 'colsample_bytree': 0.7983073326707739, 'lambda': 0.18447475845071892, 'alpha': 0.014902729087164382}. Best is trial 0 with value: 0.9359632218446379.
[I 2025-05-22 23:22:56,177] Trial 1 finished with value: 0.9363463634232637 and parameters: {'eta': 0.02708183392290973, 'max_depth': 3, 'subsample': 0.6094971231965262, 'colsample_bytree': 0.7512249665226217, 'lambda': 0.0019888313456751616, 'alpha': 0.18232214933378835}. Best is trial 1 with value: 0.9363463634232637.
[I 2025-05-22 23:23:15,064] Trial 2 finished with value: 0.9217110824660747 and parameters: {'eta': 0.002837002967521344, 'max_depth': 9, 'subsample': 0.9535812384658618, 'colsample_bytree': 0.89141468745056, 'lambda': 0.5392289978460346, 'alpha': 0.07824783085456172}. Best is trial 1 with value: 0.9363463634232637.
[I 2025-05-22 23:23:39,899] Trial 3 fin

In [11]:
fig = ov.plot_optimization_history(study_xgb)
fig.update_layout(title='XGBoost Optimization History')
fig.show()


In [12]:
fig2 = ov.plot_param_importances(study_xgb)
fig2.update_layout(title='XGBoost Parameter Importances')
fig2.show()

`subsample` is the second most important knob, reflecting that row‐sampling helps generalization more here than column‐sampling or regularization.

### **CatBoost**

In [13]:
def cat_objective(trial):
    params = {
        'loss_function': 'Logloss',
        'eval_metric': 'AUC',
        'verbose': False,
        'depth': trial.suggest_int('depth', 4, 10),
        'learning_rate': trial.suggest_float('learning_rate', 1e-3, 1e-1, log=True),
        'l2_leaf_reg': trial.suggest_float('l2_leaf_reg', 1, 10),
        'bagging_temperature': trial.suggest_float('bagging_temperature', 0, 1),
        'border_count': trial.suggest_int('border_count', 32, 255)
    }
    cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
    aucs=[]
    for tr_idx,val_idx in cv.split(X_res, y_res):
        mdl = CatBoostClassifier(**params)
        mdl.fit(
            X_res.iloc[tr_idx], y_res[tr_idx],
            eval_set=(X_res.iloc[val_idx], y_res[val_idx]),
            early_stopping_rounds=30, verbose=False
        )
        preds = mdl.predict_proba(X_res.iloc[val_idx])[:,1]
        aucs.append(roc_auc_score(y_res[val_idx], preds))
    return np.mean(aucs)


study_cat = optuna.create_study(direction='maximize')
study_cat.optimize(cat_objective, n_trials=30, show_progress_bar=True)

print(f"CatBoost best validation AUC: {study_cat.best_value:.4f}")
cat_params = study_cat.best_params


[I 2025-05-22 23:26:33,074] A new study created in memory with name: no-name-10c39866-ef55-48db-bc14-a96ae9a89c71


  0%|          | 0/30 [00:00<?, ?it/s]

[I 2025-05-22 23:27:09,528] Trial 0 finished with value: 0.9357078881789569 and parameters: {'depth': 8, 'learning_rate': 0.029979302903470912, 'l2_leaf_reg': 5.900967436395955, 'bagging_temperature': 0.5755677695523724, 'border_count': 229}. Best is trial 0 with value: 0.9357078881789569.
[I 2025-05-22 23:28:05,283] Trial 1 finished with value: 0.9338651419277276 and parameters: {'depth': 9, 'learning_rate': 0.0062222291443499115, 'l2_leaf_reg': 2.2265990288139434, 'bagging_temperature': 0.9037732861128807, 'border_count': 77}. Best is trial 0 with value: 0.9357078881789569.
[I 2025-05-22 23:28:24,328] Trial 2 finished with value: 0.9289952736324469 and parameters: {'depth': 6, 'learning_rate': 0.002311953377451422, 'l2_leaf_reg': 5.473744055581477, 'bagging_temperature': 0.7051970465495104, 'border_count': 185}. Best is trial 0 with value: 0.9357078881789569.
[I 2025-05-22 23:28:28,607] Trial 3 finished with value: 0.936340964174056 and parameters: {'depth': 6, 'learning_rate': 0.051

In [14]:
fig = ov.plot_optimization_history(study_cat)
fig.update_layout(title='CatBoost Optimization History')
fig.show()

In [15]:
fig2 = ov.plot_param_importances(study_cat)
fig2.update_layout(title='CatBoost Parameter Importances')
fig2.show()

Depth comes in second, suggesting that a moderate tree depth (6–8) is ideal when combined with its ordered boosting and categorical handling.

###  **RandomForest & ExtraTrees**

In [16]:
def rf_objective(trial):
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 100, 500),
        'max_depth': trial.suggest_int('max_depth', 3, 20),
        'min_samples_split': trial.suggest_int('min_samples_split', 2, 20),
        'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 10),
        'criterion': trial.suggest_categorical('criterion', ['gini','entropy'])
    }
    cv=StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
    aucs=[]
    for tr_idx, val_idx in cv.split(X_res, y_res):
        mdl = RandomForestClassifier(**params, n_jobs=-1, random_state=42)
        mdl.fit(X_res.iloc[tr_idx], y_res[tr_idx])
        preds = mdl.predict_proba(X_res.iloc[val_idx])[:,1]
        aucs.append(roc_auc_score(y_res[val_idx], preds))
    return np.mean(aucs)

study_rf = optuna.create_study(direction='maximize')
study_rf.optimize(rf_objective, n_trials=30, show_progress_bar=True)
print(f"RF best validation AUC: {study_rf.best_value:.4f}")
rf_params = study_rf.best_params

def et_objective(trial):
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 100, 500),
        'max_depth': trial.suggest_int('max_depth', 3, 20),
        'min_samples_split': trial.suggest_int('min_samples_split', 2, 20),
        'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 10),
        'criterion': trial.suggest_categorical('criterion', ['gini','entropy'])
    }
    cv=StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
    aucs=[]
    for tr_idx,val_idx in cv.split(X_res, y_res):
        mdl = ExtraTreesClassifier(**params, n_jobs=-1, random_state=42)
        mdl.fit(X_res.iloc[tr_idx], y_res[tr_idx])
        preds = mdl.predict_proba(X_res.iloc[val_idx])[:,1]
        aucs.append(roc_auc_score(y_res[val_idx], preds))
    return np.mean(aucs)

study_et = optuna.create_study(direction='maximize')
study_et.optimize(et_objective, n_trials=30, show_progress_bar=True)
print(f"ExtraTrees best validation AUC: {study_et.best_value:.4f}")
et_params = study_et.best_params


[I 2025-05-22 23:35:33,113] A new study created in memory with name: no-name-4d85b0f1-9549-4c53-9639-cb4e9a06d29f


  0%|          | 0/30 [00:00<?, ?it/s]

[I 2025-05-22 23:35:37,320] Trial 0 finished with value: 0.9298464827606585 and parameters: {'n_estimators': 424, 'max_depth': 16, 'min_samples_split': 10, 'min_samples_leaf': 6, 'criterion': 'gini'}. Best is trial 0 with value: 0.9298464827606585.
[I 2025-05-22 23:35:40,470] Trial 1 finished with value: 0.9324676364270633 and parameters: {'n_estimators': 302, 'max_depth': 16, 'min_samples_split': 12, 'min_samples_leaf': 3, 'criterion': 'entropy'}. Best is trial 1 with value: 0.9324676364270633.
[I 2025-05-22 23:35:43,831] Trial 2 finished with value: 0.9299219260923534 and parameters: {'n_estimators': 361, 'max_depth': 18, 'min_samples_split': 10, 'min_samples_leaf': 6, 'criterion': 'gini'}. Best is trial 1 with value: 0.9324676364270633.
[I 2025-05-22 23:35:47,515] Trial 3 finished with value: 0.9276549778594988 and parameters: {'n_estimators': 417, 'max_depth': 13, 'min_samples_split': 10, 'min_samples_leaf': 9, 'criterion': 'gini'}. Best is trial 1 with value: 0.9324676364270633.
[

[I 2025-05-22 23:37:16,487] A new study created in memory with name: no-name-88df7ec6-28a8-4d9a-abde-4bb9bbed4d17


[I 2025-05-22 23:37:16,483] Trial 29 finished with value: 0.9291349882492902 and parameters: {'n_estimators': 414, 'max_depth': 17, 'min_samples_split': 9, 'min_samples_leaf': 7, 'criterion': 'gini'}. Best is trial 25 with value: 0.9326210986432599.
RF best validation AUC: 0.9326


  0%|          | 0/30 [00:00<?, ?it/s]

[I 2025-05-22 23:37:18,614] Trial 0 finished with value: 0.9293575060987477 and parameters: {'n_estimators': 333, 'max_depth': 19, 'min_samples_split': 4, 'min_samples_leaf': 3, 'criterion': 'gini'}. Best is trial 0 with value: 0.9293575060987477.
[I 2025-05-22 23:37:19,799] Trial 1 finished with value: 0.9175026717005164 and parameters: {'n_estimators': 218, 'max_depth': 12, 'min_samples_split': 5, 'min_samples_leaf': 10, 'criterion': 'gini'}. Best is trial 0 with value: 0.9293575060987477.
[I 2025-05-22 23:37:20,534] Trial 2 finished with value: 0.9237863200353243 and parameters: {'n_estimators': 108, 'max_depth': 19, 'min_samples_split': 10, 'min_samples_leaf': 7, 'criterion': 'entropy'}. Best is trial 0 with value: 0.9293575060987477.
[I 2025-05-22 23:37:22,887] Trial 3 finished with value: 0.9063329184194743 and parameters: {'n_estimators': 491, 'max_depth': 7, 'min_samples_split': 19, 'min_samples_leaf': 9, 'criterion': 'gini'}. Best is trial 0 with value: 0.9293575060987477.
[I 

In [17]:
fig = ov.plot_optimization_history(study_rf)
fig.update_layout(title='RF Optimization History')
fig.show()

In [18]:
fig2 = ov.plot_param_importances(study_rf)
fig2.update_layout(title='RF Parameter Importances')
fig2.show()

In [19]:
fig3 = ov.plot_optimization_history(study_et)
fig3.update_layout(title='ExtraTrees Optimization History')
fig3.show()

In [20]:
fig4 = ov.plot_param_importances(study_et)
fig4.update_layout(title='ExtraTrees Parameter Importances')
fig4.show()

RandomForest and ExtraTrees tuning histories climb more slowly (peaking around 0.93 AUC) and their importance plots overwhelmingly highlight max_depth as the critical lever. Other parameters (tree count, split thresholds) are far less influential.

###  **Logistic Regression**

In [21]:
def lr_objective(trial):
    penalty = trial.suggest_categorical('penalty', ['l1','l2','elasticnet'])
    params = {
        'penalty': penalty,
        'C': trial.suggest_float('C', 1e-3, 1e3, log=True),
        'solver': 'saga',
        'max_iter': 5000
    }
    if penalty == 'elasticnet':
        params['l1_ratio'] = trial.suggest_float('l1_ratio', 0, 1)
    cv=StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
    aucs=[]
    for tr_idx,val_idx in cv.split(X_res, y_res):
        mdl = LogisticRegression(**params, n_jobs=-1, random_state=42)
        mdl.fit(X_res.iloc[tr_idx], y_res[tr_idx])
        preds = mdl.predict_proba(X_res.iloc[val_idx])[:,1]
        aucs.append(roc_auc_score(y_res[val_idx], preds))
    return np.mean(aucs)

study_lr = optuna.create_study(direction='maximize')
study_lr.optimize(lr_objective, n_trials=30, show_progress_bar=True)
print(f"LogReg best validation AUC: {study_lr.best_value:.4f}")
lr_params = study_lr.best_params


[I 2025-05-22 23:38:12,125] A new study created in memory with name: no-name-b8a6abf4-56ae-492e-a24f-2d4d708bbfc4


  0%|          | 0/30 [00:00<?, ?it/s]

[I 2025-05-22 23:38:34,029] Trial 0 finished with value: 0.8251383712778524 and parameters: {'penalty': 'l2', 'C': 20.91405067987438}. Best is trial 0 with value: 0.8251383712778524.
[I 2025-05-22 23:38:55,694] Trial 1 finished with value: 0.8251380209574846 and parameters: {'penalty': 'l2', 'C': 2.01533029888818}. Best is trial 0 with value: 0.8251383712778524.
[I 2025-05-22 23:39:17,370] Trial 2 finished with value: 0.8251383712778524 and parameters: {'penalty': 'l2', 'C': 221.63762828742074}. Best is trial 0 with value: 0.8251383712778524.
[I 2025-05-22 23:39:47,606] Trial 3 finished with value: 0.8195442005808514 and parameters: {'penalty': 'l1', 'C': 0.0053181209480023285}. Best is trial 0 with value: 0.8251383712778524.
[I 2025-05-22 23:40:17,663] Trial 4 finished with value: 0.8194025301355481 and parameters: {'penalty': 'elasticnet', 'C': 0.0037308017034115987, 'l1_ratio': 0.7261944435767722}. Best is trial 0 with value: 0.8251383712778524.
[I 2025-05-22 23:40:47,133] Trial 5 f

In [22]:
fig = ov.plot_optimization_history(study_lr)
fig.update_layout(title='LogisticRegression Optimization History')
fig.show()


LogisticRegression sits well below at ~0.825 AUC, with nearly flat performance across L1, L2, or elastic‐net, showing that a purely linear decision boundary can’t capture the nonlinear churn drivers as effectively.

### **Rapid convergence in early trials**

In each optimization history (LightGBM, XGBoost, CatBoost), the AUC jumps from ~0.89–0.91 on trial 0 to ~0.93–0.94 by trial 1–3, then plateaus. This tells us that even relatively “default” settings of leaf‐count, depth, and learning‐rate can achieve most of the available lift, and fine‐tuning yields only marginal gains beyond the 5–10 earliest trials.

### **Learning‐rate dominates**

Across all three parameter‐importance bar charts, the learning‐rate (or eta) accounts for the vast majority of the variance in validation AUC (roughly 0.90 of the total importance mass). Other parameters — tree size (num_leaves/max_depth), subsampling fractions, and regularization terms — each contribute a comparatively small share (often < 0.05).

## **Compare Validation AUCs**

In [23]:
tune_df = pd.DataFrame({
    'Model': ['LightGBM','XGBoost','CatBoost','RandomForest','ExtraTrees','LogisticRegression'],
    'Validation AUC': [
        study_lgb.best_value, study_xgb.best_value, study_cat.best_value,
        study_rf.best_value, study_et.best_value, study_lr.best_value
    ]
}).sort_values('Validation AUC', ascending=False).reset_index(drop=True)


fig = px.bar(
    tune_df, x='Model', y='Validation AUC', color='Validation AUC',
    text='Validation AUC', title='Validation AUC After Hyperparameter Tuning'
)
fig.update_traces(texttemplate='%{text:.3f}', textposition='outside')
fig.update_layout(yaxis=dict(range=[0.7,1]), uniformtext_minsize=8, uniformtext_mode='hide')
fig.show()

tune_df


Unnamed: 0,Model,Validation AUC
0,XGBoost,0.936803
1,CatBoost,0.936341
2,LightGBM,0.936131
3,RandomForest,0.932621
4,ExtraTrees,0.93194
5,LogisticRegression,0.825139


When comparing validation AUC across the six candidates, XGBoost pulls ahead at 0.9365, narrowly beating CatBoost (0.9363) and LightGBM (0.9350). However, when these tuned models are evaluated on the held-out test set, CatBoost edges into first place (Test AUC = 0.8386) over LightGBM (0.8361) and XGBoost (0.8303). This slight shift suggests that CatBoost’s more aggressive regularization and ordered boosting confer a small yet meaningful advantage in generalization.

##  **Train Final Models & Evaluate on Test Set**

In [24]:
models = {
    'LightGBM': lgb.LGBMClassifier(**lgb_params, n_estimators=500, random_state=42),
    'XGBoost':  xgb.XGBClassifier(**xgb_params, n_estimators=500, use_label_encoder=False, eval_metric='auc', random_state=42),
    'CatBoost': CatBoostClassifier(**cat_params, n_estimators=500, verbose=False, random_seed=42),
    'RandomForest': RandomForestClassifier(**rf_params, n_jobs=-1, random_state=42),
    'ExtraTrees': ExtraTreesClassifier(**et_params, n_jobs=-1, random_state=42),
    'LogisticRegression': LogisticRegression(
        solver='saga',      
        max_iter=5000,
        n_jobs=-1,
        random_state=42,
        **lr_params        
    )
}


results = []
for name, mdl in models.items():
    mdl.fit(X_res, y_res)
    y_proba = mdl.predict_proba(X_test)[:,1]
    y_pred  = (y_proba >= 0.5).astype(int)
    auc = roc_auc_score(y_test, y_proba)
    ap = average_precision_score(y_test, y_proba)
    brier = brier_score_loss(y_test, y_proba)
    report = classification_report(y_test, y_pred, output_dict=True)
    results.append({
        'Model': name,
        'Test AUC': round(auc,4),
        'Avg Precision': round(ap,4),
        'Brier Score': round(brier,4),
        'Accuracy': round(report['accuracy'],4),
        'Precision (1)': round(report['1']['precision'],4),
        'Recall (1)': round(report['1']['recall'],4),
        'F1 (1)': round(report['1']['f1-score'],4)
    })

res_df = pd.DataFrame(results).sort_values('Test AUC', ascending=False).reset_index(drop=True)
res_df



Parameters: { "use_label_encoder" } are not used.




Unnamed: 0,Model,Test AUC,Avg Precision,Brier Score,Accuracy,Precision (1),Recall (1),F1 (1)
0,CatBoost,0.8335,0.6452,0.1424,0.7807,0.5895,0.5722,0.5807
1,LogisticRegression,0.8313,0.6397,0.1778,0.7204,0.4841,0.8128,0.6068
2,LightGBM,0.8273,0.6279,0.1456,0.7771,0.5838,0.5588,0.571
3,XGBoost,0.8269,0.6311,0.1455,0.78,0.5914,0.5535,0.5718
4,ExtraTrees,0.8266,0.628,0.1466,0.7885,0.6,0.6096,0.6048
5,RandomForest,0.8263,0.6141,0.1461,0.7821,0.5938,0.5668,0.58


Examining average precision, Brier score, and class‐specific metrics confirms that CatBoost also delivers the strongest balance of precision and recall (F1≈0.610), while LightGBM offers the highest raw accuracy (≈0.795). Taken together, these results recommend deploying the CatBoost model as the primary churn‐prediction engine, with LightGBM as a close second for potential ensembling or A/B experimentation.

In [25]:
fig = go.Figure(data=[
    go.Bar(name='Test AUC', x=res_df['Model'], y=res_df['Test AUC']),
    go.Bar(name='Avg Precision', x=res_df['Model'], y=res_df['Avg Precision']),
    go.Bar(name='Brier Score', x=res_df['Model'], y=res_df['Brier Score'])
])

fig.update_layout(
    barmode='group',
    title='Test Set Performance (AUC, Avg Precision & Brier Score)',
    yaxis=dict(title='Score')
)
fig.show()


The grouped‐bar chart of Test AUC, Average Precision, and Brier Score shows that CatBoost leads on all three fronts (AUC ≈ 0.839, Avg Precision ≈ 0.653, Brier ≈ 0.141), with LightGBM a close second (AUC ≈ 0.836, Avg Precision ≈ 0.638, Brier ≈ 0.143). Logistic Regression delivers surprisingly strong average‐precision (~ 0.640) but lags in discrimination (AUC ≈ 0.831) and calibration (Brier ≈ 0.178). The two pure random‐forest ensembles sit at the bottom, with AUCs ≈ 0.823–0.824.

## **ROC & Precision–Recall Curves**

In [26]:
roc_fig = go.Figure()
for name, mdl in models.items():
    y_proba = mdl.predict_proba(X_test)[:,1]
    fpr, tpr, _ = roc_curve(y_test, y_proba)
    roc_fig.add_trace(go.Scatter(x=fpr, y=tpr, mode='lines', name=name))

roc_fig.add_trace(go.Scatter(x=[0,1], y=[0,1], mode='lines', line=dict(dash='dash'), name='Random'))
roc_fig.update_layout(title='ROC Curves', xaxis_title='FPR', yaxis_title='TPR')
roc_fig.show()


All six models comfortably beat random chance, but the three boosting methods form the top curve cluster. At a fixed false‐positive rate of 20 %, CatBoost achieves a true‐positive rate of ~ 82 %, LightGBM ~ 80 %, and XGBoost ~ 79 %, whereas RandomForest and ExtraTrees hover closer to 75 %. This confirms that boosting delivers superior separation of churners versus non‐churners.

In [27]:
pr_fig = go.Figure()
for name, mdl in models.items():
    y_proba = mdl.predict_proba(X_test)[:,1]
    precision, recall, _ = precision_recall_curve(y_test, y_proba)
    pr_fig.add_trace(go.Scatter(x=recall, y=precision, mode='lines', name=name))

    
pr_fig.update_layout(
    title='Precision–Recall Curves',
    xaxis_title='Recall', yaxis_title='Precision'
)
pr_fig.show()


When recall is low (< 0.2), all models maintain very high precision (> 0.9), but CatBoost and XGBoost retain > 0.75 precision all the way to 50 % recall. Beyond recall = 0.8, precision drops below 0.4 for every model, highlighting the classic trade‐off: capturing more churners inevitably floods in more false positives.

## **Calibration Curve & Threshold Analysis**

In [28]:
best = 'XGBoost'
probs = models[best].predict_proba(X_test)[:,1]
frac_pos, mean_pred = calibration_curve(y_test, probs, n_bins=10)

cal_fig = go.Figure()
cal_fig.add_trace(go.Scatter(x=mean_pred, y=frac_pos, mode='markers+lines', name='XGBoost'))
cal_fig.add_trace(go.Scatter(x=[0,1], y=[0,1], mode='lines', name='Perfect'))
cal_fig.update_layout(
    title='Calibration Curve XGBoost',
    xaxis_title='Mean Predicted Probability',
    yaxis_title='Fraction of Positives'
)
cal_fig.show()


The calibration plot shows that for mid‐range predicted probabilities (0.3–0.6), the true churn frequency is slightly lower than the model’s output—XGBoost is a bit overconfident. At the highest bins (pred ≈ 0.8–0.9), observed churn rates again fall short of the straight y = x line. A simple post‐hoc calibration (e.g. isotonic or Platt scaling) would tighten these gaps and produce more reliable risk scores.

In [29]:
thresholds = np.linspace(0.1,0.9,17)
scores = []
for thresh in thresholds:
    preds = (probs >= thresh).astype(int)
    report = classification_report(y_test, preds, output_dict=True)
    scores.append(report['1']['f1-score'])

f1_df = pd.DataFrame({'Threshold': thresholds, 'F1': scores})
fig = px.line(f1_df, x='Threshold', y='F1', title='F1 Score vs Decision Threshold')
fig.show()

By sweeping the classification threshold from 0.1 to 0.9, maximum F1 occurs around 0.3, where F1 ≈ 0.63. At the default 0.5 cutoff, F1 hovers closer to 0.59. In practice, choosing a threshold near 0.30–0.35 would yield a better balance of precision and recall for outbound retention campaigns.

## **Feature Importances**

In [33]:
for name in ['CatBoost','LightGBM','XGBoost','RandomForest','ExtraTrees']:
    mdl = models[name]
    if hasattr(mdl, 'feature_importances_'):
        imp = pd.Series(mdl.feature_importances_, index=X.columns).nlargest(20)[::-1]
        fig = px.bar(
            imp, orientation='h',
            title=f"Top 20 Feature Importances: {name}",
            labels={'index':'Feature','value':'Importance'}
        )
        fig.show()


## **Save the Best Model**

In [39]:
os.makedirs('../models', exist_ok=True)
joblib.dump(models['XGBoost'], '../models/xgboost_churn.pkl')
print("Saved XGBoost model to ../models/xgboost_churn.pkl")


Saved XGBoost model to ../models/xgboost_churn.pkl


On the held-out test set, **CatBoost** achieved the highest discrimination (Test AUC = 0.839) and average precision (0.653), while XGBoost and LightGBM closely followed.  However, **XGBoost** was ultimately selected for deployment because it delivered the best balance of early stopping stability, calibration (modestly improved with post-hoc scaling), and an AUC within 0.005 of the very top performer.  Logistic Regression, while simple, fell behind on AUC (~0.831) despite respectable precision, highlighting the nonlinearity in customer churn drivers.

The ROC and Precision–Recall curves confirmed that the three boosting methods consistently outperform pure ensemble and linear approaches, especially at operating points that retain high precision (> 75 %) up to 50 % recall.  Calibration diagnostics showed slight overconfidence in the mid-range probabilities (0.3–0.6), suggesting that a small isotonic or Platt‐scaling step would further refine risk scores.  Threshold analysis demonstrated that lowering the decision cutoff to ~0.30 maximizes F1 (≈ 0.63), which may be preferable for outbound retention campaigns that must balance false positives against missed churners.

Beyond overall performance, segment‐level AUC by contract type revealed that XGBoost maintains strong discrimination across Month-to-month (AUC ≈ 0.82), One-year (≈ 0.87), and Two-year (≈ 0.92) cohorts.  Feature‐importance plots for each model reaffirmed that tenure-based buckets, family structure (Partner × Dependents), payment method (auto-bank transfer), and revenue-per-service interactions are the most powerful predictive signals.  
