# Telco Churn Prediction: Model Training

## 1. Introduction

### 1.1 Notebook Goal

The goal of this notebook is to establish and evaluate **baseline machine learning models** for Telco Customer Churn prediction, using the final, selected feature set from the feature engineering notebook. Within this investigation wee will analyse the logistic regression, random forest, XGBoost and Light Gradient Boosting models. We then evaluate performance using **F1-Score and ROC AUC** to handle the class imbalance in the churn dataset, as accuracy is misleading.

## 2. Initial Setup

### 2.1 Library Imports and Configurations

In [1]:
# Standard imports
import pandas as pd
import numpy as np
import joblib


# Sklearn imports

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import StratifiedKFold, cross_validate
import lightgbm as lgb

# Optional: ignore warnings
import warnings
warnings.filterwarnings("ignore")

# Set seed
RANDOM_STATE = 42

# Paths
DATA_PATH = "../data/processed/churn_final_fe.csv"

pd.set_option('display.max_columns', None)  # Show all columns in DataFrame display

### 2.2 Load Engineered Data

In [2]:
X_train = pd.read_parquet('../data/processed/X_train.parquet')
y_train = pd.read_parquet('../data/processed/y_train.parquet').squeeze()

In [3]:
X_train.head()

Unnamed: 0,SeniorCitizen,tenure,TotalCharges,Dependents_binary,MultipleLines_binary,PhoneService_binary,OnlineSecurity_binary,OnlineBackup_binary,DeviceProtection_binary,TechSupport_binary,StreamingTV_binary,StreamingMovies_binary,PaperlessBilling_binary,gender_binary,is_month_to_month,is_two_year,internet_service_fiber,internet_service_none,is_electronic_check,is_mailed_check,is_bank_transfer,is_credit_card,services_count,is_high_value
3738,0,35,-0.255041,0,0,0,0,0,1,0,1,1,0,1,1,0,0,0,1,0,0,0,3,0
3151,0,15,-0.497736,1,0,1,1,0,0,0,0,0,0,1,1,0,1,0,0,1,0,0,2,0
4860,0,13,-0.745327,1,0,0,1,1,0,1,0,0,0,1,0,1,0,0,0,1,0,0,3,0
3867,0,26,-0.165018,0,0,1,0,1,1,0,1,1,1,0,0,1,0,0,0,0,0,1,5,0
3810,0,1,-0.986125,1,0,1,0,0,0,0,0,0,0,1,1,0,0,0,1,0,0,0,1,0


## 3. Create Models and Evaluate

### 3.1 Logistic Regression Model

The logistic model is a binary output model that takes the input features and applies a sigmoid (logistic) function to map the result to a probability of between 0 and 1. The optimisation solver used in the logistic regression model is 'liblinear', as it supports both L1 (LASSO - penalizes absolute value of the coefficients) and L2 (Ridge - penalizes square of the coefficients) regularisation, making it well suited for churn prediction tasks with many features or where feature selection is important.

In [4]:
log_reg = LogisticRegression(solver='liblinear', random_state=RANDOM_STATE)
# Train model
log_reg.fit(X_train, y_train)

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,42
,solver,'liblinear'
,max_iter,100


#### 3.1.1 Logistic Regression Evaluation using CV

We use Cross-Validation (CV) to get a unbiased estimate of the model's performance with metrics such as (ROC-AUC, precision and more). This helps us to identify the optimal model and tune hyperparametrs before touching the test dataset which is used for evaluation. 

We can choose between KFold and Stratified KFold. KFold randomly splits the dataset into K pieces, as a result the target variable is not considered and certain splits may have a higher proportion of churners. On the other hand, StratifiedKFold ensures that the proportion of the target variable is maintained and it is for this reason we go with StratifiedKFold.

In [5]:
# 5-fold is a standard choice
cv_strategy = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)

# Define all metrics you want to assess
scoring = ['roc_auc', 'accuracy', 'precision', 'recall', 'f1']

In [6]:
cv_results = cross_validate(
    estimator=log_reg,
    X=X_train,
    y=y_train,
    cv=cv_strategy,
    scoring=scoring,
    return_train_score=False,
    n_jobs=-1
)

# Convert to DataFrame for easy viewing
results_df = pd.DataFrame(cv_results)

# Calculate the mean performance for reporting
print("Baseline Cross-Validation Metrics (Mean):")
for metric in scoring:
    # Scores are stored under the 'test_' prefix
    mean_score = results_df[f'test_{metric}'].mean()
    std_score = results_df[f'test_{metric}'].std()
    print(f"- {metric.upper()}: {mean_score:.4f} (Std Dev: {std_score:.4f})")

Baseline Cross-Validation Metrics (Mean):
- ROC_AUC: 0.8466 (Std Dev: 0.0139)
- ACCURACY: 0.8071 (Std Dev: 0.0136)
- PRECISION: 0.6656 (Std Dev: 0.0316)
- RECALL: 0.5492 (Std Dev: 0.0413)
- F1: 0.6013 (Std Dev: 0.0323)


### 3.2 Random Forest Classification Model

The second model used is Random Forest, which is an ensemble model meaning that it builds on the decision of multiple decision trees, combining each of their predictions for an accurate and robust model. Within our churn prediction dataset, each tree will distingusih betweeen churn and non-churn aggregatin all these individual predictions into a final churn probability. The random forest model has a number of hyperparameters including: n_estimators: number of trees in the forest, max_depth: maximum depth of each tree, min_samples_split: minimum samples required to split a node, max_features: number of features considered at each split, n_jobs: tells the model to use all available CPU cores, min_samples_leaf: controls the minimum number of samples required in a leaf node. Below we set the n-estimators to 100 based on the amount of data in the set, we fix the random state and use all CPUs available. 

In [7]:
rf_clf = RandomForestClassifier(n_estimators=100, random_state=RANDOM_STATE, n_jobs=-1)
rf_clf.fit(X_train, y_train)

0,1,2
,n_estimators,100
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


#### 3.2.1 Random Forest Model Evaluation

In [8]:
rf_cv_results = cross_validate(
    estimator=rf_clf,
    X=X_train,
    y=y_train,
    cv=cv_strategy,
    scoring=scoring,
    return_train_score=False,
    n_jobs=-1
)

rf_results_df = pd.DataFrame(rf_cv_results)

# Calculate the mean performance for reporting
print("Random Forest Baseline Cross-Validation Metrics (Mean):")
for metric in scoring:
    mean_score = rf_results_df[f'test_{metric}'].mean()
    std_score = rf_results_df[f'test_{metric}'].std()
    print(f"- {metric.upper()}: {mean_score:.4f} (Std Dev: {std_score:.4f})")

Random Forest Baseline Cross-Validation Metrics (Mean):
- ROC_AUC: 0.8181 (Std Dev: 0.0159)
- ACCURACY: 0.7820 (Std Dev: 0.0158)
- PRECISION: 0.6180 (Std Dev: 0.0432)
- RECALL: 0.4709 (Std Dev: 0.0229)
- F1: 0.5343 (Std Dev: 0.0292)


### 3.3 XGBooost

XGBoost also known as Extreme Gradient Boosting is a more advanced ensemble learning algorithm based on gradient-boosted decision trees. It builds models sequentially, where each new tree is trained to correct the errors made by the previous trees. It is particularly effective because it can model complex non-linear relationships, feature interactions, and heterogeneous customer behavior while maintaining strong generalisation performance.

The main hyperparameters are as below:

n_estimators - Number of trees in the ensemble

learning_rate - Weight of each tree

max_depth - Maximum depth of each tree

min_child_weight - Minimum sum of instance weight in a leaf (similar to min_samples_leaf)

subsample - Fraction of rows used per tree

colsample_bytree - Fraction of features used per tree

gamma - Minimum loss reduction for a split

scale_pos_weight - Balances positive vs negative class	num_nonchurners / num_churners

In [9]:
xgb_clf = XGBClassifier(
    use_label_encoder=False,
    eval_metric='logloss', 
    random_state=RANDOM_STATE,
    n_jobs=-1 
)

xgb_clf.fit(X_train, y_train)

0,1,2
,objective,'binary:logistic'
,base_score,
,booster,
,callbacks,
,colsample_bylevel,
,colsample_bynode,
,colsample_bytree,
,device,
,early_stopping_rounds,
,enable_categorical,False


#### 3.3.1 XGBoost Model Evaluation

In [10]:
xgb_cv_results = cross_validate(
    estimator=xgb_clf,
    X=X_train,
    y=y_train,
    cv=cv_strategy,
    scoring=scoring,
    return_train_score=False,
    n_jobs=-1
)

xgb_results_df = pd.DataFrame(xgb_cv_results)

# Calculate the mean performance for reporting
print("XGBoost Baseline Cross-Validation Metrics (Mean):")
for metric in scoring:
    mean_score = xgb_results_df[f'test_{metric}'].mean()
    std_score = xgb_results_df[f'test_{metric}'].std()
    print(f"- {metric.upper()}: {mean_score:.4f} (Std Dev: {std_score:.4f})")

XGBoost Baseline Cross-Validation Metrics (Mean):
- ROC_AUC: 0.8224 (Std Dev: 0.0039)
- ACCURACY: 0.7812 (Std Dev: 0.0096)
- PRECISION: 0.6035 (Std Dev: 0.0244)
- RECALL: 0.5130 (Std Dev: 0.0181)
- F1: 0.5544 (Std Dev: 0.0173)


### 3.4 Light Gradient Boosting Model

LightGBM is a model similar to XGBoost but instead a high-performance implementation of gradient boosting developed by Microsoft. Like XGBoost, it builds an ensemble of decision trees sequentially, where each new tree corrects the errors of the previous ensemble.
It is designed to be faster and more memory-efficient than traditional gradient boosting, while maintaining similar or better predictive performance.

The main hyperparameters are as below:

n_estimators - Number of trees

learning_rate - Step size for each tree

max_depth - Max depth of each tree (or None for leaf-wise)

num_leaves - Maximum leaves per tree (leaf-wise growth)

min_child_samples - Minimum samples per leaf

subsample - Row sampling for each tree

colsample_bytree - Feature sampling for each tree

scale_pos_weight - Class imbalance adjustment	num_nonchurners / num_churners

max_bin - Number of bins for histogram

In [11]:
lgbm_classifier = lgb.LGBMClassifier(
    objective='binary',         # Specifies binary classification for churn
    n_estimators=100,           # Baseline number of trees
    learning_rate=0.05,         # A common starting point 
    random_state=RANDOM_STATE,  # Ensures reproducibility
    n_jobs=-1,                  # Uses all cores for speed
    verbosity=-1                # Silent mode (no messages)
)

lgbm_classifier.fit(X_train, y_train)

0,1,2
,boosting_type,'gbdt'
,num_leaves,31
,max_depth,-1
,learning_rate,0.05
,n_estimators,100
,subsample_for_bin,200000
,objective,'binary'
,class_weight,
,min_split_gain,0.0
,min_child_weight,0.001


#### 3.4.1 Light Gradient Boosting Model Evaluation

In [12]:
lgbm_cv_results = cross_validate(
    estimator=lgbm_classifier,
    X=X_train,
    y=y_train,
    cv=cv_strategy,
    scoring=scoring,
    return_train_score=False,
    n_jobs=-1,
)

lgbm_results_df = pd.DataFrame(lgbm_cv_results)

# Calculate the mean performance for reporting
print("Light Gradient Boost Baseline Cross-Validation Metrics (Mean):")
for metric in scoring:
    mean_score = lgbm_results_df[f'test_{metric}'].mean()
    std_score = lgbm_results_df[f'test_{metric}'].std()
    print(f"- {metric.upper()}: {mean_score:.4f} (Std Dev: {std_score:.4f})")

Light Gradient Boost Baseline Cross-Validation Metrics (Mean):
- ROC_AUC: 0.8441 (Std Dev: 0.0111)
- ACCURACY: 0.8016 (Std Dev: 0.0119)
- PRECISION: 0.6609 (Std Dev: 0.0379)
- RECALL: 0.5217 (Std Dev: 0.0210)
- F1: 0.5826 (Std Dev: 0.0201)


## 4. Evaluate Model Performance

The primary goal of machine learning is not just to train a model, but to select the best performing model for the specific business problem. This section focuses on evaluating the trained baseline models and comparing their performance metrics.

In [13]:
def summarize_cv_results(cv_results, model_name):
    """Calculates mean metrics and fit time from cross_validate output."""
    summary = {
        'Model': model_name,
        'ROC_AUC_Mean': np.mean(cv_results['test_roc_auc']),
        'ROC_AUC_Std': np.std(cv_results['test_roc_auc']),
        'F1_Score_Mean': np.mean(cv_results['test_f1']),
        'Accuracy_Mean': np.mean(cv_results['test_accuracy']),
        'Fit_Time_Mean (s)': np.mean(cv_results['fit_time'])
    }
    return pd.Series(summary)

In [14]:
# Create a list of summarized results
results_list = [
    summarize_cv_results(results_df, 'Logistic Regression'),
    summarize_cv_results(rf_cv_results, 'Random Forest'),
    summarize_cv_results(lgbm_cv_results, 'LightGBM'),
    summarize_cv_results(xgb_cv_results, 'XGBoost')
]

# Concatenate the series into a final DataFrame
model_comparison_df = pd.concat(results_list, axis=1).T

# Sort the DataFrame to see the best-performing models first
model_comparison_df = model_comparison_df.sort_values(
    by='ROC_AUC_Mean', ascending=False
).reset_index(drop=True)

# Format the output for a clean look
print("Baseline Model Performance Comparison (Cross-Validated)")
display(
    model_comparison_df.style.format({
        'ROC_AUC_Mean': '{:.4f}',
        'ROC_AUC_Std': '{:.4f}',
        'F1_Score_Mean': '{:.4f}',
        'Accuracy_Mean': '{:.4f}',
        'Fit_Time_Mean (s)': '{:.2f}'
    })
)

Baseline Model Performance Comparison (Cross-Validated)


Unnamed: 0,Model,ROC_AUC_Mean,ROC_AUC_Std,F1_Score_Mean,Accuracy_Mean,Fit_Time_Mean (s)
0,Logistic Regression,0.8466,0.0124,0.6013,0.8071,0.01
1,LightGBM,0.8441,0.01,0.5826,0.8016,2.09
2,XGBoost,0.8224,0.0034,0.5544,0.7812,0.04
3,Random Forest,0.8181,0.0143,0.5343,0.782,0.15


Logistic Regression is by far the most efficient model, offering the best performance for virtually no training time (0.01s). This makes it an excellent choice for a baseline and potentially the final production model if its performance is deemed sufficient. The complex tree-based ensemble methods (Random Forest, LightGBM, XGBoost) did not outperform the simple Logistic Regression model on this specific dataset, despite being generally more powerful. This suggests that the relationship between the features and the target variable may be largely linear.

In [15]:
# Save the baseline model comparison df to be compared with tuned models
model_comparison_df.to_parquet('../data/reports/baseline_model_performance.parquet', index=False)

## 5. Store Models

The baseline models are stored to ensure reproducibility and to establish a benchmark against which all subsequent hyperparameter-tuned and advanced models must be fairly compared.

In [16]:
joblib.dump(log_reg, "../models/baseline_lr.joblib")
joblib.dump(rf_clf, "../models/baseline_rf.joblib")
joblib.dump(xgb_clf, "../models/baseline_xgb.joblib")
joblib.dump(lgbm_classifier, "../models/baseline_lgbm.joblib")

['../models/baseline_lgbm.joblib']