# **Customer Churn Prediction**
#### **What is customer churn?**  
Customer churn refers to the percentage of customers who stop using a company's product or service within a given time frame. This metric helps businesses gauge customer satisfaction and loyalty while also providing insights into potential revenue fluctuations.

Churn is especially critical for subscription-based businesses, such as SaaS companies, which rely on recurring revenue. Understanding churn patterns allows them to anticipate financial impact and take proactive measures.

Also known as customer attrition, churn is the opposite of customer retention, which focuses on maintaining long-term customer relationships. Reducing churn should be a key part of any customer engagement strategy, ensuring consistent interactions between businesses and their customers, whether online or in person.

A strong customer retention plan plays a crucial role in minimizing churn. Companies should track churn rates regularly to assess their risk of revenue loss and identify areas for improvement.

<br>

**Source:** IBM. Customer Churn. Retrieved from https://www.ibm.com/think/topics/customer-churn

---

**Dataset:** [Telco Customer Churn (IBM)](https://www.kaggle.com/datasets/yeanzc/telco-customer-churn-ibm-dataset)

In [42]:
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd

import xgboost as xgb
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler

pd.pandas.set_option('display.max_columns',None)

In [43]:
df = pd.read_excel('../../data/TelcoCustomerChurn/Telco_customer_churn.xlsx')
df

Unnamed: 0,CustomerID,Count,Country,State,City,Zip Code,Lat Long,Latitude,Longitude,Gender,Senior Citizen,Partner,Dependents,Tenure Months,Phone Service,Multiple Lines,Internet Service,Online Security,Online Backup,Device Protection,Tech Support,Streaming TV,Streaming Movies,Contract,Paperless Billing,Payment Method,Monthly Charges,Total Charges,Churn Label,Churn Value,Churn Score,CLTV,Churn Reason
0,3668-QPYBK,1,United States,California,Los Angeles,90003,"33.964131, -118.272783",33.964131,-118.272783,Male,No,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes,1,86,3239,Competitor made better offer
1,9237-HQITU,1,United States,California,Los Angeles,90005,"34.059281, -118.30742",34.059281,-118.307420,Female,No,No,Yes,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.70,151.65,Yes,1,67,2701,Moved
2,9305-CDSKC,1,United States,California,Los Angeles,90006,"34.048013, -118.293953",34.048013,-118.293953,Female,No,No,Yes,8,Yes,Yes,Fiber optic,No,No,Yes,No,Yes,Yes,Month-to-month,Yes,Electronic check,99.65,820.5,Yes,1,86,5372,Moved
3,7892-POOKP,1,United States,California,Los Angeles,90010,"34.062125, -118.315709",34.062125,-118.315709,Female,No,Yes,Yes,28,Yes,Yes,Fiber optic,No,No,Yes,Yes,Yes,Yes,Month-to-month,Yes,Electronic check,104.80,3046.05,Yes,1,84,5003,Moved
4,0280-XJGEX,1,United States,California,Los Angeles,90015,"34.039224, -118.266293",34.039224,-118.266293,Male,No,No,Yes,49,Yes,Yes,Fiber optic,No,Yes,Yes,No,Yes,Yes,Month-to-month,Yes,Bank transfer (automatic),103.70,5036.3,Yes,1,89,5340,Competitor had better devices
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,2569-WGERO,1,United States,California,Landers,92285,"34.341737, -116.539416",34.341737,-116.539416,Female,No,No,No,72,Yes,No,No,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,Two year,Yes,Bank transfer (automatic),21.15,1419.4,No,0,45,5306,
7039,6840-RESVB,1,United States,California,Adelanto,92301,"34.667815, -117.536183",34.667815,-117.536183,Male,No,Yes,Yes,24,Yes,Yes,DSL,Yes,No,Yes,Yes,Yes,Yes,One year,Yes,Mailed check,84.80,1990.5,No,0,59,2140,
7040,2234-XADUH,1,United States,California,Amboy,92304,"34.559882, -115.637164",34.559882,-115.637164,Female,No,Yes,Yes,72,Yes,Yes,Fiber optic,No,Yes,Yes,No,Yes,Yes,One year,Yes,Credit card (automatic),103.20,7362.9,No,0,71,5560,
7041,4801-JZAZL,1,United States,California,Angelus Oaks,92305,"34.1678, -116.86433",34.167800,-116.864330,Female,No,Yes,Yes,11,No,No phone service,DSL,Yes,No,No,No,No,No,Month-to-month,Yes,Electronic check,29.60,346.45,No,0,59,2793,


Since more than 70% of the column `Churn Reason` is missing, drop it.

In [44]:
missing_percent = df['Churn Reason'].isnull().sum() / len(df) * 100
missing_percent

np.float64(73.4630129206304)

In [45]:
df = df.drop(columns=['Churn Reason'])

Drop columns with only one unique value since they wont provide any useful information to our model.

In [46]:
print(df['Count'].unique())
print(df['Count'].value_counts(), '\n')

print(df['Country'].unique())
print(df['Country'].value_counts(), '\n')

print(df['State'].unique())
print(df['State'].value_counts())

[1]
Count
1    7043
Name: count, dtype: int64 

['United States']
Country
United States    7043
Name: count, dtype: int64 

['California']
State
California    7043
Name: count, dtype: int64


In [47]:
df = df.drop(columns=['Count', 'Country', 'State'])

Drop redundant columns.

In [48]:
df = df.drop(columns=['Lat Long', 'Churn Label', 'CustomerID'])

Convert `Total Charges` into a numerical column.
Get the categorical and numerical features.

In [49]:
df['Total Charges'] = pd.to_numeric(df['Total Charges'], errors='coerce')

categorical_features = df.select_dtypes(include=['object']).columns.tolist()    
numerical_features = df.select_dtypes(include=['int','float']).columns.tolist()
numerical_features.remove('Churn Value') # remove target variable

Split the dataset and initialize the encoder and scaler.

In [50]:
x = df.drop(columns='Churn Value')
y = df['Churn Value']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

label_encoder = LabelEncoder()
scaler = StandardScaler()

Encode categorical features.

In [51]:
for feature in categorical_features:
    label_encoder = LabelEncoder()  
    x_train[feature] = label_encoder.fit_transform(x_train[feature])
    x_test[feature] = label_encoder.transform(x_test[feature])  


Use standard scaling on numerical features.

In [52]:
x_train[numerical_features] = scaler.fit_transform(x_train[numerical_features])
x_test[numerical_features] = scaler.transform(x_test[numerical_features])

In [53]:
x_train.head()

Unnamed: 0,City,Zip Code,Latitude,Longitude,Gender,Senior Citizen,Partner,Dependents,Tenure Months,Phone Service,Multiple Lines,Internet Service,Online Security,Online Backup,Device Protection,Tech Support,Streaming TV,Streaming Movies,Contract,Paperless Billing,Payment Method,Monthly Charges,Total Charges,Churn Score,CLTV
2142,869,-0.749824,-1.416407,1.291083,1,0,1,0,1.570886,1,2,1,0,0,2,0,2,2,1,1,1,1.133166,2.068906,-1.405805,0.585335
1623,462,0.965416,0.632339,-0.314794,0,0,0,0,-0.669405,1,0,1,0,2,2,0,0,2,0,1,1,0.952708,-0.312066,0.76708,0.918886
6074,541,0.011555,-0.725073,0.86033,0,0,0,0,-0.017684,1,2,1,0,2,2,0,2,2,1,1,1,1.311969,0.502253,0.073606,-1.744467
1362,292,-0.957278,-0.897809,0.817791,0,1,1,0,-0.343545,0,1,0,0,0,2,0,0,2,0,1,1,-0.787308,-0.595,1.229396,-0.474278
6754,562,-1.859811,-0.934924,0.684388,0,0,1,1,-0.506475,1,0,2,1,1,1,1,1,1,0,0,3,-1.469407,-0.827944,-0.712331,-1.438712


In [54]:
x_test.head()

Unnamed: 0,City,Zip Code,Latitude,Longitude,Gender,Senior Citizen,Partner,Dependents,Tenure Months,Phone Service,Multiple Lines,Internet Service,Online Security,Online Backup,Device Protection,Tech Support,Streaming TV,Streaming Movies,Contract,Paperless Billing,Payment Method,Monthly Charges,Total Charges,Churn Score,CLTV
185,473,0.005139,-0.43685,0.565364,0,0,0,0,-0.710138,1,0,1,0,2,0,0,0,2,0,1,0,0.710993,-0.446442,1.737943,-0.696645
2715,972,0.901255,0.69505,-0.703969,1,0,1,0,-0.58794,1,2,0,0,2,0,0,0,0,0,0,1,-0.252555,-0.53255,0.027375,0.970266
3825,881,0.565479,0.56718,-1.090318,0,0,0,0,-1.239661,1,0,2,1,1,1,1,1,1,0,1,3,-1.471062,-0.98927,-0.897257,-1.827012
1807,931,-1.121958,-0.87059,0.63887,1,1,1,0,-1.076731,1,0,1,0,2,0,0,0,2,0,1,2,0.654703,-0.777975,0.628385,1.202741
132,662,-0.512428,-1.100415,1.235335,1,0,0,0,0.308176,1,2,1,0,0,2,2,2,2,0,0,0,1.336802,0.818191,1.368091,-0.769926


Train XGBoost model and evaluate.

In [55]:
xgb_model = xgb.XGBClassifier(
    objective='binary:logistic',
    eval_metric='logloss',
    n_estimators=100,
    learning_rate=0.1,
    max_depth=6,
    random_state=42
)

xgb_model.fit(x_train, y_train)

y_pred = xgb_model.predict(x_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Accuracy: 0.9240596167494677
              precision    recall  f1-score   support

           0       0.94      0.95      0.95      1009
           1       0.87      0.86      0.87       400

    accuracy                           0.92      1409
   macro avg       0.91      0.90      0.91      1409
weighted avg       0.92      0.92      0.92      1409



Train RandomForest model and evaluate.

In [56]:
rf_model = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    random_state=42,
    class_weight='balanced'
)

rf_model.fit(x_train, y_train)

y_pred_rf = rf_model.predict(x_test)

print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))
print("Classification Report:\n", classification_report(y_test, y_pred_rf))

Random Forest Accuracy: 0.9212207239176721
Classification Report:
               precision    recall  f1-score   support

           0       0.97      0.92      0.94      1009
           1       0.82      0.92      0.87       400

    accuracy                           0.92      1409
   macro avg       0.90      0.92      0.91      1409
weighted avg       0.93      0.92      0.92      1409



First trial of hyperparameter tuning.

In [None]:
import optuna
import xgboost as xgb
from sklearn.model_selection import cross_val_score

def objective(trial):
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 50, 300),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
        'max_depth': trial.suggest_int('max_depth', 3, 10),
        'subsample': trial.suggest_float('subsample', 0.5, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0)
    }

    model = xgb.XGBClassifier(objective='binary:logistic', eval_metric='logloss', random_state=42, **params)
    score = cross_val_score(model, x_train, y_train, cv=5, scoring='accuracy').mean()
    return score

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50)

print("Best Hyperparameters:", study.best_params)
print("Best Accuracy:", study.best_value)


test best hyperparams on xgboost/

In [61]:
final_model = xgb.XGBClassifier(objective='binary:logistic', eval_metric='logloss', random_state=42, **study.best_params)
final_model.fit(x_train, y_train)

y_pred = final_model.predict(x_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Accuracy: 0.9283179559971612
              precision    recall  f1-score   support

           0       0.94      0.96      0.95      1009
           1       0.89      0.85      0.87       400

    accuracy                           0.93      1409
   macro avg       0.92      0.91      0.91      1409
weighted avg       0.93      0.93      0.93      1409

