# **Customer Churn Prediction**
#### **What is customer churn?**  
Customer churn refers to the percentage of customers who stop using a company's product or service within a given time frame. This metric helps businesses gauge customer satisfaction and loyalty while also providing insights into potential revenue fluctuations.

Churn is especially critical for subscription-based businesses, such as SaaS companies, which rely on recurring revenue. Understanding churn patterns allows them to anticipate financial impact and take proactive measures.

Also known as customer attrition, churn is the opposite of customer retention, which focuses on maintaining long-term customer relationships. Reducing churn should be a key part of any customer engagement strategy, ensuring consistent interactions between businesses and their customers, whether online or in person.

A strong customer retention plan plays a crucial role in minimizing churn. Companies should track churn rates regularly to assess their risk of revenue loss and identify areas for improvement.

<br>

**Source:** IBM. Customer Churn. Retrieved from https://www.ibm.com/think/topics/customer-churn

---

**Dataset:** [Telco Customer Churn (IBM)](https://www.kaggle.com/datasets/yeanzc/telco-customer-churn-ibm-dataset)

In [4]:
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd

import xgboost as xgb
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler

pd.pandas.set_option('display.max_columns',None)

In [5]:
df = pd.read_excel('../../data/TelcoCustomerChurn/Telco_customer_churn.xlsx')
df

Unnamed: 0,CustomerID,Count,Country,State,City,Zip Code,Lat Long,Latitude,Longitude,Gender,Senior Citizen,Partner,Dependents,Tenure Months,Phone Service,Multiple Lines,Internet Service,Online Security,Online Backup,Device Protection,Tech Support,Streaming TV,Streaming Movies,Contract,Paperless Billing,Payment Method,Monthly Charges,Total Charges,Churn Label,Churn Value,Churn Score,CLTV,Churn Reason
0,3668-QPYBK,1,United States,California,Los Angeles,90003,"33.964131, -118.272783",33.964131,-118.272783,Male,No,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes,1,86,3239,Competitor made better offer
1,9237-HQITU,1,United States,California,Los Angeles,90005,"34.059281, -118.30742",34.059281,-118.307420,Female,No,No,Yes,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.70,151.65,Yes,1,67,2701,Moved
2,9305-CDSKC,1,United States,California,Los Angeles,90006,"34.048013, -118.293953",34.048013,-118.293953,Female,No,No,Yes,8,Yes,Yes,Fiber optic,No,No,Yes,No,Yes,Yes,Month-to-month,Yes,Electronic check,99.65,820.5,Yes,1,86,5372,Moved
3,7892-POOKP,1,United States,California,Los Angeles,90010,"34.062125, -118.315709",34.062125,-118.315709,Female,No,Yes,Yes,28,Yes,Yes,Fiber optic,No,No,Yes,Yes,Yes,Yes,Month-to-month,Yes,Electronic check,104.80,3046.05,Yes,1,84,5003,Moved
4,0280-XJGEX,1,United States,California,Los Angeles,90015,"34.039224, -118.266293",34.039224,-118.266293,Male,No,No,Yes,49,Yes,Yes,Fiber optic,No,Yes,Yes,No,Yes,Yes,Month-to-month,Yes,Bank transfer (automatic),103.70,5036.3,Yes,1,89,5340,Competitor had better devices
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,2569-WGERO,1,United States,California,Landers,92285,"34.341737, -116.539416",34.341737,-116.539416,Female,No,No,No,72,Yes,No,No,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,Two year,Yes,Bank transfer (automatic),21.15,1419.4,No,0,45,5306,
7039,6840-RESVB,1,United States,California,Adelanto,92301,"34.667815, -117.536183",34.667815,-117.536183,Male,No,Yes,Yes,24,Yes,Yes,DSL,Yes,No,Yes,Yes,Yes,Yes,One year,Yes,Mailed check,84.80,1990.5,No,0,59,2140,
7040,2234-XADUH,1,United States,California,Amboy,92304,"34.559882, -115.637164",34.559882,-115.637164,Female,No,Yes,Yes,72,Yes,Yes,Fiber optic,No,Yes,Yes,No,Yes,Yes,One year,Yes,Credit card (automatic),103.20,7362.9,No,0,71,5560,
7041,4801-JZAZL,1,United States,California,Angelus Oaks,92305,"34.1678, -116.86433",34.167800,-116.864330,Female,No,Yes,Yes,11,No,No phone service,DSL,Yes,No,No,No,No,No,Month-to-month,Yes,Electronic check,29.60,346.45,No,0,59,2793,


Since more than 70% of the column `Churn Reason` is missing, drop it.

In [6]:
missing_percent = df['Churn Reason'].isnull().sum() / len(df) * 100
missing_percent

np.float64(73.4630129206304)

In [7]:
df = df.drop(columns=['Churn Reason'])

Drop columns with only one unique value since they wont provide any useful information to our model.

In [8]:
print(df['Count'].unique())
print(df['Count'].value_counts(), '\n')

print(df['Country'].unique())
print(df['Country'].value_counts(), '\n')

print(df['State'].unique())
print(df['State'].value_counts())

[1]
Count
1    7043
Name: count, dtype: int64 

['United States']
Country
United States    7043
Name: count, dtype: int64 

['California']
State
California    7043
Name: count, dtype: int64


In [9]:
df = df.drop(columns=['Count', 'Country', 'State'])

Drop redundant columns.

In [10]:
df = df.drop(columns=['Lat Long', 'Churn Label', 'CustomerID'])

Convert `Total Charges` into a numerical column.

In [11]:
df['Total Charges'] = pd.to_numeric(df['Total Charges'], errors='coerce')

Create `Average Monthly Charge` column. Lets also drop the `Total Charges` column.

In [12]:
df['Average Monthly Charge'] = df['Total Charges'] / df['Tenure Months']
avg_monthly_missing = df['Average Monthly Charge'].isna().sum()
avg_monthly_missing
df['Average Monthly Charge'] = df['Average Monthly Charge'].fillna(0)
df = df.drop(columns=['Total Charges'])

Get the categorical and numerical features.

In [13]:
categorical_features = df.select_dtypes(include=['object']).columns.tolist()    
numerical_features = df.select_dtypes(include=['int','float']).columns.tolist()
numerical_features.remove('Churn Value') # remove target variable

Split the dataset and initialize the encoder and scaler.

In [14]:
x = df.drop(columns='Churn Value')
y = df['Churn Value']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

label_encoder = LabelEncoder()
scaler = StandardScaler()

Encode categorical features.

In [15]:
for feature in categorical_features:
    label_encoder = LabelEncoder()  
    x_train[feature] = label_encoder.fit_transform(x_train[feature])
    x_test[feature] = label_encoder.transform(x_test[feature])  


Use standard scaling on numerical features.

In [16]:
x_train[numerical_features] = scaler.fit_transform(x_train[numerical_features])
x_test[numerical_features] = scaler.transform(x_test[numerical_features])

Train XGBoost model and evaluate.

In [17]:
xgb_model = xgb.XGBClassifier(
    objective='binary:logistic',
    eval_metric='logloss',
    n_estimators=100,
    learning_rate=0.1,
    max_depth=6,
    random_state=42
)

xgb_model.fit(x_train, y_train)

y_pred = xgb_model.predict(x_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Accuracy: 0.9176721078779276
              precision    recall  f1-score   support

           0       0.94      0.95      0.94      1009
           1       0.87      0.84      0.85       400

    accuracy                           0.92      1409
   macro avg       0.90      0.89      0.90      1409
weighted avg       0.92      0.92      0.92      1409



Train RandomForest model and evaluate.

In [18]:
rf_model = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    random_state=42,
    class_weight='balanced'
)

rf_model.fit(x_train, y_train)

y_pred_rf = rf_model.predict(x_test)

print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))
print("Classification Report:\n", classification_report(y_test, y_pred_rf))

Random Forest Accuracy: 0.9233498935415189
Classification Report:
               precision    recall  f1-score   support

           0       0.97      0.92      0.95      1009
           1       0.83      0.92      0.87       400

    accuracy                           0.92      1409
   macro avg       0.90      0.92      0.91      1409
weighted avg       0.93      0.92      0.92      1409



First trial of hyperparameter tuning.

In [19]:
# redo hyperparameter tuning since results got worse

### **Feature Engineering Experiments**
Created `Average Monthly Charge` column

---

### **Initial Results**

#### **XGBoost**  
**Accuracy:** `0.9240596167494677`

```
              precision    recall  f1-score   support  

           0       0.94      0.95      0.95      1009  
           1       0.87      0.86      0.87       400  

    accuracy                           0.92      1409    
   macro avg       0.91      0.90      0.91      1409    
weighted avg       0.92      0.92      0.92      1409  
```

#### **Random Forest**  
**Accuracy:** `0.9212207239176721`

```
              precision    recall  f1-score   support

           0       0.97      0.92      0.94      1009
           1       0.82      0.92      0.87       400

    accuracy                           0.92      1409  
   macro avg       0.90      0.92      0.91      1409  
weighted avg       0.93      0.92      0.92      1409  
```

---

### **With Average Monthly Charge column**

#### **XGBoost**  
**Accuracy:** `0.9233498935415189`

```
              precision    recall  f1-score   support

           0       0.94      0.95      0.95      1009
           1       0.87      0.85      0.86       400

    accuracy                           0.92      1409
   macro avg       0.91      0.90      0.91      1409
weighted avg       0.92      0.92      0.92      1409
```

#### **Random Forest**  
**Accuracy:** `0.9226401703335699`

```
              precision    recall  f1-score   support

           0       0.97      0.92      0.94      1009
           1       0.82      0.93      0.87       400

    accuracy                           0.92      1409
   macro avg       0.90      0.92      0.91      1409
weighted avg       0.93      0.92      0.92      1409
```

---

### **Dropped Total Charges column**

#### **XGBoost**  
**Accuracy:** `0.9176721078779276`

```
              precision    recall  f1-score   support

           0       0.94      0.95      0.94      1009
           1       0.87      0.84      0.85       400

    accuracy                           0.92      1409
   macro avg       0.90      0.89      0.90      1409
weighted avg       0.92      0.92      0.92      1409
```

#### **Random Forest**  
**Accuracy:** `0.9233498935415189`

```
              precision    recall  f1-score   support

           0       0.97      0.92      0.95      1009
           1       0.83      0.92      0.87       400

    accuracy                           0.92      1409
   macro avg       0.90      0.92      0.91      1409
weighted avg       0.93      0.92      0.92      1409
```


Setup new hyperparameter tuning with cross validation

In [None]:
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import make_scorer, f1_score
from scipy.stats import randint, uniform
import numpy as np

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
f1 = make_scorer(f1_score)

RandomForest tuning

In [21]:
rf = RandomForestClassifier(random_state=42)

rf_params = {
    'n_estimators': randint(100, 1000),
    'max_depth': randint(10, 50),
    'min_samples_split': randint(2, 20),
    'min_samples_leaf': randint(1, 20),
    'max_features': ['sqrt', 'log2', None]
}

rf_search = RandomizedSearchCV(
    estimator=rf,
    param_distributions=rf_params,
    n_iter=50,
    cv=cv,
    scoring=f1,
    n_jobs=-1,
    verbose=2,
    random_state=42
)

rf_search.fit(x_train, y_train)
print("Best RF Params:", rf_search.best_params_)
print("Best RF CV F1:", rf_search.best_score_)

Fitting 5 folds for each of 50 candidates, totalling 250 fits
Best RF Params: {'max_depth': 33, 'max_features': 'sqrt', 'min_samples_leaf': 12, 'min_samples_split': 9, 'n_estimators': 571}
Best RF CV F1: 0.8700275538331093


XGBoost tuning

In [None]:
xgb = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)

xgb_params = {
    'n_estimators': randint(100, 1000),
    'max_depth': randint(3, 15),
    'learning_rate': uniform(0.01, 0.3),
    'subsample': uniform(0.6, 0.4),
    'colsample_bytree': uniform(0.6, 0.4),
    'gamma': [0, 1, 5]
}

xgb_search = RandomizedSearchCV(
    estimator=xgb,
    param_distributions=xgb_params,
    n_iter=50,
    cv=cv,
    scoring=f1,
    n_jobs=-1,
    verbose=2,
    random_state=42
)

xgb_search.fit(x_train, y_train)
print("Best XGB Params:", xgb_search.best_params_)
print("Best XGB CV F1:", xgb_search.best_score_)

Fitting 5 folds for each of 50 candidates, totalling 250 fits
Best XGB Params: {'colsample_bytree': np.float64(0.941203782186944), 'gamma': 5, 'learning_rate': np.float64(0.10170910578615455), 'max_depth': 9, 'n_estimators': 260, 'subsample': np.float64(0.726768802062511)}
Best XGB CV F1: 0.870493721141764


Parameters: { "use_label_encoder" } are not used.



final eval

In [None]:
from sklearn.metrics import classification_report

best_rf = rf_search.best_estimator_
best_xgb = xgb_search.best_estimator_

rf_preds = best_rf.predict(x_test)
xgb_preds = best_xgb.predict(x_test)

print("Random Forest:\n", classification_report(y_test, rf_preds))
print("XGBoost:\n", classification_report(y_test, xgb_preds))

Random Forest:
               precision    recall  f1-score   support

           0       0.94      0.96      0.95      1009
           1       0.89      0.85      0.87       400

    accuracy                           0.93      1409
   macro avg       0.92      0.90      0.91      1409
weighted avg       0.93      0.93      0.93      1409

XGBoost:
               precision    recall  f1-score   support

           0       0.95      0.95      0.95      1009
           1       0.87      0.86      0.87       400

    accuracy                           0.93      1409
   macro avg       0.91      0.91      0.91      1409
weighted avg       0.93      0.93      0.93      1409

