Classes are imbalanced (~26% churn)

False negatives are expensive (losing customers)

We will use stratified split

We will evaluate using recall + ROC-AUC

And start with Logistic Regression.

In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split
df = pd.read_csv("../data/cleaned.csv")
X = df.drop("Churn", axis=1)
y = df["Churn"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    stratify=y,
    random_state=42
)

In [4]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)


0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,1000


In [5]:
from sklearn.metrics import classification_report, roc_auc_score

y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]

print(classification_report(y_test, y_pred))
print("ROC-AUC:", roc_auc_score(y_test, y_prob))


              precision    recall  f1-score   support

           0       0.84      0.89      0.86      1035
           1       0.63      0.52      0.57       374

    accuracy                           0.79      1409
   macro avg       0.74      0.71      0.72      1409
weighted avg       0.78      0.79      0.79      1409

ROC-AUC: 0.8342555994729907


High precision and low recall -> I can try moving the decision threshold to trade precision for recall.

In [6]:
import numpy as np
def apply_threshold(probs, threshold):
    return (probs >= threshold).astype(int)

from sklearn.metrics import classification_report

for t in [0.5, 0.4, 0.3]:
    y_pred_t = apply_threshold(y_prob, t)
    print(f"\nThreshold = {t}")
    print(classification_report(y_test, y_pred_t))



Threshold = 0.5
              precision    recall  f1-score   support

           0       0.84      0.89      0.86      1035
           1       0.63      0.52      0.57       374

    accuracy                           0.79      1409
   macro avg       0.74      0.71      0.72      1409
weighted avg       0.78      0.79      0.79      1409


Threshold = 0.4
              precision    recall  f1-score   support

           0       0.87      0.83      0.85      1035
           1       0.58      0.65      0.61       374

    accuracy                           0.78      1409
   macro avg       0.72      0.74      0.73      1409
weighted avg       0.79      0.78      0.78      1409


Threshold = 0.3
              precision    recall  f1-score   support

           0       0.90      0.74      0.81      1035
           1       0.52      0.76      0.62       374

    accuracy                           0.75      1409
   macro avg       0.71      0.75      0.72      1409
weighted avg       0.80

Shifting the threshold to 0.4 would give the best balance out of the other options. Recall would jump from 52% -> 65%. Precision only drops from 63% -> 58%. The huge jump in recall, considering false negative is worse than false positive in our case, is worth the precision drop.

In [7]:
feature_importance = pd.DataFrame({
    "Feature": X_train.columns,
    "Coefficient": model.coef_[0]
}).sort_values(by="Coefficient", ascending=False)

feature_importance

Unnamed: 0,Feature,Coefficient
4,InternetService_Fiber optic,0.920874
7,PaymentMethod_Electronic check,0.498621
3,SeniorCitizen,0.284976
8,PaymentMethod_Mailed check,0.035078
1,MonthlyCharges,0.004713
0,tenure,-0.030792
6,PaymentMethod_Credit card (automatic),-0.034537
5,InternetService_No,-0.766229
2,Contract,-0.795433


Holding everything else constant, fiber customers are far more likely to leave.
This often happens in real telecom:
-Fiber users pay more
-They have higher expectations
-They churn faster when unhappy

People paying via electronic check churn more. Aut-pay customers tend to stay longer.

Senior citizens churn slightly more.

Long term contracts are dramatically less likely to churn.

People without internet churn much less.



In [8]:
threshold = 0.4
risk_pred = (y_prob >= threshold).astype(int)

In [9]:
results = X_test.copy()
results["ActualChurn"] = y_test.values
results["ChurnProbability"] = y_prob
results["PredictedHighRisk"] = risk_pred


In [14]:
contact_list = results[results["PredictedHighRisk"] == 1]
contact_list


Unnamed: 0,tenure,MonthlyCharges,Contract,SeniorCitizen,InternetService_Fiber optic,InternetService_No,PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check,ActualChurn,ChurnProbability,PredictedHighRisk
2280,8,100.15,0,1,True,False,True,False,False,0,0.640800,1
4460,18,78.20,0,0,True,False,False,True,False,0,0.602441,1
5748,21,99.85,0,0,True,False,True,False,False,0,0.473063,1
3568,21,99.15,0,0,True,False,False,False,False,0,0.480856,1
1639,17,45.05,0,1,False,False,False,True,False,1,0.414426,1
...,...,...,...,...,...,...,...,...,...,...,...,...
3702,20,79.15,0,1,True,False,False,True,False,1,0.655547,1
53,8,80.65,0,1,True,False,True,False,False,1,0.619384,1
2900,1,69.25,0,1,True,False,False,True,False,1,0.765291,1
4223,3,70.30,0,0,True,False,False,False,True,0,0.593090,1


In [12]:
saved = contact_list["ActualChurn"].sum()
total_churners = y_test.sum()

print(saved / total_churners)


0.6470588235294118


In [13]:
len(contact_list) / len(X_test)


0.298083747338538

We would've contacted almost 65% of churners by calling 30% of the total customers.