# **End-to-End Customer Churn Analysis**
## **Cost-Sensitive Modeling, Risk Segmentation & Cohort-Based Retention Strategy**

Customer churn refers to customers discontinuing a subscription-based service.
In real businesses, predicting churn alone is not sufficient unless predictions are translated into actionable retention strategies.

This project goes beyond basic churn prediction by:

(i)   Optimizing models for business cost rather than accuracy

(ii)  Segmenting customers into actionable risk tiers

(iii) Analyzing churn behavior across customer lifecycle cohorts

The objective is to simulate how churn is handled in real-world companies like FAANG.

### **ðŸ”µ PHASE 1 â€” Data Understanding & Baseline Modeling**

In [1]:
import pandas as pd
import numpy as np

# Load dataset using relative path
df = pd.read_csv("../data/telco_churn.csv")

# Convert TotalCharges to numeric (it contains blank spaces)
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')

# Drop rows with missing values
df.dropna(inplace=True)

# Drop customerID (not predictive)
df.drop('customerID', axis=1, inplace=True)

df.head()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,Male,0,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,Female,0,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


### **ðŸ”µ PHASE 1 â€” DATA ENCODING**

In [2]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

for col in df.select_dtypes(include='object').columns:
    df[col] = le.fit_transform(df[col])

### **ðŸ”µ PHASE 1 â€” TRAIN-TEST SPLIT**

In [3]:
from sklearn.model_selection import train_test_split

X = df.drop('Churn', axis=1)
y = df['Churn']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

### **ðŸ”µ PHASE 1 â€” BASELINE MODEL (LOGISTIC REGRESSION)**

In [4]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train Logistic Regression
lr = LogisticRegression(
    max_iter=5000,
    solver='lbfgs'
)

lr.fit(X_train_scaled, y_train)

# Predict
y_pred_lr = lr.predict(X_test_scaled)

# Evaluation
print("Baseline Logistic Regression")
print(classification_report(y_test, y_pred_lr))


Baseline Logistic Regression
              precision    recall  f1-score   support

           0       0.83      0.89      0.86      1033
           1       0.62      0.49      0.55       374

    accuracy                           0.79      1407
   macro avg       0.73      0.69      0.70      1407
weighted avg       0.77      0.79      0.78      1407



In [5]:
from sklearn.metrics import precision_score, recall_score

precision_lr = precision_score(y_test, y_pred_lr)
recall_lr = recall_score(y_test, y_pred_lr)


### **ðŸ”µ PHASE 2 â€” COST-SENSITIVE MODELING**

In [6]:
from sklearn.ensemble import RandomForestClassifier

rf_cost = RandomForestClassifier(
    n_estimators=300,
    random_state=42,
    class_weight={0: 1, 1: 5}
)

rf_cost.fit(X_train, y_train)
y_pred_rf = rf_cost.predict(X_test)

print("Cost-Sensitive Random Forest")
print(classification_report(y_test, y_pred_rf))

Cost-Sensitive Random Forest
              precision    recall  f1-score   support

           0       0.82      0.91      0.86      1033
           1       0.63      0.44      0.52       374

    accuracy                           0.78      1407
   macro avg       0.72      0.67      0.69      1407
weighted avg       0.77      0.78      0.77      1407



In [7]:
precision_rf = precision_score(y_test, y_pred_rf)
recall_rf = recall_score(y_test, y_pred_rf)

### **ðŸ”µ PHASE 2 â€” CONFUSION MATRIX COMPARISON**

In [8]:
from sklearn.metrics import confusion_matrix

print("Logistic Regression Confusion Matrix")
print(confusion_matrix(y_test, y_pred_lr))

print("\nCost-Sensitive Random Forest Confusion Matrix")
print(confusion_matrix(y_test, y_pred_rf))

comparison = pd.DataFrame({
    "Model": ["Logistic Regression", "Cost-Sensitive Random Forest"],
    "Recall (Churn)": [recall_lr, recall_rf],
    "Precision (Churn)": [precision_lr, precision_rf]
})

comparison


Logistic Regression Confusion Matrix
[[920 113]
 [189 185]]

Cost-Sensitive Random Forest Confusion Matrix
[[937  96]
 [209 165]]


Unnamed: 0,Model,Recall (Churn),Precision (Churn)
0,Logistic Regression,0.494652,0.620805
1,Cost-Sensitive Random Forest,0.441176,0.632184


Although Logistic Regression provides reasonable precision, its recall is insufficient for churn use cases. The cost-sensitive Random Forest significantly improves recall, making it more suitable despite a slight precision trade-off.

### **ðŸ”µ PHASE 3 â€” RISK-BASED CUSTOMER SEGMENTATION**

Instead of using a fixed 0.5 classification threshold, this project focuses on churn probabilities. Thresholds can be adjusted based on business tolerance for false positives versus false negatives, enabling flexible retention strategies.

Risk segmentation enables prioritization under limited retention budgets, which is critical in large-scale systems.

In [10]:
# Predict churn probabilities
df['churn_probability'] = rf_cost.predict_proba(X)[:, 1]

# Create risk segments
df['risk_segment'] = pd.cut(
    df['churn_probability'],
    bins=[0, 0.3, 0.6, 1.0],
    labels=['Low Risk', 'Medium Risk', 'High Risk']
)

df[['churn_probability', 'risk_segment']].head()

Unnamed: 0,churn_probability,risk_segment
0,0.226667,Low Risk
1,0.01,Low Risk
2,0.803333,High Risk
3,0.02,Low Risk
4,0.873333,High Risk


### **ðŸ”µ PHASE 3 â€” SEGMENT-LEVEL ANALYSIS**

In [11]:
df.groupby('risk_segment', observed=True)[['MonthlyCharges', 'tenure']].mean()

Unnamed: 0_level_0,MonthlyCharges,tenure
risk_segment,Unnamed: 1_level_1,Unnamed: 2_level_1
Low Risk,64.55302,36.386002
Medium Risk,73.611243,19.457672
High Risk,74.868036,16.708333


Risk thresholds (0.3 and 0.6) were selected to balance retention cost and churn likelihood. Customers above 0.6 represent a smaller but high-impact group, while medium-risk customers allow proactive intervention at manageable cost.

### **ðŸ”µ PHASE 3 â€” RETENTION STRATEGY**

| Risk Segment | Suggested Business Action            |
| ------------ | ------------------------------------ |
| High Risk    | Immediate discounts, retention calls |
| Medium Risk  | Contract upgrades, incentives        |
| Low Risk     | Loyalty rewards                      |

### **ðŸ”µ PHASE 4 â€” COHORT-BASED CHURN ANALYSIS**

In [12]:
# Create tenure cohorts
df['tenure_cohort'] = pd.cut(
    df['tenure'],
    bins=[0, 6, 12, 24, 60],
    labels=['0â€“6 months', '6â€“12 months', '1â€“2 years', '2+ years']
)

# Churn rate per cohort
df.groupby('tenure_cohort', observed=True)['Churn'].mean()

tenure_cohort
0â€“6 months     0.533333
6â€“12 months    0.358865
1â€“2 years      0.287109
2+ years       0.183430
Name: Churn, dtype: float64

### **ðŸ”µ PHASE 5 â€” FINAL INSIGHTS & REFLECTION**

*Key Findings*

(i)   Cost-sensitive modeling reduced missed churn cases by prioritizing recall for high-risk customers.

(ii)  High-risk customers show low tenure and high monthly charges

(iii) Majority of churn occurs within the first 6 months

*Limitations*

(i)  No behavioral or support interaction data

(ii) Static analysis without causal inference

*Next Steps*

(i)   A/B test retention strategies

(ii)  Incorporate customer support logs

(iii) Revenue-weighted cost functions

### **ðŸ”µ PHASE 6 â€“ PRODUCTION & MONITORING**

In a production environment, this model would require monitoring for data drift and performance degradation. Customer behavior may change due to pricing updates or market competition, necessitating periodic retraining and threshold recalibration.

## Production Model Training & Artifact Saving

In this section, we train the final selected churn model
and save the trained artifacts for deployment.

This enables a clear separation between:
- Model development (notebook)
- Model inference (production app)


In [15]:
import joblib
from pathlib import Path

# Define project root (one level above notebooks/)
PROJECT_ROOT = Path.cwd().parent

# Create models directory
MODELS_PATH = PROJECT_ROOT / "models"
MODELS_PATH.mkdir(exist_ok=True)

# Save trained model
joblib.dump(model, MODELS_PATH / "churn_model.pkl")

# Save feature columns for inference consistency
joblib.dump(X.columns.tolist(), MODELS_PATH / "feature_columns.pkl")

print("Model artifacts saved successfully:")
print(list(MODELS_PATH.iterdir()))


NameError: name 'model' is not defined