# Customer Churn Modeling

## Objective
Build and evaluate classification models to predict customer churn using the cleaned dataset.

This notebook covers all modeling tasks:

- ✅ Train/test split and feature scaling (as needed)
- ✅ Model training: Logistic Regression, Random Forest, Gradient Boosting
- ✅ Cross-validation and hyperparameter tuning
- ✅ Compare models using AUC-ROC and accuracy
- ✅ Interpret results and select the best model for deployment

**Goal**: Build a predictive model that accurately forecasts customer churn and provides actionable insights for retention.

# Setup

## Imports 

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression


from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

# Data Preparation

## Data Loading

In [3]:
df = pd.read_csv('datasets/churn_cleaned.csv')
df.head()

Unnamed: 0,monthlycharges,totalcharges,seniorcitizen,churn,tenure_months,type_One year,type_Two year,paperlessbilling_Yes,paymentmethod_Credit card (automatic),paymentmethod_Electronic check,...,streamingtv_No internet,streamingtv_Yes,streamingmovies_No internet,streamingmovies_Yes,multiplelines_No phone service,multiplelines_Yes,contract_type,payment_method,paperless,internet_type
0,29.85,29.85,0,1,0.0,False,False,True,False,True,...,False,False,False,False,True,False,Month-to-month,Electronic Check,Yes,DSL
1,56.95,1889.5,0,1,34.0,True,False,False,False,False,...,False,False,False,False,False,False,One year,Mailed Check,No,DSL
2,53.85,108.15,0,0,2.0,False,False,True,False,False,...,False,False,False,False,False,False,Month-to-month,Mailed Check,Yes,DSL
3,42.3,1840.75,0,1,45.0,True,False,False,False,False,...,False,False,False,False,True,False,One year,Other,No,DSL
4,70.7,151.65,0,0,2.0,False,False,True,False,True,...,False,False,False,False,False,False,Month-to-month,Electronic Check,Yes,Fiber Optic


In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7032 entries, 0 to 7031
Data columns (total 34 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   monthlycharges                         7032 non-null   float64
 1   totalcharges                           7032 non-null   float64
 2   seniorcitizen                          7032 non-null   int64  
 3   churn                                  7032 non-null   int64  
 4   tenure_months                          7032 non-null   float64
 5   type_One year                          7032 non-null   bool   
 6   type_Two year                          7032 non-null   bool   
 7   paperlessbilling_Yes                   7032 non-null   bool   
 8   paymentmethod_Credit card (automatic)  7032 non-null   bool   
 9   paymentmethod_Electronic check         7032 non-null   bool   
 10  paymentmethod_Mailed check             7032 non-null   bool   
 11  gend

## Feature and Target Separation

In [4]:
X = df.drop('churn', axis=1)
y = df['churn']

## Train/Test Split

- Stratification ensures that the distribution of the target variable (churn) is approximately the same in both the training and test sets.

In [6]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y)

In [8]:
# Check the stratification
# Check original class distribution
print("Full dataset:")
print(y.value_counts(normalize=True))

# Check train set
print("\nTrain set:")
print(y_train.value_counts(normalize=True))

# Check test set
print("\nTest set:")
print(y_test.value_counts(normalize=True))

Full dataset:
churn
1    0.734215
0    0.265785
Name: proportion, dtype: float64

Train set:
churn
1    0.734168
0    0.265832
Name: proportion, dtype: float64

Test set:
churn
1    0.734357
0    0.265643
Name: proportion, dtype: float64


## Scale Numeric Features

In [9]:
numeric_features = ['monthlycharges', 'totalcharges', 'tenure_months']
scaler = StandardScaler()
X_train[numeric_features] = scaler.fit_transform(X_train[numeric_features])
X_test[numeric_features] = scaler.transform(X_test[numeric_features])

# Modeling
## Baseline Model: Logistic Regression

In [10]:
# Initialize and train model
lr = LogisticRegression(max_iter=1000, random_state=42)
lr.fit(X_train, y_train)

# Predict labels and probabilities
y_pred = lr.predict(X_test)
y_prob = lr.predict_proba(X_test)[:, 1]  # probability of class 1 (churn)

# Evaluation
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("ROC AUC Score:", round(roc_auc_score(y_test, y_prob), 4))

ValueError: could not convert string to float: 'Month-to-month'

## Compare with Other Models

- Decision Tree

- Random Forest

- Gradient Boosting (optional)



# Evaluation
- Confusion Matrix
- Classification Report
- ROC AUC
- Feature Importance (if applicable)