# Customer Churn Prediction
This notebook demonstrates how to predict customer churn using advanced classification algorithms such as Logistic Regression and Random Forest. We will walk through data loading, preprocessing, feature engineering, model training, evaluation, and extracting actionable business insights.

In [None]:
# Install missing packages if needed
%pip install pandas numpy matplotlib seaborn scikit-learn

# Import Required Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve

Collecting pandas
  Using cached pandas-2.3.2-cp312-cp312-macosx_10_13_x86_64.whl.metadata (91 kB)
Collecting numpy
  Using cached numpy-2.3.3-cp312-cp312-macosx_14_0_x86_64.whl.metadata (62 kB)
Collecting numpy
  Using cached numpy-2.3.3-cp312-cp312-macosx_14_0_x86_64.whl.metadata (62 kB)
Collecting matplotlib
  Using cached matplotlib-3.10.6-cp312-cp312-macosx_10_13_x86_64.whl.metadata (11 kB)
Collecting matplotlib
  Using cached matplotlib-3.10.6-cp312-cp312-macosx_10_13_x86_64.whl.metadata (11 kB)
Collecting seaborn
  Using cached seaborn-0.13.2-py3-none-any.whl.metadata (5.4 kB)
Collecting seaborn
  Using cached seaborn-0.13.2-py3-none-any.whl.metadata (5.4 kB)
Collecting scikit-learn
  Using cached scikit_learn-1.7.2-cp312-cp312-macosx_10_13_x86_64.whl.metadata (11 kB)
Collecting scikit-learn
  Using cached scikit_learn-1.7.2-cp312-cp312-macosx_10_13_x86_64.whl.metadata (11 kB)
Collecting pytz>=2020.1 (from pandas)
  Using cached pytz-2025.2-py2.py3-none-any.whl.metadata (22 kB)


## Load and Explore the Dataset
We will load the customer churn dataset and perform initial exploration to understand its structure and contents.

In [None]:
# Load the dataset (replace 'customer_churn.csv' with your actual file path)
df = pd.read_csv('customer_churn.csv')
df.head()

In [None]:
# Basic data exploration
df.info()
df.describe()
df['Churn'].value_counts()

## Data Preprocessing
We will handle missing values, encode categorical variables, and scale features as needed.

In [None]:
# Data Preprocessing Example
df = df.dropna()  # Drop missing values (customize as needed)
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()
categorical_cols = [col for col in categorical_cols if col != 'Churn']
df = pd.get_dummies(df, columns=categorical_cols, drop_first=True)
scaler = StandardScaler()
feature_cols = [col for col in df.columns if col != 'Churn']
df[feature_cols] = scaler.fit_transform(df[feature_cols])

## Feature Selection/Engineering
Select relevant features for modeling. You can also create new features if needed.

In [None]:
# Feature selection and train/test split
X = df.drop('Churn', axis=1)
y = df['Churn'].apply(lambda x: 1 if x == 'Yes' else 0)  # Adjust if Churn is already numeric
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Model Training
We will train both Logistic Regression and Random Forest classifiers on the training data.

In [None]:
# Train Logistic Regression
logreg = LogisticRegression(max_iter=1000)
logreg.fit(X_train, y_train)

# Train Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

## Model Evaluation
Evaluate both models using classification metrics and compare their performance.

In [None]:
# Evaluate Logistic Regression
y_pred_logreg = logreg.predict(X_test)
print('Logistic Regression Classification Report:')
print(classification_report(y_test, y_pred_logreg))

# Evaluate Random Forest
y_pred_rf = rf.predict(X_test)
print('Random Forest Classification Report:')
print(classification_report(y_test, y_pred_rf))

In [None]:
# Identify key drivers of churn
importances = rf.feature_importances_
features = X.columns
feat_imp = pd.Series(importances, index=features).sort_values(ascending=False)
plt.figure(figsize=(10,6))
sns.barplot(x=feat_imp[:10], y=feat_imp.index[:10])
plt.title('Top 10 Feature Importances (Random Forest)')
plt.show()

# Logistic Regression coefficients
coefs = pd.Series(logreg.coef_[0], index=features).sort_values(key=abs, ascending=False)
print('Top Logistic Regression Coefficients:')
print(coefs.head(10))

## Actionable Business Insights
Based on the model results and feature importances, summarize key drivers of churn and suggest business actions to reduce churn.

## Conclusion
This notebook demonstrated how to use Logistic Regression and Random Forest to predict customer churn, identify key drivers, and generate actionable business intelligence.

## Exploratory Data Analysis (EDA)
Let's explore the data visually to better understand churn distribution, feature relationships, and potential drivers.

In [None]:
# Churn distribution
sns.countplot(x='Churn', data=df)
plt.title('Churn Distribution')
plt.show()

# Plot distribution of numerical features
num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
df[num_cols].hist(figsize=(15,10), bins=20)
plt.suptitle('Numerical Feature Distributions')
plt.show()

In [None]:
# Correlation heatmap
plt.figure(figsize=(12,8))
sns.heatmap(df.corr(), annot=True, fmt='.2f', cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

# Churn rate by categorical features (example: gender, contract type)
categorical_cols = [col for col in df.columns if 'Churn' not in col and df[col].dtype == 'uint8']
for col in categorical_cols[:3]:  # Show for first 3 dummy variables as example
    churn_rate = df.groupby(col)['Churn'].mean()
    churn_rate.plot(kind='bar')
    plt.title(f'Churn Rate by {col}')
    plt.ylabel('Churn Rate')
    plt.show()

## Advanced Feature Engineering and Selection
We will create new features, use domain knowledge, and apply feature selection techniques to improve model performance.

In [None]:
# Example: Create tenure group feature if 'tenure' exists
if 'tenure' in df.columns:
    df['tenure_group'] = pd.cut(df['tenure'], bins=[0, 12, 24, 48, 60, np.inf], labels=['0-12','12-24','24-48','48-60','60+'])
    df = pd.get_dummies(df, columns=['tenure_group'], drop_first=True)

# Feature selection using RFE
from sklearn.feature_selection import RFE
selector = RFE(LogisticRegression(max_iter=1000), n_features_to_select=10)
selector = selector.fit(X_train, y_train)
selected_features = X_train.columns[selector.support_]
print('Selected Features:', selected_features.tolist())

## Hyperparameter Tuning
We will use GridSearchCV to find the best hyperparameters for both Logistic Regression and Random Forest models.

In [None]:
from sklearn.model_selection import GridSearchCV

# Logistic Regression hyperparameter tuning
logreg_params = {'C': [0.01, 0.1, 1, 10], 'solver': ['liblinear', 'lbfgs']}
logreg_grid = GridSearchCV(LogisticRegression(max_iter=1000), logreg_params, cv=5, scoring='accuracy')
logreg_grid.fit(X_train, y_train)
print('Best Logistic Regression Params:', logreg_grid.best_params_)

# Random Forest hyperparameter tuning
rf_params = {'n_estimators': [50, 100, 200], 'max_depth': [None, 5, 10], 'min_samples_split': [2, 5]}
rf_grid = GridSearchCV(RandomForestClassifier(random_state=42), rf_params, cv=5, scoring='accuracy')
rf_grid.fit(X_train, y_train)
print('Best Random Forest Params:', rf_grid.best_params_)

## Model Validation with K-Fold Cross-Validation
We will use k-fold cross-validation to assess the robustness of our models.

In [None]:
from sklearn.model_selection import cross_val_score

# K-Fold CV for best Logistic Regression
logreg_cv_scores = cross_val_score(logreg_grid.best_estimator_, X_train, y_train, cv=5, scoring='accuracy')
print('Logistic Regression CV Accuracy:', logreg_cv_scores.mean())

# K-Fold CV for best Random Forest
rf_cv_scores = cross_val_score(rf_grid.best_estimator_, X_train, y_train, cv=5, scoring='accuracy')
print('Random Forest CV Accuracy:', rf_cv_scores.mean())

## Model Explainability with SHAP
We will use SHAP to interpret the predictions and understand feature contributions.

In [None]:
# Install SHAP if not already installed
import sys
!{sys.executable} -m pip install shap

import shap
explainer = shap.TreeExplainer(rf_grid.best_estimator_)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values[1], X_test, plot_type='bar')

## Model Performance Visualization
Visualize ROC curves and confusion matrices for both models to compare their performance.

In [None]:
from sklearn.metrics import roc_curve, auc, confusion_matrix, ConfusionMatrixDisplay

# ROC Curve for both models
y_score_logreg = logreg_grid.best_estimator_.predict_proba(X_test)[:,1]
y_score_rf = rf_grid.best_estimator_.predict_proba(X_test)[:,1]
fpr_logreg, tpr_logreg, _ = roc_curve(y_test, y_score_logreg)
fpr_rf, tpr_rf, _ = roc_curve(y_test, y_score_rf)
plt.figure(figsize=(8,6))
plt.plot(fpr_logreg, tpr_logreg, label='Logistic Regression')
plt.plot(fpr_rf, tpr_rf, label='Random Forest')
plt.plot([0,1],[0,1],'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()

# Confusion Matrices
fig, axes = plt.subplots(1, 2, figsize=(12,5))
ConfusionMatrixDisplay(confusion_matrix(y_test, logreg_grid.best_estimator_.predict(X_test))).plot(ax=axes[0], colorbar=False)
axes[0].set_title('Logistic Regression')
ConfusionMatrixDisplay(confusion_matrix(y_test, rf_grid.best_estimator_.predict(X_test))).plot(ax=axes[1], colorbar=False)
axes[1].set_title('Random Forest')
plt.show()

## Requirements and Environment Setup
To reproduce this notebook, install the following packages: pandas, numpy, matplotlib, seaborn, scikit-learn, shap.

In [None]:
# Install requirements (uncomment if running in a new environment)
# !pip install pandas numpy matplotlib seaborn scikit-learn shap

## Business Recommendations and Impact
Based on the analysis, target the top drivers of churn with specific retention strategies. Quantify the potential impact by estimating how reducing churn in key segments could improve revenue or customer lifetime value.