# Customer Churn Prediction using Machine Learning

**Full notebook**: data loading → EDA → preprocessing → feature engineering → model training (Logistic Regression, Decision Tree, Random Forest, SVM) → evaluation & visualizations.

> **Note:** Update the dataset path in the first code cell if your CSV is located elsewhere (default: `data/Telco-Customer-Churn.csv`).

In [1]:
# Imports
import os
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline
import matplotlib
matplotlib.use("Agg")

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_auc_score, roc_curve, precision_recall_fscore_support

print('Libraries loaded')

Libraries loaded


In [2]:
# Load dataset (update path if needed)
data_path = './../data/Telco-Customer-Churn.csv'  # <-- change this if your file is elsewhere
if not os.path.exists(data_path):
    print(f"Warning: {data_path} not found. Please place the Telco CSV at this path or update the variable.")
else:
    df = pd.read_csv(data_path)
    print('Data loaded. Shape:', df.shape)
    display(df.head())

Data loaded. Shape: (7043, 21)


Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


## 1) Quick Data Inspection

Run the cell below to inspect types, nulls, and a basic summary.

In [3]:
# Basic info
try:
    display(df.info())
    display(df.isnull().sum())
    display(df.describe(include='all').T.head(30))
except NameError:
    print('Dataframe `df` not defined. Please run the data loading cell.')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


None

customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
customerID,7043.0,7043.0,7590-VHVEG,1.0,,,,,,,
gender,7043.0,2.0,Male,3555.0,,,,,,,
SeniorCitizen,7043.0,,,,0.162147,0.368612,0.0,0.0,0.0,0.0,1.0
Partner,7043.0,2.0,No,3641.0,,,,,,,
Dependents,7043.0,2.0,No,4933.0,,,,,,,
tenure,7043.0,,,,32.371149,24.559481,0.0,9.0,29.0,55.0,72.0
PhoneService,7043.0,2.0,Yes,6361.0,,,,,,,
MultipleLines,7043.0,3.0,No,3390.0,,,,,,,
InternetService,7043.0,3.0,Fiber optic,3096.0,,,,,,,
OnlineSecurity,7043.0,3.0,No,3498.0,,,,,,,


## 2) Data Cleaning & Preprocessing
- Convert total charges to numeric if needed
- Drop customerID
- Encode target variable `Churn`

In [4]:
# Data cleaning
if 'df' in globals():
    data = df.copy()
    # Drop customerID if present
    if 'customerID' in data.columns:
        data.drop('customerID', axis=1, inplace=True)
    
    # Convert TotalCharges to numeric if present (some rows may be blank)
    if 'TotalCharges' in data.columns:
        data['TotalCharges'] = pd.to_numeric(data['TotalCharges'], errors='coerce')
    
    # Show nulls introduced by conversion
    print('Nulls per column after conversion:')
    display(data.isnull().sum())
    
    # Drop rows with missing values (simple approach)
    data.dropna(inplace=True)
    print('Shape after dropping NA:', data.shape)
    
    # Map target
    if 'Churn' in data.columns:
        data['Churn'] = data['Churn'].map({'Yes':1, 'No':0})
    
    display(data.head())

Nulls per column after conversion:


gender               0
SeniorCitizen        0
Partner              0
Dependents           0
tenure               0
PhoneService         0
MultipleLines        0
InternetService      0
OnlineSecurity       0
OnlineBackup         0
DeviceProtection     0
TechSupport          0
StreamingTV          0
StreamingMovies      0
Contract             0
PaperlessBilling     0
PaymentMethod        0
MonthlyCharges       0
TotalCharges        11
Churn                0
dtype: int64

Shape after dropping NA: (7032, 20)


Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,0
1,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,0
2,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,1
3,Male,0,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,0
4,Female,0,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,1


## 3) Exploratory Data Analysis (EDA)
- Distribution of target
- Numeric feature distributions
- Correlation heatmap

In [5]:
# EDA plots
if 'data' in globals():
    plt.figure(figsize=(6,4))
    sns.countplot(x='Churn', data=data)
    plt.title('Churn Distribution (0 = No, 1 = Yes)')
    plt.show()
    
    # Numeric feature distribution examples
    num_cols = data.select_dtypes(include=['int64','float64']).columns.tolist()
    num_cols = [c for c in num_cols if c != 'Churn'][:6]
    for col in num_cols:
        plt.figure(figsize=(6,3))
        sns.histplot(data[col], kde=True)
        plt.title(f'Distribution: {col}')
        plt.tight_layout()
        plt.show()
    
    # Correlation heatmap (sampled to keep it readable)
    plt.figure(figsize=(10,8))
    sns.heatmap(data[num_cols + ['Churn']].corr(), annot=True, fmt='.2f', cmap='coolwarm')
    plt.title('Correlation (selected numeric features)')
    plt.show()
else:
    print('Run previous cells to load and clean data')

## 4) Feature Engineering
- One-hot encode categorical variables or label-encode them
- Scale numeric features for algorithms like SVM


In [6]:
# Feature engineering & encoding
if 'data' in globals():
    df_fe = data.copy()
    
    # Separate categorical and numerical
    cat_cols = df_fe.select_dtypes(include=['object']).columns.tolist()
    num_cols = df_fe.select_dtypes(include=['number']).columns.tolist()
    num_cols = [c for c in num_cols if c != 'Churn']
    
    print('Categorical columns:', cat_cols)
    print('Numeric columns:', num_cols)
    
    # Simple encoding: for low-cardinality categoricals use one-hot, for binary/object use LabelEncoder
    # We'll use get_dummies for simplicity (drop_first to avoid multicollinearity)
    df_fe = pd.get_dummies(df_fe, columns=cat_cols, drop_first=True)
    print('Shape after get_dummies:', df_fe.shape)
    
    # Feature matrix and target
    X = df_fe.drop('Churn', axis=1)
    y = df_fe['Churn']
    
    # Train-test split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42, stratify=y)
    print('Train/Test shapes:', X_train.shape, X_test.shape)
else:
    print('Run earlier cells to prepare data')

Categorical columns: ['gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod']
Numeric columns: ['SeniorCitizen', 'tenure', 'MonthlyCharges', 'TotalCharges']
Shape after get_dummies: (7032, 31)
Train/Test shapes: (5625, 30) (1407, 30)


### Scaling
We scale numeric features for SVM — using StandardScaler. For tree-based models scaling is not required but won't hurt when included in pipeline.

In [7]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler

if 'X_train' in globals():
    # Identify numeric columns for scaling (intersection with original num_cols)
    numeric_features = [c for c in X_train.columns if any(ch.isdigit() or ch.isalpha() for ch in c) and c in X_train.columns]
    # Simpler: scale all features (after get_dummies everything is numeric)
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    print('Data scaled.')
else:
    print('Prepare features first')

Data scaled.


## 5) Model Training & Evaluation
We train four models and compare accuracy, classification report, confusion matrix, and ROC AUC where applicable.

In [8]:
# Train models
if 'X_train' in globals():
    # Use scaled for SVM, use unscaled for tree but it's fine to use scaled variants for consistency
    models = {
        'Logistic Regression': LogisticRegression(max_iter=1000),
        'Decision Tree': DecisionTreeClassifier(random_state=42),
        'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
        'SVM': SVC(kernel='rbf', probability=True, random_state=42)
    }
    
    trained = {}
    results = {}
    for name, model in models.items():
        if name == 'SVM':
            model.fit(X_train_scaled, y_train)
            y_pred = model.predict(X_test_scaled)
            y_prob = model.predict_proba(X_test_scaled)[:,1]
        else:
            model.fit(X_train, y_train)
            y_pred = model.predict(X_test)
            # some models have predict_proba
            try:
                y_prob = model.predict_proba(X_test)[:,1]
            except Exception:
                y_prob = None
        
        acc = accuracy_score(y_test, y_pred)
        report = classification_report(y_test, y_pred, output_dict=True)
        cm = confusion_matrix(y_test, y_pred)
        roc_auc = roc_auc_score(y_test, y_prob) if y_prob is not None else None
        
        results[name] = {'accuracy': acc, 'report': report, 'confusion_matrix': cm, 'roc_auc': roc_auc}
        trained[name] = model
        
        print(f"{name} -> Accuracy: {acc:.4f}", end='')
        if roc_auc is not None:
            print(f", ROC AUC: {roc_auc:.4f}")
        else:
            print('')
else:
    print('Prepare features first')

Logistic Regression -> Accuracy: 0.8031, ROC AUC: 0.8363
Decision Tree -> Accuracy: 0.7186, ROC AUC: 0.6366
Random Forest -> Accuracy: 0.7875, ROC AUC: 0.8171
SVM -> Accuracy: 0.7868, ROC AUC: 0.7909


In [9]:
import os

# Define path to save in the root-level 'outputs' folder
output_dir = os.path.join(os.path.dirname(os.getcwd()), "outputs")
os.makedirs(output_dir, exist_ok=True)

### Confusion Matrices & ROC Curves

In [10]:
# Plot confusion matrices and ROC curves, save outputs
if 'results' in globals():
    for name, res in results.items():
        cm = res['confusion_matrix']
        plt.figure(figsize=(5,4))
        sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
        plt.title(f'Confusion Matrix - {name}')
        plt.xlabel('Predicted')
        plt.ylabel('Actual')
        plt.tight_layout()
        plt.savefig(f'{output_dir}/confusion_matrix_{name.replace(" ","_")}.png')
        plt.show()
        plt.close()
        
    # ROC curves
    plt.figure(figsize=(7,6))
    for name, model in trained.items():
        try:
            if name == 'SVM':
                y_prob = model.predict_proba(X_test_scaled)[:,1]
            else:
                y_prob = model.predict_proba(X_test)[:,1]
            fpr, tpr, _ = roc_curve(y_test, y_prob)
            auc = roc_auc_score(y_test, y_prob)
            plt.plot(fpr, tpr, label=f"{name} (AUC = {auc:.3f})")
        except Exception as e:
            print(f"Skipping ROC for {name}: {e}")
    plt.plot([0,1],[0,1],'k--', label='Random')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('ROC Curves')
    plt.legend()
    plt.tight_layout()
    plt.savefig(f'{output_dir}/roc_curves.png', dpi=300, bbox_inches='tight')
    plt.show()
else:
    print('Run model training cell first')

## 6) Results Summary & Next Steps
- Review metrics above for model selection
- Consider hyperparameter tuning (GridSearchCV) and class imbalance handling (SMOTE, class_weight)

### Save a compact summary to CSV

In [11]:
if 'results' in globals():
    summary = []
    for name, r in results.items():
        summary.append({
            'model': name,
            'accuracy': r['accuracy'],
            'roc_auc': r['roc_auc']
        })
    
    summary_df = pd.DataFrame(summary).sort_values('accuracy', ascending=False)
    display(summary_df)
    
    # Save CSV in the root-level outputs directory
    output_path = os.path.join(output_dir, "model_summary.csv")
    summary_df.to_csv(output_path, index=False)
    
    print(f"Saved {output_path}")
else:
    print("Run model training first to generate results.")


Unnamed: 0,model,accuracy,roc_auc
0,Logistic Regression,0.803127,0.836291
2,Random Forest,0.787491,0.817073
3,SVM,0.78678,0.790852
1,Decision Tree,0.71855,0.636638


Saved /Users/satyamvats/Desktop/sem3/project/customer-churn-prediction/outputs/model_summary.csv


----

### Notes
- This notebook aims to be reproducible; ensure the dataset CSV path is correct.
- For a production-ready pipeline, separate scripts, more robust preprocessing, cross-validation, and logging are recommended.

Good luck!