# Customer Churn Prediction using Machine Learning

**Full notebook**: data loading → EDA → preprocessing → feature engineering → model training (Logistic Regression, Decision Tree, Random Forest, SVM) → evaluation & visualizations.

> **Note:** Update the dataset path in the first code cell if your CSV is located elsewhere (default: `data/Telco-Customer-Churn.csv`).

In [None]:
# Imports
import os
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_auc_score, roc_curve, precision_recall_fscore_support

print('Libraries loaded')

In [None]:
# Load dataset (update path if needed)
data_path = 'data/Telco-Customer-Churn.csv'  # <-- change this if your file is elsewhere
if not os.path.exists(data_path):
    print(f"Warning: {data_path} not found. Please place the Telco CSV at this path or update the variable.")
else:
    df = pd.read_csv(data_path)
    print('Data loaded. Shape:', df.shape)
    display(df.head())

## 1) Quick Data Inspection

Run the cell below to inspect types, nulls, and a basic summary.

In [None]:
# Basic info
try:
    display(df.info())
    display(df.isnull().sum())
    display(df.describe(include='all').T.head(30))
except NameError:
    print('Dataframe `df` not defined. Please run the data loading cell.')

## 2) Data Cleaning & Preprocessing
- Convert total charges to numeric if needed
- Drop customerID
- Encode target variable `Churn`

In [None]:
# Data cleaning
if 'df' in globals():
    data = df.copy()
    # Drop customerID if present
    if 'customerID' in data.columns:
        data.drop('customerID', axis=1, inplace=True)
    
    # Convert TotalCharges to numeric if present (some rows may be blank)
    if 'TotalCharges' in data.columns:
        data['TotalCharges'] = pd.to_numeric(data['TotalCharges'], errors='coerce')
    
    # Show nulls introduced by conversion
    print('Nulls per column after conversion:')
    display(data.isnull().sum())
    
    # Drop rows with missing values (simple approach)
    data.dropna(inplace=True)
    print('Shape after dropping NA:', data.shape)
    
    # Map target
    if 'Churn' in data.columns:
        data['Churn'] = data['Churn'].map({'Yes':1, 'No':0})
    
    display(data.head())

## 3) Exploratory Data Analysis (EDA)
- Distribution of target
- Numeric feature distributions
- Correlation heatmap

In [None]:
# EDA plots
if 'data' in globals():
    plt.figure(figsize=(6,4))
    sns.countplot(x='Churn', data=data)
    plt.title('Churn Distribution (0 = No, 1 = Yes)')
    plt.show()
    
    # Numeric feature distribution examples
    num_cols = data.select_dtypes(include=['int64','float64']).columns.tolist()
    num_cols = [c for c in num_cols if c != 'Churn'][:6]
    for col in num_cols:
        plt.figure(figsize=(6,3))
        sns.histplot(data[col], kde=True)
        plt.title(f'Distribution: {col}')
        plt.tight_layout()
        plt.show()
    
    # Correlation heatmap (sampled to keep it readable)
    plt.figure(figsize=(10,8))
    sns.heatmap(data[num_cols + ['Churn']].corr(), annot=True, fmt='.2f', cmap='coolwarm')
    plt.title('Correlation (selected numeric features)')
    plt.show()
else:
    print('Run previous cells to load and clean data')

## 4) Feature Engineering
- One-hot encode categorical variables or label-encode them
- Scale numeric features for algorithms like SVM


In [None]:
# Feature engineering & encoding
if 'data' in globals():
    df_fe = data.copy()
    
    # Separate categorical and numerical
    cat_cols = df_fe.select_dtypes(include=['object']).columns.tolist()
    num_cols = df_fe.select_dtypes(include=['number']).columns.tolist()
    num_cols = [c for c in num_cols if c != 'Churn']
    
    print('Categorical columns:', cat_cols)
    print('Numeric columns:', num_cols)
    
    # Simple encoding: for low-cardinality categoricals use one-hot, for binary/object use LabelEncoder
    # We'll use get_dummies for simplicity (drop_first to avoid multicollinearity)
    df_fe = pd.get_dummies(df_fe, columns=cat_cols, drop_first=True)
    print('Shape after get_dummies:', df_fe.shape)
    
    # Feature matrix and target
    X = df_fe.drop('Churn', axis=1)
    y = df_fe['Churn']
    
    # Train-test split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42, stratify=y)
    print('Train/Test shapes:', X_train.shape, X_test.shape)
else:
    print('Run earlier cells to prepare data')

### Scaling
We scale numeric features for SVM — using StandardScaler. For tree-based models scaling is not required but won't hurt when included in pipeline.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler

if 'X_train' in globals():
    # Identify numeric columns for scaling (intersection with original num_cols)
    numeric_features = [c for c in X_train.columns if any(ch.isdigit() or ch.isalpha() for ch in c) and c in X_train.columns]
    # Simpler: scale all features (after get_dummies everything is numeric)
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    print('Data scaled.')
else:
    print('Prepare features first')

## 5) Model Training & Evaluation
We train four models and compare accuracy, classification report, confusion matrix, and ROC AUC where applicable.

In [None]:
# Train models
if 'X_train' in globals():
    # Use scaled for SVM, use unscaled for tree but it's fine to use scaled variants for consistency
    models = {
        'Logistic Regression': LogisticRegression(max_iter=1000),
        'Decision Tree': DecisionTreeClassifier(random_state=42),
        'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
        'SVM': SVC(kernel='rbf', probability=True, random_state=42)
    }
    
    trained = {}
    results = {}
    for name, model in models.items():
        if name == 'SVM':
            model.fit(X_train_scaled, y_train)
            y_pred = model.predict(X_test_scaled)
            y_prob = model.predict_proba(X_test_scaled)[:,1]
        else:
            model.fit(X_train, y_train)
            y_pred = model.predict(X_test)
            # some models have predict_proba
            try:
                y_prob = model.predict_proba(X_test)[:,1]
            except Exception:
                y_prob = None
        
        acc = accuracy_score(y_test, y_pred)
        report = classification_report(y_test, y_pred, output_dict=True)
        cm = confusion_matrix(y_test, y_pred)
        roc_auc = roc_auc_score(y_test, y_prob) if y_prob is not None else None
        
        results[name] = {'accuracy': acc, 'report': report, 'confusion_matrix': cm, 'roc_auc': roc_auc}
        trained[name] = model
        
        print(f"{name} -> Accuracy: {acc:.4f}", end='')
        if roc_auc is not None:
            print(f", ROC AUC: {roc_auc:.4f}")
        else:
            print('')
else:
    print('Prepare features first')

### Confusion Matrices & ROC Curves

In [None]:
# Plot confusion matrices and ROC curves, save outputs
if 'results' in globals():
    os.makedirs('outputs', exist_ok=True)
    for name, res in results.items():
        cm = res['confusion_matrix']
        plt.figure(figsize=(5,4))
        sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
        plt.title(f'Confusion Matrix - {name}')
        plt.xlabel('Predicted')
        plt.ylabel('Actual')
        plt.tight_layout()
        plt.show()
        plt.savefig(f'outputs/confusion_matrix_{name.replace(" ","_")}.png')
        plt.close()
        
    # ROC curves
    plt.figure(figsize=(7,6))
    for name, model in trained.items():
        try:
            if name == 'SVM':
                y_prob = model.predict_proba(X_test_scaled)[:,1]
            else:
                y_prob = model.predict_proba(X_test)[:,1]
            fpr, tpr, _ = roc_curve(y_test, y_prob)
            auc = roc_auc_score(y_test, y_prob)
            plt.plot(fpr, tpr, label=f"{name} (AUC = {auc:.3f})")
        except Exception as e:
            print(f"Skipping ROC for {name}: {e}")
    plt.plot([0,1],[0,1],'k--', label='Random')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('ROC Curves')
    plt.legend()
    plt.tight_layout()
    plt.show()
    plt.savefig('outputs/roc_curves.png')
else:
    print('Run model training cell first')

## 6) Results Summary & Next Steps
- Review metrics above for model selection
- Consider hyperparameter tuning (GridSearchCV) and class imbalance handling (SMOTE, class_weight)

### Save a compact summary to CSV

In [None]:
if 'results' in globals():
    summary = []
    for name, r in results.items():
        summary.append({
            'model': name,
            'accuracy': r['accuracy'],
            'roc_auc': r['roc_auc']
        })
    summary_df = pd.DataFrame(summary).sort_values('accuracy', ascending=False)
    display(summary_df)
    summary_df.to_csv('outputs/model_summary.csv', index=False)
    print('Saved outputs/model_summary.csv')

----

### Notes
- This notebook aims to be reproducible; ensure the dataset CSV path is correct.
- For a production-ready pipeline, separate scripts, more robust preprocessing, cross-validation, and logging are recommended.

Good luck!