# Churn Prediction Exercise

In this exercise, you will build and tune six different classification models on a business dataset (Customer Churn), applying feature engineering, model evaluation, overfitting checks, and hyperparameter tuning.

## Dataset
We will use the Telco Customer Churn dataset from Kaggle, which contains 7,043 customer records and 21 columns, including the target variable Churn (Yes/No).

Notable features include (as detailed in EDA and feature descriptions):
• customerID, gender, SeniorCitizen, Partner, Dependents, tenure
• PhoneService, MultipleLines, InternetService
• OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies
• Contract, PaperlessBilling, PaymentMethod, MonthlyCharges, TotalCharges, Churn

**Link:** https://www.kaggle.com/datasets/blastchar/telco-customer-churn

## Step 1 – Load & Explore Data

### Load dataset with Pandas

In [119]:
import kagglehub
from kagglehub import KaggleDatasetAdapter

In [131]:
file_path = 'WA_Fn-UseC_-Telco-Customer-Churn.csv'
df = kagglehub.dataset_load(
    KaggleDatasetAdapter.PANDAS,
    "blastchar/telco-customer-churn",
    file_path
)
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


### Explore distributions, missing values, and class imbalance.

In [122]:
import sweetviz as sv

In [123]:
report = sv.analyze(df)
report.show_html("sweetviz_eda.html")

                                             |          | [  0%]   00:00 -> (? left)

Report sweetviz_eda.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.


### Identify categorical vs. numerical features.

In [124]:
import numpy as np

In [132]:
# numeric and categorical detection
num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
print(f'Numeric columns ({len(num_cols)}):')
print(num_cols)

cat_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()
print(f'\nCategorical columns ({len(cat_cols)}) :')
print(cat_cols)

# binary-like columns (two distinct non-null values)
binary_cols = [c for c in df.columns if df[c].nunique(dropna=True) == 2]
print('\nBinary-like columns (nunique==2):')
print(binary_cols)

Numeric columns (3):
['SeniorCitizen', 'tenure', 'MonthlyCharges']

Categorical columns (18) :
['customerID', 'gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'TotalCharges', 'Churn']

Binary-like columns (nunique==2):
['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'PhoneService', 'PaperlessBilling', 'Churn']


## Step 2 – Data Preparation

### 1. Handle missing values

In [None]:
# Get rows with missing (including empty/whitespace) values and totals per column
import pandas as pd

# Convert empty/whitespace-only strings in object columns to NA for accurate counting
obj_cols = df.select_dtypes(include=['object']).columns.tolist()
if obj_cols:
    df[obj_cols] = df[obj_cols].apply(lambda s: s.astype(str).str.strip()).replace({'': pd.NA})

# missing mask and totals
missing_mask = df.isnull()
missing_per_col = missing_mask.sum()
missing_per_col_pct = (missing_per_col / len(df) * 100).round(2)
missing_per_row = missing_mask.sum(axis=1)
n_rows_with_missing = int((missing_per_row > 0).sum())

print(f'Rows with any missing (including converted empty/whitespace): {n_rows_with_missing} / {len(df)}')

# Tidy summary per column
missing_summary = (
    pd.DataFrame({'missing_count': missing_per_col, 'missing_pct': missing_per_col_pct})
    .reset_index()
    .rename(columns={'index': 'column'})
    .sort_values('missing_count', ascending=False)
)

# Display only columns with non-zero missing values
nonzero_missing = missing_summary[missing_summary['missing_count'] > 0].copy()
if nonzero_missing.empty:
    print('No missing values detected after converting empty/whitespace to NA.')
else:
    display(nonzero_missing)

if n_rows_with_missing > 0:
    # Dataframe of rows that contain any missing values
    missing_rows_df = df[missing_per_row > 0].copy()
    print('\nSample rows that have missing values:')
    display(missing_rows_df.head())

    # Show rows with most missing values (index + count)
    top_missing = (
        missing_rows_df.assign(_n_missing=missing_rows_df.isnull().sum(axis=1))
        .sort_values('_n_missing', ascending=False)[['_n_missing']]
    )

    # Optionally, inspect which columns are most often missing
    print('\nColumns with missing values (non-zero only):')
    display(nonzero_missing)
else:
    # already handled above; keep for clarity
    pass

Rows with any missing (including converted empty/whitespace): 11 / 7043


Unnamed: 0,column,missing_count,missing_pct
19,TotalCharges,11,0.16



Sample rows that have missing values:


Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
488,4472-LVYGI,Female,0,Yes,Yes,0,No,No phone service,DSL,Yes,...,Yes,Yes,Yes,No,Two year,Yes,Bank transfer (automatic),52.55,,No
753,3115-CZMZD,Male,0,No,Yes,0,Yes,No,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,Two year,No,Mailed check,20.25,,No
936,5709-LVOEQ,Female,0,Yes,Yes,0,Yes,No,DSL,Yes,...,Yes,No,Yes,Yes,Two year,No,Mailed check,80.85,,No
1082,4367-NUYAO,Male,0,Yes,Yes,0,Yes,Yes,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,Two year,No,Mailed check,25.75,,No
1340,1371-DWPAZ,Female,0,Yes,Yes,0,No,No phone service,DSL,Yes,...,Yes,Yes,Yes,No,Two year,No,Credit card (automatic),56.05,,No



Top rows by number of missing values:


Unnamed: 0,_n_missing
488,1
753,1
936,1
1082,1
1340,1
3331,1
3826,1
4380,1
5218,1
6670,1



Columns with missing values (non-zero only):


Unnamed: 0,column,missing_count,missing_pct
19,TotalCharges,11,0.16
