# Churn Prediction Exercise

In this exercise, you will build and tune six different classification models on a business dataset (Customer Churn), applying feature engineering, model evaluation, overfitting checks, and hyperparameter tuning.

## Dataset
We will use the Telco Customer Churn dataset from Kaggle, which contains 7,043 customer records and 21 columns, including the target variable Churn (Yes/No).

Notable features include (as detailed in EDA and feature descriptions):
• customerID, gender, SeniorCitizen, Partner, Dependents, tenure
• PhoneService, MultipleLines, InternetService
• OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies
• Contract, PaperlessBilling, PaymentMethod, MonthlyCharges, TotalCharges, Churn

**Link:** https://www.kaggle.com/datasets/blastchar/telco-customer-churn

## Step 1 – Load & Explore Data

### Load dataset with Pandas

In [9]:
file_path = 'WA_Fn-UseC_-Telco-Customer-Churn.csv'
df = kagglehub.dataset_load(
    KaggleDatasetAdapter.PANDAS,
    "blastchar/telco-customer-churn",
    file_path
)

df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [20]:
# Identify missing data: basic checks

# Replace empty strings with proper NA (helps detect hidden missing values)
df = df.replace({'': pd.NA})

# Basic overview
print('Shape:', df.shape)
print('\nDtype summary:')
df.info()

# Counts of missing values per column
missing_counts = df.isnull().sum()
print('\nMissing value counts (non-zero only):')
print(missing_counts[missing_counts > 0])

# Percent missing per column
missing_pct = (df.isnull().mean() * 100).round(2)
print('\nPercent missing (non-zero only):')
print(missing_pct[missing_pct > 0].sort_values(ascending=False))

# Common gotcha: TotalCharges may contain spaces/empty strings; coerce to numeric
if 'TotalCharges' in df.columns:
    df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
    print('\nTotalCharges missing after coercion:', df['TotalCharges'].isnull().sum())

# Show some rows that have any missing values for inspection
rows_with_na = df[df.isnull().any(axis=1)]
print('\nNumber of rows with any missing values:', len(rows_with_na))
display(rows_with_na.sample(min(5, len(rows_with_na))))

Shape: (7043, 21)

Dtype summary:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  Paperles

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
4380,2520-SGTTA,Female,0,Yes,Yes,0,Yes,No,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,Two year,No,Mailed check,20.0,,No
488,4472-LVYGI,Female,0,Yes,Yes,0,No,No phone service,DSL,Yes,...,Yes,Yes,Yes,No,Two year,Yes,Bank transfer (automatic),52.55,,No
3331,7644-OMVMY,Male,0,Yes,Yes,0,Yes,No,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,Two year,No,Mailed check,19.85,,No
753,3115-CZMZD,Male,0,No,Yes,0,Yes,No,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,Two year,No,Mailed check,20.25,,No
1082,4367-NUYAO,Male,0,Yes,Yes,0,Yes,Yes,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,Two year,No,Mailed check,25.75,,No
