# Vodafone Customer Churn

## Problem Statement

- Customer churn is a major concern for telecom providers, as acquiring new customers is significantly more expensive than retaining existing ones.  
- This project aims to build and evaluate machine learning models that can accurately predict whether a customer will churn based on demographic, service usage, and billing-related features.


## Dataset Overview

- Each row represents an individual telecom customer.
- The target variable indicates whether a customer has churned.
- The goal is to build a classification model to predict customer churn.

### Features

- `customerID`: Unique identifier for each customer.
- `gender`: Gender of the customer.
- `SeniorCitizen`: Indicates whether the customer is a senior citizen.
- `Dependents`: Whether the customer has dependents.
- `tenure`: Number of months the customer has been with the company.
- `PhoneService`: Indicates whether the customer has phone service.
- `MultipleLines`: Whether the customer has multiple phone lines.
- `InternetService`: Type of internet service subscribed.
- `OnlineSecurity`: Whether online security service is enabled.
- `OnlineBackup`: Indicates if online backup service is subscribed.
- `DeviceProtection`: Whether device protection is included.
- `TechSupport`: Availability of technical support.
- `StreamingTV`: Whether TV streaming services are subscribed.
- `StreamingMovies`: Whether movie streaming services are subscribed.
- `Contract`: Type of customer contract.
- `PaperlessBilling`: Indicates use of paperless billing.
- `PaymentMethod`: Method used for bill payment.
- `MonthlyCharges`: Monthly amount charged to the customer.
- `TotalCharges`: Total charges accumulated over the customer’s tenure.
- `numAdminTickets`: Number of administrative support tickets raised.
- `numTechTickets`: Number of technical support tickets raised.
- `Location`: Geographic location of the customer.

**Target**
- `Churn`: Indicates whether the customer has churned (1 = Yes, 0 = No).


### 1. Environment Setup and Data Loading

In [None]:
# Import necessary libraries for data manipulation, visualization, preprocessing, and modeling.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree  import DecisionTreeClassifier
from sklearn.svm import SVC

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [None]:
# --- Data Loading ---
# Load the raw customer churn data from the specified path.

data = pd.read_csv('/Users/hrishinandanmacbook/Developer/ML/001/05/customer churn data.csv')

df = data.copy()

In [None]:
df.head().T

In [None]:
# Initial checks: shape, duplicates, and data types/non-null counts.
df.shape

In [None]:
df.duplicated().sum()

In [None]:
df.info()

- Convert 'TotalCharges' into numerical
- Removing CustomerID as it's unique (candidate key)
- Removing Location as it may include large sparse values after encoding (24 unique vals)

### 2. Data Cleaning and Preprocessing

#### 2.1. Feature Type Correction and Removal

In [None]:
# Convert 'TotalCharges' from object to numeric, coercing non-numeric values

df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')

In [None]:
print(f'No of unique values in customerID: {df['customerID'].nunique()}')
print(f'No of unique values in Location: {df['Location'].nunique()}')

In [None]:
# Drop non-predictive identifier columns: 'customerID' and 'Location'.

df.drop('customerID', axis=1, inplace=True)
df.drop('Location', axis=1, inplace=True)

In [None]:
# Separate features into numerical and categorical dataframes for tailored imputation.

df_num = df.select_dtypes(include='number')
df_cat = df.select_dtypes(exclude='number')

In [None]:
df_num.head()

In [None]:
df_cat.head()

#### 2.2. Handling Missing Values (Imputation)

##### Numerical Imputation (Median)

In [None]:
# Detecting null values of numerical df

df_num.isna().sum() 

In [None]:
# Imputation using median for numeric columns


for col in df_num.columns:
    df_num[col] = pd.to_numeric(df_num[col], errors='coerce') # Ensure columns are numeric
    df_num[col] = df_num[col].fillna(df[col].median())        # Impute missing numerical values using the median.

df_num.isna().sum().sum()

##### Categorical Imputation (Mode)

In [None]:
# Detecting null values of categorical df

df_cat.isna().sum()

In [None]:
# Imputing missing values using Mode in categorical values


missing_values = [pd.NA, None, 'nan', 'NaN', 'NA', 'None', '']

for col in df_cat.columns:
    df_cat[col] = df_cat[col].astype(str).str.strip()        # remove extra spaces in values
    df_cat[col] = df_cat[col].replace(missing_values, pd.NA) # replace missing values to Na
    df_cat[col] = df_cat[col].fillna(df_cat[col].mode()[0])  # fill Na with mode


In [None]:
# Identifying Unique values of each column

for col in df_cat.columns:
    print(col, df_cat[col].unique())

#### 2.3. Category Consolidation

In [None]:
# Consolidate 'No phone service' to 'No' for binary clarity

df_cat['MultipleLines'] = ( df_cat['MultipleLines'].replace('No phone service', 'No').fillna('No') )



# Consolidate 'No internet service' to 'No' in all related service columns.

internet_cols = ['OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies']
for col in internet_cols:
    df_cat[col] = ( df_cat[col].replace('No internet service', 'No').fillna('No') )


# Detecting number of unique values in categorical features
df_cat.nunique() # Verification before encoding

#### 2.4. Outlier Handling (Numerical Features)

In [None]:
# Visualize numerical feature distributions using box plots to guide outlier handling.


for col in df_num.columns:
    plt.figure(figsize=(8, 4))
    sns.boxplot(df_num[col])
    plt.title(col)
    plt.show()

In [None]:
# Outlier clipping using the Interquartile Range (IQR) method (1.5 * IQR rule).

for col in df_num.columns:
    q1 = df_num[col].quantile(0.25)
    q3 = df_num[col].quantile(0.75)
    iqr = q3 - q1
    lower = q1 - 1.5*iqr
    upper = q3 + 1.5*iqr

    df_num[col].clip(lower=lower, upper=upper) # Apply clipping (Capping and Flooring)

### 3. Feature Encoding

In [None]:
# One Hot Encoding on features with more than 2 categories

df_multi = ['Contract', 'PaymentMethod', 'InternetService']
# Apply One-Hot Encoding to nominal features.
df_cat = pd.get_dummies(df_cat, columns=df_multi, drop_first=True, dtype='int')



# Label Encoding for binary features

df_bin = [col for col in df_cat.columns if col not in df_multi]
# Apply Label Encoding (0 or 1) to binary categorical features
for col in df_bin:
    le =LabelEncoder()
    df_cat[col] = le.fit_transform(df_cat[col])

In [None]:
 # Combine the processed numerical and categorical dataframes into the final dataset 'df'.

df = pd.concat([df_num, df_cat], axis=1)

### 4. Exploratory Data Analysis (EDA)

##### 4.1. Correlation and Distribution Analysis

In [None]:
# Visualize the correlation matrix of all processed features.

plt.figure(figsize=(30, 15))
sns.heatmap(df.corr(numeric_only=True), annot=True, cmap='coolwarm')
plt.title('Feature Correlation Heatmap')
plt.show()

In [None]:
# Plot histograms of numerical features to re-assess distribution and skewness

df[df_num.columns].hist(figsize=(14, 10), bins=30)
plt.show()

- NOTE: TotalCharges, tenure, and MonthlyCharges are still skewed, justifying the use of StandardScaler later.

##### 4.2. Churn Analysis

In [None]:
# Box plots comparing numerical feature distributions against the 'Churn' target.


for col in df_num.columns:
    plt.figure(figsize=(6, 4))
    sns.boxplot(x='Churn', y=col, data=df)
    plt.title(f'{col} vs Churn')
    plt.show()

##### INSIGHTS (Numerical Predictors)
- 'numAdminTickets' is a weak predictor of Churn.
- The most powerful numerical predictors are: Low Tenure, Low TotalCharges, High MonthlyCharges, and Senior Citizen Status.

In [None]:
# Count plots comparing categorical feature counts against the 'Churn' target.

for col in df_cat.columns:
    plt.figure(figsize=(6, 4))
    sns.countplot(x=col, hue='Churn', data=df)
    plt.title(f'{col} vs Churn')
    plt.xticks(rotation=45)
    plt.show()

##### INSIGHTS (Categorical Predictors)
- Strong predictors of churn: Short-term Contracts, Electronic Check payment, Fiber optic internet, No security/support services, No Dependents, and Paperless Billing.
- Not predictive (dropped): Gender, PhoneService, MultipleLines, StreamingTV, StreamingMovies.

### 5. Model Preparation and Scaling

In [None]:
# --- Feature Selection ---

# X = df.drop('Churn', axis=1)

# Drop non-predictive features identified during EDA.
X = df.drop(['numAdminTickets', 'gender', 'PhoneService', 'MultipleLines', 'StreamingTV', 'StreamingMovies', 'Churn'], axis=1)
y = df['Churn'] # Target variable

In [None]:
X.head()

In [None]:
# --- Standard Scaling ---
# Apply StandardScaler to standardize features (mean=0, std=1). Essential for distance-based and linear models.

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [None]:
# --- Train-Test Split ---
# Split data into 80% training and 20% testing sets for model validation. random_state=42 ensures reproducibility.

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, shuffle=True, random_state=42)

### 6. Model Training and Evaluation

In [None]:
# Initialize the classification models with chosen hyperparameters.

models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'KNN': KNeighborsClassifier(n_neighbors=5, metric='minkowski'),
    'Decision Tree': DecisionTreeClassifier(criterion='gini', max_depth=10),
    'SVM': SVC(kernel='linear', C=1, gamma='scale')
}


results = {}

# Fit models and evaluate performance on the test set.
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    # Evaluate models using key classification metrics.
    results[name] = {
        'Accuracy': accuracy_score(y_test, y_pred),
        'Precision': precision_score(y_test, y_pred),
        'Recall': recall_score(y_test, y_pred),
        'F1 score': f1_score(y_test, y_pred)
    }

    
# F1 score balances Precision and Recall, making it the primary metric.

In [None]:
# Display results

results_df = pd.DataFrame(results)
print(results_df.T)

In [None]:
# Select the best model based on the highest F1 Score.

best_model = results_df.T['F1 score'].idxmax()
print(f"\n✅ Best Model: {best_model}")

In [None]:
# Visualize model comparison

f1_scores = results_df.loc['F1 score'] 
plt.figure(figsize=(10, 5))
sns.barplot(x=f1_scores.values, y=f1_scores.index)
plt.xlabel('F1 Score')
plt.ylabel('Model')
plt.title('Model Comparison')
plt.show()

**Summary**
---
The project successfully benchmarked four models against the highly relevant **F1 Score** metric:

- Best Model: **Support Vector Machine (SVM)**
    - F1 Score: $0.7422$
    - Accuracy: $0.8659$
- **Insight**:
    - The SVM model offers the best trade-off, correctly identifying high-risk customers about **$74\%$** of the time with an accuracy of **$86.6\%$** while maintaining a balanced rate of false positives and false negatives.
    - This high performance suggests that the data cleaning, feature selection, and scaling steps were effective.