# Customer Churn Analysis And Prediction

This project analyzes customer churn behavior in a telecommunications company and builds machine learning models to predict whether a customer is likely to churn. The objective is to identify at-risk customers and provide insights to improve customer retention strategies.

Chrun here means the rate at which customers stop using a service (like mobile or internet) and switch to a competitor or stop using telecom services altogether.

Importing Libraries

In [43]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score,
    f1_score, roc_auc_score, classification_report
)

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform


### 1. Load the Dataset

In [44]:
df = pd.read_csv("Telco_Customer_Churn_Dataset  (1).csv")

In [45]:
print(df.shape)
df.head()
df.info()

(7043, 21)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null

### 2. Encode the data

In [46]:
# Convert target variable to binary format
df["Churn"] = df["Churn"].map({"Yes": 1, "No": 0})

In [47]:
# Identify categorical columns
categorical_cols = df.select_dtypes(include="object").columns

# Apply Label Encoding
le = LabelEncoder()
for col in categorical_cols:
    df[col] = le.fit_transform(df[col])

### 3. Split Data for Training and Testing

In [48]:
# Separate features and target variable
X = df.drop("Churn", axis=1)
y = df["Churn"]

# Perform an 80-20 train-test split with stratification
# Stratify ensures class balance in both sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

### 4. Feature Scaling

In [49]:
# Standardization is important for distance-based models
scaler = StandardScaler()

# Fit on training data and transform both sets
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


### 5. Model Selection

In [50]:
# Define multiple classification models for comparison
models = {
    "Logistic Regression": LogisticRegression(max_iter=1000),
    "Decision Tree": DecisionTreeClassifier(random_state=42),
    "Random Forest": RandomForestClassifier(n_estimators=200, random_state=42),
    "Gradient Boosting": GradientBoostingClassifier(random_state=42)
}

### 6. Model Training and Evaluation

In [51]:
# Store evaluation results
results = []

for name, model in models.items():
    
    # Use scaled data for models sensitive to feature scale
    if name in ["Logistic Regression", "Gradient Boosting"]:
        model.fit(X_train_scaled, y_train)
        y_pred = model.predict(X_test_scaled)
        y_prob = model.predict_proba(X_test_scaled)[:, 1]
    else:
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        y_prob = model.predict_proba(X_test)[:, 1]
    
    # Append performance metrics
    results.append({
        "Model": name,
        "Accuracy": accuracy_score(y_test, y_pred),
        "Precision": precision_score(y_test, y_pred),
        "Recall": recall_score(y_test, y_pred),
        "F1 Score": f1_score(y_test, y_pred),
        "ROC-AUC": roc_auc_score(y_test, y_prob)
    })

In [52]:
# Convert results into a DataFrame
results_df = pd.DataFrame(results).round(3)

# Sort models by ROC-AUC score
results_df.sort_values(by="ROC-AUC", ascending=False)

Unnamed: 0,Model,Accuracy,Precision,Recall,F1 Score,ROC-AUC
3,Gradient Boosting,0.8,0.661,0.505,0.573,0.842
0,Logistic Regression,0.791,0.631,0.516,0.568,0.838
2,Random Forest,0.794,0.65,0.487,0.557,0.829
1,Decision Tree,0.704,0.446,0.476,0.461,0.631


In [53]:
# Train the best-performing model
best_model = GradientBoostingClassifier(random_state=42)
best_model.fit(X_train_scaled, y_train)

# Generate predictions
y_pred = best_model.predict(X_test_scaled)

# Display detailed classification report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.84      0.91      0.87      1035
           1       0.66      0.51      0.57       374

    accuracy                           0.80      1409
   macro avg       0.75      0.71      0.72      1409
weighted avg       0.79      0.80      0.79      1409

