# AI-Based Customer Churn Prediction System

#### Problem Statement
Customer churn is a major business problem where companies lose revenue when customers discontinue their services. Traditional churn analysis is reactive and manual. The goal of this project is to build an AI-based system that predicts whether a customer is likely to churn based on demographic, service usage, and billing data, enabling proactive retention strategies.


#### System Architecture
The system consists of four layers:

Data Layer – Customer churn dataset

ML Layer – Preprocessing, feature engineering, model training

Model Layer – Trained Logistic Regression model

Application Layer – Streamlit web interface for predictions

In [1]:
import pandas as pd
import numpy as np


In [2]:
import os
os.listdir('../data')


['telco_churn.csv']

## Data Exploration

#### Load dataset

In [3]:
import pandas as pd

df = pd.read_csv('../data/telco_churn.csv')
df.head()


Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [4]:
df.shape


(7043, 21)

In [5]:
df.columns


Index(['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',
       'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
       'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
       'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn'],
      dtype='object')

In [6]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


In [7]:
df.isnull().sum()


customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64

## Data Preprocessing & Feature Engineering

#### Handling missing values

In [8]:
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')


In [9]:
df['TotalCharges'].isnull().sum()


np.int64(11)

In [10]:
df['TotalCharges'] = df['TotalCharges'].fillna(df['TotalCharges'].median())


In [11]:
df['TotalCharges'].isnull().sum()


np.int64(0)

In [12]:
X = df.drop("Churn", axis=1)
X = X.drop("customerID", axis=1)
X = pd.get_dummies(X, drop_first=True)

y = df["Churn"]

print("Number of features:", X.shape[1])
print(X.columns.tolist()[:5])

Number of features: 30
['SeniorCitizen', 'tenure', 'MonthlyCharges', 'TotalCharges', 'gender_Male']


In [13]:
X.shape, y.shape

((7043, 30), (7043,))

In [14]:
X.shape


(7043, 30)

In [15]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)


In [16]:
X_train.shape, X_test.shape


((5634, 30), (1409, 30))

#### Label Encoding

In [17]:
from sklearn.preprocessing import LabelEncoder

# Encode target variable
le = LabelEncoder()
y_train_encoded = le.fit_transform(y_train)  # 'No' -> 0, 'Yes' -> 1
y_test_encoded = le.transform(y_test)

# Check encoding
print("Sample encoded labels:", y_train_encoded[:10])


Sample encoded labels: [0 0 0 0 0 0 0 0 0 0]


#### Feature Scaling

In [18]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Verify shapes
print("X_train_scaled shape:", X_train_scaled.shape)
print("X_test_scaled shape:", X_test_scaled.shape)


X_train_scaled shape: (5634, 30)
X_test_scaled shape: (1409, 30)


#### Model Definition

In [19]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.svm import SVC

# Define models
models = {
    "Logistic Regression": LogisticRegression(random_state=42, max_iter=1000),
    "Random Forest": RandomForestClassifier(random_state=42, n_estimators=200),
    "XGBoost": XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='logloss'),
    "SVM": SVC(probability=True, random_state=42)
}

print("Models defined:", list(models.keys()))


Models defined: ['Logistic Regression', 'Random Forest', 'XGBoost', 'SVM']


## Train & Evaluate

### Logistic Regression

In [20]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

# Train Logistic Regression first
lr = models["Logistic Regression"]
lr.fit(X_train_scaled, y_train_encoded)

y_pred = lr.predict(X_test_scaled)
y_prob = lr.predict_proba(X_test_scaled)[:, 1]

# Metrics
print("Logistic Regression Metrics")
print("Accuracy:", accuracy_score(y_test_encoded, y_pred))
print("Precision:", precision_score(y_test_encoded, y_pred))
print("Recall:", recall_score(y_test_encoded, y_pred))
print("F1-Score:", f1_score(y_test_encoded, y_pred))
print("ROC-AUC:", roc_auc_score(y_test_encoded, y_prob))


Logistic Regression Metrics
Accuracy: 0.8069552874378992
Precision: 0.6583850931677019
Recall: 0.5668449197860963
F1-Score: 0.6091954022988506
ROC-AUC: 0.8415846443979436


#### Random Forest

In [21]:
# Train Random Forest
rf = models["Random Forest"]
rf.fit(X_train_scaled, y_train_encoded)

# Predict
y_pred = rf.predict(X_test_scaled)
y_prob = rf.predict_proba(X_test_scaled)[:, 1]

# Metrics
print("=== Random Forest Metrics ===")
print("Accuracy:", accuracy_score(y_test_encoded, y_pred))
print("Precision:", precision_score(y_test_encoded, y_pred))
print("Recall:", recall_score(y_test_encoded, y_pred))
print("F1-Score:", f1_score(y_test_encoded, y_pred))
print("ROC-AUC:", roc_auc_score(y_test_encoded, y_prob))


=== Random Forest Metrics ===
Accuracy: 0.7920511000709723
Precision: 0.6391752577319587
Recall: 0.49732620320855614
F1-Score: 0.5593984962406015
ROC-AUC: 0.8258518690743755


#### XGBoost

In [22]:
# Train XGBoost
xgb = models["XGBoost"]
xgb.fit(X_train_scaled, y_train_encoded)

# Predict
y_pred = xgb.predict(X_test_scaled)
y_prob = xgb.predict_proba(X_test_scaled)[:, 1]

# Metrics
print("=== XGBoost Metrics ===")
print("Accuracy:", accuracy_score(y_test_encoded, y_pred))
print("Precision:", precision_score(y_test_encoded, y_pred))
print("Recall:", recall_score(y_test_encoded, y_pred))
print("F1-Score:", f1_score(y_test_encoded, y_pred))
print("ROC-AUC:", roc_auc_score(y_test_encoded, y_prob))


=== XGBoost Metrics ===
Accuracy: 0.7849538679914834
Precision: 0.60790273556231
Recall: 0.5347593582887701
F1-Score: 0.5689900426742532
ROC-AUC: 0.8214136247384329


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


#### SVM


In [23]:
# Train SVM
svm = models["SVM"]
svm.fit(X_train_scaled, y_train_encoded)

# Predict
y_pred = svm.predict(X_test_scaled)
y_prob = svm.predict_proba(X_test_scaled)[:, 1]

# Metrics
print("=== SVM Metrics ===")
print("Accuracy:", accuracy_score(y_test_encoded, y_pred))
print("Precision:", precision_score(y_test_encoded, y_pred))
print("Recall:", recall_score(y_test_encoded, y_pred))
print("F1-Score:", f1_score(y_test_encoded, y_pred))
print("ROC-AUC:", roc_auc_score(y_test_encoded, y_prob))


=== SVM Metrics ===
Accuracy: 0.7927608232789212
Precision: 0.6443661971830986
Recall: 0.4893048128342246
F1-Score: 0.5562310030395137
ROC-AUC: 0.7960500142085821


## Confusion Matrix & Classification Report for Each Model

#### Logistic Regression

In [24]:
from sklearn.metrics import confusion_matrix, classification_report

lr = models["Logistic Regression"]
y_pred_lr = lr.predict(X_test_scaled)

print("=== Logistic Regression ===")
print("Confusion Matrix:")
print(confusion_matrix(y_test_encoded, y_pred_lr))
print("\nClassification Report:")
print(classification_report(y_test_encoded, y_pred_lr))


=== Logistic Regression ===
Confusion Matrix:
[[925 110]
 [162 212]]

Classification Report:
              precision    recall  f1-score   support

           0       0.85      0.89      0.87      1035
           1       0.66      0.57      0.61       374

    accuracy                           0.81      1409
   macro avg       0.75      0.73      0.74      1409
weighted avg       0.80      0.81      0.80      1409



#### Random Forest

In [25]:
rf = models["Random Forest"]
y_pred_rf = rf.predict(X_test_scaled)

print("=== Random Forest ===")
print("Confusion Matrix:")
print(confusion_matrix(y_test_encoded, y_pred_rf))
print("\nClassification Report:")
print(classification_report(y_test_encoded, y_pred_rf))


=== Random Forest ===
Confusion Matrix:
[[930 105]
 [188 186]]

Classification Report:
              precision    recall  f1-score   support

           0       0.83      0.90      0.86      1035
           1       0.64      0.50      0.56       374

    accuracy                           0.79      1409
   macro avg       0.74      0.70      0.71      1409
weighted avg       0.78      0.79      0.78      1409



#### XGBoost

In [26]:
xgb = models["XGBoost"]
y_pred_xgb = xgb.predict(X_test_scaled)

print("=== XGBoost ===")
print("Confusion Matrix:")
print(confusion_matrix(y_test_encoded, y_pred_xgb))
print("\nClassification Report:")
print(classification_report(y_test_encoded, y_pred_xgb))


=== XGBoost ===
Confusion Matrix:
[[906 129]
 [174 200]]

Classification Report:
              precision    recall  f1-score   support

           0       0.84      0.88      0.86      1035
           1       0.61      0.53      0.57       374

    accuracy                           0.78      1409
   macro avg       0.72      0.71      0.71      1409
weighted avg       0.78      0.78      0.78      1409



#### SVM

In [27]:
svm = models["SVM"]
y_pred_svm = svm.predict(X_test_scaled)

print("=== SVM ===")
print("Confusion Matrix:")
print(confusion_matrix(y_test_encoded, y_pred_svm))
print("\nClassification Report:")
print(classification_report(y_test_encoded, y_pred_svm))


=== SVM ===
Confusion Matrix:
[[934 101]
 [191 183]]

Classification Report:
              precision    recall  f1-score   support

           0       0.83      0.90      0.86      1035
           1       0.64      0.49      0.56       374

    accuracy                           0.79      1409
   macro avg       0.74      0.70      0.71      1409
weighted avg       0.78      0.79      0.78      1409



## Compare All Models

In [28]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

# Create empty list to store results
results = []

# Logistic Regression
y_pred_lr = models["Logistic Regression"].predict(X_test_scaled)
y_prob_lr = models["Logistic Regression"].predict_proba(X_test_scaled)[:, 1]

results.append({
    "Model": "Logistic Regression",
    "Accuracy": accuracy_score(y_test_encoded, y_pred_lr),
    "Precision": precision_score(y_test_encoded, y_pred_lr),
    "Recall": recall_score(y_test_encoded, y_pred_lr),
    "F1-Score": f1_score(y_test_encoded, y_pred_lr),
    "ROC-AUC": roc_auc_score(y_test_encoded, y_prob_lr)
})


In [29]:
# Random Forest
y_pred_rf = models["Random Forest"].predict(X_test_scaled)
y_prob_rf = models["Random Forest"].predict_proba(X_test_scaled)[:, 1]

results.append({
    "Model": "Random Forest",
    "Accuracy": accuracy_score(y_test_encoded, y_pred_rf),
    "Precision": precision_score(y_test_encoded, y_pred_rf),
    "Recall": recall_score(y_test_encoded, y_pred_rf),
    "F1-Score": f1_score(y_test_encoded, y_pred_rf),
    "ROC-AUC": roc_auc_score(y_test_encoded, y_prob_rf)
})


In [30]:
# XGBoost
y_pred_xgb = models["XGBoost"].predict(X_test_scaled)
y_prob_xgb = models["XGBoost"].predict_proba(X_test_scaled)[:, 1]

results.append({
    "Model": "XGBoost",
    "Accuracy": accuracy_score(y_test_encoded, y_pred_xgb),
    "Precision": precision_score(y_test_encoded, y_pred_xgb),
    "Recall": recall_score(y_test_encoded, y_pred_xgb),
    "F1-Score": f1_score(y_test_encoded, y_pred_xgb),
    "ROC-AUC": roc_auc_score(y_test_encoded, y_prob_xgb)
})


In [None]:
# SVM
y_pred_svm = models["SVM"].predict(X_test_scaled)
y_prob_svm = models["SVM"].predict_proba(X_test_scaled)[:, 1]

results.append({
    "Model": "SVM",
    "Accuracy": accuracy_score(y_test_encoded, y_pred_svm),
    "Precision": precision_score(y_test_encoded, y_pred_svm),
    "Recall": recall_score(y_test_encoded, y_pred_svm),
    "F1-Score": f1_score(y_test_encoded, y_pred_svm),
    "ROC-AUC": roc_auc_score(y_test_encoded, y_prob_svm)
})


In [None]:
import pandas as pd

results_df = pd.DataFrame(results)
print("=== Model Comparison Table ===")
print(results_df)


In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(10,5))
plt.bar(results_df["Model"], results_df["F1-Score"], alpha=0.6, label="F1-Score")
plt.bar(results_df["Model"], results_df["ROC-AUC"], alpha=0.4, label="ROC-AUC")
plt.ylabel("Score")
plt.title("Model Comparison: F1-Score vs ROC-AUC")
plt.legend()
plt.ylim(0,1)
plt.show()


## Selecting the best model and saving it for deployment

In [None]:
# Identify best model based on F1-Score
best_model_name = results_df.sort_values(by="F1-Score", ascending=False).iloc[0]["Model"]
best_model = models[best_model_name]

print(f"Best Model: {best_model_name}")


In [None]:
import joblib
import os

# Ensure models directory exists
os.makedirs("../models", exist_ok=True)

# Save Logistic Regression model
joblib.dump(lr, "../models/logistic_regression_churn.pkl")

print("Logistic Regression model saved successfully")


#### Model Selection
Multiple machine learning models were evaluated, including Logistic Regression, Random Forest, XGBoost, and SVM. Logistic Regression achieved the highest ROC-AUC, recall, and F1-score, making it the most suitable model for churn prediction where identifying potential churners is more important than overall accuracy. Therefore, Logistic Regression was selected as the final model for deployment.

In [None]:
joblib.dump(scaler, "../models/scaler.pkl")
print("Scaler saved successfully")


In [None]:
import joblib

# Save feature names for production inference
joblib.dump(X.columns.tolist(), "../models/feature_names.pkl")

print("Feature names saved successfully", len(X.columns))


In [None]:
features = joblib.load("../models/feature_names.pkl")
print(len(features))
print(features[:10])


In [None]:
loaded_model = joblib.load("../models/logistic_regression_churn.pkl")
loaded_scaler = joblib.load("../models/scaler.pkl")

print("Model and scaler loaded")


#### Feature names are saved to ensure that the same input structure is used during Streamlit deployment, preventing feature mismatch errors.

In [None]:
X_test_scaled_loaded = loaded_scaler.transform(X_test)

preds = loaded_model.predict(X_test_scaled_loaded)
probs = loaded_model.predict_proba(X_test_scaled_loaded)[:, 1]

print("Predictions:", preds[:10])
print("Probabilities:", probs[:10])


## Explainable AI / Feature Importance

#### Feature Importance for Logistic Regression

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Get coefficients
coefficients = best_model.coef_[0]  # assuming best_model = LogisticRegression

# Convert to absolute values (magnitude of influence)
importance = np.abs(coefficients)

# Create a Pandas Series for plotting
feature_importances = pd.Series(importance, index=X.columns).sort_values(ascending=False)

# Plot
plt.figure(figsize=(12,6))
feature_importances.plot(kind="bar", title="Feature Importance - Logistic Regression")
plt.ylabel("Coefficient Magnitude")
plt.show()


#### Random Forest Feature Importance

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

# Get feature importances from the trained Random Forest
rf_importances = pd.Series(rf.feature_importances_, index=X.columns)

# Sort in descending order
rf_importances = rf_importances.sort_values(ascending=False)

# Plot
plt.figure(figsize=(12,6))
rf_importances.plot(kind='bar', title="Feature Importance - Random Forest")
plt.ylabel("Importance Score")
plt.show()


#### XGBoost Feature Importance

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

# Feature importance from trained XGBoost
xgb_importances = pd.Series(xgb.feature_importances_, index=X.columns).sort_values(ascending=False)

# Plot
plt.figure(figsize=(12,6))
xgb_importances.plot(kind='bar', title="Feature Importance - XGBoost")
plt.ylabel("Importance Score")
plt.show()


### Decision Support System (Business Layer)

In [None]:
import pandas as pd

# Predict churn probabilities using trained Logistic Regression
churn_prob = loaded_model.predict_proba(X_test_scaled)[:, 1]

print("Total predictions:", len(churn_prob))
print("First 10 churn probabilities:")
print(churn_prob[:10])


### Risk Categorization

In [None]:
risk_level = pd.cut(
    churn_prob,
    bins=[0, 0.3, 0.6, 1],
    labels=["Low", "Medium", "High"]
)

decision_df = pd.DataFrame({
    "Churn Probability": churn_prob,
    "Risk Level": risk_level
})

decision_df.head(10)


### Business Visualization

In [None]:
import matplotlib.pyplot as plt

risk_counts = decision_df["Risk Level"].value_counts()

plt.figure(figsize=(8,5))
risk_counts.plot(kind="bar")
plt.title("Customer Churn Risk Distribution")
plt.ylabel("Number of Customers")
plt.xlabel("Risk Level")
plt.show()


In [None]:
import joblib
import os

os.makedirs("../models", exist_ok=True)

joblib.dump(best_model, "../models/logistic_regression_final.pkl")
joblib.dump(scaler, "../models/scaler.pkl")
joblib.dump(X.columns.tolist(), "../models/feature_names.pkl")

print("Final model artifacts saved")


#### Conclusion
This system successfully predicts customer churn using machine learning. By integrating the trained model into a Streamlit application, the solution enables real-time churn prediction for business users, allowing organizations to take proactive customer retention actions.