# Churn Prediction with Multiple ML Algorithms

In this notebook, I am experimenting with different machine learning algorithms
(Logistic Regression, KNN, SVC, Naive Bayes, Decision Trees, Random Forest)
to predict customer churn. The goal is to compare their performance on the same dataset
and understand which model generalizes best.


### Importing the libraries

In [215]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report

In [216]:
df=pd.read_csv("/content/Telco Customer Churn Dataset.csv")

In [217]:
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [218]:
df.columns


Index(['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',
       'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
       'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
       'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn'],
      dtype='object')

In [219]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


In [220]:
df.duplicated().sum()

np.int64(0)

In [221]:
def unique_values(df):
  for i in df.columns:
    print(i,df[i].unique(),"\n")

In [222]:
unique_values(df)

customerID ['7590-VHVEG' '5575-GNVDE' '3668-QPYBK' ... '4801-JZAZL' '8361-LTMKD'
 '3186-AJIEK'] 

gender ['Female' 'Male'] 

SeniorCitizen [0 1] 

Partner ['Yes' 'No'] 

Dependents ['No' 'Yes'] 

tenure [ 1 34  2 45  8 22 10 28 62 13 16 58 49 25 69 52 71 21 12 30 47 72 17 27
  5 46 11 70 63 43 15 60 18 66  9  3 31 50 64 56  7 42 35 48 29 65 38 68
 32 55 37 36 41  6  4 33 67 23 57 61 14 20 53 40 59 24 44 19 54 51 26  0
 39] 

PhoneService ['No' 'Yes'] 

MultipleLines ['No phone service' 'No' 'Yes'] 

InternetService ['DSL' 'Fiber optic' 'No'] 

OnlineSecurity ['No' 'Yes' 'No internet service'] 

OnlineBackup ['Yes' 'No' 'No internet service'] 

DeviceProtection ['No' 'Yes' 'No internet service'] 

TechSupport ['No' 'Yes' 'No internet service'] 

StreamingTV ['No' 'Yes' 'No internet service'] 

StreamingMovies ['No' 'Yes' 'No internet service'] 

Contract ['Month-to-month' 'One year' 'Two year'] 

PaperlessBilling ['Yes' 'No'] 

PaymentMethod ['Electronic check' 'Mailed check' 'Bank tran

In [223]:
x=df.drop("Churn",axis=1)
y=df["Churn"]

In [224]:
x.drop('customerID',axis=1,inplace=True)

In [225]:
x.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 19 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            7043 non-null   object 
 1   SeniorCitizen     7043 non-null   int64  
 2   Partner           7043 non-null   object 
 3   Dependents        7043 non-null   object 
 4   tenure            7043 non-null   int64  
 5   PhoneService      7043 non-null   object 
 6   MultipleLines     7043 non-null   object 
 7   InternetService   7043 non-null   object 
 8   OnlineSecurity    7043 non-null   object 
 9   OnlineBackup      7043 non-null   object 
 10  DeviceProtection  7043 non-null   object 
 11  TechSupport       7043 non-null   object 
 12  StreamingTV       7043 non-null   object 
 13  StreamingMovies   7043 non-null   object 
 14  Contract          7043 non-null   object 
 15  PaperlessBilling  7043 non-null   object 
 16  PaymentMethod     7043 non-null   object 


In [226]:
x['TotalCharges'] = pd.to_numeric(x['TotalCharges'], errors='coerce')

In [227]:
x['TotalCharges'].isnull().sum()

np.int64(11)

In [228]:
x['TotalCharges'].fillna(x['TotalCharges'].mean(),inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  x['TotalCharges'].fillna(x['TotalCharges'].mean(),inplace=True)


In [229]:
x['TotalCharges'].isnull().sum()

np.int64(0)

## Data Preprocessing

- Continuous features have been standardized to ensure fair distance calculations.
- Categorical features have been label encoded to convert them into numeric form.
- This preprocessing step ensures that all models can work effectively with the dataset.


### encoding categorical features

In [230]:
le=LabelEncoder()

target_col=['gender', 'Partner', 'Dependents',
        'PhoneService', 'MultipleLines', 'InternetService',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
       'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
       'PaymentMethod']

for col in target_col:
    x[col]=le.fit_transform(x[col])



### standard scaler to numerical features

In [231]:
st=StandardScaler()

x['TotalCharges']=st.fit_transform(x[['TotalCharges']])
x['MonthlyCharges']=st.fit_transform(x[['MonthlyCharges']])
x['tenure']=st.fit_transform(x[['tenure']])

In [232]:
x.head()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges
0,0,0,1,0,-1.277445,0,1,0,0,2,0,0,0,0,0,1,2,-1.160323,-0.994971
1,1,0,0,0,0.066327,1,0,0,2,0,2,0,0,0,1,0,3,-0.259629,-0.173876
2,1,0,0,0,-1.236724,1,0,0,2,2,0,0,0,0,0,1,3,-0.36266,-0.960399
3,1,0,0,0,0.514251,0,1,0,2,0,2,2,0,0,1,0,0,-0.746535,-0.1954
4,0,0,0,0,-1.236724,1,0,1,0,0,0,0,0,0,0,1,2,0.197365,-0.941193


In [233]:
y.head()

Unnamed: 0,Churn
0,No
1,No
2,Yes
3,No
4,Yes


In [234]:
y = y.map({"Yes": 1, "No": 0})

In [235]:
y.head()

Unnamed: 0,Churn
0,0
1,0
2,1
3,0
4,1


### spliting dataset for training and testing

In [236]:
X_test,X_train,y_test,y_train=train_test_split(x,y,test_size=0.2,random_state=42)

## Model Evaluation Function

For each model, I evaluate performance on both training and testing data.
This helps identify overfitting (high train accuracy, low test accuracy)
and underfitting (low accuracy on both). Metrics include:
- Accuracy
- Classification Report (Precision, Recall, F1-score)
- Confusion Matrix


In [237]:
# Function to evaluate on training data
def evaluate_on_train(model, X_train, y_train):
    y_train_pred = model.predict(X_train)
    print(f"--- Training Evaluation: {model.__class__.__name__} ---")
    print("Accuracy:", accuracy_score(y_train, y_train_pred))
    print("Classification Report:\n", classification_report(y_train, y_train_pred))
    print("Confusion Matrix:\n", confusion_matrix(y_train, y_train_pred))
    return y_train_pred

# Function to evaluate on testing data
def evaluate_on_test(model, X_test, y_test):
    y_test_pred = model.predict(X_test)
    print(f"--- Testing Evaluation: {model.__class__.__name__} ---")
    print("Accuracy:", accuracy_score(y_test, y_test_pred))
    print("Classification Report:\n", classification_report(y_test, y_test_pred))
    print("Confusion Matrix:\n", confusion_matrix(y_test, y_test_pred))
    return y_test_pred

## Logistic Regression

Logistic Regression is a linear model commonly used for binary classification problems
such as churn prediction. It estimates the probability that a customer will churn
based on input features. The model applies a logistic (sigmoid) function to map
predictions between 0 and 1, making it easy to interpret as probabilities.


In [238]:
log_reg=LogisticRegression()

In [239]:
log_reg.fit(X_train,y_train)

In [240]:
log_reg_train_pred=evaluate_on_train(log_reg,X_train,y_train)

--- Training Evaluation: LogisticRegression ---
Accuracy: 0.8176011355571328
Classification Report:
               precision    recall  f1-score   support

           0       0.86      0.90      0.88      1036
           1       0.68      0.58      0.63       373

    accuracy                           0.82      1409
   macro avg       0.77      0.74      0.75      1409
weighted avg       0.81      0.82      0.81      1409

Confusion Matrix:
 [[935 101]
 [156 217]]


In [241]:
log_reg_test_pred=evaluate_on_test(log_reg,X_test,y_test)

--- Testing Evaluation: LogisticRegression ---
Accuracy: 0.7988995385161519
Classification Report:
               precision    recall  f1-score   support

           0       0.84      0.89      0.87      4138
           1       0.64      0.54      0.59      1496

    accuracy                           0.80      5634
   macro avg       0.74      0.72      0.73      5634
weighted avg       0.79      0.80      0.79      5634

Confusion Matrix:
 [[3692  446]
 [ 687  809]]


## K-Nearest Neighbors (KNN)

KNN classifies a customer based on the majority class of its nearest neighbors.
It is simple and intuitive but can be sensitive to feature scaling and the choice of `k`.


In [242]:
knn_model=KNeighborsClassifier(n_neighbors=15)

knn_model.fit(X_train,y_train)

In [243]:
knn_train_pred=evaluate_on_train(knn_model,X_train,y_train)

--- Training Evaluation: KNeighborsClassifier ---
Accuracy: 0.8048261178140526
Classification Report:
               precision    recall  f1-score   support

           0       0.86      0.88      0.87      1036
           1       0.64      0.60      0.62       373

    accuracy                           0.80      1409
   macro avg       0.75      0.74      0.74      1409
weighted avg       0.80      0.80      0.80      1409

Confusion Matrix:
 [[910 126]
 [149 224]]


In [244]:
knn_test_pred=evaluate_on_test(knn_model,X_test,y_test)

--- Testing Evaluation: KNeighborsClassifier ---
Accuracy: 0.7703230386936457
Classification Report:
               precision    recall  f1-score   support

           0       0.83      0.86      0.85      4138
           1       0.57      0.53      0.55      1496

    accuracy                           0.77      5634
   macro avg       0.70      0.69      0.70      5634
weighted avg       0.76      0.77      0.77      5634

Confusion Matrix:
 [[3551  587]
 [ 707  789]]


## Support Vector Classifier (SVC)

SVC tries to find the best boundary (linear or curved) that separates churners
from non-churners. Parameters like `C` and `gamma` control the flexibility of this boundary.


In [245]:
svc_model=SVC(kernel='rbf',C=1.0,gamma='scale',random_state=42)

svc_model.fit(X_train,y_train)

In [246]:
svc_train_pred=evaluate_on_train(svc_model,X_train,y_train)

--- Training Evaluation: SVC ---
Accuracy: 0.8225691980127751
Classification Report:
               precision    recall  f1-score   support

           0       0.85      0.92      0.88      1036
           1       0.72      0.54      0.62       373

    accuracy                           0.82      1409
   macro avg       0.78      0.73      0.75      1409
weighted avg       0.81      0.82      0.81      1409

Confusion Matrix:
 [[957  79]
 [171 202]]


In [247]:
svc_test_pred=evaluate_on_test(svc_model,X_test,y_test)

--- Testing Evaluation: SVC ---
Accuracy: 0.7893148739794107
Classification Report:
               precision    recall  f1-score   support

           0       0.82      0.91      0.86      4138
           1       0.65      0.46      0.53      1496

    accuracy                           0.79      5634
   macro avg       0.73      0.68      0.70      5634
weighted avg       0.78      0.79      0.78      5634

Confusion Matrix:
 [[3765  373]
 [ 814  682]]


## Naive Bayes

Naive Bayes is a probabilistic model that assumes independence between features.
It is lightweight, fast, and often works surprisingly well as a baseline.


In [248]:
nb_model=GaussianNB()

nb_model.fit(X_train,y_train)

In [249]:
nb_train_pred=evaluate_on_train(nb_model,X_train,y_train)

--- Training Evaluation: GaussianNB ---
Accuracy: 0.7686302342086586
Classification Report:
               precision    recall  f1-score   support

           0       0.90      0.77      0.83      1036
           1       0.54      0.77      0.64       373

    accuracy                           0.77      1409
   macro avg       0.72      0.77      0.73      1409
weighted avg       0.81      0.77      0.78      1409

Confusion Matrix:
 [[795 241]
 [ 85 288]]


In [250]:
nb_test_pred=evaluate_on_test(nb_model,X_test,y_test)

--- Testing Evaluation: GaussianNB ---
Accuracy: 0.7548810791622294
Classification Report:
               precision    recall  f1-score   support

           0       0.88      0.77      0.82      4138
           1       0.53      0.72      0.61      1496

    accuracy                           0.75      5634
   macro avg       0.71      0.74      0.72      5634
weighted avg       0.79      0.75      0.77      5634

Confusion Matrix:
 [[3177  961]
 [ 420 1076]]


## Decision Tree

Decision Trees split the dataset into branches based on feature values.
They are easy to interpret but can overfit if not regularized.


In [251]:
dt_model = DecisionTreeClassifier(criterion='entropy', max_depth=5, random_state=42)


dt_model.fit(X_train,y_train)

In [252]:
dt_train_pred=evaluate_on_train(dt_model,X_train,y_train)

--- Training Evaluation: DecisionTreeClassifier ---
Accuracy: 0.8232789212207239
Classification Report:
               precision    recall  f1-score   support

           0       0.87      0.89      0.88      1036
           1       0.68      0.64      0.66       373

    accuracy                           0.82      1409
   macro avg       0.77      0.76      0.77      1409
weighted avg       0.82      0.82      0.82      1409

Confusion Matrix:
 [[923 113]
 [136 237]]


In [253]:
dt_test_pred=evaluate_on_test(dt_model,X_test,y_test)

--- Testing Evaluation: DecisionTreeClassifier ---
Accuracy: 0.7678381256656017
Classification Report:
               precision    recall  f1-score   support

           0       0.83      0.85      0.84      4138
           1       0.57      0.53      0.55      1496

    accuracy                           0.77      5634
   macro avg       0.70      0.69      0.70      5634
weighted avg       0.76      0.77      0.77      5634

Confusion Matrix:
 [[3528  610]
 [ 698  798]]


## Random Forest

Random Forest builds multiple decision trees and averages their predictions.
This reduces overfitting and usually improves generalization compared to a single tree.


In [254]:
rf_model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42,n_jobs=-1)


rf_model.fit(X_train,y_train)

In [255]:
rf_train_pred=evaluate_on_train(rf_model,X_train,y_train)

--- Training Evaluation: RandomForestClassifier ---
Accuracy: 0.8388928317955997
Classification Report:
               precision    recall  f1-score   support

           0       0.86      0.93      0.90      1036
           1       0.76      0.57      0.65       373

    accuracy                           0.84      1409
   macro avg       0.81      0.75      0.77      1409
weighted avg       0.83      0.84      0.83      1409

Confusion Matrix:
 [[968  68]
 [159 214]]


In [256]:
rf_test_pred=evaluate_on_test(rf_model,X_test,y_test)

--- Testing Evaluation: RandomForestClassifier ---
Accuracy: 0.7978345757898474
Classification Report:
               precision    recall  f1-score   support

           0       0.83      0.92      0.87      4138
           1       0.67      0.47      0.55      1496

    accuracy                           0.80      5634
   macro avg       0.75      0.69      0.71      5634
weighted avg       0.79      0.80      0.79      5634

Confusion Matrix:
 [[3789  349]
 [ 790  706]]


# Model Comparison

Finally, I compare the performance of all models side by side.
This provides insights into which algorithm is most suitable for churn prediction
on this dataset. Typically, ensemble methods like Random Forest or Gradient Boosting
perform better, but simpler models are useful for interpretability.


In [257]:
models = ['Logistic Regression', 'KNN', 'SVC', 'Naive Bayes', 'Decision Tree', 'Random Forest']
train_accuracies = [
    log_reg_train_accuracy,
    knn_train_accuracy,
    svc_train_accuracy,
    nb_train_accuracy,
    dt_train_accuracy,
    rf_train_accuracy
]
test_accuracies = [
    log_reg_test_accuracy,
    knn_test_accuracy,
    svc_test_accuracy,
    nb_test_accuracy,
    dt_test_accuracy,
    rf_test_accuracy
]

accuracy_df = pd.DataFrame({
    'Model': models,
    'Training Accuracy': train_accuracies,
    'Testing Accuracy': test_accuracies
})

print("Accuracy DataFrame created.")
accuracy_df

Accuracy DataFrame created.


Unnamed: 0,Model,Training Accuracy,Testing Accuracy
0,Logistic Regression,0.817601,0.7989
1,KNN,0.804826,0.770323
2,SVC,0.822569,0.789315
3,Naive Bayes,0.76863,0.754881
4,Decision Tree,0.823279,0.767838
5,Random Forest,0.838893,0.797835


## Summary:

### Data Analysis Key Findings

*   **Overall Performance**: Random Forest achieved the highest training accuracy at 0.8389 and strong testing accuracy at 0.7978. Logistic Regression also showed strong performance with a training accuracy of 0.8176 and a testing accuracy of 0.7989.
*   **Overfitting Tendencies**: Models like Decision Tree (training: 0.8233, testing: 0.7678) and Random Forest (training: 0.8389, testing: 0.7978) exhibited a noticeable drop in accuracy from training to testing data, suggesting some degree of overfitting. KNN also showed a drop from 0.8048 to 0.7703.
*   **Consistency**: Logistic Regression demonstrated good consistency between training (0.8176) and testing (0.7989) accuracies, indicating robust generalization with a relatively small difference. SVC also generalized moderately well but had a slightly larger gap (training: 0.8226, testing: 0.7893).
*   **Lowest Performance**: Naive Bayes was the lowest-performing model on both training (0.7686) and testing (0.7549) data.

### Insights or Next Steps

*   **Model Selection**: Logistic Regression and Random Forest appear to be the most promising models for this dataset, with Logistic Regression offering better generalization and Random Forest achieving the highest overall accuracy despite some overfitting.
*   **Improvement Strategies**: Further hyperparameter tuning and potential feature engineering should be considered, especially for Random Forest and Decision Tree, to mitigate overfitting and potentially improve testing accuracy across all models.
