## **CUSTOMER CHURN PREDICTION MODEL** 

## **Step 1: Cleaning the data**

One of the first things I did after looking at the dataset was to clean the data. I noticed before anything else that our target variable 'Customer Status' had 3 different unique values. They were 'churned', 'stayed' and 'joined'. But if we are doing Logistic Regression we need our target variable to have only two values. Since, the main aim of our analysis was to find out whether the customer is still with the company or not, I changed the value 'joined' to 'stayed'. So basically now we have only two values in the target variable column which told us whether the customer is currently with the company or not. 
Second thing was to check for the null values. There were many columns with null values that needed correction. Usually, I go with the replace the null values for numerical columns with median and categorical columns with mode approach but this time I went with straight up deleting the rows which had null values approach. 
Third thing was to drop three columns named 'Churn Category','Churn Reason' and 'Offer'. I don't think they contribute in giving us what we want.

## **Step 2: Start of the Logictic Regression Model**

First step was to define the parameters of the model. Target variable was 'Customer Status'. Used standardscalar to make sure that all the varibles contribute equally to the model. Used 20% of the data as test size. Got back 75% accuracy with 0.823 ROC-AUC score.

## **Step 3: Start of the Random Forest Model**
Same starting steps as in Logistic Regression Model. Got 76% accuracy and 0.845 as the ROC-AUC score.

## **Possible reason for low accuracy across both the models**
Due to the nature of predicting customer churn, the low accuracy across the logistic regression and random forest models is somewhat expected. Predicting customer churn is a difficult task for analysts as there are a lot of variables that have an impact but cannot be measured i.e., external factors and many of these external factors can affect customers' churn decisions e.g., customer satisfaction, competitor's promotions, problems with service, personal financial issues. In addition, the dataset used for building these models is also imbalanced since there are significantly more customers who remained customers versus those that churned. The lack of balance may also make it difficult for the models to learn churn patterns accurately. In addition, logistic regression may underfit the data since it can only capture linear relationships in the data, and random forest may continue to struggle with noisy or overlapping churn behaviors.

## **Step 4: Making a Parameter GridCV Search algorithm**

In [22]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score,make_scorer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold, GridSearchCV
from sklearn.metrics import accuracy_score, roc_auc_score

In [3]:
df=pd.read_csv('telecom.csv')

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 38 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Customer ID                        7043 non-null   object 
 1   Gender                             7043 non-null   object 
 2   Age                                7043 non-null   int64  
 3   Married                            7043 non-null   object 
 4   Number of Dependents               7043 non-null   int64  
 5   City                               7043 non-null   object 
 6   Zip Code                           7043 non-null   int64  
 7   Latitude                           7043 non-null   float64
 8   Longitude                          7043 non-null   float64
 9   Number of Referrals                7043 non-null   int64  
 10  Tenure in Months                   7043 non-null   int64  
 11  Offer                              3166 non-null   objec

In [5]:
df['Customer Status'] = df['Customer Status'].replace({
    'Joined': 'Stayed'
})

In [6]:
df.isnull().sum()

Customer ID                             0
Gender                                  0
Age                                     0
Married                                 0
Number of Dependents                    0
City                                    0
Zip Code                                0
Latitude                                0
Longitude                               0
Number of Referrals                     0
Tenure in Months                        0
Offer                                3877
Phone Service                           0
Avg Monthly Long Distance Charges     682
Multiple Lines                        682
Internet Service                        0
Internet Type                        1526
Avg Monthly GB Download              1526
Online Security                      1526
Online Backup                        1526
Device Protection Plan               1526
Premium Tech Support                 1526
Streaming TV                         1526
Streaming Movies                  

In [7]:
df=df.drop(columns=['Churn Category','Churn Reason','Offer'])

In [8]:
df = df.dropna(axis=0)

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4835 entries, 0 to 7041
Data columns (total 35 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Customer ID                        4835 non-null   object 
 1   Gender                             4835 non-null   object 
 2   Age                                4835 non-null   int64  
 3   Married                            4835 non-null   object 
 4   Number of Dependents               4835 non-null   int64  
 5   City                               4835 non-null   object 
 6   Zip Code                           4835 non-null   int64  
 7   Latitude                           4835 non-null   float64
 8   Longitude                          4835 non-null   float64
 9   Number of Referrals                4835 non-null   int64  
 10  Tenure in Months                   4835 non-null   int64  
 11  Phone Service                      4835 non-null   object 
 1

In [10]:
print('LOGISTIC REGRESSION MODEL')
X = df.drop(columns=['Customer Status'])
y = df['Customer Status']

X = pd.get_dummies(X, drop_first=True)

LOGISTIC REGRESSION MODEL


In [11]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

In [12]:
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

log_reg = LogisticRegression(
    max_iter=1000,
    class_weight='balanced',
    random_state=42
)

log_reg.fit(X_train_scaled, y_train)

y_pred_lr = log_reg.predict(X_test_scaled)
y_prob_lr = log_reg.predict_proba(X_test_scaled)[:, 1]

print("Logistic Regression Results")
print(classification_report(y_test, y_pred_lr))
print("ROC-AUC:", roc_auc_score(y_test, y_prob_lr))

Logistic Regression Results
              precision    recall  f1-score   support

     Churned       0.64      0.51      0.57       317
      Stayed       0.78      0.86      0.82       650

    accuracy                           0.75       967
   macro avg       0.71      0.69      0.70       967
weighted avg       0.74      0.75      0.74       967

ROC-AUC: 0.8228051443824315


In [13]:
rf = RandomForestClassifier(
    n_estimators=200,
    max_depth=10,
    min_samples_split=10,
    class_weight='balanced',
    random_state=42
)

rf.fit(X_train, y_train)

y_pred_rf = rf.predict(X_test)
y_prob_rf = rf.predict_proba(X_test)[:, 1]

print("Random Forest Results")
print(classification_report(y_test, y_pred_rf))
print("ROC-AUC:", roc_auc_score(y_test, y_prob_rf))


Random Forest Results
              precision    recall  f1-score   support

     Churned       0.61      0.76      0.68       317
      Stayed       0.87      0.77      0.81       650

    accuracy                           0.76       967
   macro avg       0.74      0.76      0.75       967
weighted avg       0.78      0.76      0.77       967

ROC-AUC: 0.8453433632613444


In [20]:
skf = StratifiedKFold(
    n_splits=10,
    shuffle=True,
    random_state=42
)

accuracy_scores = []
auc_scores = []

In [33]:
param_grid = {
    'n_estimators': [100, 200, 300, 500]
}
grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    scoring='roc_auc',
    cv=skf,
    n_jobs=-1,
    verbose=1
)
grid_search.fit(X_train, y_train)

print("Best n_estimators:", grid_search.best_params_)
print("Best CV ROC-AUC:", grid_search.best_score_)


Fitting 10 folds for each of 4 candidates, totalling 40 fits
Best n_estimators: {'n_estimators': 500}
Best CV ROC-AUC: 0.8760555927780338
