## Lab | Handling Data Imbalance in Classification Models

#### Import the required libraries and modules that you would need.

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import preprocessing

from sklearn.metrics import accuracy_score,f1_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import TomekLinks

#### Read that data into Python and call the dataframe churnData

In [2]:
df = pd.read_csv("Customer-Churn.csv")
df.head(3)

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,MonthlyCharges,TotalCharges,Churn
0,Female,0,Yes,No,1,No,No,Yes,No,No,No,No,Month-to-month,29.85,29.85,No
1,Male,0,No,No,34,Yes,Yes,No,Yes,No,No,No,One year,56.95,1889.5,No
2,Male,0,No,No,2,Yes,Yes,Yes,No,No,No,No,Month-to-month,53.85,108.15,Yes


#### Check the datatypes of all the columns in the data. You would see that the column TotalCharges is object type. Convert this column into numeric type using pd.to_numeric function.


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            7043 non-null   object 
 1   SeniorCitizen     7043 non-null   int64  
 2   Partner           7043 non-null   object 
 3   Dependents        7043 non-null   object 
 4   tenure            7043 non-null   int64  
 5   PhoneService      7043 non-null   object 
 6   OnlineSecurity    7043 non-null   object 
 7   OnlineBackup      7043 non-null   object 
 8   DeviceProtection  7043 non-null   object 
 9   TechSupport       7043 non-null   object 
 10  StreamingTV       7043 non-null   object 
 11  StreamingMovies   7043 non-null   object 
 12  Contract          7043 non-null   object 
 13  MonthlyCharges    7043 non-null   float64
 14  TotalCharges      7043 non-null   object 
 15  Churn             7043 non-null   object 
dtypes: float64(1), int64(2), object(13)
memory

In [4]:
df["TotalCharges"] = pd.to_numeric(df["TotalCharges"],errors="coerce") #converting into float
df.info() #sanity check

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            7043 non-null   object 
 1   SeniorCitizen     7043 non-null   int64  
 2   Partner           7043 non-null   object 
 3   Dependents        7043 non-null   object 
 4   tenure            7043 non-null   int64  
 5   PhoneService      7043 non-null   object 
 6   OnlineSecurity    7043 non-null   object 
 7   OnlineBackup      7043 non-null   object 
 8   DeviceProtection  7043 non-null   object 
 9   TechSupport       7043 non-null   object 
 10  StreamingTV       7043 non-null   object 
 11  StreamingMovies   7043 non-null   object 
 12  Contract          7043 non-null   object 
 13  MonthlyCharges    7043 non-null   float64
 14  TotalCharges      7032 non-null   float64
 15  Churn             7043 non-null   object 
dtypes: float64(2), int64(2), object(12)
memory

#### Check for null values in the dataframe. Replace the null values.

In [5]:
df.isna().sum()

gender               0
SeniorCitizen        0
Partner              0
Dependents           0
tenure               0
PhoneService         0
OnlineSecurity       0
OnlineBackup         0
DeviceProtection     0
TechSupport          0
StreamingTV          0
StreamingMovies      0
Contract             0
MonthlyCharges       0
TotalCharges        11
Churn                0
dtype: int64

In [6]:
df["TotalCharges"].fillna((df["TotalCharges"].mean()), inplace=True) #replace by mean

In [7]:
df.isna().sum() #sanity check, no nans anymore

gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64

### Use the following features: tenure, SeniorCitizen, MonthlyCharges and TotalCharges:

In [8]:
features = ['SeniorCitizen','MonthlyCharges', 'TotalCharges']

In [9]:
X = df.loc[:,features] #separating independent variables

In [10]:
y = df.loc[:,"Churn"] # target variable

- Split the data into a training set and a test set.

In [11]:
X_train,X_test,y_train,y_test = train_test_split(X,y, test_size = 0.20, random_state = 0) #train test split the data

- Scale the features either by using normalizer or a standard scaler.

- Fit a logistic regression model on the training data.

In [12]:
def model_inplace(scaler, model, X_train, X_test, y_train, y_test):
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)
    model.fit(X_train,y_train)
    pred_train =model.predict(X_train)
    pred_test = model.predict(X_test)
    print("classification report on train set:")
    print(classification_report(y_train, pred_train))
    print ("-------------------------------------------------------------------------------")
    print("classification report on test set:")
    print(classification_report(y_test, pred_test))

- Check the accuracy on the test data.

In [13]:
model_inplace(StandardScaler(),LogisticRegression(),X_train, X_test, y_train, y_test)

classification report on train set:
              precision    recall  f1-score   support

          No       0.82      0.91      0.86      4133
         Yes       0.65      0.46      0.54      1501

    accuracy                           0.79      5634
   macro avg       0.74      0.69      0.70      5634
weighted avg       0.78      0.79      0.78      5634

-------------------------------------------------------------------------------
classification report on test set:
              precision    recall  f1-score   support

          No       0.82      0.90      0.86      1041
         Yes       0.61      0.45      0.52       368

    accuracy                           0.78      1409
   macro avg       0.72      0.68      0.69      1409
weighted avg       0.77      0.78      0.77      1409



### Check for the imbalance.


In [14]:
y.value_counts()/len(df)

No     0.73463
Yes    0.26537
Name: Churn, dtype: float64

In [15]:
y_test.value_counts()/len(y_test)

No     0.738822
Yes    0.261178
Name: Churn, dtype: float64

In [16]:
y_train.value_counts()/len(y_train)

No     0.733582
Yes    0.266418
Name: Churn, dtype: float64

Highly imbalanced data, which is preserved in train/test

### Use the resampling strategies used in class for upsampling and downsampling to create a balance between the two classes.

#### 1. SMOTE

In [17]:
sm = SMOTE(k_neighbors=3)

X_train_SMOTE, y_train_SMOTE = sm.fit_resample(X_train, y_train)

In [18]:
y_train_SMOTE.value_counts()/len(y_train_SMOTE)

No     0.5
Yes    0.5
Name: Churn, dtype: float64

In [19]:
model_inplace(StandardScaler(),LogisticRegression(),X_train_SMOTE, X_test, y_train_SMOTE, y_test)

classification report on train set:
              precision    recall  f1-score   support

          No       0.72      0.73      0.73      4133
         Yes       0.73      0.72      0.72      4133

    accuracy                           0.73      8266
   macro avg       0.73      0.73      0.73      8266
weighted avg       0.73      0.73      0.73      8266

-------------------------------------------------------------------------------
classification report on test set:
              precision    recall  f1-score   support

          No       0.86      0.72      0.78      1041
         Yes       0.45      0.66      0.54       368

    accuracy                           0.70      1409
   macro avg       0.66      0.69      0.66      1409
weighted avg       0.75      0.70      0.72      1409



The results got worse, we will continue with TOMEKLINKS now

#### 2. TOMEKLINKS

In [20]:
tm = TomekLinks()
X_train_tm,y_train_tm = tm.fit_resample(X_train,y_train)

In [21]:
y_train_tm.value_counts()/len(y_train_tm)

No     0.710678
Yes    0.289322
Name: Churn, dtype: float64

In [22]:
model_inplace(StandardScaler(),LogisticRegression(),X_train_tm, X_test, y_train_tm, y_test)

classification report on train set:
              precision    recall  f1-score   support

          No       0.82      0.90      0.86      3687
         Yes       0.69      0.52      0.59      1501

    accuracy                           0.79      5188
   macro avg       0.75      0.71      0.73      5188
weighted avg       0.78      0.79      0.78      5188

-------------------------------------------------------------------------------
classification report on test set:
              precision    recall  f1-score   support

          No       0.83      0.85      0.84      1041
         Yes       0.54      0.50      0.52       368

    accuracy                           0.76      1409
   macro avg       0.69      0.68      0.68      1409
weighted avg       0.75      0.76      0.76      1409



#### 3. RandomUnderSampler

In [23]:
under = RandomUnderSampler()
X_train_under,y_train_under = under.fit_resample(X_train,y_train)

In [24]:
y_train_under.value_counts()/len(y_train_under)

No     0.5
Yes    0.5
Name: Churn, dtype: float64

In [25]:
model_inplace(StandardScaler(),LogisticRegression(),X_train_under, X_test, y_train_under, y_test)

classification report on train set:
              precision    recall  f1-score   support

          No       0.72      0.73      0.73      1501
         Yes       0.73      0.72      0.72      1501

    accuracy                           0.73      3002
   macro avg       0.73      0.73      0.73      3002
weighted avg       0.73      0.73      0.73      3002

-------------------------------------------------------------------------------
classification report on test set:
              precision    recall  f1-score   support

          No       0.86      0.73      0.79      1041
         Yes       0.47      0.68      0.55       368

    accuracy                           0.72      1409
   macro avg       0.67      0.70      0.67      1409
weighted avg       0.76      0.72      0.73      1409



#### 4. RandomOverSampler

In [26]:
over = RandomOverSampler()
X_train_over,y_train_over = over.fit_resample(X_train,y_train)

In [27]:
y_train_over.value_counts()/len(y_train_over)

No     0.5
Yes    0.5
Name: Churn, dtype: float64

In [28]:
model_inplace(StandardScaler(),LogisticRegression(),X_train_over, X_test, y_train_over, y_test)

classification report on train set:
              precision    recall  f1-score   support

          No       0.73      0.73      0.73      4133
         Yes       0.73      0.72      0.73      4133

    accuracy                           0.73      8266
   macro avg       0.73      0.73      0.73      8266
weighted avg       0.73      0.73      0.73      8266

-------------------------------------------------------------------------------
classification report on test set:
              precision    recall  f1-score   support

          No       0.87      0.73      0.79      1041
         Yes       0.47      0.69      0.56       368

    accuracy                           0.72      1409
   macro avg       0.67      0.71      0.68      1409
weighted avg       0.76      0.72      0.73      1409



### Each time fit the model and see how the accuracy of the model is.

All in all we can conclude that the baseline model worked best. Using resampling methods did not help much. From the used techniques TOMEKLINKS provided the best results. 