## Finding the Best Balancing Technique by Fitting a Classifier on the Telecom Churn Dataset
You are working as a data scientist for a telecom company. You have encountered a dataset that is highly imbalanced, and you want to correct the class imbalance before fitting the classifier to analyze the churn. You know different methods for correcting the imbalance in datasets and you want to compare them to find the best method before fitting the model.

In this activity, you need to implement all of the three methods that you have come across so far and compare the results. 

Solution set | https://github.com/PacktWorkshops/The-Data-Science-Workshop/blob/master/Activity_Solutions/B15019_Solution_Final.pdf

In [1]:
!pip install smote-variants



In [2]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import smote_variants as sv
import numpy as np
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

In [3]:
file_url = ('https://raw.githubusercontent.com/PacktWorkshops/'
            'The-Data-Science-Workshop/master/Chapter13/Dataset/churn.csv')

In [4]:
df = pd.read_csv(file_url)

In [5]:
print(df.shape)
df.head()

(5000, 18)


Unnamed: 0,churn,accountlength,internationalplan,voicemailplan,numbervmailmessages,totaldayminutes,totaldaycalls,totaldaycharge,totaleveminutes,totalevecalls,totalevecharge,totalnightminutes,totalnightcalls,totalnightcharge,totalintlminutes,totalintlcalls,totalintlcharge,numbercustomerservicecalls
0,No,128,no,yes,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1
1,No,107,no,yes,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1
2,No,137,no,no,0,243.4,114,41.38,121.2,110,10.3,162.6,104,7.32,12.2,5,3.29,0
3,No,84,yes,no,0,299.4,71,50.9,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2
4,No,75,yes,no,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 18 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   churn                       5000 non-null   object 
 1   accountlength               5000 non-null   int64  
 2   internationalplan           5000 non-null   object 
 3   voicemailplan               5000 non-null   object 
 4   numbervmailmessages         5000 non-null   int64  
 5   totaldayminutes             5000 non-null   float64
 6   totaldaycalls               5000 non-null   int64  
 7   totaldaycharge              5000 non-null   float64
 8   totaleveminutes             5000 non-null   float64
 9   totalevecalls               5000 non-null   int64  
 10  totalevecharge              5000 non-null   float64
 11  totalnightminutes           5000 non-null   float64
 12  totalnightcalls             5000 non-null   int64  
 13  totalnightcharge            5000 

In [7]:
# normalize numerical data using MinMaxScaler
minmaxscaler = MinMaxScaler()
df['a_length_scaled'] = minmaxscaler.fit_transform(df['accountlength'].values.reshape(-1,1)) # (-1,1) (keep samples the same, infer features)
df['num_voicemails_scaled'] = minmaxscaler.fit_transform(df['numbervmailmessages'].values.reshape(-1,1))
df['tot_day_mins_scaled'] = minmaxscaler.fit_transform(df['totaldayminutes'].values.reshape(-1,1))
df['tot_day_calls_scaled'] = minmaxscaler.fit_transform(df['totaldaycalls'].values.reshape(-1,1))
df['tot_day_charge_scaled'] = minmaxscaler.fit_transform(df['totaldaycharge'].values.reshape(-1,1))
df['tot_eve_mins_scaled'] = minmaxscaler.fit_transform(df['totaleveminutes'].values.reshape(-1,1))
df['tot_eve_calls_scaled'] = minmaxscaler.fit_transform(df['totalevecalls'].values.reshape(-1,1))
df['tot_eve_charge_scaled'] = minmaxscaler.fit_transform(df['totalevecharge'].values.reshape(-1,1))
df['tot_night_mins_scaled'] = minmaxscaler.fit_transform(df['totalnightminutes'].values.reshape(-1,1))
df['tot_night_calls_scaled'] = minmaxscaler.fit_transform(df['totalnightcalls'].values.reshape(-1,1))
df['tot_night_charge_scaled'] = minmaxscaler.fit_transform(df['totalnightcharge'].values.reshape(-1,1))
df['tot_intl_mins_scaled'] = minmaxscaler.fit_transform(df['totalintlminutes'].values.reshape(-1,1))
df['tot_intl_calls_scaled'] = minmaxscaler.fit_transform(df['totalintlcalls'].values.reshape(-1,1))
df['tot_intl_charge_scaled'] = minmaxscaler.fit_transform(df['totalintlcharge'].values.reshape(-1,1))
df['num_custser_calls_scaled'] = minmaxscaler.fit_transform(df['numbercustomerservicecalls'].values.reshape(-1,1))
print(df.shape)

(5000, 33)


In [8]:
# Drop original numerical features
df.drop(['accountlength', 'numbervmailmessages',
        'totaldayminutes', 'totaldaycalls', 
        'totaldaycharge', 'totaleveminutes', 
        'totalevecalls', 'totalevecharge', 
        'totalnightminutes', 'totalnightcalls', 
        'totalnightcharge', 'totalintlminutes', 
        'totalintlcalls', 'totalintlcharge', 
        'numbercustomerservicecalls'], axis=1, inplace=True)
print(df.shape)
df.head()

(5000, 18)


Unnamed: 0,churn,internationalplan,voicemailplan,a_length_scaled,num_voicemails_scaled,tot_day_mins_scaled,tot_day_calls_scaled,tot_day_charge_scaled,tot_eve_mins_scaled,tot_eve_calls_scaled,tot_eve_charge_scaled,tot_night_mins_scaled,tot_night_calls_scaled,tot_night_charge_scaled,tot_intl_mins_scaled,tot_intl_calls_scaled,tot_intl_charge_scaled,num_custser_calls_scaled
0,No,no,yes,0.524793,0.480769,0.754196,0.666667,0.754183,0.542755,0.582353,0.542866,0.619494,0.52,0.619584,0.5,0.15,0.5,0.111111
1,No,no,yes,0.438017,0.5,0.459744,0.745455,0.459672,0.537531,0.605882,0.53769,0.644051,0.588571,0.644344,0.685,0.15,0.685185,0.111111
2,No,no,no,0.561983,0.0,0.692461,0.690909,0.692436,0.333242,0.647059,0.333225,0.411646,0.594286,0.41193,0.61,0.25,0.609259,0.0
3,No,yes,no,0.342975,0.0,0.851778,0.430303,0.85174,0.170195,0.517647,0.170171,0.498481,0.508571,0.498593,0.33,0.35,0.32963,0.222222
4,No,yes,no,0.305785,0.0,0.474253,0.684848,0.47423,0.407754,0.717647,0.407959,0.473165,0.691429,0.47327,0.505,0.15,0.505556,0.333333


In [9]:
# convert categorical features into numerical values using dummy values
cat_var = pd.get_dummies(df[['internationalplan', 'voicemailplan']])
cat_var.shape

(5000, 4)

In [10]:
df.columns

Index(['churn', 'internationalplan', 'voicemailplan', 'a_length_scaled',
       'num_voicemails_scaled', 'tot_day_mins_scaled', 'tot_day_calls_scaled',
       'tot_day_charge_scaled', 'tot_eve_mins_scaled', 'tot_eve_calls_scaled',
       'tot_eve_charge_scaled', 'tot_night_mins_scaled',
       'tot_night_calls_scaled', 'tot_night_charge_scaled',
       'tot_intl_mins_scaled', 'tot_intl_calls_scaled',
       'tot_intl_charge_scaled', 'num_custser_calls_scaled'],
      dtype='object')

In [11]:
# seperate numerical data
num_var = df[['a_length_scaled', 'num_voicemails_scaled', 
             'tot_day_mins_scaled', 'tot_day_calls_scaled', 
             'tot_day_charge_scaled', 'tot_eve_mins_scaled', 
             'tot_eve_calls_scaled', 'tot_eve_charge_scaled', 
             'tot_night_mins_scaled', 'tot_night_calls_scaled', 
             'tot_night_charge_scaled', 'tot_intl_mins_scaled', 
             'tot_intl_calls_scaled', 'tot_intl_charge_scaled', 
             'num_custser_calls_scaled']]
num_var.shape

(5000, 15)

# 6

In [12]:
# preparing the X variable
X = pd.concat([cat_var, num_var], axis=1)
print(X.shape)

# preparing the Y variable
y = df['churn']
print(y.shape)

(5000, 19)
(5000,)


In [13]:
# split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.30, random_state=123)

In [14]:
# shape before oversampling
print('Before OverSampling count of yes: {}'.format(sum(y_train=='Yes')))
print('Before OverSampling count of no: {}'.format(sum(y_train=='No')))

Before OverSampling count of yes: 490
Before OverSampling count of no: 3010


#### Benchmark Logistic Regression Model

In [15]:
# Logistic Regression model
lr_model = LogisticRegression()
lr_model.fit(X_train, y_train)

LogisticRegression()

In [16]:
# predictions on test set
test_pred = lr_model.predict(X_test)

# accuracy values
print('Accuracy of Logistic regression model prediction on test set:'
       '{:.2f}'.format(lr_model.score(X_test, y_test)))

Accuracy of Logistic regression model prediction on test set:0.86


In [17]:
# confusion matrix
confusion_matrix = confusion_matrix(y_test, test_pred)
print(confusion_matrix)

# classification report
print(classification_report(y_test, test_pred))

[[1259   24]
 [ 179   38]]
              precision    recall  f1-score   support

          No       0.88      0.98      0.93      1283
         Yes       0.61      0.18      0.27       217

    accuracy                           0.86      1500
   macro avg       0.74      0.58      0.60      1500
weighted avg       0.84      0.86      0.83      1500



#### Benchmark model analysis
Model is biased toward the majority class, 'No', with very low recall value for the 'Yes' cases.

#### Random undersampling and Classification

In [18]:
# separate minority class and majority class to make a balanced dataset.
# concatenate training data sets into one dataset for ease of operation
trainData = pd.concat([X_train, y_train], axis=1)

In [19]:
# identity the indexes where the dataset has 'Yes' using .index() function.
ind = trainData[trainData['churn']=='Yes'].index
ind

Int64Index([4162, 4640,  509,  354, 1405, 1493, 1263, 2493, 3822, 1334,
            ...
            4939, 2624, 4202, 1704,  588, 3702, 4309, 2428, 1593, 1346],
           dtype='int64', length=490)

In [20]:
minData = trainData.loc[ind]
print(minData.shape)

(490, 20)


In [21]:
# indexes of the sample dataset where churn is 'No'
ind1 = trainData[trainData['churn']=='No'].index
ind1

Int64Index([4036, 2883, 2430,  449, 4179, 4763,  749, 2525,  275, 1090,
            ...
            1363, 3481,  111,  942, 4169, 4143,   96, 4060, 3454, 3582],
           dtype='int64', length=3010)

In [22]:
majData = trainData.loc[ind1]
print(majData.shape)
majData.head()

(3010, 20)


Unnamed: 0,internationalplan_no,internationalplan_yes,voicemailplan_no,voicemailplan_yes,a_length_scaled,num_voicemails_scaled,tot_day_mins_scaled,tot_day_calls_scaled,tot_day_charge_scaled,tot_eve_mins_scaled,tot_eve_calls_scaled,tot_eve_charge_scaled,tot_night_mins_scaled,tot_night_calls_scaled,tot_night_charge_scaled,tot_intl_mins_scaled,tot_intl_calls_scaled,tot_intl_charge_scaled,num_custser_calls_scaled,churn
4036,1,0,0,1,0.256198,0.5,0.609388,0.484848,0.60927,0.695628,0.894118,0.695891,0.395949,0.622857,0.396173,0.515,0.1,0.514815,0.222222,No
2883,1,0,1,0,0.504132,0.0,0.595733,0.29697,0.595716,0.652736,0.688235,0.652863,0.60557,0.56,0.605515,0.49,0.55,0.490741,0.111111,No
2430,1,0,0,1,0.491736,0.769231,0.364438,0.6,0.364458,0.681056,0.458824,0.681009,0.50557,0.691429,0.505909,0.78,0.15,0.77963,0.0,No
449,1,0,0,1,0.322314,0.403846,0.75192,0.478788,0.751841,0.557602,0.694118,0.557748,0.438987,0.525714,0.438942,0.315,0.15,0.314815,0.444444,No
4179,1,0,1,0,0.578512,0.0,0.61394,0.478788,0.613956,0.309871,0.5,0.309932,0.561519,0.622857,0.561621,0.28,0.2,0.27963,0.222222,No


In [23]:
# Take a random sample equal to the length of the minority class from the majority class to make the dataset balanced.
majSample = majData.sample(n=len(ind), random_state=123)

In [24]:
print(majSample.shape)
majSample.head()

(490, 20)


Unnamed: 0,internationalplan_no,internationalplan_yes,voicemailplan_no,voicemailplan_yes,a_length_scaled,num_voicemails_scaled,tot_day_mins_scaled,tot_day_calls_scaled,tot_day_charge_scaled,tot_eve_mins_scaled,tot_eve_calls_scaled,tot_eve_charge_scaled,tot_night_mins_scaled,tot_night_calls_scaled,tot_night_charge_scaled,tot_intl_mins_scaled,tot_intl_calls_scaled,tot_intl_charge_scaled,num_custser_calls_scaled,churn
1807,1,0,1,0,0.450413,0.0,0.557895,0.624242,0.557898,0.549079,0.723529,0.549013,0.344051,0.405714,0.344401,0.645,0.05,0.644444,0.333333,No
4578,1,0,1,0,0.475207,0.0,0.244097,0.533333,0.244143,0.318394,0.658824,0.318344,0.495949,0.52,0.496342,0.55,0.1,0.55,0.111111,No
355,1,0,1,0,0.123967,0.0,0.472546,0.636364,0.472557,0.218037,0.547059,0.218052,0.541013,0.56,0.541362,0.635,0.1,0.635185,0.111111,No
23,1,0,1,0,0.454545,0.0,0.314083,0.624242,0.31409,0.377509,0.6,0.377548,0.48,0.6,0.480023,0.385,0.3,0.385185,0.222222,No
1541,1,0,0,1,0.194215,0.692308,0.656899,0.557576,0.656794,0.460819,0.711765,0.461016,0.683544,0.497143,0.683737,0.38,0.2,0.37963,0.333333,No


In [25]:
# concatenate the minData and majSample using pd.concat() function
balData = pd.concat([minData, majSample], axis=0) # in this case we are concatenating in the vertical direction and, therefore, axis=0 is used. The datasets will preserve their indexes
print(balData.shape)
balData.head()

(980, 20)


Unnamed: 0,internationalplan_no,internationalplan_yes,voicemailplan_no,voicemailplan_yes,a_length_scaled,num_voicemails_scaled,tot_day_mins_scaled,tot_day_calls_scaled,tot_day_charge_scaled,tot_eve_mins_scaled,tot_eve_calls_scaled,tot_eve_charge_scaled,tot_night_mins_scaled,tot_night_calls_scaled,tot_night_charge_scaled,tot_intl_mins_scaled,tot_intl_calls_scaled,tot_intl_charge_scaled,num_custser_calls_scaled,churn
4162,0,1,1,0,0.012397,0.0,0.482788,0.581818,0.482764,0.362387,0.552941,0.362342,0.620506,0.491429,0.620709,0.71,0.2,0.709259,0.0,Yes
4640,1,0,1,0,0.450413,0.0,0.714936,0.551515,0.714859,0.5697,0.558824,0.569719,0.63038,0.4,0.630838,0.645,0.1,0.644444,0.111111,Yes
509,1,0,0,1,0.483471,0.5,0.485917,0.690909,0.485944,0.548529,0.735294,0.54869,0.42962,0.56,0.429938,0.48,0.25,0.47963,0.555556,Yes
354,0,1,1,0,0.260331,0.0,0.671977,0.466667,0.671854,0.601045,0.5,0.6011,0.491392,0.554286,0.491277,0.66,0.1,0.659259,0.222222,Yes
1405,1,0,1,0,0.512397,0.0,0.407397,0.484848,0.407296,0.242233,0.552941,0.242316,0.59038,0.771429,0.590321,0.44,0.35,0.440741,0.444444,Yes


In [26]:
# shuffle the balanced dataset so that both the minority and majority classes are evenly distributed using the shuffle() function
from sklearn.utils import shuffle
balData = shuffle(balData)
balData.head()

Unnamed: 0,internationalplan_no,internationalplan_yes,voicemailplan_no,voicemailplan_yes,a_length_scaled,num_voicemails_scaled,tot_day_mins_scaled,tot_day_calls_scaled,tot_day_charge_scaled,tot_eve_mins_scaled,tot_eve_calls_scaled,tot_eve_charge_scaled,tot_night_mins_scaled,tot_night_calls_scaled,tot_night_charge_scaled,tot_intl_mins_scaled,tot_intl_calls_scaled,tot_intl_charge_scaled,num_custser_calls_scaled,churn
2033,1,0,1,0,0.665289,0.0,0.619061,0.527273,0.618976,0.767116,0.417647,0.767389,0.634684,0.371429,0.634778,0.52,0.2,0.52037,0.222222,Yes
4087,0,1,1,0,0.582645,0.0,0.456899,0.612121,0.456827,0.555403,0.735294,0.555484,0.561266,0.571429,0.561621,0.44,0.1,0.440741,0.111111,Yes
2230,0,1,1,0,0.446281,0.0,0.594879,0.854545,0.59488,0.563651,0.547059,0.563895,0.302278,0.634286,0.302195,0.39,0.15,0.390741,0.222222,No
1350,1,0,1,0,0.22314,0.0,0.812802,0.751515,0.812751,0.634864,0.623529,0.63507,0.584051,0.8,0.584131,0.74,0.35,0.740741,0.0,Yes
3428,0,1,1,0,0.173554,0.0,0.456615,0.557576,0.45666,0.648337,0.711765,0.648334,0.283038,0.542857,0.283061,0.69,0.25,0.690741,0.111111,Yes


In [27]:
balData.shape

(980, 20)

In [28]:
balData.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 980 entries, 2033 to 509
Data columns (total 20 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   internationalplan_no      980 non-null    uint8  
 1   internationalplan_yes     980 non-null    uint8  
 2   voicemailplan_no          980 non-null    uint8  
 3   voicemailplan_yes         980 non-null    uint8  
 4   a_length_scaled           980 non-null    float64
 5   num_voicemails_scaled     980 non-null    float64
 6   tot_day_mins_scaled       980 non-null    float64
 7   tot_day_calls_scaled      980 non-null    float64
 8   tot_day_charge_scaled     980 non-null    float64
 9   tot_eve_mins_scaled       980 non-null    float64
 10  tot_eve_calls_scaled      980 non-null    float64
 11  tot_eve_charge_scaled     980 non-null    float64
 12  tot_night_mins_scaled     980 non-null    float64
 13  tot_night_calls_scaled    980 non-null    float64
 14  tot_nig

In [29]:
# separate the shuffled dataset into independant variables, X_train_new, and dependent variables, y_train_new.
# The separation is to be done using the index features 0 to 19 for the independent variables using .iloc() funtion.
# The dependent variables are separated by sub-setting with the column name 'churn'
X_train_new = balData.iloc[:, 0:19] # first:n-1 (does not include feature 19)
print(X_train_new.head())
y_train_new = balData['churn']
print(y_train_new.head())

      internationalplan_no  internationalplan_yes  voicemailplan_no  \
2033                     1                      0                 1   
4087                     0                      1                 1   
2230                     0                      1                 1   
1350                     1                      0                 1   
3428                     0                      1                 1   

      voicemailplan_yes  a_length_scaled  num_voicemails_scaled  \
2033                  0         0.665289                    0.0   
4087                  0         0.582645                    0.0   
2230                  0         0.446281                    0.0   
1350                  0         0.223140                    0.0   
3428                  0         0.173554                    0.0   

      tot_day_mins_scaled  tot_day_calls_scaled  tot_day_charge_scaled  \
2033             0.619061              0.527273               0.618976   
4087             0.456

In [30]:
# fit LogisticRegression model
lr_model1 = LogisticRegression()
lr_model1.fit(X_train_new, y_train_new)

LogisticRegression()

In [31]:
# prediction for test set
pred1 = lr_model1.predict(X_test)
print('Accuracy of Logistic Regression Model prediction on test set for balanced data set: {:.2f}'
     .format(lr_model1.score(X_test, y_test)))

Accuracy of Logistic Regression Model prediction on test set for balanced data set: 0.79


In [33]:
# confusion matrix
from sklearn.metrics import confusion_matrix
confusionMatrix = confusion_matrix(y_test, pred1)
print(confusionMatrix)

# classification report 
print(classification_report(y_test, pred1))

[[1032  251]
 [  57  160]]
              precision    recall  f1-score   support

          No       0.95      0.80      0.87      1283
         Yes       0.39      0.74      0.51       217

    accuracy                           0.79      1500
   macro avg       0.67      0.77      0.69      1500
weighted avg       0.87      0.79      0.82      1500



### Random undersampling model analysis
The racall of the minority class, 'Yes', has improved from 0.18 to 0.74 now that the dataset has been balanced. This means the classifier has improved its ability to identify positive cases. The overall accuray has taken a hit from 0.86 to 0.79. The accuracy has taken a hit in the number of False Negatives, which are those 'Yes' cases that were wrongly predicted as 'No'. Ideally, we would want quadrant 1 and 3 to come down in favor of the other two because we want to spend time focussing on those customers that are likely to churn.

### Implementing SMOTE on customer churn dataset to find the optimal result
In this section we will generate synthetic samples of the minority class churn 'No' using SMOTE and then make the dataset balanced.

In [34]:
# shape before oversampling
print('Before OverSampling count of yes: {}'.format(sum(y_train=='Yes')))
print('Before OverSampling count of no: {}'.format(sum(y_train=='No')))

Before OverSampling count of yes: 490
Before OverSampling count of no: 3010


In [35]:
# instantiate SMOTE
oversampler = sv.SMOTE()

In [36]:
# creating new training set
X_train_os, y_train_os = oversampler.sample(np.array(X_train), np.array(y_train))

2021-10-25 06:41:58,253:INFO:SMOTE: Running sampling via ('SMOTE', "{'proportion': 1.0, 'n_neighbors': 5, 'n_jobs': 1, 'random_state': None}")


In [37]:
# shape after oversampling
print('After OverSampling, the shape of X_train: {}'.format(
X_train_os.shape))
print('After OverSampling, the shape of y_train: {}'.format(
y_train_os.shape))
print("After OverSampling, counts of labels 'Yes': {}".format(
sum(y_train_os=='Yes')))
print("After OverSampling, counts of labels 'No': {}".format(
sum(y_train_os=='No')))

After OverSampling, the shape of X_train: (6020, 19)
After OverSampling, the shape of y_train: (6020,)
After OverSampling, counts of labels 'Yes': 3010
After OverSampling, counts of labels 'No': 3010


#### Overall dataset has increased from 5000 to 6020 now that we have generated synthetic points usins SMOTE. Fit logistic regression model on new sample and anaylze results using a confusion matrix and a classification report.

In [38]:
# Regression model2
lr_model2 = LogisticRegression()
lr_model2.fit(X_train_os, y_train_os)

LogisticRegression()

In [39]:
# predictions on test set
pred = lr_model2.predict(X_test)

# accuracy values
print('Accuracy of Logistic regression model prediction on test set for SMOTE balanced data set:'
       '{:.2f}'.format(lr_model2.score(X_test, y_test)))

Accuracy of Logistic regression model prediction on test set for SMOTE balanced data set:0.78


In [41]:
# confusion matrix
confusionMatrix = confusion_matrix(y_test, pred)
print(confusionMatrix)

# classification report
print(classification_report(y_test, pred))

[[1007  276]
 [  54  163]]
              precision    recall  f1-score   support

          No       0.95      0.78      0.86      1283
         Yes       0.37      0.75      0.50       217

    accuracy                           0.78      1500
   macro avg       0.66      0.77      0.68      1500
weighted avg       0.87      0.78      0.81      1500



### SMOTE Analysis
From the generated metrics, we can see that the results are very similar to the undersampling results, with the exception that the recall value of the 'No' cases has reduced from 0.80 to 0.78. Accuracy has been reduced from 0.79 to 0.78.

#### Implementing MSMOTE on churn dataset to find optimal result
Use MSMOTE to generate synthentic samples of the minority class and make the dataset balanced. Fit a logistic regression model and analyze the results using confusion matrix and classification report

In [42]:
# shape before oversampling
print('Before OverSampling count of yes: {}'.format(sum(y_train=='Yes')))
print('Before OverSampling count of no: {}'.format(sum(y_train=='No')))

Before OverSampling count of yes: 490
Before OverSampling count of no: 3010


In [43]:
# instantiate SMOTE 
oversampler = sv.MSMOTE()

In [44]:
# creating new training set
X_train_os, y_train_os = oversampler.sample(np.array(X_train), np.array(y_train))

2021-10-25 07:13:03,008:INFO:MSMOTE: Running sampling via ('MSMOTE', "{'proportion': 1.0, 'n_neighbors': 5, 'n_jobs': 1, 'random_state': None}")


In [45]:
# shape after oversampling
print('After OverSampling, the shape of X_train: {}'.format(
X_train_os.shape))
print('After OverSampling, the shape of y_train: {}'.format(
y_train_os.shape))
print("After OverSampling, counts of labels 'Yes': {}".format(
sum(y_train_os=='Yes')))
print("After OverSampling, counts of labels 'No': {}".format(
sum(y_train_os=='No')))

After OverSampling, the shape of X_train: (6020, 19)
After OverSampling, the shape of y_train: (6020,)
After OverSampling, counts of labels 'Yes': 3010
After OverSampling, counts of labels 'No': 3010


In [46]:
# Regression model3
lr_model3 = LogisticRegression()
lr_model3.fit(X_train_os, y_train_os)

LogisticRegression()

In [47]:
# predictions on test set
pred = lr_model3.predict(X_test)

# accuracy values
print('Accuracy of Logistic regression model prediction on test set for MSMOTE balanced data set:'
       '{:.2f}'.format(lr_model3.score(X_test, y_test)))

Accuracy of Logistic regression model prediction on test set for MSMOTE balanced data set:0.80


In [48]:
# confusion matrix
confusionMatrix = confusion_matrix(y_test, pred)
print(confusionMatrix)

# classification report
print(classification_report(y_test, pred))

[[1034  249]
 [  53  164]]
              precision    recall  f1-score   support

          No       0.95      0.81      0.87      1283
         Yes       0.40      0.76      0.52       217

    accuracy                           0.80      1500
   macro avg       0.67      0.78      0.70      1500
weighted avg       0.87      0.80      0.82      1500



### MSMOTE Analysis
Metrics have improved compared to SMOTE implementation. We can then conclude that MSMOTE might be the best method for this use case.