**we will discuss strategies for identifying imbalanced datasets and ways to mitigate the effects of imbalanced datasets.**

Benchmarking the Logistic Regression Model on the Dataset

In [0]:
import pandas as pd
filename = 'https://raw.githubusercontent.com/PacktWorkshops/The-Data-Science-Workshop/master/Chapter13/Dataset/bank-full.csv'

In [83]:
#Loading the data using pandas
bankData = pd.read_csv(filename,sep=";")
bankData.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no


In [84]:
bankData.dtypes

age           int64
job          object
marital      object
education    object
default      object
balance       int64
housing      object
loan         object
contact      object
day           int64
month        object
duration      int64
campaign      int64
pdays         int64
previous      int64
poutcome     object
y            object
dtype: object

Normalize the numerical features (age, balance, and duration) through scaling

In [0]:
from sklearn.preprocessing import RobustScaler
rob_scaler = RobustScaler()

After scaling the numerical data, we convert each of the columns to a scaled version, as in the following code snippet:

In [0]:
# Converting each of the columns to scaled version
bankData['ageScaled'] = rob_scaler.fit_transform(bankData['age'].values.reshape(-1,1))
bankData['balScaled'] = rob_scaler.fit_transform(bankData['balance'].values.reshape(-1,1))
bankData['durScaled'] = rob_scaler.fit_transform(bankData['duration'].values.reshape(-1,1))

Now, drop the original features after we introduce the scaled features using the .drop() function:

In [0]:
# Dropping the original columns
bankData.drop(['age','balance','duration'], axis=1, inplace=True)

In [88]:
bankData.head()

Unnamed: 0,job,marital,education,default,housing,loan,contact,day,month,campaign,pdays,previous,poutcome,y,ageScaled,balScaled,durScaled
0,management,married,tertiary,no,yes,no,unknown,5,may,1,-1,0,unknown,no,1.266667,1.25,0.375
1,technician,single,secondary,no,yes,no,unknown,5,may,1,-1,0,unknown,no,0.333333,-0.308997,-0.134259
2,entrepreneur,married,secondary,no,yes,yes,unknown,5,may,1,-1,0,unknown,no,-0.4,-0.328909,-0.481481
3,blue-collar,married,unknown,no,yes,no,unknown,5,may,1,-1,0,unknown,no,0.533333,0.780236,-0.407407
4,unknown,single,unknown,no,no,no,unknown,5,may,1,-1,0,unknown,no,-0.4,-0.329646,0.083333


The categorical features in the dataset have to be converted into numerical values by transforming them into dummy values, which was covered in Chapter 3, Binary Classification.

In [0]:
bankCat = pd.get_dummies(bankData[['job','marital','education','default','housing','loan','contact','month','poutcome']])

In [90]:
bankCat.head()

Unnamed: 0,job_admin.,job_blue-collar,job_entrepreneur,job_housemaid,job_management,job_retired,job_self-employed,job_services,job_student,job_technician,job_unemployed,job_unknown,marital_divorced,marital_married,marital_single,education_primary,education_secondary,education_tertiary,education_unknown,default_no,default_yes,housing_no,housing_yes,loan_no,loan_yes,contact_cellular,contact_telephone,contact_unknown,month_apr,month_aug,month_dec,month_feb,month_jan,month_jul,month_jun,month_mar,month_may,month_nov,month_oct,month_sep,poutcome_failure,poutcome_other,poutcome_success,poutcome_unknown
0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1
1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,1,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1
2,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1
3,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,1,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1
4,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,1,1,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1


Separate the numerical data as in the following code snippet

In [0]:
bankNum = bankData[['ageScaled','balScaled','day','durScaled','campaign','pdays','previous']]

After the categorical values are transformed, they must be combined with the scaled numerical values of the data frame to get the feature-engineered dataset.

Create the independent variables, X, and dependent variables, Y, from the combined dataset for modeling, as in the following code snippet:

In [92]:
# Merging with the original data frame
# Preparing the X variables
X = pd.concat([bankCat, bankNum], axis=1)
print(X.shape)
# Preparing the Y variable
Y = bankData['y']
print(Y.shape)
X.head()

(45211, 51)
(45211,)


Unnamed: 0,job_admin.,job_blue-collar,job_entrepreneur,job_housemaid,job_management,job_retired,job_self-employed,job_services,job_student,job_technician,job_unemployed,job_unknown,marital_divorced,marital_married,marital_single,education_primary,education_secondary,education_tertiary,education_unknown,default_no,default_yes,housing_no,housing_yes,loan_no,loan_yes,contact_cellular,contact_telephone,contact_unknown,month_apr,month_aug,month_dec,month_feb,month_jan,month_jul,month_jun,month_mar,month_may,month_nov,month_oct,month_sep,poutcome_failure,poutcome_other,poutcome_success,poutcome_unknown,ageScaled,balScaled,day,durScaled,campaign,pdays,previous
0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,1.266667,1.25,5,0.375,1,-1,0
1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,1,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0.333333,-0.308997,5,-0.134259,1,-1,0
2,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,-0.4,-0.328909,5,-0.481481,1,-1,0
3,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,1,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0.533333,0.780236,5,-0.407407,1,-1,0
4,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,1,1,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,-0.4,-0.329646,5,0.083333,1,-1,0


Now, import the necessary functions of train_test_split() and LogisticRegression from sklearn:

In [0]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

Split the data into train and test sets at a 70:30 ratio using the test_size = 0.3 variable in the splitting function. We also set random_state for the reproducibility of the code:

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=123)

In [0]:
# Defining the LogisticRegression function
bankModel = LogisticRegression()

In [96]:
bankModel.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [97]:
pred = bankModel.predict(X_test)
print('Accuracy of Logistic regression model prediction on test set: {:.2f}'.format(bankModel.score(X_test, y_test)))

Accuracy of Logistic regression model prediction on test set: 0.90


Now, use both the confusion_matrix() and classification_report() functions to generate the metrics for further analysis, which we will cover in the Analysis of the Result section:

In [98]:
# Confusion Matrix for the model
from sklearn.metrics import confusion_matrix
confusionMatrix = confusion_matrix(y_test, pred)
print(confusionMatrix)
from sklearn.metrics import classification_report
print(classification_report(y_test, pred))

[[11696   302]
 [ 1073   493]]
              precision    recall  f1-score   support

          no       0.92      0.97      0.94     11998
         yes       0.62      0.31      0.42      1566

    accuracy                           0.90     13564
   macro avg       0.77      0.64      0.68     13564
weighted avg       0.88      0.90      0.88     13564



In this exercise, we have found a report that may have caused the issue with the number of customers expected to purchase the term deposit plan. From the metrics, we can see that the number of values for No is relatively higher than that for Yes.

From the accuracy perspective, the model would seem like it is doing a reasonable job. However, the reality might be quite different. To find out what's really the case, let's look at the precision and recall values, which are available from the classification report we obtained.

Recall indicates the ability of the classifier to correctly identify the respective classes. From the metrics, we see that the model that we built does a good job of identifying the positive classes, but does a very poor job of correctly identifying the negative class.

Why do you think that the classifier is biased toward one class? The answer to this can be unearthed by looking at the class balance in the training set.

The following code will generate the percentages of the classes in the training data:

In [99]:

print('Percentage of negative class :',(y_train[y_train=='yes'].value_counts()/len(y_train) ) * 100)
print('Percentage of positive class :',(y_train[y_train=='no'].value_counts()/len(y_train) ) * 100)

Percentage of negative class : yes    11.764148
Name: y, dtype: float64
Percentage of positive class : no    88.235852
Name: y, dtype: float64


**From this, we can see that the majority of the training set (88%) is made up of the positive class. This imbalance is one of the major reasons behind the poor metrics that we have had with the logistic regression classifier we have selected.**

**Challenges of Imbalanced Datasets**

As seen from the classifier example, one of the biggest challenges with imbalanced datasets is the bias toward the majority class, which ended up being 88% in the previous example. This will result in suboptimal results. However, what makes such cases even more challenging is the deceptive nature of results if the right metric is not used.

Therefore, it is important to identify cases with imbalanced datasets and equally important to pick the right metric for analyzing such datasets. The right metric in this example would have been to look at the recall values for both the classes.

From the recall values, we could have identified the bias of the classifier toward the majority class, prompting us to look at strategies for mitigating such biases, which is the next topic we will focus on.

# **Strategies for Dealing with Imbalanced Datasets**

**Collecting More Data**

Having encountered an imbalanced dataset, one of the first questions you need to ask is whether it is possible to get more data. This might appear naïve, but collecting more data, especially from the minority class, and then balancing the dataset should be the first strategy for addressing the class imbalance.

**Resampling Data**

One effective strategy is to resample our dataset to make the dataset more balanced. Resampling would mean taking samples from the available dataset to create a new dataset, thereby making the new dataset balanced.the idea behind resampling is to randomly pick samples from the majority class to make the final dataset more balanced. 

***Implementing Random Undersampling and Classification on Our Banking Dataset to Find the Optimal Result***

In this exercise, you will undersample the majority class (propensity 'No') and then make the dataset balanced. On the new balanced dataset, you will fit a logistic regression model and then analyze the results

In [0]:
# Let us first join the train_x and train_y for ease of operation
trainData = pd.concat([X_train,y_train],axis=1)

In [101]:
trainData.head()

Unnamed: 0,job_admin.,job_blue-collar,job_entrepreneur,job_housemaid,job_management,job_retired,job_self-employed,job_services,job_student,job_technician,job_unemployed,job_unknown,marital_divorced,marital_married,marital_single,education_primary,education_secondary,education_tertiary,education_unknown,default_no,default_yes,housing_no,housing_yes,loan_no,loan_yes,contact_cellular,contact_telephone,contact_unknown,month_apr,month_aug,month_dec,month_feb,month_jan,month_jul,month_jun,month_mar,month_may,month_nov,month_oct,month_sep,poutcome_failure,poutcome_other,poutcome_success,poutcome_unknown,ageScaled,balScaled,day,durScaled,campaign,pdays,previous,y
19100,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,1,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0.8,-0.162979,5,0.236111,1,-1,0,no
37958,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,1,0,0,1,0,1,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0.733333,-0.238938,14,0.865741,2,289,19,no
12451,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,0,1,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0.0,0.385693,1,1.347222,3,-1,0,no
18263,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,1,1,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,1.333333,-0.330383,31,-0.592593,8,-1,0,no
5128,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,1,0,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,-0.466667,-0.14233,21,-0.435185,2,-1,0,no


Now, let's move onto separating the minority and majority classes into separate datasets.

What we will do next is separate the minority class and the majority class. This is required because we have to sample separately from the majority class to make a balanced dataset. To separate the minority class, we have to identify the indexes of the dataset where the dataset has 'yes.' The indexes are identified using .index() function.

Once those indexes are identified, they are separated from the main dataset using the .loc() function and stored in a new variable for the minority class. The shape of the minority dataset is also printed. A similar process is followed for the majority class and, after these two steps, we have two datasets: one for the minority class and one for the majority class.

Next, find the indexes of the sample dataset where the propensity is yes:

In [102]:
ind = trainData[trainData['y']=='yes'].index
print(len(ind))

3723


Separate by the minority class as in the following code snippet

In [103]:
minData = trainData.loc[ind]
print(minData.shape)

(3723, 52)


Now, find the indexes of the majority class

In [104]:
ind1 = trainData[trainData['y']=='no'].index
print(len(ind1))

27924


Separate by the majority class as in the following code snippet

In [105]:
majData = trainData.loc[ind1]
print(majData.shape)
majData.head()

(27924, 52)


Unnamed: 0,job_admin.,job_blue-collar,job_entrepreneur,job_housemaid,job_management,job_retired,job_self-employed,job_services,job_student,job_technician,job_unemployed,job_unknown,marital_divorced,marital_married,marital_single,education_primary,education_secondary,education_tertiary,education_unknown,default_no,default_yes,housing_no,housing_yes,loan_no,loan_yes,contact_cellular,contact_telephone,contact_unknown,month_apr,month_aug,month_dec,month_feb,month_jan,month_jul,month_jun,month_mar,month_may,month_nov,month_oct,month_sep,poutcome_failure,poutcome_other,poutcome_success,poutcome_unknown,ageScaled,balScaled,day,durScaled,campaign,pdays,previous,y
19100,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,1,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0.8,-0.162979,5,0.236111,1,-1,0,no
37958,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,1,0,0,1,0,1,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0.733333,-0.238938,14,0.865741,2,289,19,no
12451,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,0,1,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0.0,0.385693,1,1.347222,3,-1,0,no
18263,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,1,1,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,1.333333,-0.330383,31,-0.592593,8,-1,0,no
5128,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,1,0,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,-0.466667,-0.14233,21,-0.435185,2,-1,0,no


Once the majority class is separated, we can proceed with sampling from the majority class. Once the sampling is done, the shape of the majority class dataset and its head are printed.

Take a random sample equal to the length of the minority class to make the dataset balanced.

Extract the samples using the .sample() function:

In [0]:
majSample = majData.sample(n=len(ind),random_state = 123)

The number of examples that are sampled is equal to the number of examples in the minority class. This is implemented with the parameters (n=len(ind)).

Now that sampling is done, the shape of the majority class dataset and its head is printed:

In [107]:
print(majSample.shape)
majSample.head()

(3723, 52)


Unnamed: 0,job_admin.,job_blue-collar,job_entrepreneur,job_housemaid,job_management,job_retired,job_self-employed,job_services,job_student,job_technician,job_unemployed,job_unknown,marital_divorced,marital_married,marital_single,education_primary,education_secondary,education_tertiary,education_unknown,default_no,default_yes,housing_no,housing_yes,loan_no,loan_yes,contact_cellular,contact_telephone,contact_unknown,month_apr,month_aug,month_dec,month_feb,month_jan,month_jul,month_jun,month_mar,month_may,month_nov,month_oct,month_sep,poutcome_failure,poutcome_other,poutcome_success,poutcome_unknown,ageScaled,balScaled,day,durScaled,campaign,pdays,previous,y
17387,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1,0,0,1,1,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0.666667,0.752212,28,-0.425926,3,-1,0,no
34679,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0,1,0,1,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0.8,0.086283,5,-0.106481,7,250,3,no
26572,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,0,1,1,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0.466667,1.785398,20,-0.134259,2,-1,0,no
3280,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,1.2,1.972714,15,-0.009259,1,-1,0,no
4434,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,-0.133333,2.011062,20,-0.055556,1,-1,0,no


Now, we move onto preparing the new training data

After preparing the individual dataset, we can now concatenate them together using the pd.concat() function:

In [0]:
# Concatenating both data sets and then shuffling the data set
balData = pd.concat([minData,majSample],axis = 0)

Now, shuffle the dataset so that both the minority and majority classes are evenly distributed using the shuffle() function:

In [109]:
# Shuffling the data set
from sklearn.utils import shuffle
balData = shuffle(balData)
balData.head()

Unnamed: 0,job_admin.,job_blue-collar,job_entrepreneur,job_housemaid,job_management,job_retired,job_self-employed,job_services,job_student,job_technician,job_unemployed,job_unknown,marital_divorced,marital_married,marital_single,education_primary,education_secondary,education_tertiary,education_unknown,default_no,default_yes,housing_no,housing_yes,loan_no,loan_yes,contact_cellular,contact_telephone,contact_unknown,month_apr,month_aug,month_dec,month_feb,month_jan,month_jul,month_jun,month_mar,month_may,month_nov,month_oct,month_sep,poutcome_failure,poutcome_other,poutcome_success,poutcome_unknown,ageScaled,balScaled,day,durScaled,campaign,pdays,previous,y
43512,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,1,0,1,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,-1.0,0.624631,21,1.37037,1,-1,0,yes
26172,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,1,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,-0.533333,-0.074484,20,-0.736111,1,-1,0,no
4648,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,-0.466667,1.414454,20,-0.24537,2,-1,0,no
40642,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,1,0,1,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0.533333,-0.266962,5,0.337963,1,97,4,yes
35264,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1,0,0,1,1,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0.2,0.727876,7,4.481481,1,-1,0,yes


Now, separate the shuffled dataset into the independent variables, X_trainNew, and dependent variables, y_trainNew. The separation is to be done using the index features 0 to 51 for the dependent variables using the .iloc() function in pandas. The dependent variables are separated by sub-setting with the column name 'y':

In [110]:
# Making the new X_train and y_train
X_trainNew = balData.iloc[:,0:51]
print(X_trainNew.head())

y_trainNew = balData['y']
print(y_trainNew.head())

       job_admin.  job_blue-collar  job_entrepreneur  ...  campaign  pdays  previous
43512           1                0                 0  ...         1     -1         0
26172           0                1                 0  ...         1     -1         0
4648            0                1                 0  ...         2     -1         0
40642           0                0                 0  ...         1     97         4
35264           0                0                 0  ...         1     -1         0

[5 rows x 51 columns]
43512    yes
26172     no
4648      no
40642    yes
35264    yes
Name: y, dtype: object


First, define the LogisticRegression function with the following code snippet

In [111]:
from sklearn.linear_model import LogisticRegression
bankModel1 = LogisticRegression()
bankModel1.fit(X_trainNew, y_trainNew)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

Next, perform the prediction on the test with the following code snippet

In [112]:
pred = bankModel1.predict(X_test)
print('Accuracy of Logistic regression model prediction on test set for balanced data set: {:.2f}'.format(bankModel1.score(X_test, y_test)))

Accuracy of Logistic regression model prediction on test set for balanced data set: 0.83


Now, generate the confusion matrix for the model and print the results:

In [113]:
from sklearn.metrics import confusion_matrix
confusionMatrix = confusion_matrix(y_test, pred)
print(confusionMatrix)
from sklearn.metrics import classification_report
print(classification_report(y_test, pred))

[[9969 2029]
 [ 278 1288]]
              precision    recall  f1-score   support

          no       0.97      0.83      0.90     11998
         yes       0.39      0.82      0.53      1566

    accuracy                           0.83     13564
   macro avg       0.68      0.83      0.71     13564
weighted avg       0.91      0.83      0.85     13564



Let's analyze the results and compare them with those of the benchmark logistic regression model that we built at the beginning of this chapter. In the benchmark model, we had the problem of the model being biased toward the majority class with a very low recall value for the yes cases.

Now, by balancing the dataset, **we have seen that the recall for the minority class has improved tremendously, from a low of 0.32 to around 0.82. This means that by balancing the dataset**, the classifier has improved its ability to identify negative cases.

However, we can see that our overall accuracy has taken a hit. From a high of around 90%, it has come down to around 85%. One major area where accuracy has taken a hit is the number of false positives, which are those No cases that were wrongly predicted as Yes.

**Generating Synthetic Samples**

In the previous section, we looked at the undersampling method, where we downsized the majority class to make the dataset balanced. However, when undersampling, we reduced the size of the dataset. In many circumstances, downsizing the dataset can have adverse effects on the predictive power of the classifier. An effective way to counter the downsizing of the dataset is to oversample the minority class. Oversampling is done by generating new synthetic data points similar to those of the minority class, thereby balancing the dataset.

Two very popular methods for generating such synthetic points are:

***Synthetic Minority Oversampling Technique (SMOTE)***

***Modified SMOTE (MSMOTE)***

The way the SMOTE algorithm generates synthetic data is by looking at the neighborhood of minority classes and generating new data points within the neighborhood.
n creating synthetic points, an imaginary line connecting all the minority samples in the neighborhood is created and new data points are generated on this line.

MSMOTE is an advancement over the SMOTE algorithm and has a different approach to generating synthetic points. MSMOTE classifies the minority class into three distinct groups: security samples, border samples, and latent noise samples. Different strategies are adopted to generate neighborhood points based on the group each minority class falls into.

Implementation of SMOTE and MSMOTE

In [114]:
!pip install smote-variants



In [115]:
# Shape before oversampling
print("Before OverSampling count of yes: {}".format(sum(y_train=='yes')))
print("Before OverSampling count of no: {} \n".format(sum(y_train=='no')))

Before OverSampling count of yes: 3723
Before OverSampling count of no: 27924 



In [0]:
import smote_variants as sv
import numpy as np

Now, instantiate the SMOTE library to a variable called oversampler using the sv.SMOTE() function:

In [0]:
# Instantiating the SMOTE class
oversampler= sv.SMOTE()

Now, sample the process using the .sample() function of oversampler:

In [118]:
# Creating new training set
X_train_os, y_train_os = oversampler.sample(np.array(X_train), np.array(y_train))

2020-05-01 04:30:48,761:INFO:SMOTE: Running sampling via ('SMOTE', "{'proportion': 1.0, 'n_neighbors': 5, 'n_jobs': 1, 'random_state': None}")


applying the .sample() function.

Now, print the shapes of the new X and y variables and the counts of the classes

In [119]:
# Shape after oversampling
print('After OverSampling, the shape of train_X: {}'.format(X_train_os.shape))
print('After OverSampling, the shape of train_y: {} \n'.format(y_train_os.shape))
print("After OverSampling, counts of label 'Yes': {}".format(sum(y_train_os=='yes')))
print("After OverSampling, counts of label 'no': {}".format(sum(y_train_os=='no')))

After OverSampling, the shape of train_X: (55848, 51)
After OverSampling, the shape of train_y: (55848,) 

After OverSampling, counts of label 'Yes': 27924
After OverSampling, counts of label 'no': 27924


Now that we have generated synthetic points using SMOTE and balanced the dataset, let's fit a logistic regression model on the new sample and analyze the results using a confusion matrix and a classification report.

Define the LogisticRegression function

In [0]:
# Training the model with Logistic regression model
from sklearn.linear_model import LogisticRegression

In [121]:
bankModel2 = LogisticRegression()
bankModel2.fit(X_train_os, y_train_os)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

Now, predict using .predict on the test set, as mentioned in the following code snippet:

In [0]:
pred = bankModel2.predict(X_test)

Next, print the accuracy values

In [123]:
print('Accuracy of Logistic regression model prediction on test set for Smote balanced data set: {:.2f}'.format(bankModel2.score(X_test, y_test)))

Accuracy of Logistic regression model prediction on test set for Smote balanced data set: 0.84


Then, generate ConfusionMatrix for the model:

In [124]:
from sklearn.metrics import confusion_matrix
confusionMatrix = confusion_matrix(y_test, pred)
print(confusionMatrix)

[[10077  1921]
 [  311  1255]]


Generate Classification_report for the model

In [125]:
from sklearn.metrics import classification_report
print(classification_report(y_test, pred))

              precision    recall  f1-score   support

          no       0.97      0.84      0.90     11998
         yes       0.40      0.80      0.53      1566

    accuracy                           0.84     13564
   macro avg       0.68      0.82      0.71     13564
weighted avg       0.90      0.84      0.86     13564



From the generated metrics, we can see that the results are very similar to the undersampling results, with the exception that the recall value of the **'Yes' cases has reduced from 0.82 to around 0.81.** The results that are generated vary from one use case to the next**. SMOTE and its variants have been proven to have robust results in balancing data and are therefore the most popular methods used when encountering use cases with highly imbalanced data**.

**Implementing MSMOTE on Our Banking Dataset to Find the Optimal Result**

In this exercise, we will generate synthetic samples of the minority class using MSMOTE and then make the dataset balanced. Then, on the new balanced dataset, we will fit a logistic regression model and analyze the results. 

Now, print the count of both the classes before we oversample

In [126]:
# Shape before oversampling
print("Before OverSampling count of yes: {}".format(sum(y_train=='yes')))
print("Before OverSampling count of no: {} \n".format(sum(y_train=='no')))

Before OverSampling count of yes: 3723
Before OverSampling count of no: 27924 



In [0]:
import smote_variants as sv
import numpy as np

Now, instantiate the MSMOTE library to a variable called oversampler using the sv.MSMOTE() function:

In [0]:
# Instantiating the MSMOTE class
oversampler= sv.MSMOTE()

Now, sample the process using the .sample() function of oversampler

In [129]:
# Creating new training set
X_train_os, y_train_os = oversampler.sample(np.array(X_train), np.array(y_train))

2020-05-01 04:40:58,424:INFO:MSMOTE: Running sampling via ('MSMOTE', "{'proportion': 1.0, 'n_neighbors': 5, 'n_jobs': 1, 'random_state': None}")


In [130]:
# Shape after oversampling
print('After OverSampling, the shape of train_X: {}'.format(X_train_os.shape))
print('After OverSampling, the shape of train_y: {} \n'.format(y_train_os.shape))
print("After OverSampling, counts of label 'Yes': {}".format(sum(y_train_os=='yes')))
print("After OverSampling, counts of label 'no': {}".format(sum(y_train_os=='no')))

After OverSampling, the shape of train_X: (55848, 51)
After OverSampling, the shape of train_y: (55848,) 

After OverSampling, counts of label 'Yes': 27924
After OverSampling, counts of label 'no': 27924


In [131]:
# Training the model with Logistic regression model
from sklearn.linear_model import LogisticRegression

# Defining the LogisticRegression function
bankModel3 = LogisticRegression()
bankModel3.fit(X_train_os, y_train_os)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

Now, predict using .predict on the test set as in the following code snippet

In [0]:
pred = bankModel3.predict(X_test)

Next, print the accuracy values

In [133]:
print ('Accuracy of Logistic regression model prediction on test set for MSMOTE balanced data set: {:.2f}'.format(bankModel3.score(X_test, y_test)))

Accuracy of Logistic regression model prediction on test set for MSMOTE balanced data set: 0.84


In [134]:
from sklearn.metrics import confusion_matrix
confusionMatrix = confusion_matrix(y_test, pred)
print(confusionMatrix)

[[10180  1818]
 [  329  1237]]


In [135]:
from sklearn.metrics import classification_report
print(classification_report(y_test, pred))

              precision    recall  f1-score   support

          no       0.97      0.85      0.90     11998
         yes       0.40      0.79      0.54      1566

    accuracy                           0.84     13564
   macro avg       0.69      0.82      0.72     13564
weighted avg       0.90      0.84      0.86     13564



From the implementation of MSMOTE, it is seen that the metrics have degraded compared to the SMOTE implementation.

Implementing SMOTE on Our Banking Dataset to Find the Optimal Result. **We can then conclude that MSMOTE might not be the best method for this use case.**