## Implementing SMOTE on Our Banking Dataset to Find the Optimal Result
In this exercise, we will generate synthetic samples of the minority class using SMOTE and then make the dataset balanced. Then, on the new balanced dataset, we will fit a logistic regression model and analyze the results:

In [1]:
import pandas as pd

In [2]:
bankData = pd.read_csv('bank-data-set.csv', sep=';')

In [3]:
bankData.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no


In [4]:
# normalize age, balance and duration
from sklearn.preprocessing import RobustScaler
rob_scaler = RobustScaler() # similar to MinMaxScaler

In [5]:
# convert columns to scaled versions
bankData['ageScaled'] = rob_scaler.fit_transform(bankData['age'].values.reshape(-1,1))
bankData['balScaled'] = rob_scaler.fit_transform(bankData['balance'].values.reshape(-1,1))
bankData['durScaled'] = rob_scaler.fit_transform(bankData['duration'].values.reshape(-1,1))

In [6]:
bankData.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y,ageScaled,balScaled,durScaled
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no,1.266667,1.25,0.375
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no,0.333333,-0.308997,-0.134259
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no,-0.4,-0.328909,-0.481481
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no,0.533333,0.780236,-0.407407
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no,-0.4,-0.329646,0.083333


In [7]:
# drop original features
bankData.drop(['age', 'balance', 'duration'], axis=1, inplace=True)
bankData.head()

Unnamed: 0,job,marital,education,default,housing,loan,contact,day,month,campaign,pdays,previous,poutcome,y,ageScaled,balScaled,durScaled
0,management,married,tertiary,no,yes,no,unknown,5,may,1,-1,0,unknown,no,1.266667,1.25,0.375
1,technician,single,secondary,no,yes,no,unknown,5,may,1,-1,0,unknown,no,0.333333,-0.308997,-0.134259
2,entrepreneur,married,secondary,no,yes,yes,unknown,5,may,1,-1,0,unknown,no,-0.4,-0.328909,-0.481481
3,blue-collar,married,unknown,no,yes,no,unknown,5,may,1,-1,0,unknown,no,0.533333,0.780236,-0.407407
4,unknown,single,unknown,no,no,no,unknown,5,may,1,-1,0,unknown,no,-0.4,-0.329646,0.083333


In [8]:
bankData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   job        45211 non-null  object 
 1   marital    45211 non-null  object 
 2   education  45211 non-null  object 
 3   default    45211 non-null  object 
 4   housing    45211 non-null  object 
 5   loan       45211 non-null  object 
 6   contact    45211 non-null  object 
 7   day        45211 non-null  int64  
 8   month      45211 non-null  object 
 9   campaign   45211 non-null  int64  
 10  pdays      45211 non-null  int64  
 11  previous   45211 non-null  int64  
 12  poutcome   45211 non-null  object 
 13  y          45211 non-null  object 
 14  ageScaled  45211 non-null  float64
 15  balScaled  45211 non-null  float64
 16  durScaled  45211 non-null  float64
dtypes: float64(3), int64(4), object(10)
memory usage: 5.9+ MB


In [9]:
# convert categorical features into numerical values using dummy values
bankCat = pd.get_dummies(bankData[['job', 'marital', 'education', 'default',
                                  'housing', 'loan', 'contact', 'month', 'poutcome']])

In [10]:
# seperate numerical data
bankNum = bankData[['day', 'campaign', 'pdays', 'previous', 'ageScaled', 'balScaled', 'durScaled']]
bankNum.shape

(45211, 7)

After the categorical values are transformed, they must be combined with the scaled numerical values of the data frame to get the feature-engineered dataset

In [11]:
# merging with original dataframe
# preparing X variables
X = pd.concat([bankCat, bankNum], axis=1)
print(X.shape)

# preparing the Y variable
Y = bankData['y']
print(Y.shape)
X.head()

(45211, 51)
(45211,)


Unnamed: 0,job_admin.,job_blue-collar,job_entrepreneur,job_housemaid,job_management,job_retired,job_self-employed,job_services,job_student,job_technician,...,poutcome_other,poutcome_success,poutcome_unknown,day,campaign,pdays,previous,ageScaled,balScaled,durScaled
0,0,0,0,0,1,0,0,0,0,0,...,0,0,1,5,1,-1,0,1.266667,1.25,0.375
1,0,0,0,0,0,0,0,0,0,1,...,0,0,1,5,1,-1,0,0.333333,-0.308997,-0.134259
2,0,0,1,0,0,0,0,0,0,0,...,0,0,1,5,1,-1,0,-0.4,-0.328909,-0.481481
3,0,1,0,0,0,0,0,0,0,0,...,0,0,1,5,1,-1,0,0.533333,0.780236,-0.407407
4,0,0,0,0,0,0,0,0,0,0,...,0,0,1,5,1,-1,0,-0.4,-0.329646,0.083333


In [12]:
# import libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# split data into train and test sets with test_size = 0.3
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=123)

In [13]:
# shape before oversampling
print('Before OverSampling count of yes: {}'.format(sum(Y_train=='yes')))
print('Before OverSampling count of no: {}'.format(sum(Y_train=='no')))

Before OverSampling count of yes: 3723
Before OverSampling count of no: 27924


In [14]:
# OverSampling with SMOTE
!pip install smote-variants

Collecting smote-variants
  Downloading smote_variants-0.4.0-py3-none-any.whl (134 kB)
[K     |████████████████████████████████| 134 kB 3.2 MB/s eta 0:00:01
Collecting keras
  Downloading keras-2.6.0-py2.py3-none-any.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 7.9 MB/s eta 0:00:01
[?25hCollecting mkl
  Downloading mkl-2021.4.0-py2.py3-none-macosx_10_15_x86_64.macosx_11_0_x86_64.whl (186.4 MB)
[K     |████████████████████████████████| 186.4 MB 83 kB/s s eta 0:00:01
[?25hCollecting tensorflow
  Downloading tensorflow-2.6.0-cp38-cp38-macosx_10_11_x86_64.whl (199.0 MB)
[K     |████████████████████████████████| 199.0 MB 82 kB/s s eta 0:00:01
[?25hCollecting minisom
  Downloading MiniSom-2.2.9.tar.gz (8.1 kB)
Collecting statistics
  Downloading statistics-1.0.3.5.tar.gz (8.3 kB)
Collecting tbb==2021.*
  Downloading tbb-2021.4.0-py2.py3-none-macosx_10_15_x86_64.macosx_11_0_x86_64.whl (924 kB)
[K     |████████████████████████████████| 924 kB 5.5 MB/s eta 0:00:01
[?25

In [17]:
import smote_variants as sv
import numpy as np

The library files that are required for oversampling the training set include the smote_variants library, which we installed earlier. This is imported as sv. The other library that is required is numpy, as the training set will have to be given a numpy array for the smote_variants library.

In [18]:
# instantiate SMOTE 
oversampler = sv.SMOTE()

In [20]:
# creating new training set
X_train_os, Y_train_os = oversampler.sample(np.array(X_train), np.array(Y_train))

2021-10-21 08:35:56,207:INFO:SMOTE: Running sampling via ('SMOTE', "{'proportion': 1.0, 'n_neighbors': 5, 'n_jobs': 1, 'random_state': None}")


Now, print the shapes of the new X and y variables and the counts of the classes. You will note that the size of the overall dataset has increased from the earlier count of around 31,647 (3694 + 27953) to 55,906. The increase in size can be attributed to the fact that the minority class has been oversampled from 3,694 to 27,953:

In [21]:
# shape after oversampling
print('After OverSampling, the shape of X_train: {}'.format(
X_train_os.shape))
print('After OverSampling, the shape of Y_train: {}'.format(
Y_train_os.shape))
print("After OverSampling, counts of labels 'Yes': {}".format(
sum(Y_train_os=='yes')))
print("After OverSampling, counts of labels 'no': {}".format(
sum(Y_train_os=='no')))

After OverSampling, the shape of X_train: (55848, 51)
After OverSampling, the shape of Y_train: (55848,)
After OverSampling, counts of labels 'Yes': 27924
After OverSampling, counts of labels 'no': 27924


Now that we have generated synthetic points using SMOTE and balanced the dataset, let's fit a logistic regression model on the new sample and analyze the results using a confusion matrix and a classification report.

In [22]:
# Training model with Logistic regression
bankModel2 = LogisticRegression()
bankModel2.fit(X_train_os, Y_train_os)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression()

In [23]:
# predictions on the test set
pred = bankModel2.predict(X_test)

In [25]:
# accuracy values
print('Accuracy of Logistic regression model prediction on test set for SMOTE balanced data set:'
       '{:.2f}'.format(bankModel2.score(X_test, Y_test)))

Accuracy of Logistic regression model prediction on test set for SMOTE balanced data set:0.83


In [26]:
# confusion matrix
from sklearn.metrics import confusion_matrix
confusionMatrix = confusion_matrix(Y_test, pred)
print(confusionMatrix)

[[10030  1968]
 [  302  1264]]


In [27]:
# classification report
from sklearn.metrics import classification_report
print(classification_report(Y_test, pred))

              precision    recall  f1-score   support

          no       0.97      0.84      0.90     11998
         yes       0.39      0.81      0.53      1566

    accuracy                           0.83     13564
   macro avg       0.68      0.82      0.71     13564
weighted avg       0.90      0.83      0.86     13564



## Analysis
From the generated metrics, we can see that the results are very similar to the undersampling results, with the exception that the recall value of the 'Yes' cases has reduced from 0.82 to around 0.80. The results that are generated vary from one use case to the next. SMOTE and its variants have been proven to have robust results in balancing data and are therefore the most popular methods used when encountering use cases with highly imbalanced data.