## Implementing Random Undersampling and Classification on Our Banking Dataset to Find the Optimal Result
In this exercise, you will undersample the majority class (propensity 'No') and then make the dataset balanced. On the new balanced dataset, you will fit a logistic regression model and then analyze the results:

In [1]:
import pandas as pd

In [2]:
bankData = pd.read_csv('bank-data-set.csv', sep=';')

In [3]:
bankData.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no


In [4]:
# normalize age, balance and duration
from sklearn.preprocessing import RobustScaler
rob_scaler = RobustScaler() # similar to MinMaxScaler

In [5]:
# convert columns to scaled versions
bankData['ageScaled'] = rob_scaler.fit_transform(bankData['age'].values.reshape(-1,1))
bankData['balScaled'] = rob_scaler.fit_transform(bankData['balance'].values.reshape(-1,1))
bankData['durScaled'] = rob_scaler.fit_transform(bankData['duration'].values.reshape(-1,1))

In [6]:
bankData.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y,ageScaled,balScaled,durScaled
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no,1.266667,1.25,0.375
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no,0.333333,-0.308997,-0.134259
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no,-0.4,-0.328909,-0.481481
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no,0.533333,0.780236,-0.407407
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no,-0.4,-0.329646,0.083333


In [7]:
# drop original features
bankData.drop(['age', 'balance', 'duration'], axis=1, inplace=True)
bankData.head()

Unnamed: 0,job,marital,education,default,housing,loan,contact,day,month,campaign,pdays,previous,poutcome,y,ageScaled,balScaled,durScaled
0,management,married,tertiary,no,yes,no,unknown,5,may,1,-1,0,unknown,no,1.266667,1.25,0.375
1,technician,single,secondary,no,yes,no,unknown,5,may,1,-1,0,unknown,no,0.333333,-0.308997,-0.134259
2,entrepreneur,married,secondary,no,yes,yes,unknown,5,may,1,-1,0,unknown,no,-0.4,-0.328909,-0.481481
3,blue-collar,married,unknown,no,yes,no,unknown,5,may,1,-1,0,unknown,no,0.533333,0.780236,-0.407407
4,unknown,single,unknown,no,no,no,unknown,5,may,1,-1,0,unknown,no,-0.4,-0.329646,0.083333


In [8]:
bankData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   job        45211 non-null  object 
 1   marital    45211 non-null  object 
 2   education  45211 non-null  object 
 3   default    45211 non-null  object 
 4   housing    45211 non-null  object 
 5   loan       45211 non-null  object 
 6   contact    45211 non-null  object 
 7   day        45211 non-null  int64  
 8   month      45211 non-null  object 
 9   campaign   45211 non-null  int64  
 10  pdays      45211 non-null  int64  
 11  previous   45211 non-null  int64  
 12  poutcome   45211 non-null  object 
 13  y          45211 non-null  object 
 14  ageScaled  45211 non-null  float64
 15  balScaled  45211 non-null  float64
 16  durScaled  45211 non-null  float64
dtypes: float64(3), int64(4), object(10)
memory usage: 5.9+ MB


In [9]:
# convert categorical features into numerical values using dummy values
bankCat = pd.get_dummies(bankData[['job', 'marital', 'education', 'default',
                                  'housing', 'loan', 'contact', 'month', 'poutcome']])

In [10]:
# seperate numerical data
bankNum = bankData[['day', 'campaign', 'pdays', 'previous', 'ageScaled', 'balScaled', 'durScaled']]
bankNum.shape

(45211, 7)

After the categorical values are transformed, they must be combined with the scaled numerical values of the data frame to get the feature-engineered dataset

In [11]:
# merging with original dataframe
# preparing X variables
X = pd.concat([bankCat, bankNum], axis=1)
print(X.shape)

# preparing the Y variable
Y = bankData['y']
print(Y.shape)
X.head()

(45211, 51)
(45211,)


Unnamed: 0,job_admin.,job_blue-collar,job_entrepreneur,job_housemaid,job_management,job_retired,job_self-employed,job_services,job_student,job_technician,...,poutcome_other,poutcome_success,poutcome_unknown,day,campaign,pdays,previous,ageScaled,balScaled,durScaled
0,0,0,0,0,1,0,0,0,0,0,...,0,0,1,5,1,-1,0,1.266667,1.25,0.375
1,0,0,0,0,0,0,0,0,0,1,...,0,0,1,5,1,-1,0,0.333333,-0.308997,-0.134259
2,0,0,1,0,0,0,0,0,0,0,...,0,0,1,5,1,-1,0,-0.4,-0.328909,-0.481481
3,0,1,0,0,0,0,0,0,0,0,...,0,0,1,5,1,-1,0,0.533333,0.780236,-0.407407
4,0,0,0,0,0,0,0,0,0,0,...,0,0,1,5,1,-1,0,-0.4,-0.329646,0.083333


In [12]:
# import libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# split data into train and test sets with test_size = 0.3
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=123)

In [13]:
# join X_train and Y_train for ease of operation
trainData = pd.concat([X_train, Y_train], axis=1)

In this step, we concatenated the X_train and y_train datasets to one single dataset. This is done to make the resampling process in the subsequent steps easier. To concatenate the two datasets, we use the .concat() function from pandas. In the code, we use axis = 1 to indicate that the concatenation is done horizontally, which is along the columns.

In [14]:
trainData.head()

Unnamed: 0,job_admin.,job_blue-collar,job_entrepreneur,job_housemaid,job_management,job_retired,job_self-employed,job_services,job_student,job_technician,...,poutcome_success,poutcome_unknown,day,campaign,pdays,previous,ageScaled,balScaled,durScaled,y
19100,1,0,0,0,0,0,0,0,0,0,...,0,1,5,1,-1,0,0.8,-0.162979,0.236111,no
37958,1,0,0,0,0,0,0,0,0,0,...,0,0,14,2,289,19,0.733333,-0.238938,0.865741,no
12451,0,1,0,0,0,0,0,0,0,0,...,0,1,1,3,-1,0,0.0,0.385693,1.347222,no
18263,0,0,0,0,1,0,0,0,0,0,...,0,1,31,8,-1,0,1.333333,-0.330383,-0.592593,no
5128,0,0,0,0,0,0,0,1,0,0,...,0,1,21,2,-1,0,-0.466667,-0.14233,-0.435185,no


What we will do next is separate the minority class and the majority class. This is required because we have to sample separately from the majority class to make a balanced dataset. To separate the minority class, we have to identify the indexes of the dataset where the dataset has 'yes.' The indexes are identified using .index() function.

Once those indexes are identified, they are separated from the main dataset using the .loc() function and stored in a new variable for the minority class. The shape of the minority dataset is also printed. A similar process is followed for the majority class and, after these two steps, we have two datasets: one for the minority class and one for the majority class.

In [15]:
# indexes of the sample dataset where propensity is yes
ind = trainData[trainData['y']=='yes'].index
print(len(ind))

3723


In [16]:
minData = trainData.loc[ind]
print(minData.shape)

(3723, 52)


In [17]:
# indexes of the sample dataset where propensity is no
ind1 = trainData[trainData['y']=='no'].index
print(len(ind1))

27924


In [18]:
majData = trainData.loc[ind1]
print(majData.shape)
majData.head()

(27924, 52)


Unnamed: 0,job_admin.,job_blue-collar,job_entrepreneur,job_housemaid,job_management,job_retired,job_self-employed,job_services,job_student,job_technician,...,poutcome_success,poutcome_unknown,day,campaign,pdays,previous,ageScaled,balScaled,durScaled,y
19100,1,0,0,0,0,0,0,0,0,0,...,0,1,5,1,-1,0,0.8,-0.162979,0.236111,no
37958,1,0,0,0,0,0,0,0,0,0,...,0,0,14,2,289,19,0.733333,-0.238938,0.865741,no
12451,0,1,0,0,0,0,0,0,0,0,...,0,1,1,3,-1,0,0.0,0.385693,1.347222,no
18263,0,0,0,0,1,0,0,0,0,0,...,0,1,31,8,-1,0,1.333333,-0.330383,-0.592593,no
5128,0,0,0,0,0,0,0,1,0,0,...,0,1,21,2,-1,0,-0.466667,-0.14233,-0.435185,no


Once the majority class is separated, we can proceed with sampling from the majority class. Once the sampling is done, the shape of the majority class dataset and its head are printed.

Take a random sample equal to the length of the minority class to make the dataset balanced.

In [19]:
majSample=majData.sample(n=len(ind), random_state=123)

In [20]:
print(majSample.shape)
majSample.head()

(3723, 52)


Unnamed: 0,job_admin.,job_blue-collar,job_entrepreneur,job_housemaid,job_management,job_retired,job_self-employed,job_services,job_student,job_technician,...,poutcome_success,poutcome_unknown,day,campaign,pdays,previous,ageScaled,balScaled,durScaled,y
17387,0,0,0,0,1,0,0,0,0,0,...,0,1,28,3,-1,0,0.666667,0.752212,-0.425926,no
34679,0,1,0,0,0,0,0,0,0,0,...,0,0,5,7,250,3,0.8,0.086283,-0.106481,no
26572,1,0,0,0,0,0,0,0,0,0,...,0,1,20,2,-1,0,0.466667,1.785398,-0.134259,no
3280,0,0,0,0,0,1,0,0,0,0,...,0,1,15,1,-1,0,1.2,1.972714,-0.009259,no
4434,0,0,0,0,1,0,0,0,0,0,...,0,1,20,1,-1,0,-0.133333,2.011062,-0.055556,no


After preparing the individual dataset, we can now concatenate them together using the pd.concat() function:

In [21]:
balData = pd.concat([minData, majSample], axis=0)
#Note
#In this case, we are concatenating in the vertical direction and, therefore, axis = 0 is used.

Now, shuffle the dataset so that both the minority and majority classes are evenly distributed using the shuffle() function:

In [22]:
from sklearn.utils import shuffle
balData = shuffle(balData)
balData.head()

Unnamed: 0,job_admin.,job_blue-collar,job_entrepreneur,job_housemaid,job_management,job_retired,job_self-employed,job_services,job_student,job_technician,...,poutcome_success,poutcome_unknown,day,campaign,pdays,previous,ageScaled,balScaled,durScaled,y
39407,0,0,0,0,0,0,0,0,0,1,...,0,1,22,2,-1,0,0.0,-0.244838,1.731481,yes
40218,0,0,1,0,0,0,0,0,0,0,...,1,0,9,4,85,4,-0.2,2.182153,3.763889,yes
28481,0,0,0,0,0,0,0,0,1,0,...,0,1,29,2,-1,0,-0.866667,0.662242,0.939815,no
23127,0,0,0,0,1,0,0,0,0,0,...,0,1,26,7,-1,0,0.466667,-0.089971,0.009259,no
1554,0,0,0,0,0,0,1,0,0,0,...,0,1,8,3,-1,0,0.2,-0.252212,1.550926,no


Now, separate the shuffled dataset into the independent variables, X_trainNew, and dependent variables, y_trainNew. The separation is to be done using the index features 0 to 51 for the dependent variables using the .iloc() function in pandas. The dependent variables are separated by sub-setting with the column name 'y':

In [24]:
# Making the new X_train and y_train
X_trainNew = balData.iloc[:,0:51]
print(X_trainNew.head())
Y_trainNew = balData['y']
print(Y_trainNew.head())

       job_admin.  job_blue-collar  job_entrepreneur  job_housemaid  \
39407           0                0                 0              0   
40218           0                0                 1              0   
28481           0                0                 0              0   
23127           0                0                 0              0   
1554            0                0                 0              0   

       job_management  job_retired  job_self-employed  job_services  \
39407               0            0                  0             0   
40218               0            0                  0             0   
28481               0            0                  0             0   
23127               1            0                  0             0   
1554                0            0                  1             0   

       job_student  job_technician  ...  poutcome_other  poutcome_success  \
39407            0               1  ...               0              

In [25]:
# fit LogisticRegression Model
bankModel1 = LogisticRegression()
bankModel1.fit(X_trainNew, Y_trainNew)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression()

In [26]:
# prediction for test set
pred = bankModel1.predict(X_test)
print('Accuracy of Logistic Regression Model prediction on test set for balanced data set: {:.2f}'
     .format(bankModel1.score(X_test, Y_test)))

Accuracy of Logistic Regression Model prediction on test set for balanced data set: 0.83


Now, generate the confusion matrix for the model and print the results:

In [27]:
from sklearn.metrics import confusion_matrix
confusionMatrix = confusion_matrix(Y_test, pred)
print(confusionMatrix)
from sklearn.metrics import classification_report
print(classification_report(Y_test, pred))

[[9969 2029]
 [ 278 1288]]
              precision    recall  f1-score   support

          no       0.97      0.83      0.90     11998
         yes       0.39      0.82      0.53      1566

    accuracy                           0.83     13564
   macro avg       0.68      0.83      0.71     13564
weighted avg       0.91      0.83      0.85     13564



## Analysis

Let's analyze the results and compare them with those of the benchmark logistic regression model that we built at the beginning of this chapter. In the benchmark model, we had the problem of the model being biased toward the majority class with a very low recall value for the yes cases.

Now, by balancing the dataset, we have seen that the recall for the minority class has improved tremendously, from a low of 0.32 to around 0.82. This means that by balancing the dataset, the classifier has improved its ability to identify negative cases.

However, we can see that our overall accuracy has taken a hit. From a high of around 90%, it has come down to around 85%. One major area where accuracy has taken a hit is the number of false positives, which are those No cases that were wrongly predicted as Yes.

Analyzing the result from a business perspective, this is a much better scenario than the one we got in the benchmark model. In the benchmark model, out of the total 1,566 Yes cases, only 506 were correctly identified. However, after balancing, we were able to identify 1,277 out of 1,566 customers from the dataset who were likely to buy term deposits, which can potentially result in a better conversion rate. However, the flip side of this is that the sales team will also have to spend a lot of time on customers who are unlikely to buy term deposits. From the confusion matrix, we can see that false negatives have gone up to 1,795 from the earlier 291 we got in the benchmark model. Ideally, we would want quadrants 2 and 3 to come down in favor of the other two quadrants.