## Benchmarking the Logistic Regression Model on the Dataset
In this exercise, we will be analyzing the problem of predicting whether a customer will buy a term deposit. For this, you will be fitting a logistic regression model, as you did in Chapter 3, Binary Classification, and you will look closely at the metrics:

In [1]:
import pandas as pd

In [2]:
bankData = pd.read_csv('bank-data-set.csv', sep=';')

In [3]:
bankData.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no


In [4]:
# normalize age, balance and duration
from sklearn.preprocessing import RobustScaler
rob_scaler = RobustScaler() # similar to MinMaxScaler

In [5]:
# convert columns to scaled versions
bankData['ageScaled'] = rob_scaler.fit_transform(bankData['age'].values.reshape(-1,1))
bankData['balScaled'] = rob_scaler.fit_transform(bankData['balance'].values.reshape(-1,1))
bankData['durScaled'] = rob_scaler.fit_transform(bankData['duration'].values.reshape(-1,1))

In [6]:
bankData.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y,ageScaled,balScaled,durScaled
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no,1.266667,1.25,0.375
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no,0.333333,-0.308997,-0.134259
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no,-0.4,-0.328909,-0.481481
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no,0.533333,0.780236,-0.407407
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no,-0.4,-0.329646,0.083333


In [7]:
# drop original features
bankData.drop(['age', 'balance', 'duration'], axis=1, inplace=True)
bankData.head()

Unnamed: 0,job,marital,education,default,housing,loan,contact,day,month,campaign,pdays,previous,poutcome,y,ageScaled,balScaled,durScaled
0,management,married,tertiary,no,yes,no,unknown,5,may,1,-1,0,unknown,no,1.266667,1.25,0.375
1,technician,single,secondary,no,yes,no,unknown,5,may,1,-1,0,unknown,no,0.333333,-0.308997,-0.134259
2,entrepreneur,married,secondary,no,yes,yes,unknown,5,may,1,-1,0,unknown,no,-0.4,-0.328909,-0.481481
3,blue-collar,married,unknown,no,yes,no,unknown,5,may,1,-1,0,unknown,no,0.533333,0.780236,-0.407407
4,unknown,single,unknown,no,no,no,unknown,5,may,1,-1,0,unknown,no,-0.4,-0.329646,0.083333


In [8]:
bankData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   job        45211 non-null  object 
 1   marital    45211 non-null  object 
 2   education  45211 non-null  object 
 3   default    45211 non-null  object 
 4   housing    45211 non-null  object 
 5   loan       45211 non-null  object 
 6   contact    45211 non-null  object 
 7   day        45211 non-null  int64  
 8   month      45211 non-null  object 
 9   campaign   45211 non-null  int64  
 10  pdays      45211 non-null  int64  
 11  previous   45211 non-null  int64  
 12  poutcome   45211 non-null  object 
 13  y          45211 non-null  object 
 14  ageScaled  45211 non-null  float64
 15  balScaled  45211 non-null  float64
 16  durScaled  45211 non-null  float64
dtypes: float64(3), int64(4), object(10)
memory usage: 5.9+ MB


In [9]:
# convert categorical features into numerical values using dummy values
bankCat = pd.get_dummies(bankData[['job', 'marital', 'education', 'default',
                                  'housing', 'loan', 'contact', 'month', 'poutcome']])

In [10]:
# seperate numerical data
bankNum = bankData[['day', 'campaign', 'pdays', 'previous', 'ageScaled', 'balScaled', 'durScaled']]
bankNum.shape

(45211, 7)

After the categorical values are transformed, they must be combined with the scaled numerical values of the data frame to get the feature-engineered dataset

In [11]:
# merging with original dataframe
# preparing X variables
X = pd.concat([bankCat, bankNum], axis=1)
print(X.shape)

# preparing the Y variable
Y = bankData['y']
print(Y.shape)
X.head()

(45211, 51)
(45211,)


Unnamed: 0,job_admin.,job_blue-collar,job_entrepreneur,job_housemaid,job_management,job_retired,job_self-employed,job_services,job_student,job_technician,...,poutcome_other,poutcome_success,poutcome_unknown,day,campaign,pdays,previous,ageScaled,balScaled,durScaled
0,0,0,0,0,1,0,0,0,0,0,...,0,0,1,5,1,-1,0,1.266667,1.25,0.375
1,0,0,0,0,0,0,0,0,0,1,...,0,0,1,5,1,-1,0,0.333333,-0.308997,-0.134259
2,0,0,1,0,0,0,0,0,0,0,...,0,0,1,5,1,-1,0,-0.4,-0.328909,-0.481481
3,0,1,0,0,0,0,0,0,0,0,...,0,0,1,5,1,-1,0,0.533333,0.780236,-0.407407
4,0,0,0,0,0,0,0,0,0,0,...,0,0,1,5,1,-1,0,-0.4,-0.329646,0.083333


In [12]:
# import libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# split data into train and test sets with test_size = 0.3
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=123)

# fit model
LogRegModel = LogisticRegression()
LogRegModel.fit(X_train, Y_train)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression()

In [13]:
# predictions
train_preds = LogRegModel.predict(X_train)
test_preds = LogRegModel.predict(X_test)

# accuracy scores
print('Accuracy of Logistic regression model prediction on test set: {:.2f}'.format(LogRegModel.score(X_test, Y_test)))

Accuracy of Logistic regression model prediction on test set: 0.90


In [14]:
# confusion matrix for the model
from sklearn.metrics import confusion_matrix
confusionMatrix = confusion_matrix(Y_test, test_preds)
print(confusionMatrix)

from sklearn.metrics import classification_report
print(classification_report(Y_test, test_preds))

[[11705   293]
 [ 1081   485]]
              precision    recall  f1-score   support

          no       0.92      0.98      0.94     11998
         yes       0.62      0.31      0.41      1566

    accuracy                           0.90     13564
   macro avg       0.77      0.64      0.68     13564
weighted avg       0.88      0.90      0.88     13564



In this exercise, we have found a report that may have caused the issue with the number of customers expected to purchase the term deposit plan. From the metrics, we can see that the number of values for **No** is relatively higher than that for **Yes**.

### Analysis of the Result

From the accuracy perspective, the model would seem like it is doing a reasonable job. However, the reality might be quite different. To find out what's really the case, let's look at the precision and recall values, which are available from the classification report we obtained.

Similarly, the recall value for any class can be represented as follows

Recall indicates the ability of the classifier to correctly identify the respective classes. From the metrics, we see that the model that we built does a good job of identifying the positive classes, but does a very poor job of correctly identifying the negative class.

Why do you think that the classifier is biased toward one class? The answer to this can be unearthed by looking at the class balance in the training set. 

In [21]:
# The following code will generate the percentages of the classes in the training data
print('Percentage of negative class:',(Y_train[Y_train=='yes'].value_counts()/len(Y_train))*100)
print('Percentage of positive class:',(Y_train[Y_train=='no'].value_counts()/len(Y_train))*100)

Percentage of negative class: yes    11.764148
Name: y, dtype: float64
Percentage of positive class: no    88.235852
Name: y, dtype: float64


From this, we can see that the majority of the training set (88%) is made up of the positive class. This imbalance is one of the major reasons behind the poor metrics that we have had with the logistic regression classifier we have selected.