## Train Three Different Models and Use Evaluation Metrics to Pick the Best Performing Model
You work as a data scientist at a bank. The bank would like to implement a model that predicts the likelihood of a customer purchasing a term deposit. The bank provides you with a dataset, which is the same as the one in Chapter 3, Binary Classification. You have previously learned how to train a logistic regression model for binary classification. You have also heard about other non-parametric modeling techniques and would like to try out a decision tree as well as a random forest to see how well they perform against the logistic regression models you have been training.

In this activity, you will train a logistic regression model and compute a classification report. You will then proceed to train a decision tree classifier and compute a classification report. You will compare the models using the classification reports. Finally, you will train a random forest classifier and generate the classification report. You will then compare the logistic regression model with the random forest using the classification reports to determine which model you should put into production.

### Logistic Regression Model

In [30]:
# import required packages
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve
from sklearn.metrics import auc
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

In [2]:
#3 read the data
bankData = pd.read_csv('bank-data-set.csv', sep=";")
bankData.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no


In [3]:
#4 Explore the data
# Normalizing data
x = bankData[['balance']].values.astype(float)

In [4]:
# Convert all numerical data between a scaled range of 0 t0 1
minmaxScaler = preprocessing.MinMaxScaler()

# Tranform the balance data by normalizing it with minmaxScaler:
bankData['balanceTran'] = minmaxScaler.fit_transform(x)
bankData.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y,balanceTran
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no,0.092259
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no,0.073067
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no,0.072822
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no,0.086476
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no,0.072812


In [5]:
# Adding a small numerical constant to eliminate 0 values
bankData['balanceTran'] = bankData['balanceTran'] + 0.00001

In [6]:
# Transform values for loan data
bankData['loanTran'] = 1

# Giving a weight of 5 if there is no loan
bankData.loc[bankData['loan'] == 'no', 'loanTran'] = 5
bankData.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y,balanceTran,loanTran
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no,0.092269,5
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no,0.073077,5
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no,0.072832,1
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no,0.086486,5
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no,0.072822,5


In [7]:
# Transform values for housing data
bankData['houseTran'] = 5

# Give a weight of 1 if the customer has a house
bankData.loc[bankData['housing'] == 'no', 'houseTran'] = 1
bankData.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y,balanceTran,loanTran,houseTran
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no,0.092269,5,5
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no,0.073077,5,5
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no,0.072832,1,5
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no,0.086486,5,5
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no,0.072822,5,1


In [8]:
# Create a new variable which is the product of all three transformed variables
bankData['assetIndex'] = bankData['balanceTran'] * bankData['houseTran'] * bankData['loanTran']
bankData.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,...,duration,campaign,pdays,previous,poutcome,y,balanceTran,loanTran,houseTran,assetIndex
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,...,261,1,-1,0,unknown,no,0.092269,5,5,2.306734
1,44,technician,single,secondary,no,29,yes,no,unknown,5,...,151,1,-1,0,unknown,no,0.073077,5,5,1.826916
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,...,76,1,-1,0,unknown,no,0.072832,1,5,0.364158
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,...,92,1,-1,0,unknown,no,0.086486,5,5,2.162153
4,33,unknown,single,unknown,no,1,no,no,unknown,5,...,198,1,-1,0,unknown,no,0.072822,5,1,0.364112


In [10]:
# Finding the quantile
np.quantile(bankData['assetIndex'],[0.25,0.5,0.75])

array([0.37668646, 0.56920367, 1.9027249 ])

In [11]:
# Create quantiles from the assetIndex data
bankData['assetClass'] = 'Quant1'
bankData.loc[(bankData['assetIndex'] > 0.38) & (bankData['assetIndex'] < 0.57), 'assetClass'] = 'Quant2'
bankData.loc[(bankData['assetIndex'] > 0.57) & (bankData['assetIndex'] < 1.9), 'assetClass'] = 'Quant3'
bankData.loc[(bankData['assetIndex'] > 1.9), 'assetClass'] = 'Quant4'
bankData.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,...,campaign,pdays,previous,poutcome,y,balanceTran,loanTran,houseTran,assetIndex,assetClass
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,...,1,-1,0,unknown,no,0.092269,5,5,2.306734,Quant4
1,44,technician,single,secondary,no,29,yes,no,unknown,5,...,1,-1,0,unknown,no,0.073077,5,5,1.826916,Quant3
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,...,1,-1,0,unknown,no,0.072832,1,5,0.364158,Quant1
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,...,1,-1,0,unknown,no,0.086486,5,5,2.162153,Quant4
4,33,unknown,single,unknown,no,1,no,no,unknown,5,...,1,-1,0,unknown,no,0.072822,5,1,0.364112,Quant1


In [12]:
# Calculate total of each asset class
assetTot = bankData.groupby('assetClass')['y'].agg(assetTot='count').reset_index()

# Calculating the category wise counts
assetProp = bankData.groupby(['assetClass', 'y'])['y'].agg(assetCat='count').reset_index()

In [13]:
# Merge both data frames (assetTot & assetProp)
assetComb = pd.merge(assetProp, assetTot, on=['assetClass'])
assetComb['catProp'] = (assetComb.assetCat / assetComb.assetTot) * 100
assetComb

Unnamed: 0,assetClass,y,assetCat,assetTot,catProp
0,Quant1,no,10921,12212,89.428431
1,Quant1,yes,1291,12212,10.571569
2,Quant2,no,8436,10400,81.115385
3,Quant2,yes,1964,10400,18.884615
4,Quant3,no,10144,11121,91.214819
5,Quant3,yes,977,11121,8.785181
6,Quant4,no,10421,11478,90.791079
7,Quant4,yes,1057,11478,9.208921


In [14]:
#5
# Create dummy variables for the categorical variables; Exclude original raw variables which were used to create new variable, assetIndex (loan, housing)
bankCat = pd.get_dummies(bankData[['job', 'marital', 'education', 'default', 'contact', 'month', 'poutcome']])
bankCat.shape

(45211, 40)

In [15]:
bankNum = bankData[['age', 'day', 'duration', 'campaign', 'pdays', 'previous', 'assetIndex']]
bankNum.head()

Unnamed: 0,age,day,duration,campaign,pdays,previous,assetIndex
0,58,5,261,1,-1,0,2.306734
1,44,5,151,1,-1,0,1.826916
2,33,5,76,1,-1,0,0.364158
3,47,5,92,1,-1,0,2.162153
4,33,5,198,1,-1,0,0.364112


In [16]:
# Create the transformed variables
ageT1 = bankNum[['age']].values.astype(float)
dayT1 = bankNum[['day']].values.astype(float)
durT1 = bankNum[['duration']].values.astype(float)

In [17]:
# Transform the balance data by normalizing it with minmaxScaler:
bankNum['ageTran'] = minmaxScaler.fit_transform(ageT1)
bankNum['dayTran'] = minmaxScaler.fit_transform(dayT1)
bankNum['durTran'] = minmaxScaler.fit_transform(durT1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  bankNum['ageTran'] = minmaxScaler.fit_transform(ageT1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  bankNum['dayTran'] = minmaxScaler.fit_transform(dayT1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  bankNum['durTran'] = minmaxScaler.fit_transform(durT1)


In [18]:
bankNum.head()

Unnamed: 0,age,day,duration,campaign,pdays,previous,assetIndex,ageTran,dayTran,durTran
0,58,5,261,1,-1,0,2.306734,0.519481,0.133333,0.05307
1,44,5,151,1,-1,0,1.826916,0.337662,0.133333,0.030704
2,33,5,76,1,-1,0,0.364158,0.194805,0.133333,0.015453
3,47,5,92,1,-1,0,2.162153,0.376623,0.133333,0.018707
4,33,5,198,1,-1,0,0.364112,0.194805,0.133333,0.04026


In [19]:
# Create a new numerical variable by selecting the transformed variables:
bankNum2 = bankNum[['ageTran', 'dayTran', 'durTran', 'campaign', 'pdays', 'previous', 'assetIndex']]
bankNum2.head()

Unnamed: 0,ageTran,dayTran,durTran,campaign,pdays,previous,assetIndex
0,0.519481,0.133333,0.05307,1,-1,0,2.306734
1,0.337662,0.133333,0.030704,1,-1,0,1.826916
2,0.194805,0.133333,0.015453,1,-1,0,0.364158
3,0.376623,0.133333,0.018707,1,-1,0,2.162153
4,0.194805,0.133333,0.04026,1,-1,0,0.364112


In [20]:
#6
# Preparing the X variables
X = pd.concat([bankCat,bankNum2],axis=1)
print(X.shape)
# Preparing Y variable
Y = bankData['y']
print(Y.shape)
X.head()

(45211, 47)
(45211,)


Unnamed: 0,job_admin.,job_blue-collar,job_entrepreneur,job_housemaid,job_management,job_retired,job_self-employed,job_services,job_student,job_technician,...,poutcome_other,poutcome_success,poutcome_unknown,ageTran,dayTran,durTran,campaign,pdays,previous,assetIndex
0,0,0,0,0,1,0,0,0,0,0,...,0,0,1,0.519481,0.133333,0.05307,1,-1,0,2.306734
1,0,0,0,0,0,0,0,0,0,1,...,0,0,1,0.337662,0.133333,0.030704,1,-1,0,1.826916
2,0,0,1,0,0,0,0,0,0,0,...,0,0,1,0.194805,0.133333,0.015453,1,-1,0,0.364158
3,0,1,0,0,0,0,0,0,0,0,...,0,0,1,0.376623,0.133333,0.018707,1,-1,0,2.162153
4,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0.194805,0.133333,0.04026,1,-1,0,0.364112


In [21]:
# Split the dataset into training and evaluation sets and then fit a new model using the LogisticRegression() model
X_train, X_eval, y_train, y_eval = train_test_split(X, Y, test_size=0.2, random_state=0)
X_val, X_test, y_val, y_test = train_test_split(X_eval, y_eval, test_size=0.5, random_state=0)

bankModel = LogisticRegression()
bankModel.fit(X_train,y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression()

In [23]:
LRE_preds = bankModel.predict(X_val)
print('Accuracy of Logistic regression model prediction on validation set: {:.2f}'.format(bankModel.score(X_val, y_val)))

Accuracy of Logistic regression model prediction on validation set: 0.89


In [24]:
# Confusion Matrix for the model
from sklearn.metrics import confusion_matrix
LRE_confusion_matrix = confusion_matrix(y_val, LRE_preds)
print(LRE_confusion_matrix)

[[3930   72]
 [ 405  114]]


In [25]:
# Generate a classification_report:
from sklearn.metrics import classification_report
print(classification_report(y_val, LRE_preds))

              precision    recall  f1-score   support

          no       0.91      0.98      0.94      4002
         yes       0.61      0.22      0.32       519

    accuracy                           0.89      4521
   macro avg       0.76      0.60      0.63      4521
weighted avg       0.87      0.89      0.87      4521



### Decision Tree Classifier Model

In [27]:
#12 Create an instance of DecisionTreeClassifier:
dt_model = DecisionTreeClassifier(max_depth=6)
dt_model.fit(X_train, y_train)

DecisionTreeClassifier(max_depth=6)

In [28]:
#14 Using the DecisionTreeClassifier model, make a prediction on the evaluation dataset:
dt_preds = dt_model.predict(X_val)

In [29]:
#15 Use the prediction from the DecisionTreeClassifier model to compute the classification report:
dt_report = classification_report(y_val, dt_preds)
print(dt_report)

              precision    recall  f1-score   support

          no       0.93      0.96      0.94      4002
         yes       0.58      0.46      0.51       519

    accuracy                           0.90      4521
   macro avg       0.76      0.71      0.73      4521
weighted avg       0.89      0.90      0.89      4521



The best model between Decision Tree Classifier and Linear Regression is the Decision Tree Classifier since it has the higher f1-score.

### Random Forest Classifier Model

In [None]:
#17 Create an instance of RandomForestClassifier.
rf_model = RandomForestClassifier()