<a id='toc'></a>
## Table of Contents

* [1. About the project](#about_project)
* [2. Import libraries and load data](#import_libraries_load_data)
* [3 Exploratory data analysis (EDA)](#eda)
  * [3.1 EDA - basic](#eda-basic)
  * [3.2 EDA - additional](#eda-additional)
* [4 Baseline model](#baseline-model)
* [5 Improvement over baseline model](#baseline-improvement)
  * [5.1 Logistic Regression](#logistic-regression)
  * [5.2 Decision Tree](#decision-tree)
  * [5.3 Random Forest](#random-forest)
  * [5.4 XGBoost](#xgb)
* [6. Final model](#final-model)
  * [6.1 Compare results from hyper-parameter tuning for the different models and choose final model](#choose-final-model)
  * [6.2 Train final model](#train-final-model)

<a class='anchor' id='about_project'></a>
[back to TOC](#toc)
### 1. About the project:

Banks are important institutions that provide funds, in terms of loans, to businesses and individuals to function and prosper. However banks need money to provide as loan and to also make their own investments (e.g. in stocks). One such good source that banks have is in the from of Term Deposits that bank's customers make.

Banks regularly make calls to their customers to secure such Term deposits. However, from a big list of all its customers, it would be wise to make calls to the customers who are more likely to invest. This way, banks can reduce the cost of acquisition of Term deposit (in the form of payment to staff making call, call charges and so on.).

This project aims at building a machine learning model that can be trained from previous marketing campaigns and data collected, to predict customers that potentialy will subscribe to Term deposit with the bank. Further, the prediction model will be hosted as a web service, which can accept customer data (in JSON format) and return the prediction (whether customer is likely to subscribe to Term deposit).

#### The dataset
**Citation:** [Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.

*Data source:* https://archive.ics.uci.edu/ml/datasets/bank+marketing
(Also available at https://www.openml.org/d/1461 but has some differences due to data being processed a bit)

*Datafile to be used:* https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank-additional.zip

* 1) bank-additional-full.csv with all examples (41188) and 20 inputs, ordered by date (from May 2008 to November 2010), very close to the data analyzed in [Moro et al., 2014]
* 2) bank-additional.csv with 10% of the examples (4119), randomly selected from 1), and 20 inputs.

The bank-additional-full.csv (which has the complete dataset) is used for this project.

#### Notes from data source

##### Input variables:

*bank client data:*
* 1 - age (numeric)
* 2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
* 3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
* 4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
* 5 - default: has credit in default? (categorical: 'no','yes','unknown')
* 6 - housing: has housing loan? (categorical: 'no','yes','unknown')
* 7 - loan: has personal loan? (categorical: 'no','yes','unknown')

*related with the last contact of the current campaign:*
* 8 - contact: contact communication type (categorical: 'cellular','telephone')
* 9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
* 10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
* 11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

*other attributes:*
* 12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
* 13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
* 14 - previous: number of contacts performed before this campaign and for this client (numeric)
* 15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')

*social and economic context attributes*
* 16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)
* 17 - cons.price.idx: consumer price index - monthly indicator (numeric)
* 18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)
* 19 - euribor3m: euribor 3 month rate - daily indicator (numeric)
* 20 - nr.employed: number of employees - quarterly indicator (numeric)

##### Output variable (desired target)
* 21 - y - has the client subscribed a term deposit? (binary: 'yes','no')



<a id='import_libraries_load_data'></a>
### 2. Import libraries and load data
[back to TOC](#toc)

In [99]:
#Import libraries

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import plotly.express as px
from IPython.display import display

from sklearn.metrics import mutual_info_score
from sklearn.model_selection import train_test_split, GridSearchCV, RepeatedStratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from sklearn.metrics import auc
import xgboost as xgb

from sklearn.feature_extraction import DictVectorizer

from sklearn import preprocessing

import pickle

In [2]:
datafile = 'bank-additional-full.csv'
df = pd.read_csv(datafile,delimiter=';')

In [1]:
df

<a id='eda'></a>
### 3. Exploratory Data Analysis

[back to TOC](#toc)

This section performs various analysis of the dataset, split it into training, validation, test.

* [3.1 EDA - basic](#eda-basic)
* [3.2 EDA - additional](#eda-additional)

<a id='eda-basic'></a>
#### 3.1  EDA - Basic

* Check if columns are correctly classified as numerical and categorical *(sometimes numerical columns are marked categorical or vice versa)*
* Check for missing data and impute if data is missing
* Check if any numerical features have extremely high values *(sometimes NaNs are coded as high number like 99999999)*
* Check cardinality of categorical features *(if very high cardinality then using one-hot encoding may create a lot of features)*
* Target variable analysis 
  - Convert categorical to binary *(since this dataset has target values of yes and no)*
  - Check whether there is class imbalance *(if class imbalance then accuracy should not be used as evaluation metric, rather roc_auc can be used)*
* Several features have 'unknown' as values *(representative of NaNs)*. Check whether these samples can be deleted or it will reduce the dataset significantly

Check if all the columns have correct data type (sometimes numerical columns are marked categorical or vice versa) 

In [2]:
df_head = df.head(2).T
dtypes = list(df.dtypes.values)
df_head.insert(loc=0,column='dtype',value=dtypes)
df_head

**Observations:** The columns seem to be correctly typed

Check for missing data

In [3]:
df.isnull().sum()

**Observations:** No missing data

In [4]:
#Numerical and Categorical features
t_num_cols = list(df.columns[df.dtypes != 'object'])
t_cat_cols = list(df.columns[df.dtypes == 'object'])
t_cat_cols.remove('y')
display(t_num_cols)
display(t_cat_cols)

Check if any of the numerical columns have significantly high values (sometimes NaNs are filled as something like 99999999)

In [5]:
df[t_num_cols].describe().T

**Observations:** Looking at the max for all the numerical features, we can see that there are no significantly high values for any of the features.

For categorical features, check cardinality (distinct values and their count)

In [6]:
df[t_cat_cols].describe().T

In [7]:
#Number of distinct values for each categorical feature
df[t_cat_cols].nunique()

The categorical features do not have high cardinality.

In [8]:
#Distinct values and their count for each categorical feature
for c in t_cat_cols:
    print(c)
    display(df[c].value_counts())
    print()

The cardinality for the categorical columns look to be ok, not too much

Converting target variable from having 'yes'/'no' to 1/0

In [9]:
df['y'] = (df['y'] == 'yes').astype(int)
df['y'].value_counts(normalize=True)

**Observations:** There is a high class imbalance in the target variable - 89% (no) - 11% (yes)

Many of the features have the value 'unknown' possibly since that piece of information was not available. Checking how many such records are there with unknowns and if removing these make sense.

In [10]:
df_copy1 = df.copy()
initial_records = df_copy1.shape[0]
for c in t_cat_cols:
    df_copy1.drop(df_copy1[df_copy1[c] == 'unknown'].index, inplace = True)
final_records = df_copy1.shape[0]
print(initial_records - final_records,(initial_records - final_records)/final_records)
del df_copy1

**Observations:** Thus totally there are 10700 records with atleast one feature having value as 'unknown'. This is 35% of the total data. So it does not make sense to remove this data. Will have to go ahead with the 'unknown' data

Look at the distribution of numerical features

In [11]:
# Distribution of numerical features

rows, columns = 2, 5
n_row, n_col = 0, 0
fig, axes = plt.subplots(rows, columns, figsize=(20,10))
for num_ft in t_num_cols:
    if n_col < columns:
        axes[n_row][n_col].hist(df[num_ft])
        axes[n_row][n_col].set_title(num_ft)
        n_col = n_col + 1
    else:
        n_row = 1
        axes[n_row][n_col-columns].hist(df[num_ft])
        axes[n_row][n_col-columns].set_title(num_ft)
        n_col = n_col + 1
    
plt.show()

**Observations:** We can see that the numerical features 'duration', 'campaign', 'pdays', 'previous' are not normally distributed. Will use this information for additional EDA and when performing experiments with models and scores

In [15]:
num_cols_transform = ['duration', 'campaign', 'pdays', 'previous']

<a id='eda-additional'></a>
#### 3.2. EDA - additional
[back to TOC](#toc)

* Split the data into Train (70%), Validation (20%) and Test (10%)
* For the features 'duration', 'campaign', 'pdays', 'previous' perform various transformations using transformers like Quantile, Power, Log transformation to see if the distribution can be brought close to normal
* Feature importance - using mutual information score for categorical features and using correlation for numerical features

**Splitting data as Train (70%), Val (20%), Test (10%)**

In [12]:
df_full_train, df_test = train_test_split(df,test_size=0.1,shuffle=True,random_state=1)
df_train, df_val = train_test_split(df_full_train,test_size=0.22,shuffle=True,random_state=1)
print(f'train : {round(df_train.shape[0]/df.shape[0],2)}, val: {round(df_val.shape[0]/df.shape[0],2)}, test: {round(df_test.shape[0]/df.shape[0],2)}')
df_test_project = df_test.copy()   #Dataframe with test features and label if required to use final model for preditions.

In [13]:
display(df_full_train['y'].value_counts(normalize=True))
display(df_train['y'].value_counts(normalize=True))
display(df_val['y'].value_counts(normalize=True))
display(df_test['y'].value_counts(normalize=True))

In [18]:
y_train = df_train['y'].values
y_val = df_val['y'].values
y_test = df_test['y'].values

del df_train['y']
del df_val['y']
del df_test['y']

df_full_train = df_full_train.reset_index(drop=True)
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

In [14]:
quantile_transformer = preprocessing.QuantileTransformer(output_distribution='normal', random_state=42)
df_train_trans = pd.DataFrame(quantile_transformer.fit_transform(df_train[num_cols_transform]),columns=num_cols_transform)

rows, columns = 2, 2
n_row, n_col = 0, 0
fig, axes = plt.subplots(rows, columns, figsize=(20,10))
for num_ft in num_cols_transform:
    if n_col < columns:
        axes[n_row][n_col].hist(df_train_trans[num_ft])
        axes[n_row][n_col].set_title(num_ft)
        n_col = n_col + 1
    else:
        n_row = 1
        axes[n_row][n_col-columns].hist(df_train_trans[num_ft])
        axes[n_row][n_col-columns].set_title(num_ft)
        n_col = n_col + 1
    
plt.show()

In [15]:
quantile_transformer = preprocessing.QuantileTransformer(output_distribution='uniform', random_state=42)
df_train_trans = pd.DataFrame(quantile_transformer.fit_transform(df_train[num_cols_transform]),columns=num_cols_transform)

rows, columns = 2, 2
n_row, n_col = 0, 0
fig, axes = plt.subplots(rows, columns, figsize=(20,10))
for num_ft in num_cols_transform:
    if n_col < columns:
        axes[n_row][n_col].hist(df_train_trans[num_ft])
        axes[n_row][n_col].set_title(num_ft)
        n_col = n_col + 1
    else:
        n_row = 1
        axes[n_row][n_col-columns].hist(df_train_trans[num_ft])
        axes[n_row][n_col-columns].set_title(num_ft)
        n_col = n_col + 1
    
plt.show()

In [16]:
power_transformer = preprocessing.PowerTransformer(method='yeo-johnson')
df_train_trans = pd.DataFrame(power_transformer.fit_transform(df_train[num_cols_transform]),columns=num_cols_transform)

rows, columns = 2, 2
n_row, n_col = 0, 0
fig, axes = plt.subplots(rows, columns, figsize=(20,10))
for num_ft in num_cols_transform:
    if n_col < columns:
        axes[n_row][n_col].hist(df_train_trans[num_ft])
        axes[n_row][n_col].set_title(num_ft)
        n_col = n_col + 1
    else:
        n_row = 1
        axes[n_row][n_col-columns].hist(df_train_trans[num_ft])
        axes[n_row][n_col-columns].set_title(num_ft)
        n_col = n_col + 1
    
plt.show()

In [17]:
plt.hist(np.log1p(df_train['previous']))

In [18]:
plt.hist(np.log1p(df_train['pdays']))

Noting down which transformations should be tried for which features:
* **No transformation:** 'age', 'emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed'
* **Power transformation with 'yeo-johnson':** 'duration', 'campaign'
* **log1p transformation:** 'previous', 'pdays'

**Feature Importance**

Mutual information with categorical features

In [24]:
def mutual_info_subscribed_score(series):
    return mutual_info_score(series,df_full_train['y'])

In [19]:
mi = df_full_train[t_cat_cols].apply(mutual_info_subscribed_score)
mi.sort_values(ascending=False)

**Observations:** We can see that the features **marital, day_of_week, housing and loan** seem to not have any bearing on the outcome. Will check the score with and without these features.

Check global subscription rate relation to subscription rate feature wise

In [20]:
global_subscription_rate = round(df['y'].mean(),3)
print(global_subscription_rate)

In [21]:
cons_df_group = pd.DataFrame()

for c in t_cat_cols:
    df_group = df_full_train.groupby(c)['y'].agg(['mean','count'])
    df_group['diff'] = df_group['mean'] - global_subscription_rate
    df_group['abs_diff'] = np.abs(df_group['mean'] - global_subscription_rate)
    df_group['ratio'] = df_group['mean'] / global_subscription_rate
    new_idx = [c+'_'+val for val in list(df_group.index.values)]
    df_group.index = new_idx
    cons_df_group = pd.concat([cons_df_group,df_group])
display(cons_df_group.sort_values(by='ratio',ascending=False))

Co-relation of numerical features with target variable

In [22]:
df_full_train[t_num_cols].corrwith(df_full_train['y']).sort_values()

**Observations:** We can see that most of the numerical features have correlation with target unlike categorical variables, where almost all categorical variables had lower mutual information score with target. The features age, duration, previous and cons.conf.idx have a postivie correlation, while campaign, pdays, emp.var.rate, cons.price.idx, euribor3m and nr.employed have negative correlation

To check absolute correlations (without considering if it is postive or negative, but simply how highly they are correlated)

In [23]:
np.abs(df_full_train[t_num_cols].corrwith(df_full_train['y'])).sort_values(ascending=False)

**Observations:** All the numerical features seem to be important for predicting the outcome. 

**Important Note:**
The duration will not be considered in the model performance and the final model, based on the guidelines that were given with the dataset and mentioned below.
- *Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.*

Lets look at correlations of numerical feature with other numerical features

In [24]:
df_full_train[t_num_cols].corr()

In [25]:
plt.figure(figsize=(16, 8))
sns.heatmap(df_full_train[t_num_cols].corr(), annot=True, fmt='.3f')

**Observations:** We can see that some of the features have good correlation (>0.75) - like - emp.var.rate, nr.employed, euribor3m, cons.price.idx are co-related to each other. This makes sense as all these are social and economic context attributes.

<a id='baseline-model'></a>
### 4. Baseline model
[back to TOC](#toc)

Will check performance of baseline model using LogisticRegression with default parameters and evaluating using roc_auc_score (since the target variable has class imbalance). Will check the scores with and without the feature 'duration'

Score with feature 'duration'

In [32]:
dv = DictVectorizer(sparse=False)
model = LogisticRegression(solver='liblinear',random_state=42)

In [26]:
dict_train = df_train.to_dict(orient='records')
X_train = dv.fit_transform(dict_train)
model.fit(X_train,y_train)

In [27]:
dict_val = df_val.to_dict(orient='records')
X_val = dv.transform(dict_val)
y_pred = model.predict_proba(X_val)[:,1]
this_auc = roc_auc_score(y_val,y_pred)
print(this_auc)

In [35]:
#Save all scores into a pandas dataframe so that we can have a comparative study across experiments

#Construct of this dataframe will be like below example
#"algo", "desc", "score", "diff"
#"logisticregression", "description of the experiment", 0.6896, 0.0243

exp_columns = ["algo", "desc", "score", "diff"]
exp_scores = pd.DataFrame(columns = exp_columns)

In [36]:
# Function to do one-hot encoding, train the model and evaluate model on validation data

def evaluate(new_features,df_train_copy,df_val_copy,model):
    dict_train_new = df_train_copy[new_features].to_dict(orient='records')
    X_train_new = dv.fit_transform(dict_train_new)
    model.fit(X_train_new,y_train)
    
    dict_val_new = df_val_copy[new_features].to_dict(orient='records')
    X_val_new = dv.transform(dict_val_new)
    y_pred_new = model.predict_proba(X_val_new)[:,1]    
    roc_auc_val = roc_auc_score(y_val,y_pred_new)

    return roc_auc_val

Removing the feature 'duration' and checking the score

In [28]:
#Experiment 0 - baseline

df_train_copy = df_train.copy()
del df_train_copy['duration']
df_val_copy = df_val.copy()
del df_val_copy['duration']

features_list = df_train_copy.columns
baseline_auc = evaluate(features_list,df_train_copy,df_val_copy,model)
print(baseline_auc)
score_entry = {"algo": "logisticregression", "desc": "baseline score. all features excluding duration", "score": baseline_auc, "diff": 0}
exp_scores = exp_scores.append(score_entry,ignore_index=True)

**Observations:** We will consider this score (without the 'duration' feature) as the baseline score. We will now see how we can improve the score. Later we will try using different models and then again see for those, how we can further improve the score by hyper-parameter tuning.

Deleting 'duration' feature from all the dataframes

In [38]:
del df_train['duration']
del df_full_train['duration']
del df_val['duration']
del df_test['duration']
del df_test_project['duration']
t_num_cols.remove('duration')

<a id='baseline-improvement'></a>
### 5. Improvement over baseline
[back to TOC](#toc)

Idea here is to do several experiments using the algorithm used in baseline and find methods to improve the score. Then we will try to tune the paramaters of this model to further improve the score. After having worked with this model, we will look at other algorithms and compare the scores, then we will also tune the hyperparameters for these other models. Finally we will compare all the results to select the best model and parameters, which we will then use to train the full_train dataset and do final evaluation on the test dataset

#### [5.1 Logistic Regression](#logistic-regression)
##### [5.1.1 Experiments to improve the score using LogisticRegression](#logistic-regression-1)
   * Perform various scaling (Standard, MinMax, use Polynomial features) of the numerical features and compare scores
   * Check scores by dropping less important features based on EDA
   * Check effect on score by dropping one feature at a time
   * Check effect on score by dropping multiple features that led to increased score when dropped
   * Look at coefficients of the model to determine less important features

##### [5.1.2 Model tuning for LogisticRegression](#logistic-regression-2)
   * Choose the best experiment and find the best hyper-parameters for the model

#### [5.2 DecisionTreeClassifier](#decision-tree)
Compare score of using DecisionTree with baseline using LogisticRegression
##### [5.2.1 Experiments to improve the score using DecisionTreeClassifier](#decision-tree-1)
   * Perform various scaling (Standard, MinMax, use Polynomial features) of the numerical features and compare scores
   * Check scores by dropping less important features based on EDA
   * Check effect on score by dropping one feature at a time
   * Check effect on score by dropping multiple features that led to increased score when dropped

##### [5.2.2 Model tuning for DecisionTreeClassifier](#ldecision-tree-2)
   * Choose the best experiment and find the best hyper-parameters for the model

#### [5.3 RandomForestClassifier](#random-forest)
Compare score of using RandomForestClassifier with baseline using LogisticRegression
##### [5.3.1 Experiments to improve the score using RandomForestClassifier](#random-forest-1)
   * Perform various scaling (Standard, MinMax, use Polynomial features) of the numerical features and compare scores
   * Check scores by dropping less important features based on EDA
   * Check effect on score by dropping one feature at a time
   * Check effect on score by dropping multiple features that led to increased score when dropped

##### [5.3.2 Model tuning for RandomForestClassifier](#random-forest-2)
   * Choose the best experiment and find the best hyper-parameters for the model

#### [5.4 XGBoost](#xgb)
Compare score of using XGBoost with baseline using LogisticRegression
##### [5.4.1 Experiments to improve the score using XGBoost](#xgb-1)
   * Perform various scaling (Standard, MinMax, use Polynomial features) of the numerical features and compare scores
   * Check scores by dropping less important features based on EDA
   * Check effect on score by dropping one feature at a time
   * Check effect on score by dropping multiple features that led to increased score when dropped

##### [5.4.2 Model tuning for XGBoost](#xgb-2)
   * Choose the best experiment and find the best hyper-parameters for the model

<a id='logistic-regression'></a>
#### 5.1 Logistic Regression (continued...)
[back to TOC](#toc)

<a id='logistic-regression-1'></a>
#### 5.1.1 Experiments to improve the score using LogisticRegression
[back to TOC](#toc)

Linear models work best when all the features have similar scale. Let us check whether scaling of numerical features helps increase the score. Using StandardScaler

In [38]:
#Experiment 1

scaler_std = preprocessing.StandardScaler()

df_train_copy = df_train.copy()
# df_train_copy.reset_index(drop=True,inplace=True)
X_train_num = scaler_std.fit_transform(df_train_copy[t_num_cols])
df_train_copy[t_num_cols] = pd.DataFrame(X_train_num,columns=t_num_cols)

df_val_copy = df_val.copy()
# df_val_copy.reset_index(drop=True,inplace=True)
X_val_num = scaler_std.transform(df_val_copy[t_num_cols])
df_val_copy[t_num_cols] = pd.DataFrame(X_val_num,columns=t_num_cols)

features_list = df_train_copy.columns
this_auc = evaluate(features_list,df_train_copy,df_val_copy,model)

score_entry = {"algo": "logisticregression", "desc": "standardscaler", "score": this_auc, "diff": this_auc-baseline_auc}
exp_scores = exp_scores.append(score_entry,ignore_index=True)

In [29]:
exp_scores.sort_values(by='score',ascending=False)

**Observations:** There is considerable improvement in score after scaling using StandardScaler. Let us check results with other scaling (MinMaxScaler) and preprocessing (Polynomial)

In [40]:
#Experiment 2

scaler_minmax = preprocessing.MinMaxScaler()
# df_train_copy.reset_index(drop=True,inplace=True)
df_train_copy = df_train.copy()

X_train_num = scaler_minmax.fit_transform(df_train_copy[t_num_cols])
df_train_copy[t_num_cols] = pd.DataFrame(X_train_num,columns=t_num_cols)

df_val_copy = df_val.copy()
# df_val_copy.reset_index(drop=True,inplace=True)
X_val_num = scaler_minmax.transform(df_val_copy[t_num_cols])
df_val_copy[t_num_cols] = pd.DataFrame(X_val_num,columns=t_num_cols)

features_list = df_train_copy.columns
this_auc = evaluate(features_list,df_train_copy,df_val_copy,model)

score_entry = {"algo": "logisticregression", "desc": "minmaxscaler", "score": this_auc, "diff": this_auc-baseline_auc}
exp_scores = exp_scores.append(score_entry,ignore_index=True)

In [30]:
exp_scores.sort_values(by='score',ascending=False)

**Observations:** There is further slight improvement in score with MinMaxScaler compared to StandarScaler.

Let us create polynomial features and check score

In [42]:
#Experiment 3

poly = preprocessing.PolynomialFeatures(degree=3, interaction_only=True, include_bias=False)

df_train_copy = df_train.copy()
X_train_num = poly.fit_transform(df_train_copy[t_num_cols])
df_train_poly = pd.DataFrame(X_train_num, columns=[f"poly_{i}" for i in range(X_train_num.shape[1])])
df_train_copy = pd.concat([df_train_copy, df_train_poly], axis=1)

df_val_copy = df_val.copy()
X_val_num = poly.fit_transform(df_val_copy[t_num_cols])
df_val_poly = pd.DataFrame(X_val_num, columns=[f"poly_{i}" for i in range(X_val_num.shape[1])])
df_val_copy = pd.concat([df_val_copy, df_val_poly], axis=1)

features_list = df_train_copy.columns
this_auc = evaluate(features_list,df_train_copy,df_val_copy,model)

score_entry = {"algo": "logisticregression", "desc": "polynomialfeatures", "score": this_auc, "diff": this_auc-baseline_auc}
exp_scores = exp_scores.append(score_entry,ignore_index=True)

In [31]:
exp_scores.sort_values(by='score',ascending=False)

**Observations:** Score after using polynomial features has decreased and is even less than the baseline (here we kept original features as well as the polynomial features). Let us check replacing original numerical features by the corresponding polynomil features.

In [32]:
#Experiment 4

poly = preprocessing.PolynomialFeatures(degree=3, interaction_only=True, include_bias=False)

df_train_copy = df_train.copy()
X_train_num = poly.fit_transform(df_train_copy[t_num_cols])
df_train_poly = pd.DataFrame(X_train_num, columns=[f"poly_{i}" for i in range(X_train_num.shape[1])])
df_train_copy.drop(t_num_cols,axis=1,inplace=True)
poly_cols = list(df_train_poly.columns.values)
df_train_copy[poly_cols] = df_train_poly[poly_cols]

df_val_copy = df_val.copy()
X_val_num = poly.fit_transform(df_val_copy[t_num_cols])
df_val_poly = pd.DataFrame(X_val_num, columns=[f"poly_{i}" for i in range(X_val_num.shape[1])])
df_val_copy.drop(t_num_cols,axis=1,inplace=True)
df_val_copy[poly_cols] = df_val_poly[poly_cols]

features_list = df_train_copy.columns
this_auc = evaluate(features_list,df_train_copy,df_val_copy,model)

score_entry = {"algo": "logisticregression", "desc": "polynomialfeatures replacing org. features", "score": this_auc, "diff": this_auc-baseline_auc}
exp_scores = exp_scores.append(score_entry,ignore_index=True)

In [33]:
exp_scores.sort_values(by='score',ascending=False)

**Observations:** Using polynomial features replacing original is getting better result, however is still less than baseline.

Check scores with various transformations on numerical features that we noted down during additional EDA
* Power transformation with 'yeo-johnson': 'duration', 'campaign' -- since we have deleted 'duration' feature, will consider only 'campaign'
* log1p transformation: 'previous', 'pdays'

In [46]:
#Experiment 5 - power transform 'campaign'
cols_transform = ['campaign']
power_transformer = preprocessing.PowerTransformer(method='yeo-johnson')


df_train_copy = df_train.copy()
df_train_trans = pd.DataFrame(power_transformer.fit_transform(df_train_copy[cols_transform]),columns=cols_transform)
df_train_copy[cols_transform] = df_train_trans[cols_transform]


df_val_copy = df_val.copy()
df_val_trans = pd.DataFrame(power_transformer.transform(df_val_copy[cols_transform]),columns=cols_transform)
df_val_copy[cols_transform] = df_val_trans[cols_transform]

features_list = df_train_copy.columns
this_auc = evaluate(features_list,df_train_copy,df_val_copy,model)

score_entry = {"algo": "logisticregression", "desc": "powertransform 'campaign'", "score": this_auc, "diff": this_auc-baseline_auc}
exp_scores = exp_scores.append(score_entry,ignore_index=True)

In [34]:
exp_scores.sort_values(by='score',ascending=False)

In [48]:
#Experiment 6 - log1p transform 'previous'
cols_transform = ['previous']

df_train_copy = df_train.copy()
df_train_copy[cols_transform] = np.log1p(df_train_copy[cols_transform])


df_val_copy = df_val.copy()
df_val_copy[cols_transform] = np.log1p(df_val_copy[cols_transform])

features_list = df_train_copy.columns
this_auc = evaluate(features_list,df_train_copy,df_val_copy,model)

score_entry = {"algo": "logisticregression", "desc": "log1p 'previous'", "score": this_auc, "diff": this_auc-baseline_auc}
exp_scores = exp_scores.append(score_entry,ignore_index=True)

In [35]:
exp_scores.sort_values(by='score',ascending=False)

In [50]:
#Experiment 7 - log1p transform 'pdays'
cols_transform = ['pdays']

df_train_copy = df_train.copy()
df_train_copy[cols_transform] = np.log1p(df_train_copy[cols_transform])


df_val_copy = df_val.copy()
df_val_copy[cols_transform] = np.log1p(df_val_copy[cols_transform])

features_list = df_train_copy.columns
this_auc = evaluate(features_list,df_train_copy,df_val_copy,model)

score_entry = {"algo": "logisticregression", "desc": "log1p 'pdays'", "score": this_auc, "diff": this_auc-baseline_auc}
exp_scores = exp_scores.append(score_entry,ignore_index=True)

In [36]:
exp_scores.sort_values(by='score',ascending=False)

In [52]:
#Experiment 8 - Use log1p transform for 'pdays' and for other numerical features use StandardScaler

scaler = preprocessing.StandardScaler()
cols_transform = t_num_cols.copy()
cols_transform.remove('pdays')

df_train_copy = df_train.copy()
X_train_num = scaler.fit_transform(df_train_copy[cols_transform])
df_train_copy[cols_transform] = pd.DataFrame(X_train_num,columns=cols_transform)

df_val_copy = df_val.copy()
X_val_num = scaler.transform(df_val_copy[cols_transform])
df_val_copy[cols_transform] = pd.DataFrame(X_val_num,columns=cols_transform)

df_train_copy['pdays'] = np.log1p(df_train_copy['pdays'])
df_val_copy['pdays'] = np.log1p(df_val_copy['pdays'])

features_list = df_train_copy.columns
this_auc = evaluate(features_list,df_train_copy,df_val_copy,model)

score_entry = {"algo": "logisticregression", "desc": "log1p 'pdays' rest standardscaler", "score": this_auc, "diff": this_auc-baseline_auc}
exp_scores = exp_scores.append(score_entry,ignore_index=True)

In [37]:
exp_scores.sort_values(by='score',ascending=False)

**Observations:** From experiments so far, performing various transformation on numerical features, we can see that we got **best score with MixMaxScaler**, then log1p of 'pdays', then using StandardScaler. Since we decided not to use MinMaxScaler, and since score of simply using StandardScaler without any other transformations is equally good (very small difference), we will be choosing this method of transformation for further experiments.

Let us look at the coefficients of the trained Logistic regression model to see which are the features that do not help much (coefficient values close to 0)

In [56]:
# Function to do one-hot encoding of categorical features, StandardScaler processing of numerical features, train the model and evaluate model on validation data

def evaluate(new_features,df_train_copy,df_val_copy,model):
    scaler = preprocessing.StandardScaler()
    dv = DictVectorizer(sparse=False)
    
    num_cols = list(df_train_copy.columns[df_train_copy.dtypes != 'object'])
    
    cols_transform = num_cols.copy()
    if 'pdays' in cols_transform:
        cols_transform.remove('pdays')
        df_train_copy['pdays'] = np.log1p(df_train_copy['pdays'])
        df_val_copy['pdays'] = np.log1p(df_val_copy['pdays'])
    
    X_train_num = scaler_minmax.fit_transform(df_train_copy[cols_transform])
    df_train_copy[cols_transform] = pd.DataFrame(X_train_num,columns=cols_transform)

    X_val_num = scaler_minmax.transform(df_val_copy[cols_transform])
    df_val_copy[cols_transform] = pd.DataFrame(X_val_num,columns=cols_transform)

    dict_train_new = df_train_copy[new_features].to_dict(orient='records')
    X_train_new = dv.fit_transform(dict_train_new)
    model.fit(X_train_new,y_train)
    
    dict_val_new = df_val_copy[new_features].to_dict(orient='records')
    X_val_new = dv.transform(dict_val_new)
    y_pred_new = model.predict_proba(X_val_new)[:,1]    
    roc_auc_val = roc_auc_score(y_val,y_pred_new)

    return roc_auc_val

In [57]:
#Will again process the data with StandardScaler and train the model and then will check the coefficients

df_train_copy = df_train.copy()
df_val_copy = df_val.copy()

features_list = df_train_copy.columns
this_auc = evaluate(features_list,df_train_copy,df_val_copy,model)

In [38]:
features_names = dv.get_feature_names()
df_ = pd.DataFrame()
df_['features'] = features_names
df_['coef'] = model.coef_[0].round(3)
df_['abs_coef'] = np.abs(model.coef_[0].round(3))
df_ = df_.sort_values('abs_coef').reset_index(drop=True)
df_

In [39]:
exp_scores.sort_values(by='score',ascending=False)

Based on out EDA, we had seen that the features 'loan','housing','day_of_week','marital' had very less mutual information with the target. Let us check the score of model trained without these features.

In [40]:
#Experiment 9

drop_features = ['loan','housing','day_of_week','marital']   # Features observed to be least significant in EDA

#Defining description based on features being dropped
entry_desc = "EDA obs deleted features 'loan','housing','day_of_week','marital'"

df_train_copy = df_train.copy()
df_train_copy.drop(drop_features,axis=1,inplace=True)
df_val_copy = df_val.copy()
df_val_copy.drop(drop_features,axis=1,inplace=True)
features_list = list(df_train_copy.columns.values)

this_auc = evaluate(features_list,df_train_copy,df_val_copy,model)
print(this_auc)
print('%.6f' % (round(this_auc - baseline_auc,6)))
score_entry = {"algo": "logisticregression", "desc": entry_desc, "score": this_auc, "diff": this_auc - baseline_auc}
exp_scores = exp_scores.append(score_entry,ignore_index=True)

**Observations:** Thus we can see that there is extremely tiny difference in the scores with and without the features 'loan','housing','day_of_week','marital'. Thus the model is confirming our observations in EDA. Also the score seems to have slightly improved after removing the features.

Evaluate effect on score by dropping one feature at a time

In [41]:
exp_scores.sort_values(by='score',ascending=False)

In [42]:
#Score for each feature being dropped and it's difference with baseline score

#Experiment 10 onwards

df_train_copy = df_train.copy()
df_val_copy = df_val.copy()

features_list = df_train_copy.columns
feature_roc_auc_scores = []

for drop_feature in list(features_list):
    new_features = list(features_list.drop(drop_feature))
    this_auc = evaluate(new_features,df_train_copy,df_val_copy,model)
    feature_roc_auc_scores.append((drop_feature,this_auc,this_auc-baseline_auc,np.abs(this_auc-baseline_auc)))
    print(str((drop_feature,this_auc,this_auc-baseline_auc,np.abs(this_auc-baseline_auc))))
    entry_desc = f"delete one feature {drop_feature}"
    score_entry = {"algo": "logisticregression", "desc": entry_desc, "score": this_auc, "diff": this_auc - baseline_auc}
    exp_scores = exp_scores.append(score_entry,ignore_index=True)

In [43]:
exp_scores.sort_values(by='score',ascending=False).head(10)

Check difference in scores - positive difference indicating score improved on removing the feature (feature is unnecessary and impacting model), negative difference indicating score reduced on removing the feature (feature is useful) and the magnitude of the value indicating how important (more or less).

In [44]:
df_feature_roc_auc = pd.DataFrame(feature_roc_auc_scores,columns=['feature','new_roc_auc_score','diff_roc_auc_score','abs_diff_roc_auc_score'])
df_sorted = df_feature_roc_auc.sort_values(by='diff_roc_auc_score',ascending=False)
df_sorted

In [45]:
plt.figure(figsize=(10, 8))
plt.barh(df_sorted['feature'],df_sorted['diff_roc_auc_score'])

**Observations:** We can see the following:
* Deleting the feature 'month' results into reduced score
* Deleting any other features is improving the score although not very much

Check which features result into least change in score than baseline, irrespective of whether the difference is positive or negative

In [46]:
df_feature_roc_auc.sort_values(by='abs_diff_roc_auc_score',ascending=False)

**Observations:** Deleting any of the features has very minimal effect on score [less than 0.008 - 0.0015]

Will now experiment with dropping a groups of features (features that on dropping result into increased score) and see the effect on score. We will select the top 8 features that have positive effect when removed.

In [67]:
top_features = list(df_feature_roc_auc.sort_values(by='abs_diff_roc_auc_score',ascending=False).head(8)['feature'].values)
top_features.remove('month')

In [68]:
import itertools

#Function to get all combinations of all features
def drop_features(features):
    drop_features_list = []
    for L in range(1, len(features)+1):
        for subset in itertools.combinations(features, L):
            drop_features_list.append(list(subset))
    return drop_features_list

In [47]:
#Experiment 29 onwards

df_train_copy = df_train.copy()
df_val_copy = df_val.copy()

drop_features_list = drop_features(top_features)

for drop_features in drop_features_list:
    #Defining description based on features being dropped
    entry_desc = f'deleted feature {str(drop_features).replace("[","").replace("]","")}'

    df_train_copy = df_train.copy()
    df_train_copy.drop(drop_features,axis=1,inplace=True)
    df_val_copy = df_val.copy()
    df_val_copy.drop(drop_features,axis=1,inplace=True)
    features_list = list(df_train_copy.columns.values)

    this_auc = evaluate(features_list,df_train_copy,df_val_copy,model)
    print(this_auc, this_auc - baseline_auc)
    score_entry = {"algo": "logisticregression", "desc": entry_desc, "score": this_auc, "diff": this_auc - baseline_auc}
    exp_scores = exp_scores.append(score_entry,ignore_index=True)

In [48]:
exp_scores.sort_values(by='score',ascending=False).head(20)

**Observations:** Removing combination of features is giving better scores that removing a single feature and almost all of these experiments are scoring very well in comparison with baseline.

In [49]:
exp_scores.sort_values(by='score',ascending=False)[:10]['desc'].values

Top 10 ranking experiments are:
* "deleted feature 'day_of_week', 'education', 'cons.conf.idx', 'age', 'job'"
* "deleted feature 'day_of_week', 'education', 'cons.conf.idx', 'age', 'job', 'previous'"
* "deleted feature 'day_of_week', 'education', 'cons.conf.idx', 'job'"
* "deleted feature 'day_of_week', 'education', 'cons.conf.idx', 'job', 'previous'"
* "deleted feature 'day_of_week', 'education', 'age', 'job'"
* "deleted feature 'day_of_week', 'education', 'job'"
* "deleted feature 'day_of_week', 'education', 'job', 'previous'"
* "deleted feature 'day_of_week', 'education', 'age', 'job', 'previous'"
* "deleted feature 'day_of_week', 'education', 'cons.conf.idx', 'marital', 'age'"
* "deleted feature 'day_of_week', 'education', 'cons.conf.idx', 'age'"

<a id='logistic-regression-2'></a>
##### 5.1.2 Model tuning for LogisticRegression
[back to TOC](#toc)

**Tuning the parameters using GridsearchCV**

Tuning parameters using the top experiment: Use MinMaxScaler to transform numerical features, Delete features 'day_of_week', 'education', 'cons.conf.idx', 'age', 'job'

In [82]:
def pre_process(df_full_train_copy):
    scaler = preprocessing.MinMaxScaler()
    dv = DictVectorizer(sparse=False)
    
    num_cols = list(df_full_train_copy.columns[df_full_train_copy.dtypes != 'object'])
    
    cols_transform = num_cols.copy()
    if 'pdays' in cols_transform:
        cols_transform.remove('pdays')
        df_full_train_copy['pdays'] = np.log1p(df_full_train_copy['pdays'])
    
    X_full_train_num = scaler.fit_transform(df_full_train_copy[cols_transform])
    df_full_train_copy[cols_transform] = pd.DataFrame(X_full_train_num,columns=cols_transform)

    dict_full_train_new = df_full_train_copy.to_dict(orient='records')
    X_full_train_new = dv.fit_transform(dict_full_train_new)

    return X_full_train_new

In [50]:
#Find best hyperparameters. 

#Will do this in 2 steps, since 'newton-cg' and 'lbfgs' supports 'l2' and 'none', while 'liblinear' supports 'l1' and 'l2'

#Step1: Run for 'newton-cg', 'lbfgs' with 'l2' and 'none'

# define models and parameters
model = LogisticRegression()
solvers = ['newton-cg', 'lbfgs']
penalty = ['l2']
c_values = [100, 10, 1.0, 0.1, 0.01]
max_iters = np.linspace(10, 200,20)

drop_features = ['day_of_week', 'education', 'cons.conf.idx', 'age', 'job']
entry_desc = f'deleted feature {str(drop_features).replace("[","").replace("]","")}'
  
df_full_train_copy = df_full_train.copy()
df_full_train_copy.drop(drop_features,axis=1,inplace=True)

y_full_train = df_full_train_copy['y'].values
del df_full_train_copy['y']

# dict_full_train_dict = df_full_train_copy.to_dict(orient='records')
# X_full_train = dv.fit_transform(dict_full_train_dict)
X_full_train = pre_process(df_full_train_copy)

# define grid search
grid = dict(solver=solvers,penalty=penalty,C=c_values,max_iter=max_iters)
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=1)
grid_search_1 = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, cv=cv, scoring='roc_auc',error_score=0)
grid_result_1 = grid_search_1.fit(X_full_train, y_full_train)

In [85]:
# summarize results
# print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result_1.cv_results_['mean_test_score']
stds = grid_result_1.cv_results_['std_test_score']
params = grid_result_1.cv_results_['params']
columns = ['algo','mean_test_score','std_test_score','params']
df_gridcv_results_1 = pd.DataFrame(columns=columns)
for mean, stdev, param in zip(means, stds, params):
    score_entry = {"algo": "logisticregression", "mean_test_score": mean, "std_test_score": stdev, "params": param}
    df_gridcv_results_1 = df_gridcv_results_1.append(score_entry,ignore_index=True)

In [51]:
df_gridcv_results_1.sort_values(by='mean_test_score',ascending=False).head(10)

In [52]:
#Find best hyperparameters. 

#Will do this in 2 steps, since 'newton-cg' and 'lbfgs' supports 'l2' and 'none', while 'liblinear' supports 'l1' and 'l2'

#Step2: Run for 'liblinear' with 'l1' and l2'

# define models and parameters
model = LogisticRegression()
solvers = ['liblinear']
penalty = ['l1', 'l2']
c_values = [100, 10, 1.0, 0.1, 0.01]
max_iters = np.linspace(10, 200,20)

drop_features = ['day_of_week', 'education', 'cons.conf.idx', 'age', 'job']
entry_desc = f'deleted feature {str(drop_features).replace("[","").replace("]","")}'
  
df_full_train_copy = df_full_train.copy()
df_full_train_copy.drop(drop_features,axis=1,inplace=True)

y_full_train = df_full_train_copy['y'].values
del df_full_train_copy['y']

# dict_full_train_dict = df_full_train_copy.to_dict(orient='records')
# X_full_train = dv.fit_transform(dict_full_train_dict)
X_full_train = pre_process(df_full_train_copy)

# define grid search
grid = dict(solver=solvers,penalty=penalty,C=c_values,max_iter=max_iters)
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=1)
grid_search_2 = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, cv=cv, scoring='roc_auc',error_score=0)
grid_result_2 = grid_search_1.fit(X_full_train, y_full_train)

In [88]:
# summarize results
# print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result_2.cv_results_['mean_test_score']
stds = grid_result_2.cv_results_['std_test_score']
params = grid_result_2.cv_results_['params']
columns = ['algo','mean_test_score','std_test_score','params']
df_gridcv_results_2 = pd.DataFrame(columns=columns)
for mean, stdev, param in zip(means, stds, params):
    score_entry = {"algo": "logisticregression", "mean_test_score": mean, "std_test_score": stdev, "params": param}
    df_gridcv_results_2 = df_gridcv_results_2.append(score_entry,ignore_index=True)

In [53]:
df_gridcv_results_2.sort_values(by='mean_test_score',ascending=False).head(10)

In [54]:
df_gridcv_results_2.sort_values(by='mean_test_score',ascending=False).head(30)['params'].values

In [90]:
df_gridcv_results = pd.concat([df_gridcv_results_1, df_gridcv_results_2],axis=0)

In [55]:
df_gridcv_results_2.sort_values(by='mean_test_score',ascending=False).head(30)['params'].values

In [108]:
df_gridcv_results.reset_index(drop=True,inplace=True)

In [56]:
df_gridcv_results.sort_values(by='mean_test_score',ascending=False).head(30)

In [106]:
# Results from hyper-parameter tuning concatenated from the 2 steps
# df_gridcv_results.to_csv('project-logistic-regression-gridcv-scores-29oct.csv')

<a id='decision-tree'></a>
#### 5.2 DecisionTreeClassifier
[back to TOC](#toc)

In [38]:
from sklearn.tree import DecisionTreeClassifier

In [39]:
# Function to do one-hot encoding, train the model and evaluate model on validation data

def dt_evaluate(new_features,df_train_copy,df_val_copy,model):
    dict_train_new = df_train_copy[new_features].to_dict(orient='records')
    X_train_new = dv.fit_transform(dict_train_new)
    model.fit(X_train_new,y_train)
    
    dict_val_new = df_val_copy[new_features].to_dict(orient='records')
    X_val_new = dv.transform(dict_val_new)
    y_pred_new = model.predict_proba(X_val_new)[:,1]    
    roc_auc_val = roc_auc_score(y_val,y_pred_new)

    #Returning dv also, so that we can use this to check decision made by decisiontree
    return roc_auc_val, dv

In [57]:
model = DecisionTreeClassifier(random_state=42)

#Baseline score with DecisionTree

#Defining description based on features being dropped
entry_desc = f'baseline score. all features except duration'

df_train_copy = df_train.copy()
df_val_copy = df_val.copy()
features_list = list(df_train_copy.columns.values)

this_auc, dv = dt_evaluate(features_list,df_train_copy,df_val_copy,model)
print(this_auc, this_auc - baseline_auc)
score_entry = {"algo": "decisiontree", "desc": entry_desc, "score": this_auc, "diff": this_auc - baseline_auc}
exp_scores = exp_scores.append(score_entry,ignore_index=True)

In [58]:
exp_scores

**Observations:** With DecisionTree the score is worse than LogisticRegression - this might be due to overfitting. To check overfitting, we will check the score on y_train itself

In [59]:
df_train_copy = df_train.copy()
features_list = list(df_train_copy.columns.values)

dict_train_new = df_train_copy[features_list].to_dict(orient='records')
X_train_new = dv.fit_transform(dict_train_new)
model.fit(X_train_new,y_train)

y_pred_train = model.predict_proba(X_train_new)[:,1]    
this_auc = roc_auc_score(y_train,y_pred_train)

print(this_auc)
print('%.6f' % (round(baseline_auc - this_auc,6)))

**Observations:** Indeed there is overfitting and hence we did not get good score in previous experiment. This must be due to max_depth default being infinite which causes DecisionTree to almost memorize training data and hence overfitting. This is the same reason why the baseline score with DecisionTree was also bad.

Let us set max_depth to 4 and take a new baseline. For further experiments we can then keep this same value for max_depth.

In [43]:
#Model with DecisionTree with max_depth = 4.

model = DecisionTreeClassifier(random_state=42,max_depth=4)

In [60]:
#New baseline score with DecisionTree with max_depth = 4. Experiment 2
entry_desc = f'new baseline max_depth=4. all features except duration'

df_train_copy = df_train.copy()
df_val_copy = df_val.copy()
features_list = list(df_train_copy.columns.values)

this_auc, dv = dt_evaluate(features_list,df_train_copy,df_val_copy,model)
print(this_auc, this_auc - baseline_auc)
score_entry = {"algo": "decisiontree", "desc": entry_desc, "score": this_auc, "diff": this_auc - baseline_auc}
exp_scores = exp_scores.append(score_entry,ignore_index=True)

In [61]:
exp_scores.sort_values(by='score',ascending=False)

**Observations:** Now we have comparatively better score but it is slightly less than the logisticregression baseline.

Let us look at how decisiontree made its decisions

In [62]:
from sklearn.tree import export_text

print(export_text(model,feature_names=dv.get_feature_names()))

<a id='decision-tree-1'></a>
##### 5.2.1 Experiments to improve the score using DecisionTreeClassifier
[back to TOC](#toc)

Based on out EDA, we had seen that the features 'loan','housing','day_of_week','marital' had very less mutual information with the target. Let us check the score of model trained without these features.

In [63]:
#Experiment 3

drop_features = ['loan','housing','day_of_week','marital']   # Features observed to be least significant in EDA

#Defining description based on features being dropped
entry_desc = "EDA obs deleted features 'loan','housing','day_of_week','marital'"

df_train_copy = df_train.copy()
df_train_copy.drop(drop_features,axis=1,inplace=True)
df_val_copy = df_val.copy()
df_val_copy.drop(drop_features,axis=1,inplace=True)
features_list = list(df_train_copy.columns.values)

this_auc, dv = dt_evaluate(features_list,df_train_copy,df_val_copy,model)
print(this_auc)
print('%.6f' % (round(baseline_auc - this_auc,6)))
score_entry = {"algo": "decisiontree", "desc": entry_desc, "score": this_auc, "diff": this_auc - baseline_auc}
exp_scores = exp_scores.append(score_entry,ignore_index=True)

In [64]:
exp_scores.sort_values(by='score',ascending=False)

**Observations:** With DecisionTree the difference in the scores with and without the features 'loan', 'housing', 'day_of_week', 'marital' is very less, thus confirming our observations in EDA.

Let us evaluate scores by transforming the numerical features

In [55]:
#Experiment 4

scaler_std = preprocessing.StandardScaler()

df_train_copy = df_train.copy()
# df_train_copy.reset_index(drop=True,inplace=True)
X_train_num = scaler_std.fit_transform(df_train_copy[t_num_cols])
df_train_copy[t_num_cols] = pd.DataFrame(X_train_num,columns=t_num_cols)

df_val_copy = df_val.copy()
# df_val_copy.reset_index(drop=True,inplace=True)
X_val_num = scaler_std.transform(df_val_copy[t_num_cols])
df_val_copy[t_num_cols] = pd.DataFrame(X_val_num,columns=t_num_cols)

features_list = df_train_copy.columns
this_auc, dv = dt_evaluate(features_list,df_train_copy,df_val_copy,model)

score_entry = {"algo": "decisiontree", "desc": "standardscaler", "score": this_auc, "diff": this_auc-baseline_auc}
exp_scores = exp_scores.append(score_entry,ignore_index=True)


In [65]:
exp_scores.sort_values(by='score',ascending=False)

In [57]:
#Experiment 5

scaler_minmax = preprocessing.MinMaxScaler()
# df_train_copy.reset_index(drop=True,inplace=True)
df_train_copy = df_train.copy()

X_train_num = scaler_minmax.fit_transform(df_train_copy[t_num_cols])
df_train_copy[t_num_cols] = pd.DataFrame(X_train_num,columns=t_num_cols)

df_val_copy = df_val.copy()
# df_val_copy.reset_index(drop=True,inplace=True)
X_val_num = scaler_minmax.transform(df_val_copy[t_num_cols])
df_val_copy[t_num_cols] = pd.DataFrame(X_val_num,columns=t_num_cols)

features_list = df_train_copy.columns
this_auc, dv = dt_evaluate(features_list,df_train_copy,df_val_copy,model)

score_entry = {"algo": "decisiontree", "desc": "minmaxscaler", "score": this_auc, "diff": this_auc-baseline_auc}
exp_scores = exp_scores.append(score_entry,ignore_index=True)

In [66]:
exp_scores.sort_values(by='score',ascending=False)

In [59]:
#Experiment 6

poly = preprocessing.PolynomialFeatures(degree=3, interaction_only=True, include_bias=False)

df_train_copy = df_train.copy()
X_train_num = poly.fit_transform(df_train_copy[t_num_cols])
df_train_poly = pd.DataFrame(X_train_num, columns=[f"poly_{i}" for i in range(X_train_num.shape[1])])
df_train_copy = pd.concat([df_train_copy, df_train_poly], axis=1)

df_val_copy = df_val.copy()
X_val_num = poly.fit_transform(df_val_copy[t_num_cols])
df_val_poly = pd.DataFrame(X_val_num, columns=[f"poly_{i}" for i in range(X_val_num.shape[1])])
df_val_copy = pd.concat([df_val_copy, df_val_poly], axis=1)

features_list = df_train_copy.columns
this_auc, dv = dt_evaluate(features_list,df_train_copy,df_val_copy,model)

score_entry = {"algo": "decisiontree", "desc": "polynomialfeatures", "score": this_auc, "diff": this_auc-baseline_auc}
exp_scores = exp_scores.append(score_entry,ignore_index=True)

In [67]:
exp_scores.sort_values(by='score',ascending=False)

In [61]:
#Experiment 7

poly = preprocessing.PolynomialFeatures(degree=3, interaction_only=True, include_bias=False)

df_train_copy = df_train.copy()
X_train_num = poly.fit_transform(df_train_copy[t_num_cols])
df_train_poly = pd.DataFrame(X_train_num, columns=[f"poly_{i}" for i in range(X_train_num.shape[1])])
df_train_copy.drop(t_num_cols,axis=1,inplace=True)
poly_cols = list(df_train_poly.columns.values)
df_train_copy[poly_cols] = df_train_poly[poly_cols]

df_val_copy = df_val.copy()
X_val_num = poly.fit_transform(df_val_copy[t_num_cols])
df_val_poly = pd.DataFrame(X_val_num, columns=[f"poly_{i}" for i in range(X_val_num.shape[1])])
df_val_copy.drop(t_num_cols,axis=1,inplace=True)
df_val_copy[poly_cols] = df_val_poly[poly_cols]

features_list = df_train_copy.columns
this_auc, dv = dt_evaluate(features_list,df_train_copy,df_val_copy,model)

score_entry = {"algo": "decisiontree", "desc": "polynomialfeatures replacing org. features", "score": this_auc, "diff": this_auc-baseline_auc}
exp_scores = exp_scores.append(score_entry,ignore_index=True)

In [68]:
exp_scores.sort_values(by='score',ascending=False)

In [63]:
#Experiment 8 - power transform 'campaign'
cols_transform = ['campaign']
power_transformer = preprocessing.PowerTransformer(method='yeo-johnson')


df_train_copy = df_train.copy()
df_train_trans = pd.DataFrame(power_transformer.fit_transform(df_train_copy[cols_transform]),columns=cols_transform)
df_train_copy[cols_transform] = df_train_trans[cols_transform]


df_val_copy = df_val.copy()
df_val_trans = pd.DataFrame(power_transformer.transform(df_val_copy[cols_transform]),columns=cols_transform)
df_val_copy[cols_transform] = df_val_trans[cols_transform]

features_list = df_train_copy.columns
this_auc, dv = dt_evaluate(features_list,df_train_copy,df_val_copy,model)

score_entry = {"algo": "decisiontree", "desc": "powertransform 'campaign'", "score": this_auc, "diff": this_auc-baseline_auc}
exp_scores = exp_scores.append(score_entry,ignore_index=True)


In [69]:
exp_scores.sort_values(by='score',ascending=False)

**Observations:** From the experiments so far, we can see that only the score with polynomial features is better than the baseline (new baseline of decisiontree). Also, since the score with added polynomial features and replacing numerical features with polynomial features is the same, we will choose replacing option, as it will reduce number of features to be trained on. We will use this method for further experiments.

In [71]:
# Modified function to do one-hot encoding, replace numerical features with polynomial features, train the model and evaluate model on validation data

def dt_evaluate(new_features,df_train_copy,df_val_copy,model):
    poly = preprocessing.PolynomialFeatures(degree=3, interaction_only=True, include_bias=False)

    df_train_copy = df_train_copy[new_features]
    df_val_copy = df_val_copy[new_features]
    
    num_cols = list(df_train_copy.columns[df_train_copy.dtypes != 'object'])
    X_train_num = poly.fit_transform(df_train_copy[num_cols])
    df_train_poly = pd.DataFrame(X_train_num, columns=[f"poly_{i}" for i in range(X_train_num.shape[1])])
    df_train_copy.drop(num_cols,axis=1,inplace=True)
    poly_cols = list(df_train_poly.columns.values)
    df_train_copy[poly_cols] = df_train_poly[poly_cols]

    X_val_num = poly.transform(df_val_copy[num_cols])
    df_val_poly = pd.DataFrame(X_val_num, columns=[f"poly_{i}" for i in range(X_val_num.shape[1])])
    df_val_copy.drop(num_cols,axis=1,inplace=True)
    df_val_copy[poly_cols] = df_val_poly[poly_cols]

    new_features = list(df_train_copy.columns.values)
    dict_train_new = df_train_copy[new_features].to_dict(orient='records')
    X_train_new = dv.fit_transform(dict_train_new)
    model.fit(X_train_new,y_train)
    
    dict_val_new = df_val_copy[new_features].to_dict(orient='records')
    X_val_new = dv.transform(dict_val_new)
    y_pred_new = model.predict_proba(X_val_new)[:,1]    
    roc_auc_val = roc_auc_score(y_val,y_pred_new)

    #Returning dv also, so that we can use this to check decision made by decisiontree
    return roc_auc_val, dv

Evaluate effect on score by dropping one feature at a time

In [70]:
#Score for each feature being dropped and it's difference with baseline score

#Experiment 9 onwards

df_train_copy = df_train.copy()
df_val_copy = df_val.copy()

features_list = df_train_copy.columns
feature_roc_auc_scores = []

for drop_feature in list(features_list):
    new_features = list(features_list.drop(drop_feature))
    this_auc, dv = dt_evaluate(new_features,df_train_copy,df_val_copy,model)
    feature_roc_auc_scores.append((drop_feature,this_auc,this_auc-baseline_auc,np.abs(this_auc-baseline_auc)))
    print(str((drop_feature,this_auc,this_auc-baseline_auc,np.abs(this_auc-baseline_auc))))
    entry_desc = f"delete one feature {drop_feature}"
    score_entry = {"algo": "decisiontree", "desc": entry_desc, "score": this_auc, "diff": this_auc - baseline_auc}
    exp_scores = exp_scores.append(score_entry,ignore_index=True)

In [71]:
df_feature_roc_auc = pd.DataFrame(feature_roc_auc_scores,columns=['feature','new_roc_auc_score','diff_roc_auc_score','abs_diff_roc_auc_score'])
df_sorted = df_feature_roc_auc.sort_values(by='diff_roc_auc_score',ascending=False)
df_sorted

In [72]:
plt.figure(figsize=(10, 8))
plt.barh(df_sorted['feature'],df_sorted['diff_roc_auc_score'])

**Observations:** Score is higher when dropping any of the feature (one feature at a time), except for 'contact', 'pdays', where score is decreasing when feature is dropped. Lets us check these scores in comparison with score when using polynomial features

In [73]:
exp_scores.sort_values(by='score',ascending=False)

We can see that when compared to base score with polynomial features, score is higher only when dropping any of 'month', 'campaign', 'cons.conf.idx', 'age', 'previous'. Let us check if dropping any combination of these features further increases the score.

In [92]:
import itertools

#Function to get all combinations of all features
def drop_features(features):
    drop_features_list = []
    for L in range(2, len(features)+1):
        for subset in itertools.combinations(features, L):
            drop_features_list.append(list(subset))
    return drop_features_list

In [93]:
top_features = ['month', 'campaign', 'cons.conf.idx', 'age', 'previous']

In [74]:
drop_features(top_features)

In [75]:
#Experiment 28 onwards

df_train_copy = df_train.copy()
df_val_copy = df_val.copy()

drop_features_list = drop_features(top_features)

for drop_features in drop_features_list:
    #Defining description based on features being dropped
    entry_desc = f'deleted feature {str(drop_features).replace("[","").replace("]","")}'

    df_train_copy = df_train.copy()
    df_train_copy.drop(drop_features,axis=1,inplace=True)
    df_val_copy = df_val.copy()
    df_val_copy.drop(drop_features,axis=1,inplace=True)
    features_list = list(df_train_copy.columns.values)

    this_auc, dv = dt_evaluate(features_list,df_train_copy,df_val_copy,model)
    print(this_auc, this_auc - baseline_auc)
    score_entry = {"algo": "decisiontree", "desc": entry_desc, "score": this_auc, "diff": this_auc - baseline_auc}
    exp_scores = exp_scores.append(score_entry,ignore_index=True)

In [76]:
exp_scores.sort_values(by='score',ascending=False).head(20)

**Observations:** Deleting multiple features among ['month', 'campaign', 'cons.conf.idx', 'age', 'previous'] has increased the score with good positive difference. Will choose the combination that led to highest score, and then perform parameter tuning.

In [77]:
exp_scores.sort_values(by='score',ascending=False).head(1)['desc'].values

**Observations:** Deleting all the features 'month', 'campaign', 'cons.conf.idx', 'age', 'previous' has led to the best score. We will now do parameter tuning and see the results.

<a id='decision-tree-2'></a>
##### 5.2.2 Model tuning for DecisionTreeClassifier
[back to TOC](#toc)

Tuning the parameters using GridsearchCV

In [38]:
# define models and parameters
model = DecisionTreeClassifier()
max_depths = [1, 2, 3, 4, 5, 6, 10, 15, 20, None]
min_samples_leafs = [1, 2, 5, 10, 15, 20, 100, 200, 500]
max_features = ["auto", "sqrt", "log2", None]
criteria = ["gini", "entropy"]

drop_features = ['month', 'campaign', 'cons.conf.idx', 'age', 'previous']

df_train_copy = df_train.copy()
df_train_copy.drop(drop_features,axis=1,inplace=True)
df_val_copy = df_val.copy()
df_val_copy.drop(drop_features,axis=1,inplace=True)

dict_train_new = df_train_copy.to_dict(orient='records')
X_train_new = dv.fit_transform(dict_train_new)


# define grid search
grid = dict(max_depth=max_depths,min_samples_leaf=min_samples_leafs,max_features=max_features,criterion=criteria)
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=1)
grid_search = GridSearchCV(model, param_grid=grid, n_jobs=-1, cv=cv, scoring='roc_auc',error_score=0)
grid_result = grid_search.fit(X_train_new, y_train)

In [39]:
# summarize results
# print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
columns = ['algo','mean_test_score','std_test_score','params']
df_gridcv_results = pd.DataFrame(columns=columns)
for mean, stdev, param in zip(means, stds, params):
    score_entry = {"algo": "decisiontree", "mean_test_score": mean, "std_test_score": stdev, "params": param}
    df_gridcv_results = df_gridcv_results.append(score_entry,ignore_index=True)

In [78]:
df_gridcv_results.sort_values(by='mean_test_score',ascending=False).head(20)

In [41]:
# df_gridcv_results.to_csv('project-decisiontree-gridcv-scores-29oct.csv')

<a id='random-forest'></a>
#### 5.3 RandomForestClassifier
[back to TOC](#toc)

In [38]:
from sklearn.ensemble import RandomForestClassifier

In [39]:
# Function to do one-hot encoding, train the model and evaluate model on validation data

def rf_evaluate(new_features,df_train_copy,df_val_copy,model):
    dict_train_new = df_train_copy[new_features].to_dict(orient='records')
    X_train_new = dv.fit_transform(dict_train_new)
    model.fit(X_train_new,y_train)
    
    dict_val_new = df_val_copy[new_features].to_dict(orient='records')
    X_val_new = dv.transform(dict_val_new)
    y_pred_new = model.predict_proba(X_val_new)[:,1]    
    roc_auc_val = roc_auc_score(y_val,y_pred_new)

    #Returning dv als, so that we can use this to check decision made by decisiontree
    return roc_auc_val, dv

In [40]:
#Model with RandomForestClassifier.

model = RandomForestClassifier(random_state=42, n_jobs=-1)

In [79]:
#Updated baseline score with DecisionTree with max_depth = 4 and no one-hot encoding. Experiment 2
entry_desc = f'baseline score. all features except duration'

df_train_copy = df_train.copy()
df_val_copy = df_val.copy()
features_list = list(df_train_copy.columns.values)

this_auc, dv = rf_evaluate(features_list,df_train_copy,df_val_copy,model)
print(this_auc, this_auc - baseline_auc)
score_entry = {"algo": "randomforest", "desc": entry_desc, "score": this_auc, "diff": this_auc - baseline_auc}
exp_scores = exp_scores.append(score_entry,ignore_index=True)

In [80]:
exp_scores

**Observations:** The baseline score using RandomForest is slightly lesser than baseline using LogisticRegression

<a id='random-forest-1'></a>
##### 5.3.1 Experiments to improve the score using RandomForestClassifier
[back to TOC](#toc)

In [81]:
#Experiment 2

drop_features = ['loan','housing','day_of_week','marital']   # Features observed to be least significant in EDA

#Defining description based on features being dropped
entry_desc = "EDA obs deleted features 'loan','housing','day_of_week','marital'"

df_train_copy = df_train.copy()
df_train_copy.drop(drop_features,axis=1,inplace=True)
df_val_copy = df_val.copy()
df_val_copy.drop(drop_features,axis=1,inplace=True)
features_list = list(df_train_copy.columns.values)

this_auc, dv = rf_evaluate(features_list,df_train_copy,df_val_copy,model)
print(this_auc)
print('%.6f' % (round(baseline_auc - this_auc,6)))
score_entry = {"algo": "randomforest", "desc": entry_desc, "score": this_auc, "diff": this_auc - baseline_auc}
exp_scores = exp_scores.append(score_entry,ignore_index=True)

In [82]:
exp_scores

**Observations:** After deleting least important features as per EDA, with Randomforest, there is a slight impact on score compared to baseline (as compared to minimal impact when using LogisticRegression or DecisionTree)

Let us evaluate scores by transforming the numerical features

In [45]:
#Experiment 3

scaler_std = preprocessing.StandardScaler()

df_train_copy = df_train.copy()
X_train_num = scaler_std.fit_transform(df_train_copy[t_num_cols])
df_train_copy[t_num_cols] = pd.DataFrame(X_train_num,columns=t_num_cols)

df_val_copy = df_val.copy()
X_val_num = scaler_std.transform(df_val_copy[t_num_cols])
df_val_copy[t_num_cols] = pd.DataFrame(X_val_num,columns=t_num_cols)

features_list = df_train_copy.columns
this_auc, dv = rf_evaluate(features_list,df_train_copy,df_val_copy,model)

score_entry = {"algo": "randomforest", "desc": "standardscaler", "score": this_auc, "diff": this_auc-baseline_auc}
exp_scores = exp_scores.append(score_entry,ignore_index=True)

In [83]:
exp_scores.sort_values(by='score',ascending=False)

In [47]:
#Experiment 4

scaler_minmax = preprocessing.MinMaxScaler()
# df_train_copy.reset_index(drop=True,inplace=True)
df_train_copy = df_train.copy()

X_train_num = scaler_minmax.fit_transform(df_train_copy[t_num_cols])
df_train_copy[t_num_cols] = pd.DataFrame(X_train_num,columns=t_num_cols)

df_val_copy = df_val.copy()
# df_val_copy.reset_index(drop=True,inplace=True)
X_val_num = scaler_minmax.transform(df_val_copy[t_num_cols])
df_val_copy[t_num_cols] = pd.DataFrame(X_val_num,columns=t_num_cols)

features_list = df_train_copy.columns
this_auc, dv = rf_evaluate(features_list,df_train_copy,df_val_copy,model)

score_entry = {"algo": "randomforest", "desc": "minmaxscaler", "score": this_auc, "diff": this_auc-baseline_auc}
exp_scores = exp_scores.append(score_entry,ignore_index=True)


In [84]:
exp_scores.sort_values(by='score',ascending=False)

In [49]:
#Experiment 5

poly = preprocessing.PolynomialFeatures(degree=3, interaction_only=True, include_bias=False)

df_train_copy = df_train.copy()
X_train_num = poly.fit_transform(df_train_copy[t_num_cols])
df_train_poly = pd.DataFrame(X_train_num, columns=[f"poly_{i}" for i in range(X_train_num.shape[1])])
df_train_copy = pd.concat([df_train_copy, df_train_poly], axis=1)

df_val_copy = df_val.copy()
X_val_num = poly.fit_transform(df_val_copy[t_num_cols])
df_val_poly = pd.DataFrame(X_val_num, columns=[f"poly_{i}" for i in range(X_val_num.shape[1])])
df_val_copy = pd.concat([df_val_copy, df_val_poly], axis=1)

features_list = df_train_copy.columns
this_auc, dv = rf_evaluate(features_list,df_train_copy,df_val_copy,model)

score_entry = {"algo": "randomforest", "desc": "polynomialfeatures", "score": this_auc, "diff": this_auc-baseline_auc}
exp_scores = exp_scores.append(score_entry,ignore_index=True)

In [85]:
exp_scores.sort_values(by='score',ascending=False)

In [51]:
#Experiment 6

poly = preprocessing.PolynomialFeatures(degree=3, interaction_only=True, include_bias=False)

df_train_copy = df_train.copy()
X_train_num = poly.fit_transform(df_train_copy[t_num_cols])
df_train_poly = pd.DataFrame(X_train_num, columns=[f"poly_{i}" for i in range(X_train_num.shape[1])])
df_train_copy.drop(t_num_cols,axis=1,inplace=True)
poly_cols = list(df_train_poly.columns.values)
df_train_copy[poly_cols] = df_train_poly[poly_cols]

df_val_copy = df_val.copy()
X_val_num = poly.fit_transform(df_val_copy[t_num_cols])
df_val_poly = pd.DataFrame(X_val_num, columns=[f"poly_{i}" for i in range(X_val_num.shape[1])])
df_val_copy.drop(t_num_cols,axis=1,inplace=True)
df_val_copy[poly_cols] = df_val_poly[poly_cols]

features_list = df_train_copy.columns
this_auc, dv = rf_evaluate(features_list,df_train_copy,df_val_copy,model)

score_entry = {"algo": "randomforest", "desc": "polynomialfeatures replacing org. features", "score": this_auc, "diff": this_auc-baseline_auc}
exp_scores = exp_scores.append(score_entry,ignore_index=True)

In [86]:
exp_scores.sort_values(by='score',ascending=False)

In [53]:
#Experiment 7 - power transform 'campaign'
cols_transform = ['campaign']
power_transformer = preprocessing.PowerTransformer(method='yeo-johnson')


df_train_copy = df_train.copy()
df_train_trans = pd.DataFrame(power_transformer.fit_transform(df_train_copy[cols_transform]),columns=cols_transform)
df_train_copy[cols_transform] = df_train_trans[cols_transform]


df_val_copy = df_val.copy()
df_val_trans = pd.DataFrame(power_transformer.transform(df_val_copy[cols_transform]),columns=cols_transform)
df_val_copy[cols_transform] = df_val_trans[cols_transform]

features_list = df_train_copy.columns
this_auc, dv = rf_evaluate(features_list,df_train_copy,df_val_copy,model)

score_entry = {"algo": "randomforest", "desc": "powertransform 'campaign'", "score": this_auc, "diff": this_auc-baseline_auc}
exp_scores = exp_scores.append(score_entry,ignore_index=True)

In [87]:
exp_scores.sort_values(by='score',ascending=False)

**Observations:** From the experiments so far, we can see that using polynomial features increased the score in comparison with baseline of randomforest, although is still less than the main baseline score (with logistic regression). Also other transformations led to decreased score. Thus we will use polynomial features for further experiments.

In [55]:
# Modified function to do one-hot encoding, replace numerical features with polynomial features, train the model and evaluate model on validation data

def rf_evaluate(new_features,df_train_copy,df_val_copy,model):
    poly = preprocessing.PolynomialFeatures(degree=3, interaction_only=True, include_bias=False)

    df_train_copy = df_train_copy[new_features]
    df_val_copy = df_val_copy[new_features]
    
    num_cols = list(df_train_copy.columns[df_train_copy.dtypes != 'object'])
    X_train_num = poly.fit_transform(df_train_copy[num_cols])
    df_train_poly = pd.DataFrame(X_train_num, columns=[f"poly_{i}" for i in range(X_train_num.shape[1])])
    df_train_copy.drop(num_cols,axis=1,inplace=True)
    poly_cols = list(df_train_poly.columns.values)
    df_train_copy[poly_cols] = df_train_poly[poly_cols]

    X_val_num = poly.transform(df_val_copy[num_cols])
    df_val_poly = pd.DataFrame(X_val_num, columns=[f"poly_{i}" for i in range(X_val_num.shape[1])])
    df_val_copy.drop(num_cols,axis=1,inplace=True)
    df_val_copy[poly_cols] = df_val_poly[poly_cols]

    new_features = list(df_train_copy.columns.values)
    dict_train_new = df_train_copy[new_features].to_dict(orient='records')
    X_train_new = dv.fit_transform(dict_train_new)
    model.fit(X_train_new,y_train)
    
    dict_val_new = df_val_copy[new_features].to_dict(orient='records')
    X_val_new = dv.transform(dict_val_new)
    y_pred_new = model.predict_proba(X_val_new)[:,1]    
    roc_auc_val = roc_auc_score(y_val,y_pred_new)

    #Returning dv also, so that we can use this to check decision made by randomforest
    return roc_auc_val, dv

Evaluate effect on score by dropping one feature at a time

In [88]:
#Score for each feature being dropped and it's difference with baseline score

#Experiment 8 onwards

df_train_copy = df_train.copy()
df_val_copy = df_val.copy()

features_list = df_train_copy.columns
feature_roc_auc_scores = []

for drop_feature in list(features_list):
    new_features = list(features_list.drop(drop_feature))
    this_auc, dv = rf_evaluate(new_features,df_train_copy,df_val_copy,model)
    feature_roc_auc_scores.append((drop_feature,this_auc,this_auc-baseline_auc,np.abs(this_auc-baseline_auc)))
    print(str((drop_feature,this_auc,this_auc-baseline_auc,np.abs(this_auc-baseline_auc))))
    entry_desc = f"delete one feature {drop_feature}"
    score_entry = {"algo": "randomforest", "desc": entry_desc, "score": this_auc, "diff": this_auc - baseline_auc}
    exp_scores = exp_scores.append(score_entry,ignore_index=True)

In [89]:
exp_scores.sort_values(by='score',ascending=False)

**Observations:** When compared with polynomial features using randomforest:

deleting one feature has positive impact (increased score) for features - 'cons.price.idx', 'emp.var.rate', 'cons.conf.idx', 'housing', 'previous', 'nr.employed', 'month', 'pdays', 'education', 'poutcome'

while deleting one feature has negative impact (reduced score) for features - 'default', 'day_of_week', 'loan', 'marital', 'contact', 'age', 'campaign', 'job', 'euribor3m'

Let us experiment with removing a group of features that affected positively.

In [58]:
import itertools

#Function to get all combinations of all features
def drop_features(features):
    drop_features_list = []
    for L in range(2, len(features)+1):
        for subset in itertools.combinations(features, L):
            drop_features_list.append(list(subset))
    return drop_features_list

In [59]:
top_features = ['cons.price.idx', 'emp.var.rate', 'cons.conf.idx', 'housing', 'previous', 'nr.employed', 'month', 'pdays', 'education', 'poutcome']

In [1]:
#Experiment 27 onwards

#Testing combination of multiple feature drops

df_train_copy = df_train.copy()
df_val_copy = df_val.copy()

drop_features_list = drop_features(top_features)

for drop_features in drop_features_list:
    #Defining description based on features being dropped
    entry_desc = f'deleted feature {str(drop_features).replace("[","").replace("]","")}'

    df_train_copy = df_train.copy()
    df_train_copy.drop(drop_features,axis=1,inplace=True)
    df_val_copy = df_val.copy()
    df_val_copy.drop(drop_features,axis=1,inplace=True)
    features_list = list(df_train_copy.columns.values)

    this_auc, dv = rf_evaluate(features_list,df_train_copy,df_val_copy,model)
    print(this_auc, this_auc - baseline_auc)
    score_entry = {"algo": "randomforest", "desc": entry_desc, "score": this_auc, "diff": this_auc - baseline_auc}
    exp_scores = exp_scores.append(score_entry,ignore_index=True)

In [90]:
exp_scores.sort_values(by='score',ascending=False).head(20)

In [None]:
# exp_scores.to_csv('rf_exp_scores-29oct.csv')

**Observations:** Dropping multiple features that had positive impact is improving the score for several combinations. Will choose the top scoring combination.

In [91]:
exp_scores.sort_values(by='score',ascending=False).head(1)['desc'].values

Will tune parameters and check the results

<a id='random-forest-2'></a>
##### 5.3.2 Model tuning for RandomForestClassifier
[back to TOC](#toc)

In [92]:
# define models and parameters
model = RandomForestClassifier(random_state=42)
n_estimators = range(10, 120, 10)
min_samples_leafs = [5, 10, 15, 20]
max_features = ["auto", "sqrt", None]
max_depths = [2, 3, 4, 5, 7, 10, 15, 30]
warm_start = [True]
random_state = [42]
n_jobs = [-1]


drop_features = ['cons.price.idx', 'emp.var.rate', 'cons.conf.idx', 'housing', 'nr.employed', 'pdays', 'education']

# define grid search
random_grid = {
                'n_estimators': n_estimators,
                'min_samples_leaf': min_samples_leafs,
                'max_features': max_features,
                'max_depth': max_depths,
                'warm_start': warm_start,
                'random_state': random_state,
                'n_jobs': n_jobs,
              }



df_full_train_copy = df_full_train.copy()
df_full_train_copy.drop(drop_features,axis=1,inplace=True)
y_full_train = df_full_train_copy['y']
del df_full_train_copy['y']

dict_full_train_new = df_full_train_copy.to_dict(orient='records')
X_full_train_new = dv.fit_transform(dict_full_train_new)

cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=1)
grid_search = GridSearchCV(model, param_grid=random_grid, n_jobs=-1, cv=cv, scoring='roc_auc',error_score=0,verbose=1)
grid_result = grid_search.fit(X_full_train_new, y_full_train)

In [51]:
# summarize results
# print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
columns = ['algo','mean_test_score','std_test_score','params']
df_gridcv_results = pd.DataFrame(columns=columns)
for mean, stdev, param in zip(means, stds, params):
    score_entry = {"algo": "randomforest", "mean_test_score": mean, "std_test_score": stdev, "params": param}
    df_gridcv_results = df_gridcv_results.append(score_entry,ignore_index=True)

In [52]:
df_gridcv_results.to_csv('project-randomforest-gridcv-scores-30oct.csv')

In [93]:
df_gridcv_results.sort_values(by='mean_test_score',ascending=False).head(20)

<a id='xgb'></a>
#### 5.4 XGBoost
[back to TOC](#toc)

In [38]:
import xgboost as xgb
# from xgboost import XGBClassifier

In [94]:
dv = DictVectorizer(sparse=False)

train_dict = df_train.to_dict(orient='records')
X_train = dv.fit_transform(train_dict)

feature_names = dv.get_feature_names()
dtrain = xgb.DMatrix(X_train,label=y_train,feature_names=feature_names)

val_dict = df_val.to_dict(orient='records')
X_val = dv.transform(val_dict)
dval = xgb.DMatrix(X_val,label=y_val,feature_names=feature_names)

xgb_params = {'seed': 42, 'eval_metric': 'auc', 'n_jobs': -1}

model = xgb.train(xgb_params,dtrain)

y_pred = model.predict(dval)
this_auc = roc_auc_score(y_val,y_pred)
this_auc

In [45]:
entry_desc = 'baseline xgb'
score_entry = {"algo": "xgb", "desc": entry_desc, "score": this_auc, "diff": this_auc - baseline_auc}
exp_scores = exp_scores.append(score_entry,ignore_index=True)

In [95]:
exp_scores

**Observations:** Score using XGBoost without any feature engineering or parameter tuning is better than the baseline. Let us check how the score improves / degrades with higher number of rounds of training.

In [47]:
def parse_xgb_output(output):
    results = []

    for line in output.stdout.strip().split('\n'):
        it_line, train_line, val_line = line.split('\t')

        it = int(it_line.strip('[]'))
        train = float(train_line.split(':')[1])
        val = float(val_line.split(':')[1])

        results.append((it, train, val))
    
    columns = ['num_iter', 'train_auc', 'val_auc']
    df_results = pd.DataFrame(results, columns=columns)
    return df_results

In [48]:
%%capture output

#Experiment 2

watchlist = [(dtrain,'train'),(dval,'val')]

model = xgb.train(xgb_params,dtrain,num_boost_round=200,evals=watchlist)

In [49]:
df_score = parse_xgb_output(output)

In [96]:
plt.plot(df_score.num_iter, df_score.train_auc, label='train')
plt.plot(df_score.num_iter, df_score.val_auc, label='val')
plt.legend()

We can see that while score on validation data improves a bit initially, somewhere from the initial rounds onwards it starts decreasing while score on training continues to increase as number of rounds are increased (this is overfitting). Let us check till what number of rounds is the score actually increasing

In [97]:
plt.plot(df_score.num_iter, df_score.val_auc, label='val')
plt.legend()

In [98]:
df_score.sort_values(by='val_auc',ascending=False)

In [99]:
df_score[:20]

**Observations:** We can see that around num_boost_round 3 the score is highest and then it starts decreasing (due to possible overfitting). We will use num_boost_round=3

In [100]:
#Experiment 2 continued ..

model = xgb.train(xgb_params,dtrain,num_boost_round=3)

y_pred = model.predict(dval)
this_auc = roc_auc_score(y_val,y_pred)
this_auc

In [55]:
entry_desc = 'num_boost_round=3'
score_entry = {"algo": "xgb", "desc": entry_desc, "score": this_auc, "diff": this_auc - baseline_auc}
exp_scores = exp_scores.append(score_entry,ignore_index=True)

In [101]:
exp_scores

**Observations** Limiting num_boost_round to 3 has increased the score

<a id='xgb-1'></a>
##### 5.4.1 Experiments to improve the score using XGBoost
[back to TOC](#toc)

Based on out EDA, we had seen that the features 'loan','housing','day_of_week','marital' had very less mutual information with the target. Let us check the score of model trained without these features.

In [61]:
#Function to do one-hot encoding, train the model and evaluate model on validation data.

#num_boost_round=3

def xgb_evaluate(new_features,df_train_copy,df_val_copy,xgb_params):
    df_train_copy = df_train_copy[new_features]
    df_val_copy = df_val_copy[new_features]
    
    dv = DictVectorizer(sparse=False)

    train_dict = df_train_copy.to_dict(orient='records')
    X_train = dv.fit_transform(train_dict)

    feature_names = dv.get_feature_names()
    dtrain = xgb.DMatrix(X_train,label=y_train,feature_names=feature_names)

    val_dict = df_val_copy.to_dict(orient='records')
    X_val = dv.transform(val_dict)
    dval = xgb.DMatrix(X_val,label=y_val,feature_names=feature_names)

    model = xgb.train(xgb_params,dtrain,num_boost_round=3)

    y_pred = model.predict(dval)
    roc_auc_val = roc_auc_score(y_val,y_pred)

    #Returning dv also, so that we can use this to check decision made by xgboost
    return roc_auc_val, dv

In [102]:
#Experiment 3

drop_features = ['loan','housing','day_of_week','marital']   # Features observed to be least significant in EDA

xgb_params = {'seed': 42, 'eval_metric': 'auc', 'n_jobs': -1}

#Defining description based on features being dropped
entry_desc = "EDA obs deleted features 'loan','housing','day_of_week','marital'"

df_train_copy = df_train.copy()
df_train_copy.drop(drop_features,axis=1,inplace=True)
df_val_copy = df_val.copy()
df_val_copy.drop(drop_features,axis=1,inplace=True)
features_list = list(df_train_copy.columns.values)

this_auc, dv = xgb_evaluate(features_list,df_train_copy,df_val_copy,xgb_params)
print(this_auc)
print('%.6f' % (round(baseline_auc - this_auc,6)))
score_entry = {"algo": "xgb", "desc": entry_desc, "score": this_auc, "diff": this_auc - baseline_auc}
exp_scores = exp_scores.append(score_entry,ignore_index=True)

In [103]:
exp_scores

The difference in score is small after removing the least important features as per EDA, which confirms our EDA observation

Let us evaluate scores by transforming the numerical features

In [70]:
#Experiment 4

xgb_params = {'seed': 42, 'eval_metric': 'auc', 'n_jobs': -1}

scaler_std = preprocessing.StandardScaler()

df_train_copy = df_train.copy()
X_train_num = scaler_std.fit_transform(df_train_copy[t_num_cols])
df_train_copy[t_num_cols] = pd.DataFrame(X_train_num,columns=t_num_cols)

df_val_copy = df_val.copy()
X_val_num = scaler_std.transform(df_val_copy[t_num_cols])
df_val_copy[t_num_cols] = pd.DataFrame(X_val_num,columns=t_num_cols)

features_list = df_train_copy.columns
this_auc, dv = xgb_evaluate(features_list,df_train_copy,df_val_copy,xgb_params)

score_entry = {"algo": "xgb", "desc": "standardscaler", "score": this_auc, "diff": this_auc-baseline_auc}
exp_scores = exp_scores.append(score_entry,ignore_index=True)

In [104]:
exp_scores.sort_values(by='score',ascending=False)

In [72]:
#Experiment 5

xgb_params = {'seed': 42, 'eval_metric': 'auc', 'n_jobs': -1}

scaler_minmax = preprocessing.MinMaxScaler()
df_train_copy = df_train.copy()

X_train_num = scaler_minmax.fit_transform(df_train_copy[t_num_cols])
df_train_copy[t_num_cols] = pd.DataFrame(X_train_num,columns=t_num_cols)

df_val_copy = df_val.copy()
X_val_num = scaler_minmax.transform(df_val_copy[t_num_cols])
df_val_copy[t_num_cols] = pd.DataFrame(X_val_num,columns=t_num_cols)

features_list = df_train_copy.columns
this_auc, dv = xgb_evaluate(features_list,df_train_copy,df_val_copy,xgb_params)

score_entry = {"algo": "xgb", "desc": "minmaxscaler", "score": this_auc, "diff": this_auc-baseline_auc}
exp_scores = exp_scores.append(score_entry,ignore_index=True)

In [105]:
exp_scores.sort_values(by='score',ascending=False)

In [74]:
#Experiment 6

xgb_params = {'seed': 42, 'eval_metric': 'auc', 'n_jobs': -1}

poly = preprocessing.PolynomialFeatures(degree=3, interaction_only=True, include_bias=False)

df_train_copy = df_train.copy()
X_train_num = poly.fit_transform(df_train_copy[t_num_cols])
df_train_poly = pd.DataFrame(X_train_num, columns=[f"poly_{i}" for i in range(X_train_num.shape[1])])
df_train_copy = pd.concat([df_train_copy, df_train_poly], axis=1)

df_val_copy = df_val.copy()
X_val_num = poly.fit_transform(df_val_copy[t_num_cols])
df_val_poly = pd.DataFrame(X_val_num, columns=[f"poly_{i}" for i in range(X_val_num.shape[1])])
df_val_copy = pd.concat([df_val_copy, df_val_poly], axis=1)

features_list = df_train_copy.columns
this_auc, dv = xgb_evaluate(features_list,df_train_copy,df_val_copy,xgb_params)

score_entry = {"algo": "xgb", "desc": "polynomialfeatures", "score": this_auc, "diff": this_auc-baseline_auc}
exp_scores = exp_scores.append(score_entry,ignore_index=True)

In [106]:
exp_scores.sort_values(by='score',ascending=False)

In [76]:
#Experiment 7

xgb_params = {'seed': 42, 'eval_metric': 'auc', 'n_jobs': -1}

poly = preprocessing.PolynomialFeatures(degree=3, interaction_only=True, include_bias=False)

df_train_copy = df_train.copy()
X_train_num = poly.fit_transform(df_train_copy[t_num_cols])
df_train_poly = pd.DataFrame(X_train_num, columns=[f"poly_{i}" for i in range(X_train_num.shape[1])])
df_train_copy.drop(t_num_cols,axis=1,inplace=True)
poly_cols = list(df_train_poly.columns.values)
df_train_copy[poly_cols] = df_train_poly[poly_cols]

df_val_copy = df_val.copy()
X_val_num = poly.fit_transform(df_val_copy[t_num_cols])
df_val_poly = pd.DataFrame(X_val_num, columns=[f"poly_{i}" for i in range(X_val_num.shape[1])])
df_val_copy.drop(t_num_cols,axis=1,inplace=True)
df_val_copy[poly_cols] = df_val_poly[poly_cols]

features_list = df_train_copy.columns
this_auc, dv = xgb_evaluate(features_list,df_train_copy,df_val_copy,xgb_params)

score_entry = {"algo": "xgb", "desc": "polynomialfeatures replacing org. features", "score": this_auc, "diff": this_auc-baseline_auc}
exp_scores = exp_scores.append(score_entry,ignore_index=True)

In [107]:
exp_scores.sort_values(by='score',ascending=False)

In [78]:
#Experiment 8 - power transform 'campaign'
xgb_params = {'seed': 42, 'eval_metric': 'auc', 'n_jobs': -1}

cols_transform = ['campaign']
power_transformer = preprocessing.PowerTransformer(method='yeo-johnson')


df_train_copy = df_train.copy()
df_train_trans = pd.DataFrame(power_transformer.fit_transform(df_train_copy[cols_transform]),columns=cols_transform)
df_train_copy[cols_transform] = df_train_trans[cols_transform]


df_val_copy = df_val.copy()
df_val_trans = pd.DataFrame(power_transformer.transform(df_val_copy[cols_transform]),columns=cols_transform)
df_val_copy[cols_transform] = df_val_trans[cols_transform]

features_list = df_train_copy.columns
this_auc, dv = xgb_evaluate(features_list,df_train_copy,df_val_copy,xgb_params)

score_entry = {"algo": "xgb", "desc": "powertransform 'campaign'", "score": this_auc, "diff": this_auc-baseline_auc}
exp_scores = exp_scores.append(score_entry,ignore_index=True)

In [108]:
exp_scores.sort_values(by='score',ascending=False)

**Observations:** From the experiments so far, we can see that using polynomial features increased the score in comparison with baseline of xgb, however all the xgb experiments have score better than baseline with logistic regression. Other transformations led to decreased score. Thus we will use polynomial features for further experiments.

In [80]:
# Modified function to do one-hot encoding, replace numerical features with polynomial features, train the model and evaluate model on validation data

def xgb_evaluate(new_features,df_train_copy,df_val_copy,xgb_params):

    xgb_params = {'seed': 42, 'eval_metric': 'auc', 'n_jobs': -1}

    poly = preprocessing.PolynomialFeatures(degree=3, interaction_only=True, include_bias=False)

    df_train_copy = df_train_copy[new_features]
    df_val_copy = df_val_copy[new_features]
    
    num_cols = list(df_train_copy.columns[df_train_copy.dtypes != 'object'])
    X_train_num = poly.fit_transform(df_train_copy[num_cols])
    df_train_poly = pd.DataFrame(X_train_num, columns=[f"poly_{i}" for i in range(X_train_num.shape[1])])
    df_train_copy.drop(num_cols,axis=1,inplace=True)
    poly_cols = list(df_train_poly.columns.values)
    df_train_copy[poly_cols] = df_train_poly[poly_cols]

    X_val_num = poly.transform(df_val_copy[num_cols])
    df_val_poly = pd.DataFrame(X_val_num, columns=[f"poly_{i}" for i in range(X_val_num.shape[1])])
    df_val_copy.drop(num_cols,axis=1,inplace=True)
    df_val_copy[poly_cols] = df_val_poly[poly_cols]

    new_features = list(df_train_copy.columns.values)
    dict_train_new = df_train_copy[new_features].to_dict(orient='records')
    X_train_new = dv.fit_transform(dict_train_new)
 
    feature_names = dv.get_feature_names()
    dtrain = xgb.DMatrix(X_train_new,label=y_train,feature_names=feature_names)
    
    dict_val_new = df_val_copy[new_features].to_dict(orient='records')
    X_val_new = dv.transform(dict_val_new)
    dval = xgb.DMatrix(X_val_new,label=y_val,feature_names=feature_names)

    model = xgb.train(xgb_params,dtrain,num_boost_round=3)

    y_pred = model.predict(dval)
    roc_auc_val = roc_auc_score(y_val,y_pred)

    #Returning dv also, so that we can use this to check decision made by xgboost
    return roc_auc_val, dv

Evaluate effect on score by dropping one feature at a time

In [109]:
#Score for each feature being dropped and it's difference with baseline score

#Experiment 9 onwards

xgb_params = {'seed': 42, 'eval_metric': 'auc', 'n_jobs': -1}

df_train_copy = df_train.copy()
df_val_copy = df_val.copy()

features_list = df_train_copy.columns
feature_roc_auc_scores = []

for drop_feature in list(features_list):
    new_features = list(features_list.drop(drop_feature))
    this_auc, dv = xgb_evaluate(new_features,df_train_copy,df_val_copy,xgb_params)
    feature_roc_auc_scores.append((drop_feature,this_auc,this_auc-baseline_auc,np.abs(this_auc-baseline_auc)))
    print(str((drop_feature,this_auc,this_auc-baseline_auc,np.abs(this_auc-baseline_auc))))
    entry_desc = f"delete one feature {drop_feature}"
    score_entry = {"algo": "xgb", "desc": entry_desc, "score": this_auc, "diff": this_auc - baseline_auc}
    exp_scores = exp_scores.append(score_entry,ignore_index=True)

In [110]:
df_feature_roc_auc = pd.DataFrame(feature_roc_auc_scores,columns=['feature','new_roc_auc_score','diff_roc_auc_score','abs_diff_roc_auc_score'])
df_sorted = df_feature_roc_auc.sort_values(by='diff_roc_auc_score',ascending=False)
df_sorted

In [111]:
plt.figure(figsize=(10, 8))
plt.barh(df_sorted['feature'],df_sorted['diff_roc_auc_score'])

**Observations:** Score is higher when dropping any of the feature (one feature at a time). Lets us check these scores in comparison with score when using polynomial features

In [112]:
exp_scores.sort_values(by='score',ascending=False)

**Observations:** We can see that when compared to base score with polynomial features, score is higher only when dropping any of 'job', 'pdays', 'education'. Let us check if dropping any combination of these features further increases the score.

In [85]:
import itertools

#Function to get all combinations of all features
def drop_features(features):
    drop_features_list = []
    for L in range(2, len(features)+1):
        for subset in itertools.combinations(features, L):
            drop_features_list.append(list(subset))
    return drop_features_list

In [86]:
top_features = ['job', 'pdays', 'education']

In [113]:
drop_features(top_features)

In [114]:
#Experiment 28 onwards

xgb_params = {'seed': 42, 'eval_metric': 'auc', 'n_jobs': -1}

df_train_copy = df_train.copy()
df_val_copy = df_val.copy()

drop_features_list = drop_features(top_features)

for drop_features in drop_features_list:
    #Defining description based on features being dropped
    entry_desc = f'deleted feature {str(drop_features).replace("[","").replace("]","")}'

    df_train_copy = df_train.copy()
    df_train_copy.drop(drop_features,axis=1,inplace=True)
    df_val_copy = df_val.copy()
    df_val_copy.drop(drop_features,axis=1,inplace=True)
    features_list = list(df_train_copy.columns.values)

    this_auc, dv = xgb_evaluate(features_list,df_train_copy,df_val_copy,xgb_params)
    print(this_auc, this_auc - baseline_auc)
    score_entry = {"algo": "xgb", "desc": entry_desc, "score": this_auc, "diff": this_auc - baseline_auc}
    exp_scores = exp_scores.append(score_entry,ignore_index=True)

In [115]:
exp_scores.sort_values(by='score',ascending=False).head(20)

**Observations:** Dropping 'job' or 'pdays' or 'education' or 'job', 'pdays' or 'pdays', 'education' have better scores, with dropping only 'job' having the highest score. The scores are almost similar for all these top experiments, thus we can choose to use any of these. Will choose dropping 'job' and perform parameter tuning.

<a id='xgb-2'></a>
##### 5.4.2 Model tuning for XGBoost
[back to TOC](#toc)

Useful references
* https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/
* https://www.kaggle.com/phunter/xgboost-with-gridsearchcv

In [78]:
def eval_model(parameters,X_full_train_new,y_full_train):
    model = xgb.XGBClassifier(use_label_encoder=False)
    cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=1)
    grid_search = GridSearchCV(model, param_grid=parameters, n_jobs=-1, cv=cv, scoring='roc_auc',error_score=0)
    grid_result = grid_search.fit(X_full_train_new, y_full_train)
    
    # summarize results
    print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
    means = grid_result.cv_results_['mean_test_score']
    stds = grid_result.cv_results_['std_test_score']
    params = grid_result.cv_results_['params']
    columns = ['algo','mean_test_score','std_test_score','params']
    df_gridcv_results = pd.DataFrame(columns=columns)
    for mean, stdev, param in zip(means, stds, params):
        score_entry = {"algo": "xgb", "mean_test_score": mean, "std_test_score": stdev, "params": param}
        df_gridcv_results = df_gridcv_results.append(score_entry,ignore_index=True)
        
    return df_gridcv_results

In [80]:
#Performing evaluation of best parameters for XGB using below parameters was taking a very long time using CPUs [even AWS instance with 36 cores - c4.8xlarge - was taking more than 2/3 hours].
#Hence used Kaggle with GPU and hence some parameters like tree_method, gpu_id, predictor configured to use GPU.

# define models and parameters
dv = DictVectorizer(sparse=False)

drop_features = ['job']

boosters = ['gbtree']
eval_metric = ['auc']
objectives = ['binary:logistic']
random_state = [42]

tree_method = ['gpu_hist']
gpu_id = [0]
predictor = ['gpu_predictor']

max_depths = [3, 4, 5, 6]
min_child_weights = [4, 5, 6, 8, 10]
colsample_bytrees = [0.4, 0.6, 0.8]


# define grid search
parameters = {
                'random_state': random_state,
                'eval_metric': eval_metric,
                'booster': boosters,
                'objective': objectives,
                'tree_method': tree_method,
                'gpu_id': gpu_id,
                'predictor': predictor,
                'max_depth': max_depths,
                'min_child_weight': min_child_weights,
                'colsample_bytree': colsample_bytrees,
              }

df_full_train_copy = df_full_train.copy()
df_full_train_copy.drop(drop_features,axis=1,inplace=True)
y_full_train = df_full_train_copy['y']
del df_full_train_copy['y']

dict_full_train_new = df_full_train_copy.to_dict(orient='records')
X_full_train_new = dv.fit_transform(dict_full_train_new)

learning_params = [
    (1.0, 500),
    (0.3, 800),
    (0.1, 1000),
    (0.05, 1500),
    (0.01, 3000),
]


all_gridcv_results = pd.DataFrame()

for learning_rate, n_estimators in learning_params:
    print(learning_rate, n_estimators)
    parameters['learning_rate'] = [learning_rate]
    parameters['n_estimators'] = [n_estimators]
    df_gridcv_results = eval_model(parameters,X_full_train_new,y_full_train)
    all_gridcv_results = pd.concat([all_gridcv_results,df_gridcv_results],axis=0,ignore_index=True)



In [None]:
all_gridcv_results.to_csv('xgb-gridcv-results-30oct.csv')

<a id='final-model'></a>
### 6. Final model
[back to TOC](#toc)

<a id='choose-final-model'></a>
#### 6.1 Compare results from hyper-parameter tuning for the different models and choose final model
[back to TOC](#toc)

In [38]:
files = [
    'work-dump/project-xgb-gridcv-scores-30oct.csv',
    'work-dump/project-randomforest-gridcv-scores-30oct.csv',
    'work-dump/project-decisiontree-gridcv-scores-29oct.csv',
    'work-dump/project-logistic-regression-gridcv-scores-29oct.csv'
]

In [39]:
df_cons_scores = pd.DataFrame()

for file in files:
    df_tmp = pd.read_csv(file)
    df_cons_scores = pd.concat([df_cons_scores,df_tmp],axis=0,ignore_index=True)
del df_cons_scores['Unnamed: 0']
del df_tmp

In [116]:
df_cons_scores['algo'].value_counts()

Lets find top 4 scores of each model

In [117]:
df_cons_scores.sort_values(by='mean_test_score',ascending=False).groupby(['algo']).head(4)

In [118]:
df_cons_scores.sort_values(by='mean_test_score',ascending=False).head(20)

In [119]:
plt.scatter(df_cons_scores['std_test_score'],df_cons_scores['mean_test_score'])

Using Plotly since it provides interactive graphs - so that we can see what the values for the points of our interest are.

In [14]:
import plotly.express as px

In [120]:
fig = px.scatter(df_cons_scores, x="std_test_score", y="mean_test_score")
fig.show()

**Note** Plotly graphs appear blank when viewing previously run notebook

**Observations:** We can see that 'mean_test_score' of 0.803311 has 'std_test_score' of 0.005255258 seems to be the overall best score with least std. deviation. So we will check which algorithm and what parameters got this score. From the the df_cons_scores sorted above, we can see that it is the 2nd entry from top (index 241)

In [121]:
df_cons_scores.iloc[241].values

Thus, XGBoostClassifier with parameters ['booster': 'gbtree', 'colsample_bytree': 0.4, 'eval_metric': 'auc',  'learning_rate': 0.01, 'max_depth': 3, 'min_child_weight': 5, 'n_estimators': 3000, 'objective': 'binary:logistic', 'random_state': 42] got us the best score. Now we will train our final model using this.

In [51]:
def parse_xgb_output(output):
    results = []

    for line in output.stdout.strip().split('\n'):
        it_line, train_line, val_line = line.split('\t')

        it = int(it_line.strip('[]'))
        train = float(train_line.split(':')[1])
        val = float(val_line.split(':')[1])

        results.append((it, train, val))
    
    columns = ['num_iter', 'train_auc', 'val_auc']
    df_results = pd.DataFrame(results, columns=columns)
    return df_results

In [53]:
%%capture output

drop_features = 'job'

xgb_params = {
    'seed': 42, 
    'eval_metric': 'auc', 
    'n_jobs': -1,
    'booster': 'gbtree', 
    'colsample_bytree': 0.4,
    'learning_rate': 0.01, 
    'max_depth': 3, 
    'min_child_weight': 5, 
#     'n_estimators': 3000, 
    'objective': 'binary:logistic', 
    'random_state': 42,
}

poly = preprocessing.PolynomialFeatures(degree=3, interaction_only=True, include_bias=False)

df_train_copy = df_train.copy()
df_train_copy.drop(drop_features,axis=1,inplace=True)
df_val_copy = df_val.copy()
df_val_copy.drop(drop_features,axis=1,inplace=True)

num_cols = list(df_train_copy.columns[df_train_copy.dtypes != 'object'])
X_train_num = poly.fit_transform(df_train_copy[num_cols])
df_train_poly = pd.DataFrame(X_train_num, columns=[f"poly_{i}" for i in range(X_train_num.shape[1])])
df_train_copy.drop(num_cols,axis=1,inplace=True)
poly_cols = list(df_train_poly.columns.values)
df_train_copy[poly_cols] = df_train_poly[poly_cols]

X_val_num = poly.transform(df_val_copy[num_cols])
df_val_poly = pd.DataFrame(X_val_num, columns=[f"poly_{i}" for i in range(X_val_num.shape[1])])
df_val_copy.drop(num_cols,axis=1,inplace=True)
df_val_copy[poly_cols] = df_val_poly[poly_cols]

new_features = list(df_train_copy.columns.values)
dict_train_new = df_train_copy[new_features].to_dict(orient='records')
X_train_new = dv.fit_transform(dict_train_new)

feature_names = dv.get_feature_names()
dtrain = xgb.DMatrix(X_train_new,label=y_train,feature_names=feature_names)

dict_val_new = df_val_copy[new_features].to_dict(orient='records')
X_val_new = dv.transform(dict_val_new)
dval = xgb.DMatrix(X_val_new,label=y_val,feature_names=feature_names)

watchlist = [(dtrain,'train'),(dval,'val')]

model = xgb.train(xgb_params,dtrain,num_boost_round=3000,evals=watchlist)

In [54]:
df_score = parse_xgb_output(output)

In [122]:
plt.plot(df_score.num_iter, df_score.train_auc, label='train')
plt.plot(df_score.num_iter, df_score.val_auc, label='val')
plt.legend()

In [123]:
plt.plot(df_score.num_iter, df_score.val_auc, label='val')
plt.legend()

In [124]:
fig = px.line(df_score, x="num_iter", y="val_auc")
fig.show()

**Note:** Plotly graphs appear blank when viewing previously run notebook

**Observations:** We can see that around iteration 335 the score is 0.79351, then it drops and then from iteration 1286 again it starts increasing upto iteration 2330 - where it is highest 0.79461 (but the increase is very minimal - 0.0011 compared to iter 335) and then its pretty stable. Thus, for practical purposes we will choose training upto iteration 335 as it will take lesser compute resources/time to train and still achieve optimal score possible.

In [125]:
df_score.sort_values(by='val_auc',ascending=False).head(20)

Performing a last training with iteration 335 and validation on validation dataset, before training the final model on full_train dataset

In [126]:
drop_features = 'job'

xgb_params = {
    'seed': 42, 
    'eval_metric': 'auc', 
    'n_jobs': -1,
    'booster': 'gbtree', 
    'colsample_bytree': 0.4,
    'learning_rate': 0.01, 
    'max_depth': 3, 
    'min_child_weight': 5, 
#     'n_estimators': 3000, 
    'objective': 'binary:logistic', 
    'random_state': 42,
}

dv = DictVectorizer(sparse=False)
poly = preprocessing.PolynomialFeatures(degree=3, interaction_only=True, include_bias=False)

df_train_copy = df_train.copy()
df_train_copy.drop(drop_features,axis=1,inplace=True)
df_val_copy = df_val.copy()
df_val_copy.drop(drop_features,axis=1,inplace=True)

num_cols = list(df_train_copy.columns[df_train_copy.dtypes != 'object'])
X_train_num = poly.fit_transform(df_train_copy[num_cols])
df_train_poly = pd.DataFrame(X_train_num, columns=[f"poly_{i}" for i in range(X_train_num.shape[1])])
df_train_copy.drop(num_cols,axis=1,inplace=True)
poly_cols = list(df_train_poly.columns.values)
df_train_copy[poly_cols] = df_train_poly[poly_cols]

X_val_num = poly.transform(df_val_copy[num_cols])
df_val_poly = pd.DataFrame(X_val_num, columns=[f"poly_{i}" for i in range(X_val_num.shape[1])])
df_val_copy.drop(num_cols,axis=1,inplace=True)
df_val_copy[poly_cols] = df_val_poly[poly_cols]

new_features = list(df_train_copy.columns.values)
dict_train_new = df_train_copy[new_features].to_dict(orient='records')
X_train_new = dv.fit_transform(dict_train_new)

feature_names = dv.get_feature_names()
dtrain = xgb.DMatrix(X_train_new,label=y_train,feature_names=feature_names)

dict_val_new = df_val_copy[new_features].to_dict(orient='records')
X_val_new = dv.transform(dict_val_new)
dval = xgb.DMatrix(X_val_new,label=y_val,feature_names=feature_names)

model = xgb.train(xgb_params,dtrain,num_boost_round=335)

y_pred = model.predict(dval)
roc_auc_val = roc_auc_score(y_val,y_pred)
roc_auc_val

In [127]:
y_train_pred = model.predict(dtrain)
roc_auc_score(y_train,y_train_pred)

Let us have a look at various metrics like tp, tn, fp, fn, Precision, Recall, F1 score, TRP, FPR, AUC, Accuracy etc. to detemine which threshold should be used to make the decision on the prediction.

In [118]:
def all_metrics_dataframe(y_val,y_pred):
    thresholds = np.linspace(0,1,101)

    scores = []

    for t in thresholds:
        actual_positive = (y_val == 1)
        actual_negative = (y_val == 0)

        predicted_positive = (y_pred > t)
        predicted_negative = (y_pred <= t)

        tp = (actual_positive & predicted_positive).sum()
        tn = (actual_negative & predicted_negative).sum()
        fp = (predicted_positive & actual_negative).sum()
        fn = (predicted_negative & actual_positive).sum()

        scores.append((t, tp, fp, fn, tn))

    columns = ['threshold','tp','fp','fn','tn']
    df_scores = pd.DataFrame(scores, columns=columns)
    
    #'precision', 'recall', 'f1', 'tpr', 'fpr', 'auc', 'accuracy'

    df_scores['precision'] = df_scores['tp'] / (df_scores['tp'] + df_scores['fp'])
    df_scores['recall'] = df_scores['tp'] / (df_scores['tp'] + df_scores['fn'])
    df_scores['f1'] = 2 * (df_scores['precision'] * df_scores['recall']) / (df_scores['precision'] + df_scores['recall'])
    df_scores['tpr'] = df_scores['tp'] / (df_scores['tp'] + df_scores['fn'])
    df_scores['fpr'] = df_scores['fp'] / (df_scores['fp'] + df_scores['tn'])
    df_scores['tnr'] = df_scores['tn'] / (df_scores['tn'] + df_scores['fp'])
    df_scores['fnr'] = df_scores['fn'] / (df_scores['fn'] + df_scores['tp'])
    df_scores['auc'] = auc(df_scores['fpr'], df_scores['tpr'])
    df_scores['accuracy'] = (df_scores['tp'] + df_scores['tn'])/(df_scores['tp'] + df_scores['fp'] + df_scores['tn'] + df_scores['fn'])
        
    return df_scores

In [119]:
df_all_scores = all_metrics_dataframe(y_val,y_pred)

In [128]:
df_all_scores[::2]

In [129]:
fig = px.line(df_all_scores, x="threshold", y=["auc","accuracy","f1","precision","recall"])
fig.show()

**Note:** Plotly graphs appear blank when viewing previously run notebook

**Observations:** In ideal scenario we would like our predictions to be always correct i.e. have fp and fn as 0. Practically it is not possible. We will see what threshold makes more sense to us to use for our predictions such that we have comparatively acceptable fp and fn.

In [130]:
fig = px.line(df_all_scores, x="threshold", y=["fp","fn"])
fig.show()

**Note:** Plotly graphs appear blank when viewing previously run notebook

**Observations:** 
* We can see that for threshold of 0, we have 0 fn but then we have 7260 fp (meaning we would be predicting that all customers will subscribe to Term deposit - which obviously is wrong). At the extreme right end where threshold is close to 1, we have 0 fp but 896 fn (meaning we would be predicting that none of the customers will be making Term deposit).
* At a threshold of 0.32 we see fp and fn crossing - where fp is around 440 and fn is around 487.

In [131]:
fig = px.line(df_all_scores, x="fpr", y="tpr")
fig.show()

Looking at the F1 scores and at what threshold is our F1 score the highest

In [132]:
def f1_dataframe(y_val,y_pred):
    thresholds = np.linspace(0,1,101)

    scores = []

    for t in thresholds:
        actual_positive = (y_val == 1)
        actual_negative = (y_val == 0)

        predicted_positive = (y_pred >= t)
        predicted_negative = (y_pred < t)

        tp = (actual_positive & predicted_positive).sum()
        tn = (actual_negative & predicted_negative).sum()
        fp = (predicted_positive & actual_negative).sum()
        fn = (predicted_negative & actual_positive).sum()

        scores.append((t, tp, fp, fn, tn))

    columns = ['threshold','tp','fp','fn','tn']
    df_scores = pd.DataFrame(scores, columns=columns)

    df_scores['precision'] = df_scores['tp'] / (df_scores['tp'] + df_scores['fp'])
    df_scores['recall'] = df_scores['tp'] / (df_scores['tp'] + df_scores['fn'])
    df_scores['f1'] = 2 * (df_scores['precision'] * df_scores['recall']) / (df_scores['precision'] + df_scores['recall'])
        
    return df_scores

df_f1_scores = f1_dataframe(y_val,y_pred)

print(df_f1_scores[df_f1_scores['f1'] == df_f1_scores['f1'].max()][['threshold','f1']])

plt.plot(df_f1_scores['threshold'], df_f1_scores['f1'], label='F1 score')
plt.xlabel('threshold')
plt.ylabel('F1 score')
plt.legend()

Looking at the TPR and FPR

In [133]:
def tpr_fpr_dataframe(y_val,y_pred):
    thresholds = np.linspace(0,1,101)

    scores = []

    for t in thresholds:
        actual_positive = (y_val == 1)
        actual_negative = (y_val == 0)

        predicted_positive = (y_pred >= t)
        predicted_negative = (y_pred < t)

        tp = (actual_positive & predicted_positive).sum()
        tn = (actual_negative & predicted_negative).sum()
        fp = (predicted_positive & actual_negative).sum()
        fn = (predicted_negative & actual_positive).sum()

    #     tpr = tp / (tp + fn)
    #     fpr = fp / (fp + tn)

        scores.append((t, tp, fp, fn, tn))

    columns = ['threshold','tp','fp','fn','tn']
    df_scores = pd.DataFrame(scores, columns=columns)

    df_scores['tpr'] = df_scores['tp'] / (df_scores['tp'] + df_scores['fn'])
    df_scores['fpr'] = df_scores['fp'] / (df_scores['fp'] + df_scores['tn'])
        
    return df_scores

df_tpr_fpr_scores = tpr_fpr_dataframe(y_val,y_pred)

# print(round(auc(df_tpr_fpr_scores['fpr'],df_tpr_fpr_scores['tpr']),3))

plt.plot(df_tpr_fpr_scores['threshold'], df_tpr_fpr_scores['tpr'], label='TPR')
plt.plot(df_tpr_fpr_scores['threshold'], df_tpr_fpr_scores['fpr'], label='FPR')
plt.legend()

Looking at the Precision and Recall for our model

In [134]:
def precision_recall_dataframe(y_val,y_pred):
    thresholds = np.linspace(0,1,101)

    scores = []

    for t in thresholds:
        actual_positive = (y_val == 1)
        actual_negative = (y_val == 0)

        predicted_positive = (y_pred >= t)
        predicted_negative = (y_pred < t)

        tp = (actual_positive & predicted_positive).sum()
        tn = (actual_negative & predicted_negative).sum()
        fp = (predicted_positive & actual_negative).sum()
        fn = (predicted_negative & actual_positive).sum()

        scores.append((t, tp, fp, fn, tn))

    columns = ['threshold','tp','fp','fn','tn']
    df_scores = pd.DataFrame(scores, columns=columns)

    df_scores['precision'] = df_scores['tp'] / (df_scores['tp'] + df_scores['fp'])
    df_scores['recall'] = df_scores['tp'] / (df_scores['tp'] + df_scores['fn'])
        
    return df_scores

df_pr_rec_scores = precision_recall_dataframe(y_val,y_pred)

plt.plot(df_pr_rec_scores['threshold'], df_pr_rec_scores['precision'], label='Precision')
plt.plot(df_pr_rec_scores['threshold'], df_pr_rec_scores['recall'], label='Recall')
plt.xlabel('threshold')
plt.ylabel('precision/recall')
plt.legend()

In [135]:
fig = px.line(df_pr_rec_scores, x="threshold", y=["precision","recall"])
# fig = px.line(df_pr_rec_scores, x="threshold", y="recall",labels='Recall')
fig.show()

In [136]:
#Precision Vs Recall
plt.plot(df_pr_rec_scores['recall'], df_pr_rec_scores['precision'], label='Precision Vs Recall')
plt.xlabel('recall')
plt.ylabel('precision')
plt.legend()

In [137]:
df_tpr_fpr_scores[::5]

Plotting the ROC curve

In [138]:
plt.figure(figsize=(5,5))
plt.plot(df_tpr_fpr_scores['fpr'],df_tpr_fpr_scores['tpr'],label='model')
plt.legend()

<a id='train-final-model'></a>
#### 6.2 Train final model
[back to TOC](#toc)

Training on full_train dataset for final model

In [139]:
drop_features = 'job'

xgb_params = {
    'seed': 42, 
    'eval_metric': 'auc', 
    'n_jobs': -1,
    'booster': 'gbtree', 
    'colsample_bytree': 0.4,
    'learning_rate': 0.01, 
    'max_depth': 3, 
    'min_child_weight': 5, 
#     'n_estimators': 3000, 
    'objective': 'binary:logistic', 
    'random_state': 42,
}

poly = preprocessing.PolynomialFeatures(degree=3, interaction_only=True, include_bias=False)

df_full_train_copy = df_full_train.copy()
df_full_train_copy.drop(drop_features,axis=1,inplace=True)
y_full_train = df_full_train_copy['y']
del df_full_train_copy['y']

df_test_copy = df_test.copy()
df_test_copy.drop(drop_features,axis=1,inplace=True)

num_cols = list(df_full_train_copy.columns[df_full_train_copy.dtypes != 'object'])
X_full_train_num = poly.fit_transform(df_full_train_copy[num_cols])
df_full_train_poly = pd.DataFrame(X_full_train_num, columns=[f"poly_{i}" for i in range(X_full_train_num.shape[1])])
df_full_train_copy.drop(num_cols,axis=1,inplace=True)
poly_cols = list(df_full_train_poly.columns.values)
df_full_train_copy[poly_cols] = df_full_train_poly[poly_cols]

X_test_num = poly.transform(df_test_copy[num_cols])
df_test_poly = pd.DataFrame(X_test_num, columns=[f"poly_{i}" for i in range(X_test_num.shape[1])])
df_test_copy.drop(num_cols,axis=1,inplace=True)
df_test_copy[poly_cols] = df_test_poly[poly_cols]

new_features = list(df_full_train_copy.columns.values)
dict_full_train_new = df_full_train_copy[new_features].to_dict(orient='records')
X_full_train_new = dv.fit_transform(dict_full_train_new)

feature_names = dv.get_feature_names()
dfulltrain = xgb.DMatrix(X_full_train_new,label=y_full_train,feature_names=feature_names)

dict_test_new = df_test_copy[new_features].to_dict(orient='records')
X_test_new = dv.transform(dict_test_new)
dtest = xgb.DMatrix(X_test_new,label=y_test,feature_names=feature_names)

model = xgb.train(xgb_params,dfulltrain,num_boost_round=335)

y_pred_test = model.predict(dtest)
roc_auc_test = roc_auc_score(y_test,y_pred_test)
roc_auc_test

Save model to disk

In [47]:
model_output_file = f'xgb_model.bin'

with open(model_output_file,'wb') as f_out:
    pickle.dump((poly,dv,model),f_out)

## END of Notebook