# 1. Setup

In [1]:
!pip install -r requirements.txt



In [2]:
# my gereneral imports
import os
import wget
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from IPython.display import display
import time
# url data reader imports
from bs4 import BeautifulSoup
import requests

from zipfile import ZipFile

# Scikit-Learn imports
import sklearn
assert sklearn.__version__ >= "0.20"
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OrdinalEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# TensorFlow ≥2.0 is required
import tensorflow as tf
assert tf.__version__ >= "2.0"

# To plot pretty figures
import matplotlib.pyplot as plt
%matplotlib inline

PROJECT_ROOT_DIR = "."
PROJECT = "Bank"
DATA_PATH = os.path.join(PROJECT_ROOT_DIR, "data", PROJECT)
os.makedirs(DATA_PATH, exist_ok=True)
OUT_PATH = os.path.join(PROJECT_ROOT_DIR, "output", PROJECT)
os.makedirs(OUT_PATH, exist_ok=True)

In [3]:
t0 = time.time()

# to make this notebook's output stable across runs
np.random.seed(42)

# Q3. Modelling Assessment

Build 2 binary classification models using any 2 of the following methods (in R or Python):
1. Logistic Regression
2. Random Forest
3. GBM
4. Xgboost
5. Neural Network

## 3.1 Download the Data

In [4]:
url = "http://archive.ics.uci.edu/ml/machine-learning-databases/00222/"
ext = 'zip'

def list_files(url, ext=''):
    page = requests.get(url).text
#     print(page)
    soup = BeautifulSoup(page, 'html.parser')
    return [url + node.get('href') for node in soup.find_all('a') if node.get('href').endswith(ext)]

for fileurl in list_files(url, ext):
#     print(fileurl)
    filename = fileurl.split('/')[-1]
#     print(filename)
    wget.download(fileurl, out=os.path.join(DATA_PATH, filename))

100% [............................................................................] 579043 / 579043

In [5]:
os.listdir(DATA_PATH)

['bank (1).zip', 'bank-additional (1).zip', 'bank-additional.zip', 'bank.zip']

### 3.1.1 Read the Data

The description of the data states that there are four datasets.
We need
> (1) `bank-additional-full.csv` with all examples (41188) and 20 inputs, ordered by date (from May 2008 to November 2010), very close to the data analyzed in [Moro et al., 2014].[<sup>1</sup>](#fn1)

An examination of the archive `bank-additional.zip` shows that it contains a folder called `bank-additional` with the files we need.

<span id="fn1"><sup>1</sup>
    S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014.
</span>

In [6]:
zip_file = ZipFile(os.path.join(DATA_PATH, 'bank-additional.zip'))
df = pd.read_csv(
    zip_file.open('bank-additional/bank-additional-full.csv'),
    sep=';',
    quotechar='"',
)
df.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


## 3.2 Variable and Data Set Definitions

Let's define output variable and split the data into train, dev, and test set.

In [7]:
print(df['y'].value_counts())

no     36548
yes     4640
Name: y, dtype: int64


In [8]:
y = df['y']
X = df.drop(columns='y')

In [9]:
print(len(y))

41188


We see that the target variable is *yes* for approximately 11% of cases. If we simply split the data into 80-10-10% train/val/test sets, there is a high chance that we end up with a lot of negative examples in the dev and test sets. This, in turn, might bias our evaluation of model performance.

So, there are two options, I think:
1. to use stratified shuffling to ensure that the proportions of positive and negative examples are similar in the split data sets, or
2. use the test set from the website:
    > (2) `bank-additional.csv` with 10% of the examples (4119), randomly selected from (1), and 20 inputs.

I would personally prefer option 1.
But for the sake of completeness, let's see what distribution of the target variable is in the test file from the website.

In [10]:
df_test_web = pd.read_csv(
    zip_file.open('bank-additional/bank-additional.csv'),
    sep=';',
    quotechar='"',
)
print(df_test_web['y'].value_counts())

no     3668
yes     451
Name: y, dtype: int64


It is clear now, that the file on the website was generated using stratified shuffling.

**Note:** In the PySpark exercise, the data was also split into train and test set. However, in this exercise, we are explicitly told that the test data is a subsample of the full data set.

Therefore, I will use stratified shuffling to split the data into train/dev/test sets.

In [11]:
print(X.shape)

(41188, 20)


In [12]:
print(X.dtypes)

age                 int64
job                object
marital            object
education          object
default            object
housing            object
loan               object
contact            object
month              object
day_of_week        object
duration            int64
campaign            int64
pdays               int64
previous            int64
poutcome           object
emp.var.rate      float64
cons.price.idx    float64
cons.conf.idx     float64
euribor3m         float64
nr.employed       float64
dtype: object


In [13]:
# # I put this import into the Setup section at the beginning
# from sklearn.model_selection import train_test_split

In [14]:
X_trainval, X_test, y_trainval, y_test = train_test_split(X, y, stratify=y, test_size=0.1, random_state=42)

In [15]:
y_test.value_counts()

no     3655
yes     464
Name: y, dtype: int64

In [16]:
y_trainval.value_counts()

no     32893
yes     4176
Name: y, dtype: int64

In [17]:
X_train, X_val, y_train, y_val = train_test_split(X_trainval, y_trainval, stratify=y_trainval, test_size=0.1, random_state=42)

Now, each of the train, dev, and test sets have roughly 11% of the positive outcomes.

**Note:** such splitting method ensures that we avoid data mismatch problem, i.e. make our dev and test sets come from similar distributions.

Let's convert our dependent variables in all the sets into integers.

In [18]:
yesno_dict = {'yes': 1, 'no': 0}
y_train = y_train.map(yesno_dict).astype('int32')
y_val = y_val.map(yesno_dict).astype('int32')
y_test = y_test.map(yesno_dict).astype('int32')

## 3.3 Preprocessing Pipeline

### 3.3.1 Data Exploration

Let's first explore the data to see what variables are continuous, what are needed to be categorized, and what are to be transformed into dummy variables (one-hot encoded).

In [19]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 33362 entries, 22373 to 31364
Data columns (total 20 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             33362 non-null  int64  
 1   job             33362 non-null  object 
 2   marital         33362 non-null  object 
 3   education       33362 non-null  object 
 4   default         33362 non-null  object 
 5   housing         33362 non-null  object 
 6   loan            33362 non-null  object 
 7   contact         33362 non-null  object 
 8   month           33362 non-null  object 
 9   day_of_week     33362 non-null  object 
 10  duration        33362 non-null  int64  
 11  campaign        33362 non-null  int64  
 12  pdays           33362 non-null  int64  
 13  previous        33362 non-null  int64  
 14  poutcome        33362 non-null  object 
 15  emp.var.rate    33362 non-null  float64
 16  cons.price.idx  33362 non-null  float64
 17  cons.conf.idx   33362 non-n

We clearly see that the following variables are categorical (or binary):
* job,
* marital,
* education,
* housing,
* loan,
* contact,
* month,
* day_of_week,
* poutcome.

Let's check what values they take.

In [20]:
for col in X_train.select_dtypes(exclude=['int64', 'float64']).columns:
    print(col)
    print(X_train[col].value_counts())
    print("\n")

job
admin.           8415
blue-collar      7510
technician       5443
services         3227
management       2386
retired          1383
entrepreneur     1187
self-employed    1142
housemaid         867
unemployed        823
student           720
unknown           259
Name: job, dtype: int64


marital
married     20220
single       9357
divorced     3721
unknown        64
Name: marital, dtype: int64


education
university.degree      9845
high.school            7711
basic.9y               4887
professional.course    4231
basic.4y               3372
basic.6y               1869
unknown                1430
illiterate               17
Name: education, dtype: int64


default
no         26415
unknown     6944
yes            3
Name: default, dtype: int64


housing
yes        17503
no         15040
unknown      819
Name: housing, dtype: int64


loan
no         27452
yes         5091
unknown      819
Name: loan, dtype: int64


contact
cellular     21145
telephone    12217
Name: contact, dtype: i

In [21]:
# a list of columns to be transformed from string to categorical numeric objects
str_cat_vars = X_train.select_dtypes(exclude=['int64', 'float64']).columns.values
str_cat_vars

array(['job', 'marital', 'education', 'default', 'housing', 'loan',
       'contact', 'month', 'day_of_week', 'poutcome'], dtype=object)

In [22]:
# a list of numeric columns to be scaled
num_vars = X_train.select_dtypes(include=['int64', 'float64']).columns.values
num_vars

array(['age', 'duration', 'campaign', 'pdays', 'previous', 'emp.var.rate',
       'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed'],
      dtype=object)

### 3.3.2 Declare the Pipeline

I want the following steps to be processed in my preprocessing pipeline.
1. Transform text variables into categorical numeric;
2. Transform categorical variables into a set of binary variables (dummies, or one-hot encoded);
3. Standardize continuous variables.

**Note:** I treat variables of type `int64` as continuous. Sometimes, we want to treat such variables as categorical variables, so we then transform them into a set of dummy variables.

In [23]:
# # I put these imports into the Setup section at the beginning
# from sklearn.preprocessing import StandardScaler, OrdinalEncoder, OneHotEncoder
# from sklearn.compose import ColumnTransformer
# from sklearn.pipeline import Pipeline

# encoder = OrdinalEncoder()
# dummies = OneHotEncoder()

dummy_pipeline = Pipeline([
        ('encoder', OrdinalEncoder()),
        ('dummies', OneHotEncoder()),
    ])

# scaler = StandardScaler()

prep_pipeline = ColumnTransformer([
        ("cat", dummy_pipeline, str_cat_vars),
        ("num", StandardScaler(), num_vars),
    ])

In [24]:
X_train_prepared = prep_pipeline.fit_transform(X_train)

In [25]:
X_train_prepared

array([[ 0.        ,  0.        ,  0.        , ...,  0.95603918,
         0.77476545,  0.84745082],
       [ 0.        ,  0.        ,  0.        , ..., -0.47320393,
         0.774189  ,  0.84745082],
       [ 1.        ,  0.        ,  0.        , ..., -1.23113588,
        -1.36905844, -0.9393463 ],
       ...,
       [ 0.        ,  0.        ,  0.        , ...,  0.89107358,
         0.71250867,  0.33357351],
       [ 0.        ,  0.        ,  0.        , ..., -1.23113588,
        -1.3373536 , -0.9393463 ],
       [ 0.        ,  0.        ,  0.        , ..., -1.23113588,
        -1.31717779, -0.9393463 ]])

In [26]:
X_val_prepared = prep_pipeline.transform(X_val)  # we do not *fit* the test data!
X_test_prepared = prep_pipeline.transform(X_test)  # we do not *fit* the test data!

## 3.4 Build a Model

The task is to build 2 binary classification models using any 2 of the following methods:
1. Logistic Regression,
2. Random Forest,
3. GBM,
4. Xgboost,
5. Neural Network.

Let's train the first four and see their performance and keep Nerual Nets as a cherry on a cake! :)

**Note:** I train the basic model in this section. I will tune hyperparameters later.

### 3.4.1 Logistic Regression

In [27]:
from sklearn.linear_model import LogisticRegression

In [28]:
log_reg = LogisticRegression(solver="lbfgs", random_state=42)
log_reg.fit(X_train_prepared, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression(random_state=42)

How well we do on the train set

In [29]:
log_reg.score(X_train_prepared, y_train)

0.9105868952700678

How well we do on the validation set

In [30]:
log_reg.score(X_val_prepared, y_val)

0.9163744267601834

I leave model evaluation assessment on the test set for later analysis after I fine-tune the model with hyperparameters tuning.

### 3.4.2 Random Forest

In [31]:
from sklearn.ensemble import RandomForestClassifier

In [32]:
forest_clf = RandomForestClassifier(n_estimators=100, random_state=42)
forest_clf.fit(X_train_prepared, y_train)

RandomForestClassifier(random_state=42)

In [33]:
forest_clf.score(X_train_prepared, y_train)

1.0

In [34]:
forest_clf.score(X_val_prepared, y_val)

0.9128675478823847

### 3.4.3 GBM

In [35]:
from sklearn.ensemble import GradientBoostingClassifier

In [36]:
gb_clf = GradientBoostingClassifier(n_estimators=100, random_state=42)
gb_clf.fit(X_train_prepared, y_train)

GradientBoostingClassifier(random_state=42)

In [37]:
forest_clf.score(X_train_prepared, y_train)

1.0

In [38]:
forest_clf.score(X_val_prepared, y_val)

0.9128675478823847

### 3.4.4 XGBoost

In [39]:
from xgboost import XGBClassifier

In [40]:
xgb_clf = XGBClassifier(random_state=42)
xgb_clf.fit(X_train_prepared, y_train)

KeyError: 'base_score'

KeyError: 'base_score'

**WARNING.** XGBClassifier gives errors when printing the output after `fit` method. It seems to be an issue with Scikit-Learn version 0.23.1 and higher, see discussion here: https://github.com/dmlc/xgboost/issues/5668

I updated scikit-learn to version 0.23.2 and it still appears as an error...

In [41]:
xgb_clf.score(X_train_prepared, y_train)

0.9597146454049518

In [43]:
xgb_clf.score(X_val_prepared, y_val)

0.9174534664148908

### 3.4.5 Preliminary Summary

It is clear that *ALL* models overfit on the training set. The only exception is XGBClassifier.
Therefore, we have to do some regularization.

## 3.5 Regularization and Hyperparameter Tuning

There are two classes in `Scikit-Learn` that help us tune hyperparameter:
1. `GridSearchCV`, and
2. `RandomizedSearchCV`.

Let's try both of them.

In [44]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV