# Practical Application III: Comparing Classifiers

**Overview**: In this practical application, your goal is to compare the performance of the classifiers we encountered in this section, namely K Nearest Neighbor, Logistic Regression, Decision Trees, and Support Vector Machines.  We will utilize a dataset related to marketing bank products over the telephone.  



### Getting Started

Our dataset comes from the UCI Machine Learning repository [link](https://archive.ics.uci.edu/ml/datasets/bank+marketing).  The data is from a Portugese banking institution and is a collection of the results of multiple marketing campaigns.  We will make use of the article accompanying the dataset [here](CRISP-DM-BANK.pdf) for more information on the data and features.



### Problem 1: Understanding the Data

To gain a better understanding of the data, please read the information provided in the UCI link above, and examine the **Materials and Methods** section of the paper.  How many marketing campaigns does this data represent?

__The dataset collected is related to 17 campaigns that occured between May 2008 and November 2010.__

### Problem 2: Read in the Data

Use pandas to read in the dataset `bank-additional-full.csv` and assign to a meaningful variable name.

In [132]:
import pandas as pd
import time
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.svm import SVC
from sklearn.compose import make_column_transformer
from sklearn.model_selection import train_test_split

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.feature_selection import SelectFromModel
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import ConfusionMatrixDisplay, RocCurveDisplay


In [56]:
# Reading the Dataset 
df = pd.read_csv('C:/Users/natha/OneDrive/Escritorio/Berkeley/Modulo 17/module_17_starter/data/bank-additional-full.csv', sep = ';')

In [57]:
df.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


In [58]:
# Renaming some columns
df_new = df.rename(columns={"emp.var.rate": "empvarrate", "cons.price.idx": "conspriceidx","cons.conf.idx":"consconfidx","nr.employed":"nremployed"})

### Problem 3: Understanding the Features


Examine the data description below, and determine if any of the features are missing values or need to be coerced to a different data type.


```
Input variables:
# bank client data:
1 - age (numeric)
2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
5 - default: has credit in default? (categorical: 'no','yes','unknown')
6 - housing: has housing loan? (categorical: 'no','yes','unknown')
7 - loan: has personal loan? (categorical: 'no','yes','unknown')
# related with the last contact of the current campaign:
8 - contact: contact communication type (categorical: 'cellular','telephone')
9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
# other attributes:
12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14 - previous: number of contacts performed before this campaign and for this client (numeric)
15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')
# social and economic context attributes
16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)
17 - cons.price.idx: consumer price index - monthly indicator (numeric)
18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)
19 - euribor3m: euribor 3 month rate - daily indicator (numeric)
20 - nr.employed: number of employees - quarterly indicator (numeric)

Output variable (desired target):
21 - y - has the client subscribed a term deposit? (binary: 'yes','no')
```



No feature has null values. Categorical values must be converted to numeric and some categories are classified as unknown which would look like a null value. It's better to convert the target to numeric too.


In [60]:
df_new.select_dtypes('object').columns[:-1] #categorical features

Index(['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact',
       'month', 'day_of_week', 'poutcome'],
      dtype='object')

In [41]:
# Job entries classified as unknown will be discarded.
job = df_new['job'].value_counts(normalize = True)
job

admin.           0.253035
blue-collar      0.224677
technician       0.163713
services         0.096363
management       0.070992
retired          0.041760
entrepreneur     0.035350
self-employed    0.034500
housemaid        0.025736
unemployed       0.024619
student          0.021244
unknown          0.008012
Name: job, dtype: float64

In [42]:
# Marital entries classified as unknown will be discarded.
marital = df_new['marital'].value_counts(normalize = True)
marital

married     0.605225
single      0.280859
divorced    0.111974
unknown     0.001942
Name: marital, dtype: float64

In [61]:
# Education entries classified as unknown will be discarded.
education= df_new['education'].value_counts(normalize = True)
education         

university.degree      0.295426
high.school            0.231014
basic.9y               0.146766
professional.course    0.127294
basic.4y               0.101389
basic.6y               0.055647
unknown                0.042027
illiterate             0.000437
Name: education, dtype: float64

In [74]:
# The default feature has 21% of entries classified as unknown, these entries will be discarded.
default = df_new['default'].value_counts(normalize = True)
default

no         0.791201
unknown    0.208726
yes        0.000073
Name: default, dtype: float64

In [75]:
# Housing entries classified as unknown will be discarded.
housing = df_new['housing'].value_counts(normalize = True)
housing

yes        0.523842
no         0.452122
unknown    0.024036
Name: housing, dtype: float64

In [76]:
# Loan entries classified as unknown will be discarded.

loan = df_new['loan'].value_counts(normalize = True)
loan

no         0.824269
yes        0.151695
unknown    0.024036
Name: loan, dtype: float64

In [77]:
contact = df_new['contact'].value_counts(normalize = True)
contact

cellular     0.634748
telephone    0.365252
Name: contact, dtype: float64

In [78]:
month = df_new['month'].value_counts(normalize = True)
month

may    0.334296
jul    0.174177
aug    0.149995
jun    0.129115
nov    0.099568
apr    0.063902
oct    0.017432
sep    0.013839
mar    0.013256
dec    0.004419
Name: month, dtype: float64

In [79]:
day_of_week = df_new['day_of_week'].value_counts(normalize = True)
day_of_week

thu    0.209357
mon    0.206711
wed    0.197485
tue    0.196416
fri    0.190031
Name: day_of_week, dtype: float64

In [80]:
poutcome = df_new['poutcome'].value_counts(normalize = True)
poutcome   

nonexistent    0.863431
failure        0.103234
success        0.033335
Name: poutcome, dtype: float64

### Problem 4: Understanding the Task

After examining the description and data, your goal now is to clearly state the *Business Objective* of the task.  State the objective below.

__The main objective of this analysis is to improve the conversion rate of the marketing campaign, more accurately identifying the audience with the greatest potential to adhere to the product.__

### Problem 5: Engineering Features

Now that you understand your business objective, we will build a basic model to get started.  Before we can do this, we must work to encode the data.  Using just the bank information features (columns 1 - 7), prepare the features and target column for modeling with appropriate encoding and transformations.

In [81]:
#1 - age (numeric)
#2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
#3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
#4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
#5 - default: has credit in default? (categorical: 'no','yes','unknown')
#6 - housing: has housing loan? (categorical: 'no','yes','unknown')
#7 - loan: has personal loan? (categorical: 'no','yes','unknown')

In [82]:
# Cleaning the data, discarding the entries classified as unknown

df_clean = df_new [(df_new.job != 'unknown')] 
df_clean = df_clean [(df_clean.marital != 'unknown')] 
df_clean = df_clean [(df_clean.education != 'unknown')] 
df_clean = df_clean [(df_clean.housing != 'unknown')] 
df_clean = df_clean [(df_clean.loan != 'unknown')] 
df_clean = df_clean [(df_clean.default != 'unknown')] 
df_clean.info()                   

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30488 entries, 0 to 41187
Data columns (total 21 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   age           30488 non-null  int64  
 1   job           30488 non-null  object 
 2   marital       30488 non-null  object 
 3   education     30488 non-null  object 
 4   default       30488 non-null  object 
 5   housing       30488 non-null  object 
 6   loan          30488 non-null  object 
 7   contact       30488 non-null  object 
 8   month         30488 non-null  object 
 9   day_of_week   30488 non-null  object 
 10  duration      30488 non-null  int64  
 11  campaign      30488 non-null  int64  
 12  pdays         30488 non-null  int64  
 13  previous      30488 non-null  int64  
 14  poutcome      30488 non-null  object 
 15  empvarrate    30488 non-null  float64
 16  conspriceidx  30488 non-null  float64
 17  consconfidx   30488 non-null  float64
 18  euribor3m     30488 non-nu

In [91]:
# keeping just the features 1-7 and the target

df_1 = df_clean[['age', 'job','marital','education','default','housing','loan','y']]
df_1.reset_index(inplace= True)

df_1 = df_1.drop('index', axis = 1)
df_1

Unnamed: 0,age,job,marital,education,default,housing,loan,y
0,56,housemaid,married,basic.4y,no,no,no,no
1,37,services,married,high.school,no,yes,no,no
2,40,admin.,married,basic.6y,no,no,no,no
3,56,services,married,high.school,no,no,yes,no
4,59,admin.,married,professional.course,no,no,no,no
...,...,...,...,...,...,...,...,...
30483,73,retired,married,professional.course,no,yes,no,yes
30484,46,blue-collar,married,professional.course,no,no,no,no
30485,56,retired,married,university.degree,no,yes,no,no
30486,44,technician,married,professional.course,no,no,no,yes


In [100]:
X = df_1.drop('y', axis = 1)
y = df_1['y']

In [111]:
transformer = ''
X_num = ''
y_num = ''
    
transformer = make_column_transformer((OneHotEncoder(drop = 'if_binary'), ['job', 'marital', 'education', 'default',
       'housing', 'loan']),
                                     remainder = StandardScaler())
X_num = transformer.fit_transform(X)#.toarray()

y_num = np.where(y == 'no', 0, 1)


### Problem 6: Train/Test Split

With your data prepared, split it into a train and test set.

In [112]:
X_train, X_test, y_train, y_test = train_test_split(X_num, y_num, random_state=42)

### Problem 7: A Baseline Model

Before we build our first model, we want to establish a baseline.  What is the baseline performance that our classifier should aim to beat?

In [115]:
y = df_1['y'].value_counts(normalize = True) #baseline for model
y

no     0.873426
yes    0.126574
Name: y, dtype: float64

### Problem 8: A Simple Model

Use Logistic Regression to build a basic model on your data.  

In [117]:
log_reg = LogisticRegression().fit(X_train, y_train)


### Problem 9: Score the Model

What is the accuracy of your model?

In [119]:
acc = log_reg.score(X_test, y_test)
acc

0.8728680136447127

### Problem 10: Model Comparisons

Now, we aim to compare the performance of the Logistic Regression model to our KNN algorithm, Decision Tree, and SVM models.  Using the default settings for each of the models, fit and score each.  Also, be sure to compare the fit time of each of the models.  Present your findings in a `DataFrame` similar to that below:

| Model | Train Time | Train Accuracy | Test Accuracy |
| ----- | ---------- | -------------  | -----------   |
|     |    |.     |.     |

In [140]:
#Logistic Regression

start = time.time()
log_reg = LogisticRegression().fit(X_train, y_train)
stop = time.time()

acc_test = log_reg.score(X_test, y_test)
acc_train = log_reg.score(X_train, y_train)

print(f"Training time: {stop - start}s")
print(f"Train ACC: {acc_train}")
print(f"Test ACC: {acc_test}")


Training time: 0.14690232276916504s
Train ACC: 0.8736114755532232
Test ACC: 0.8728680136447127


In [141]:
#KNN

start = time.time()
knn = KNeighborsClassifier().fit(X_train, y_train)
stop = time.time()

acc_test = knn.score(X_test, y_test)
acc_train = knn.score(X_train, y_train)

print(f"Training time: {stop - start}s")
print(f"Train ACC: {acc_train}")
print(f"Test ACC: {acc_test}")

  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


Training time: 0.003747224807739258s
Train ACC: 0.8768039884544739
Test ACC: 0.8663080556284439


In [142]:
#Decision Tree

start = time.time()
dt = DecisionTreeClassifier().fit(X_train, y_train)
stop = time.time()

acc_test = dt.score(X_test, y_test)
acc_train = dt.score(X_train, y_train)

print(f"Training time: {stop - start}s")
print(f"Train ACC: {acc_train}")
print(f"Test ACC: {acc_test}")

Training time: 0.3970613479614258s
Train ACC: 0.9018630280766203
Test ACC: 0.8518761479926529


In [143]:
#SVM

start = time.time()
svc = SVC().fit(X_train, y_train)
stop = time.time()

acc_test = svc.score(X_test, y_test)
acc_train = svc.score(X_train, y_train)

print(f"Training time: {stop - start}s")
print(f"Train ACC: {acc_train}")
print(f"Test ACC: {acc_test}")

Training time: 24.939914226531982s
Train ACC: 0.8736114755532232
Test ACC: 0.8728680136447127


In [144]:
data = {'Model': ['Logistic Regression','KNN', 'Decision Tree','SVM'],
       'Train Time': ['0.14690232276916504s','0.003747224807739258s','0.3970613479614258s','24.939914226531982s'],
       'Train Accuracy': ['0.8736114755532232','0.8768039884544739','0.9018630280766203','0.8736114755532232'],
       'Test Accuracy': ['0.8728680136447127','0.8663080556284439','0.8518761479926529','0.8728680136447127']}
       
result = pd.DataFrame(data)
result

Unnamed: 0,Model,Train Time,Train Accuracy,Test Accuracy
0,Logistic Regression,0.14690232276916504s,0.8736114755532232,0.8728680136447127
1,KNN,0.003747224807739258s,0.8768039884544739,0.8663080556284439
2,Decision Tree,0.3970613479614258s,0.9018630280766204,0.8518761479926529
3,SVM,24.939914226531982s,0.8736114755532232,0.8728680136447127


__The model that performed best was Logistic Regression, with the best training time vs accuracy.__

### Problem 11: Improving the Model

Now that we have some basic models on the board, we want to try to improve these.  Below, we list a few things to explore in this pursuit.

- More feature engineering and exploration.  For example, should we keep the gender feature?  Why or why not?
- Hyperparameter tuning and grid search.  All of our models have additional hyperparameters to tune and explore.  For example the number of neighbors in KNN or the maximum depth of a Decision Tree.  
- Adjust your performance metric

__To try to improve the performance of the model, new features will be included.__

In [145]:
# Adding new features Duration and Campaigns

df_2 = df_clean[['age', 'job','marital','education','default','housing','loan','duration','campaign','y']]
df_2.reset_index(inplace= True)

df_2 = df_2.drop('index', axis = 1)
df_2

Unnamed: 0,age,job,marital,education,default,housing,loan,duration,campaign,y
0,56,housemaid,married,basic.4y,no,no,no,261,1,no
1,37,services,married,high.school,no,yes,no,226,1,no
2,40,admin.,married,basic.6y,no,no,no,151,1,no
3,56,services,married,high.school,no,no,yes,307,1,no
4,59,admin.,married,professional.course,no,no,no,139,1,no
...,...,...,...,...,...,...,...,...,...,...
30483,73,retired,married,professional.course,no,yes,no,334,1,yes
30484,46,blue-collar,married,professional.course,no,no,no,383,1,no
30485,56,retired,married,university.degree,no,yes,no,189,2,no
30486,44,technician,married,professional.course,no,no,no,442,1,yes


In [146]:
X2 = df_2.drop('y', axis = 1)
y2 = df_2['y']

In [150]:
transformer2 = ''
X2_num = ''
y2_num = ''  
transformer2 = make_column_transformer((OneHotEncoder(drop = 'if_binary'), ['job', 'marital', 'education', 'default',
       'housing', 'loan']),
                                     remainder = StandardScaler())
X2_num = transformer2.fit_transform(X2)
y2_num = np.where(y2 == 'no', 0, 1)

In [151]:
X2_train, X2_test, y2_train, y2_test = train_test_split(X2_num, y2_num, random_state=42)

In [159]:
#Logistic Regression with more features

start = time.time()
lgr2 = LogisticRegression().fit(X2_train, y2_train)
stop = time.time()

acc_test = lgr2.score(X2_test, y2_test)
acc_train = lgr2.score(X2_train, y2_train)

print(f"Training time: {stop - start}s")
print(f"Train ACC: {acc_train}")
print(f"Test ACC: {acc_test}")

Training time: 0.14943504333496094s
Train ACC: 0.8798215691419575
Test ACC: 0.8808711624245604


In [160]:
# Parameters to use in the decision tree model

params = {'min_impurity_decrease': [0.01, 0.02, 0.03, 0.05],
         'max_depth': [2, 5, 10],
         'min_samples_split': [0.1, 0.2, 0.05]}

In [161]:
# Decision Tree with Grid Search CV and the two additional features

grid = ''
grid_train_acc = ''
grid_test_acc = ''
best_params = ''

### BEGIN SOLUTION
grid = GridSearchCV(DecisionTreeClassifier(random_state = 42), param_grid=params).fit(X2_train, y2_train)
grid_train_acc = grid.score(X2_train, y2_train)
grid_test_acc = grid.score(X2_test, y2_test)
best_params = grid.best_params_
### END SOLUTION

### Answer Check
print(f'Training Accuracy: {grid_train_acc: .3f}')
print(f'Trest Accuracy: {grid_test_acc: .3f}')
print(f'Best parameters of tree: {best_params}')

Training Accuracy:  0.874
Trest Accuracy:  0.873
Best parameters of tree: {'max_depth': 2, 'min_impurity_decrease': 0.01, 'min_samples_split': 0.1}


The Logistic Regression model continued to be the best model, it presented a better accuracy with the inclusion of new features.
The decision tree, using the new features and the Grid Search CV, presented a better performance when compared to a simple decision tree, but could not overcome the Logistic Regression

##### Questions

As next steps, to improve this classification model, we could test other variables and other techniques such as Ensemble Techniques and Neural Networks.
As a recommendation, it would be interesting to raise other characteristics of the clients, such as investment profile: if the client has other investments or not and what are the characteristics of these investments (is this client's profile a conservative or risky investor?). 