# Practical Application III: Comparing Classifiers

**Overview**: In this practical application, your goal is to compare the performance of the classifiers we encountered in this section, namely K Nearest Neighbor, Logistic Regression, Decision Trees, and Support Vector Machines.  We will utilize a dataset related to marketing bank products over the telephone.  



### Getting Started

Our dataset comes from the UCI Machine Learning repository [link](https://archive.ics.uci.edu/ml/datasets/bank+marketing).  The data is from a Portugese banking institution and is a collection of the results of multiple marketing campaigns.  We will make use of the article accompanying the dataset [here](CRISP-DM-BANK.pdf) for more information on the data and features.



### Problem 1: Understanding the Data

To gain a better understanding of the data, please read the information provided in the UCI link above, and examine the **Materials and Methods** section of the paper.  How many marketing campaigns does this data represent?

### Problem 2: Read in the Data

Use pandas to read in the dataset `bank-additional-full.csv` and assign to a meaningful variable name.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set_theme(style="whitegrid")

import warnings
warnings.filterwarnings("ignore")

In [2]:
bank_df = pd.read_csv(r'C:\Users\peter\Analyze_This\codio_practical_application_17_starter\data\bank-additional-full.csv', sep = ';')

In [3]:
bank_df.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


### Problem 3: Understanding the Features


Examine the data description below, and determine if any of the features are missing values or need to be coerced to a different data type.


```
Input variables:
# bank client data:
1 - age (numeric)
2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
5 - default: has credit in default? (categorical: 'no','yes','unknown')
6 - housing: has housing loan? (categorical: 'no','yes','unknown')
7 - loan: has personal loan? (categorical: 'no','yes','unknown')
# related with the last contact of the current campaign:
8 - contact: contact communication type (categorical: 'cellular','telephone')
9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
# other attributes:
12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14 - previous: number of contacts performed before this campaign and for this client (numeric)
15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')
# social and economic context attributes
16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)
17 - cons.price.idx: consumer price index - monthly indicator (numeric)
18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)
19 - euribor3m: euribor 3 month rate - daily indicator (numeric)
20 - nr.employed: number of employees - quarterly indicator (numeric)

Output variable (desired target):
21 - y - has the client subscribed a term deposit? (binary: 'yes','no')
```



### Problem 4: Understanding the Task

After examining the description and data, your goal now is to clearly state the *Business Objective* of the task.  State the objective below.

In [4]:
bank_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 21 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             41188 non-null  int64  
 1   job             41188 non-null  object 
 2   marital         41188 non-null  object 
 3   education       41188 non-null  object 
 4   default         41188 non-null  object 
 5   housing         41188 non-null  object 
 6   loan            41188 non-null  object 
 7   contact         41188 non-null  object 
 8   month           41188 non-null  object 
 9   day_of_week     41188 non-null  object 
 10  duration        41188 non-null  int64  
 11  campaign        41188 non-null  int64  
 12  pdays           41188 non-null  int64  
 13  previous        41188 non-null  int64  
 14  poutcome        41188 non-null  object 
 15  emp.var.rate    41188 non-null  float64
 16  cons.price.idx  41188 non-null  float64
 17  cons.conf.idx   41188 non-null 

In [5]:
bank_df['y'].value_counts()

y
no     36548
yes     4640
Name: count, dtype: int64

In [6]:
bank_df.rename(columns = {'y':'subscribed'}, inplace = True)

In [7]:
bank_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 21 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             41188 non-null  int64  
 1   job             41188 non-null  object 
 2   marital         41188 non-null  object 
 3   education       41188 non-null  object 
 4   default         41188 non-null  object 
 5   housing         41188 non-null  object 
 6   loan            41188 non-null  object 
 7   contact         41188 non-null  object 
 8   month           41188 non-null  object 
 9   day_of_week     41188 non-null  object 
 10  duration        41188 non-null  int64  
 11  campaign        41188 non-null  int64  
 12  pdays           41188 non-null  int64  
 13  previous        41188 non-null  int64  
 14  poutcome        41188 non-null  object 
 15  emp.var.rate    41188 non-null  float64
 16  cons.price.idx  41188 non-null  float64
 17  cons.conf.idx   41188 non-null 

### Problem 5: Engineering Features

Now that you understand your business objective, we will build a basic model to get started.  Before we can do this, we must work to encode the data.  Using just the bank information features, prepare the features and target column for modeling with appropriate encoding and transformations.

In [8]:
bank_df['subscribed'] = bank_df['subscribed'].map( {'no': 0, 'yes': 1} ).astype(int)
bank_df

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,subscribed
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41183,73,retired,married,professional.course,no,yes,no,cellular,nov,fri,...,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,1
41184,46,blue-collar,married,professional.course,no,no,no,cellular,nov,fri,...,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,0
41185,56,retired,married,university.degree,no,yes,no,cellular,nov,fri,...,2,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,0
41186,44,technician,married,professional.course,no,no,no,cellular,nov,fri,...,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,1


In [9]:
bank_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,41188.0,40.02406,10.42125,17.0,32.0,38.0,47.0,98.0
duration,41188.0,258.28501,259.279249,0.0,102.0,180.0,319.0,4918.0
campaign,41188.0,2.567593,2.770014,1.0,1.0,2.0,3.0,56.0
pdays,41188.0,962.475454,186.910907,0.0,999.0,999.0,999.0,999.0
previous,41188.0,0.172963,0.494901,0.0,0.0,0.0,0.0,7.0
emp.var.rate,41188.0,0.081886,1.57096,-3.4,-1.8,1.1,1.4,1.4
cons.price.idx,41188.0,93.575664,0.57884,92.201,93.075,93.749,93.994,94.767
cons.conf.idx,41188.0,-40.5026,4.628198,-50.8,-42.7,-41.8,-36.4,-26.9
euribor3m,41188.0,3.621291,1.734447,0.634,1.344,4.857,4.961,5.045
nr.employed,41188.0,5167.035911,72.251528,4963.6,5099.1,5191.0,5228.1,5228.1


In [10]:
drop_cols = ['age','campaign','cons.conf.idx','pdays','nr.employed']

In [11]:
bank_df.drop(drop_cols, axis=1, inplace=True)

In [12]:
bank_df_encoded = pd.get_dummies(bank_df, columns=['job','marital','education','default','housing','loan','contact','month','day_of_week','poutcome'],  
                                 prefix=['job','marital','education','default','housing','loan','contact','month','day_of_week','poutcome'])

In [13]:
bank_df_encoded.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 59 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   duration                       41188 non-null  int64  
 1   previous                       41188 non-null  int64  
 2   emp.var.rate                   41188 non-null  float64
 3   cons.price.idx                 41188 non-null  float64
 4   euribor3m                      41188 non-null  float64
 5   subscribed                     41188 non-null  int32  
 6   job_admin.                     41188 non-null  bool   
 7   job_blue-collar                41188 non-null  bool   
 8   job_entrepreneur               41188 non-null  bool   
 9   job_housemaid                  41188 non-null  bool   
 10  job_management                 41188 non-null  bool   
 11  job_retired                    41188 non-null  bool   
 12  job_self-employed              41188 non-null 

In [14]:
bank_df_encoded.describe()

Unnamed: 0,duration,previous,emp.var.rate,cons.price.idx,euribor3m,subscribed
count,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0
mean,258.28501,0.172963,0.081886,93.575664,3.621291,0.112654
std,259.279249,0.494901,1.57096,0.57884,1.734447,0.316173
min,0.0,0.0,-3.4,92.201,0.634,0.0
25%,102.0,0.0,-1.8,93.075,1.344,0.0
50%,180.0,0.0,1.1,93.749,4.857,0.0
75%,319.0,0.0,1.4,93.994,4.961,0.0
max,4918.0,7.0,1.4,94.767,5.045,1.0


In [15]:
correlation_matrix = bank_df_encoded.corr()

In [16]:
correlation_matrix

Unnamed: 0,duration,previous,emp.var.rate,cons.price.idx,euribor3m,subscribed,job_admin.,job_blue-collar,job_entrepreneur,job_housemaid,...,month_oct,month_sep,day_of_week_fri,day_of_week_mon,day_of_week_thu,day_of_week_tue,day_of_week_wed,poutcome_failure,poutcome_nonexistent,poutcome_success
duration,1.0,0.02064,-0.027968,0.005312,-0.032897,0.405274,-0.008918,0.012992,0.003679,-0.004908,...,0.015454,0.018011,-0.010343,-0.023068,0.014666,0.00414,0.014537,-0.013311,-0.011665,0.044876
previous,0.02064,1.0,-0.420489,-0.20313,-0.454494,0.230181,0.018888,-0.054845,-0.013239,-0.011569,...,0.127754,0.157651,0.004404,-0.002012,0.001512,9e-05,-0.003929,0.682608,-0.878776,0.524045
emp.var.rate,-0.027968,-0.420489,1.0,0.775334,0.972245,-0.298334,-0.024572,0.057264,0.009363,0.036367,...,-0.213706,-0.170366,-0.015277,-0.018139,-0.007673,0.014582,0.026797,-0.381706,0.472501,-0.256886
cons.price.idx,0.005312,-0.20313,0.775334,1.0,0.68823,-0.136211,-0.04184,0.075322,0.009825,0.028335,...,-0.092174,-0.046905,0.002569,0.001273,-0.017143,0.001216,0.012479,-0.297718,0.304264,-0.077416
euribor3m,-0.032897,-0.454494,0.972245,0.68823,1.0,-0.307771,-0.023831,0.046775,0.018744,0.036392,...,-0.185937,-0.190321,-0.015371,-0.023279,-0.013757,0.022732,0.030201,-0.385417,0.488406,-0.281022
subscribed,0.405274,0.230181,-0.298334,-0.136211,-0.307771,1.0,0.031426,-0.074423,-0.016644,-0.006505,...,0.137366,0.126067,-0.006996,-0.021265,0.013888,0.008046,0.006302,0.031799,-0.193507,0.316269
job_admin.,-0.008918,0.018888,-0.024572,-0.04184,-0.023831,0.031426,1.0,-0.313313,-0.111417,-0.094595,...,0.006538,0.010407,0.009892,-0.000736,-0.00397,-0.001835,-0.003112,0.002771,-0.01556,0.025069
job_blue-collar,0.012992,-0.054845,0.057264,0.075322,0.046775,-0.074423,-0.313313,1.0,-0.10305,-0.087492,...,-0.049034,-0.054309,0.003329,-0.009754,-0.007062,-0.006829,0.020673,-0.013254,0.043843,-0.061403
job_entrepreneur,0.003679,-0.013239,0.009363,0.009825,0.018744,-0.016644,-0.111417,-0.10305,1.0,-0.031113,...,-0.010429,-0.009172,-0.001905,0.006828,0.005551,-0.007275,-0.00348,0.001595,0.007598,-0.017238
job_housemaid,-0.004908,-0.011569,0.036367,0.028335,0.036392,-0.006505,-0.094595,-0.087492,-0.031113,1.0,...,0.000611,-0.003503,-0.007595,0.003365,-0.009014,0.0115,0.001797,-0.017853,0.014629,0.002276


In [17]:
correlation_matrix["subscribed"].sort_values(ascending=True)

euribor3m                       -0.307771
emp.var.rate                    -0.298334
poutcome_nonexistent            -0.193507
contact_telephone               -0.144773
cons.price.idx                  -0.136211
month_may                       -0.108271
default_unknown                 -0.099293
job_blue-collar                 -0.074423
education_basic.9y              -0.045135
marital_married                 -0.043398
job_services                    -0.032301
month_jul                       -0.032230
education_basic.6y              -0.023517
day_of_week_mon                 -0.021265
job_entrepreneur                -0.016644
month_nov                       -0.011796
housing_no                      -0.011085
education_basic.4y              -0.010798
marital_divorced                -0.010608
month_jun                       -0.009182
month_aug                       -0.008813
education_high.school           -0.007452
day_of_week_fri                 -0.006996
job_housemaid                   -0

In [18]:
drop_cols = ['duration','job_admin.','job_blue-collar','job_entrepreneur','job_housemaid','job_management','job_retired',
             'job_self-employed','job_services','job_student','job_technician','job_unemployed','job_unknown','marital_divorced',
             'marital_married','marital_single','marital_unknown','education_basic.4y','education_basic.6y','education_basic.9y',
             'education_high.school','education_illiterate','education_professional.course','education_university.degree','education_unknown',
             'housing_no','housing_unknown','housing_yes','loan_no','loan_unknown','loan_yes','day_of_week_fri','day_of_week_mon','day_of_week_thu',
             'day_of_week_tue','day_of_week_wed']

In [19]:
bank_df_encoded.drop(drop_cols,axis=1,inplace=True)

In [20]:
bank_df_encoded.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 23 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   previous              41188 non-null  int64  
 1   emp.var.rate          41188 non-null  float64
 2   cons.price.idx        41188 non-null  float64
 3   euribor3m             41188 non-null  float64
 4   subscribed            41188 non-null  int32  
 5   default_no            41188 non-null  bool   
 6   default_unknown       41188 non-null  bool   
 7   default_yes           41188 non-null  bool   
 8   contact_cellular      41188 non-null  bool   
 9   contact_telephone     41188 non-null  bool   
 10  month_apr             41188 non-null  bool   
 11  month_aug             41188 non-null  bool   
 12  month_dec             41188 non-null  bool   
 13  month_jul             41188 non-null  bool   
 14  month_jun             41188 non-null  bool   
 15  month_mar          

In [21]:
bank_df_encoded.columns

Index(['previous', 'emp.var.rate', 'cons.price.idx', 'euribor3m', 'subscribed',
       'default_no', 'default_unknown', 'default_yes', 'contact_cellular',
       'contact_telephone', 'month_apr', 'month_aug', 'month_dec', 'month_jul',
       'month_jun', 'month_mar', 'month_may', 'month_nov', 'month_oct',
       'month_sep', 'poutcome_failure', 'poutcome_nonexistent',
       'poutcome_success'],
      dtype='object')

In [22]:
bank_df_encoded = bank_df_encoded[['subscribed', 'previous', 'emp.var.rate', 'cons.price.idx', 'euribor3m',
       'default_no', 'default_unknown', 'default_yes', 'contact_cellular',
       'contact_telephone', 'month_apr', 'month_aug', 'month_dec', 'month_jul',
       'month_jun', 'month_mar', 'month_may', 'month_nov', 'month_oct',
       'month_sep', 'poutcome_failure', 'poutcome_nonexistent',
       'poutcome_success']]

In [23]:
bank_df_encoded.head(10)

Unnamed: 0,subscribed,previous,emp.var.rate,cons.price.idx,euribor3m,default_no,default_unknown,default_yes,contact_cellular,contact_telephone,...,month_jul,month_jun,month_mar,month_may,month_nov,month_oct,month_sep,poutcome_failure,poutcome_nonexistent,poutcome_success
0,0,0,1.1,93.994,4.857,True,False,False,False,True,...,False,False,False,True,False,False,False,False,True,False
1,0,0,1.1,93.994,4.857,False,True,False,False,True,...,False,False,False,True,False,False,False,False,True,False
2,0,0,1.1,93.994,4.857,True,False,False,False,True,...,False,False,False,True,False,False,False,False,True,False
3,0,0,1.1,93.994,4.857,True,False,False,False,True,...,False,False,False,True,False,False,False,False,True,False
4,0,0,1.1,93.994,4.857,True,False,False,False,True,...,False,False,False,True,False,False,False,False,True,False
5,0,0,1.1,93.994,4.857,False,True,False,False,True,...,False,False,False,True,False,False,False,False,True,False
6,0,0,1.1,93.994,4.857,True,False,False,False,True,...,False,False,False,True,False,False,False,False,True,False
7,0,0,1.1,93.994,4.857,False,True,False,False,True,...,False,False,False,True,False,False,False,False,True,False
8,0,0,1.1,93.994,4.857,True,False,False,False,True,...,False,False,False,True,False,False,False,False,True,False
9,0,0,1.1,93.994,4.857,True,False,False,False,True,...,False,False,False,True,False,False,False,False,True,False


### Problem 6: Train/Test Split

With your data prepared, split it into a train and test set.

In [30]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA

In [25]:
cols = bank_df_encoded.columns
target_col = 'subscribed'
feat_cols = [c for c in cols if c != target_col]

In [26]:
array = bank_df_encoded.values

In [27]:
X = array[:, 1:22]
y = array[:, 0]

### Problem 7: A Baseline Model

Before we build our first model, we want to establish a baseline.  What is the baseline performance that our classifier should aim to beat?

In [31]:
model = KNeighborsRegressor(n_neighbors=3, n_jobs=-1)
model.fit(X_train, y_train)

In [32]:
# gather the predictations that our model made for our test set
preds = model.predict(X_test)

# display the actuals and predictions for the test set
print('Actuals for test data set')
print(y_test)
print('Predictions for test data set')
print(preds)

Actuals for test data set
[0 0 0 ... 0 1 0]
Predictions for test data set
[0.0 0.0 0.0 ... 0.0 0.0 0.0]


### Problem 9: Score the Model

What is the accuracy of your model?

In [34]:
from sklearn.metrics import r2_score

print(r2_score(y_test,preds))

-0.022800436321068096


In [35]:
from sklearn.metrics import explained_variance_score

print(explained_variance_score(y_test,preds))

-0.021413008873439887


In [36]:
scores = []
print(f'Features: {feat_cols} \nTarget: {target_col}')

for k in range(2, 20):
    print(f'Evaluating {k} clusters')
    
    model = KNeighborsRegressor(n_neighbors=k, n_jobs=-1)
    model.fit(X_train, y_train)
    scores.append(model.score(X_test, y_test))

Features: ['previous', 'emp.var.rate', 'cons.price.idx', 'euribor3m', 'default_no', 'default_unknown', 'default_yes', 'contact_cellular', 'contact_telephone', 'month_apr', 'month_aug', 'month_dec', 'month_jul', 'month_jun', 'month_mar', 'month_may', 'month_nov', 'month_oct', 'month_sep', 'poutcome_failure', 'poutcome_nonexistent', 'poutcome_success'] 
Target: subscribed
Evaluating 2 clusters
Evaluating 3 clusters
Evaluating 4 clusters
Evaluating 5 clusters
Evaluating 6 clusters
Evaluating 7 clusters
Evaluating 8 clusters
Evaluating 9 clusters
Evaluating 10 clusters
Evaluating 11 clusters
Evaluating 12 clusters
Evaluating 13 clusters
Evaluating 14 clusters
Evaluating 15 clusters
Evaluating 16 clusters
Evaluating 17 clusters
Evaluating 18 clusters
Evaluating 19 clusters


In [38]:
plt.plot(range(2, 20), scores)
plt.scatter(range(2, 20), scores)
plt.grid()
plt.xticks(range(2, 20))

([<matplotlib.axis.XTick at 0x1f244bc6690>,
  <matplotlib.axis.XTick at 0x1f244bc6600>,
  <matplotlib.axis.XTick at 0x1f23b0d5b50>,
  <matplotlib.axis.XTick at 0x1f244bc64b0>,
  <matplotlib.axis.XTick at 0x1f244be8f80>,
  <matplotlib.axis.XTick at 0x1f244be9c40>,
  <matplotlib.axis.XTick at 0x1f244bea570>,
  <matplotlib.axis.XTick at 0x1f244beae40>,
  <matplotlib.axis.XTick at 0x1f244beb830>,
  <matplotlib.axis.XTick at 0x1f244bca870>,
  <matplotlib.axis.XTick at 0x1f244b99100>,
  <matplotlib.axis.XTick at 0x1f244be9e80>,
  <matplotlib.axis.XTick at 0x1f244bebc80>,
  <matplotlib.axis.XTick at 0x1f244c04680>,
  <matplotlib.axis.XTick at 0x1f244bea870>,
  <matplotlib.axis.XTick at 0x1f244c04d40>,
  <matplotlib.axis.XTick at 0x1f244c05730>,
  <matplotlib.axis.XTick at 0x1f244c060c0>],
 [Text(2, 0, '2'),
  Text(3, 0, '3'),
  Text(4, 0, '4'),
  Text(5, 0, '5'),
  Text(6, 0, '6'),
  Text(7, 0, '7'),
  Text(8, 0, '8'),
  Text(9, 0, '9'),
  Text(10, 0, '10'),
  Text(11, 0, '11'),
  Text(12, 0,

In [39]:
print(f'\nMax accuracy = {(max(scores)*100)}%')


Max accuracy = 18.741208673146414%


In [40]:
max(scores)

0.18741208673146414

### Problem 10: Model Comparisons

Now, we aim to compare the performance of the Logistic Regression model to our KNN algorithm, Decision Tree, and SVM models.  Using the default settings for each of the models, fit and score each.  Also, be sure to compare the fit time of each of the models.  Present your findings in a `DataFrame` similar to that below:

| Model | Train Time | Train Accuracy | Test Accuracy |
| ----- | ---------- | -------------  | -----------   |
|     |    |.     |.     |

In [41]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.metrics import mean_squared_error

from matplotlib import pyplot

In [43]:
models = []
models.append(('LR', LinearRegression()))
models.append(('LASSO', Lasso()))
models.append(('EN', ElasticNet()))
models.append(('KNN', KNeighborsRegressor()))
models.append(('CART', DecisionTreeRegressor()))
models.append(('SVR', SVR(gamma='auto')))

In [44]:
seed = 42
num_folds = 5
scoring = 'neg_mean_squared_error'

In [45]:
results = []
names = []
for name, model in models:
    kfold = KFold(n_splits=num_folds, random_state=seed, shuffle=True)
    cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)

LR: -0.078575 (0.000864)
LASSO: -0.099801 (0.001004)
EN: -0.099801 (0.001004)
KNN: -0.090317 (0.000960)
CART: -0.087526 (0.000977)
SVR: -0.086241 (0.001427)


### Problem 11: Improving the Model

Now that we have some basic models on the board, we want to try to improve these.  Below, we list a few things to explore in this pursuit.

- More feature engineering and exploration.  For example, should we keep the gender feature?  Why or why not?
- Hyperparameter tuning and grid search.  All of our models have additional hyperparameters to tune and explore.  For example the number of neighbors in KNN or the maximum depth of a Decision Tree.  
- Adjust your performance metric

In [46]:
ensembles = []
ensembles.append(('ScaledAB', Pipeline([('Scaler', StandardScaler()),('AB', AdaBoostRegressor())])))
ensembles.append(('ScaledGBM', Pipeline([('Scaler', StandardScaler()),('GBM', GradientBoostingRegressor())])))
ensembles.append(('ScaledRF', Pipeline([('Scaler', StandardScaler()),('RF', RandomForestRegressor(n_estimators=10))])))
ensembles.append(('ScaledET', Pipeline([('Scaler', StandardScaler()),('ET', ExtraTreesRegressor(n_estimators=10))])))

In [47]:
results = []
names = []
for name, model in ensembles:
    kfold = KFold(n_splits=num_folds, random_state=seed, shuffle=True)
    cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)

ScaledAB: -0.081415 (0.000467)
ScaledGBM: -0.076960 (0.000532)
ScaledRF: -0.083346 (0.001521)
ScaledET: -0.086701 (0.001210)
