# Practical Application III: Comparing Classifiers

**Overview**: In this practical application, your goal is to compare the performance of the classifiers we encountered in this section, namely K Nearest Neighbor, Logistic Regression, Decision Trees, and Support Vector Machines.  We will utilize a dataset related to marketing bank products over the telephone.  



### Problem 1: Understanding the Data

To gain a better understanding of the data, please read the information provided in the UCI link above, and examine the **Materials and Methods** section of the paper.  How many marketing campaigns does this data represent?

The study was conducted over 2.5 years and covered 17 marketing campaigns.

### Problem 2: Read in the Data

Use pandas to read in the dataset `bank-additional-full.csv` and assign to a meaningful variable name.

In [4]:
import pandas as pd
import plotly.graph_objects as go
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from datetime import datetime

import random

In [5]:
df = pd.read_csv('data/bank-additional-full.csv', sep = ';')

In [6]:
df.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


### Problem 3: Understanding the Features


Examine the data description below, and determine if any of the features are missing values or need to be coerced to a different data type.


## Input variables:
#### Bank client data:
1. age (numeric)
2. job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
3. marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
4. education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
5. default: has credit in default? (categorical: 'no','yes','unknown')
6. housing: has housing loan? (categorical: 'no','yes','unknown')
7. loan: has personal loan? (categorical: 'no','yes','unknown')
#### Related with the last contact of the current campaign:
8. contact: contact communication type (categorical: 'cellular','telephone')
9. month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
10. day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
11. duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
#### Other attributes:
12. campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13. pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14. previous: number of contacts performed before this campaign and for this client (numeric)
15. poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')
#### Social and economic context attributes
16. emp.var.rate: employment variation rate. quarterly indicator (numeric)
17. cons.price.idx: consumer price index. monthly indicator (numeric)
18. cons.conf.idx: consumer confidence index. monthly indicator (numeric)
19. euribor3m: euribor 3 month rate. daily indicator (numeric)
20. nr.employed: number of employees. quarterly indicator (numeric)

## Output variable (desired target):
21. y. has the client subscribed a term deposit? (binary: 'yes','no')



### Problem 4: Understanding the Task

After examining the description and data, your goal now is to clearly state the *Business Objective* of the task.  State the objective below.

The business objective of this project is to find a model that can explain the success of a contact in a bank direct marketing campaign, specifically in relation to bank deposit subscriptions.

The goal is to identify the main characteristics that affect the success of a contact and use this information to increase the efficiency of the campaign.

By understanding the factors that contribute to success, the bank can better manage its resources (such as human effort, phone calls, and time) and select a high-quality and affordable set of potential customers who are likely to subscribe to the deposit. ​

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 21 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             41188 non-null  int64  
 1   job             41188 non-null  object 
 2   marital         41188 non-null  object 
 3   education       41188 non-null  object 
 4   default         41188 non-null  object 
 5   housing         41188 non-null  object 
 6   loan            41188 non-null  object 
 7   contact         41188 non-null  object 
 8   month           41188 non-null  object 
 9   day_of_week     41188 non-null  object 
 10  duration        41188 non-null  int64  
 11  campaign        41188 non-null  int64  
 12  pdays           41188 non-null  int64  
 13  previous        41188 non-null  int64  
 14  poutcome        41188 non-null  object 
 15  emp.var.rate    41188 non-null  float64
 16  cons.price.idx  41188 non-null  float64
 17  cons.conf.idx   41188 non-null 

In [8]:
df.describe()

Unnamed: 0,age,duration,campaign,pdays,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed
count,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0
mean,40.02406,258.28501,2.567593,962.475454,0.172963,0.081886,93.575664,-40.5026,3.621291,5167.035911
std,10.42125,259.279249,2.770014,186.910907,0.494901,1.57096,0.57884,4.628198,1.734447,72.251528
min,17.0,0.0,1.0,0.0,0.0,-3.4,92.201,-50.8,0.634,4963.6
25%,32.0,102.0,1.0,999.0,0.0,-1.8,93.075,-42.7,1.344,5099.1
50%,38.0,180.0,2.0,999.0,0.0,1.1,93.749,-41.8,4.857,5191.0
75%,47.0,319.0,3.0,999.0,0.0,1.4,93.994,-36.4,4.961,5228.1
max,98.0,4918.0,56.0,999.0,7.0,1.4,94.767,-26.9,5.045,5228.1


In [9]:
df.nunique()


age                 78
job                 12
marital              4
education            8
default              3
housing              3
loan                 3
contact              2
month               10
day_of_week          5
duration          1544
campaign            42
pdays               27
previous             8
poutcome             3
emp.var.rate        10
cons.price.idx      26
cons.conf.idx       26
euribor3m          316
nr.employed         11
y                    2
dtype: int64

In [10]:
# print a random record
print(df.iloc[random.randint(0, len(df))])


age                                48
job                       blue-collar
marital                       married
education         professional.course
default                            no
housing                           yes
loan                              yes
contact                     telephone
month                             may
day_of_week                       tue
duration                          231
campaign                            1
pdays                             999
previous                            0
poutcome                  nonexistent
emp.var.rate                      1.1
cons.price.idx                 93.994
cons.conf.idx                   -36.4
euribor3m                       4.857
nr.employed                    5191.0
y                                  no
Name: 5950, dtype: object


In [11]:
df.rename(columns=lambda x: x.replace('.', '_'), inplace=True)
df.rename(columns={'y': 'response'}, inplace=True)
df['response'] = df['response'].map({'no': 0, 'yes': 1})
df['education'] = df['education'].str.replace('.', '_') 

In [12]:
# See if there are any missing values in the data
missing_values = df.isnull().sum()
print(missing_values)

age               0
job               0
marital           0
education         0
default           0
housing           0
loan              0
contact           0
month             0
day_of_week       0
duration          0
campaign          0
pdays             0
previous          0
poutcome          0
emp_var_rate      0
cons_price_idx    0
cons_conf_idx     0
euribor3m         0
nr_employed       0
response          0
dtype: int64


### Problem 5: Engineering Features

Now that you understand your business objective, we will build a basic model to get started.  Before we can do this, we must work to encode the data.  Using just the bank information features (columns 1. 7), prepare the features and target column for modeling with appropriate encoding and transformations.

In [13]:
job_response_pct = pd.crosstab(df['response'], df['job']).apply(lambda x: x/x.sum() * 100)
job_response_pct = job_response_pct.transpose()

In [14]:


job_response_pct = df.groupby('job')['response'].mean() * 100

fig = go.Figure(data=[go.Bar(x=job_response_pct.index, y=job_response_pct.values)])
fig.update_layout(title='Subscription Rate by Job', xaxis_title='Job Category', yaxis_title='Subscription Rate')

fig.show()


In [15]:
married_response_percent = df.groupby('marital')['response'].mean() * 100
education_response_percent = df.groupby('education')['response'].mean() * 100

# Plotting job_response_pct, married_percent, and education_percent on a horizontal chart
fig = go.Figure()

fig.add_trace(go.Bar(y=job_response_pct.index, x=job_response_pct.values, name='Job', orientation='h'))
fig.add_trace(go.Bar(y=married_response_percent.index, x=married_response_percent.values, name='Marital', orientation='h'))
fig.add_trace(go.Bar(y=education_response_percent.index, x=education_response_percent.values, name='Education', orientation='h'))

fig.update_layout(title='Percentage Distribution of Job, Marital, and Education',
                  xaxis_title='Percentage',
                  yaxis_title='Categories')

fig.show()


In [16]:

label_encoder = LabelEncoder()

for feature in df.columns:
    if df[feature].dtype == 'object':
        old_value = df[feature]
        df[feature] = label_encoder.fit_transform(df[feature])
        new_value = df[feature]
        print(f'{feature}: {dict(zip(new_value, old_value))}')
        
df.head()


job: {3: 'housemaid', 7: 'services', 0: 'admin.', 1: 'blue-collar', 9: 'technician', 5: 'retired', 4: 'management', 10: 'unemployed', 6: 'self-employed', 11: 'unknown', 2: 'entrepreneur', 8: 'student'}
marital: {1: 'married', 2: 'single', 0: 'divorced', 3: 'unknown'}
education: {0: 'basic_4y', 3: 'high_school', 1: 'basic_6y', 2: 'basic_9y', 5: 'professional_course', 7: 'unknown', 6: 'university_degree', 4: 'illiterate'}
default: {0: 'no', 1: 'unknown', 2: 'yes'}
housing: {0: 'no', 2: 'yes', 1: 'unknown'}
loan: {0: 'no', 2: 'yes', 1: 'unknown'}
contact: {1: 'telephone', 0: 'cellular'}
month: {6: 'may', 4: 'jun', 3: 'jul', 1: 'aug', 8: 'oct', 7: 'nov', 2: 'dec', 5: 'mar', 0: 'apr', 9: 'sep'}
day_of_week: {1: 'mon', 3: 'tue', 4: 'wed', 2: 'thu', 0: 'fri'}
poutcome: {1: 'nonexistent', 0: 'failure', 2: 'success'}


Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,response
0,56,3,1,0,0,0,0,1,6,1,...,1,999,0,1,1.1,93.994,-36.4,4.857,5191.0,0
1,57,7,1,3,1,0,0,1,6,1,...,1,999,0,1,1.1,93.994,-36.4,4.857,5191.0,0
2,37,7,1,3,0,2,0,1,6,1,...,1,999,0,1,1.1,93.994,-36.4,4.857,5191.0,0
3,40,0,1,1,0,0,0,1,6,1,...,1,999,0,1,1.1,93.994,-36.4,4.857,5191.0,0
4,56,7,1,3,0,0,2,1,6,1,...,1,999,0,1,1.1,93.994,-36.4,4.857,5191.0,0


In [17]:
#show the encoded labels with their original values


### Problem 6: Train/Test Split

With your data prepared, split it into a train and test set.

In [18]:
# split the data into training and testing sets

y = df['response']
X = df.drop('response', axis=1)
y = df['response']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# check the shape of the training and testing sets
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")


X_train shape: (32950, 20)
X_test shape: (8238, 20)
y_train shape: (32950,)
y_test shape: (8238,)


### Problem 7: A Baseline Model

Before we build our first model, we want to establish a baseline.  What is the baseline performance that our classifier should aim to beat?

## First let's set up a function to measure how our model performs

In [19]:
def train_model(model, X_train, y_train):
    start = datetime.now()
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    end = datetime.now()
    elapsed = end - start
    return model, pred, elapsed.total_seconds()*1000.0

In [20]:
dummy = DummyClassifier(strategy='stratified', random_state=42)
dummy, dummy_pred, dummy_time = train_model(dummy, X_train, y_train)

dummy_test_accuracy = accuracy_score(y_test, dummy_pred)
dummy_train_accuracy = dummy.score(X_train, y_train)

performance_log = pd.DataFrame([['Dummy Classifier', dummy_time, dummy_test_accuracy, dummy_train_accuracy]],
                       columns=['Model', 'Training Time', 'Test Accuracy', 'Train Accuracy'])

print(performance_log)
print("=============================================")
print(f"Training time for Dummy Classifier: {dummy_time} milliseconds")
print(f"Accuracy of the model: {dummy_test_accuracy*100:.2f}%")


              Model  Training Time  Test Accuracy  Train Accuracy
0  Dummy Classifier          3.002       0.799466        0.800637
Training time for Dummy Classifier: 3.002 milliseconds
Accuracy of the model: 79.95%


In [21]:
dt = DecisionTreeClassifier(random_state=42)
dt, dt_pred, dt_time = train_model(dt, X_train, y_train)

dt_test_accuracy = accuracy_score(y_test, dt_pred)
dt_train_accuracy = dt.score(X_train, y_train)

new_row = pd.DataFrame([['Decision Tree Classifier', dt_time, dt_test_accuracy, dt_train_accuracy]], 
                       columns=['Model', 'Training Time', 'Test Accuracy', 'Train Accuracy'])

performance_log = pd.concat([performance_log, new_row], ignore_index=True)
print(performance_log)
print("=============================================")
print(f"Training time for Decision Tree Classifier: {dt_time} milliseconds")
print(f"Accuracy of the Decision Tree model: {dt_test_accuracy*100:.2f}%")

                      Model  Training Time  Test Accuracy  Train Accuracy
0          Dummy Classifier          3.002       0.799466        0.800637
1  Decision Tree Classifier        142.124       0.889415        1.000000
Training time for Decision Tree Classifier: 142.124 milliseconds
Accuracy of the Decision Tree model: 88.94%


In [22]:
rf = RandomForestClassifier(random_state=42)
rf, rf_pred, rf_time = train_model(rf, X_train, y_train)

rf_test_accuracy = accuracy_score(y_test, rf_pred)
rf_train_accuracy = rf.score(X_train, y_train)

new_row = pd.DataFrame([['Random Forest Classifier', rf_time, rf_test_accuracy, rf_train_accuracy]],
                          columns=['Model', 'Training Time', 'Test Accuracy', 'Train Accuracy'])

performance_log = pd.concat([performance_log, new_row], ignore_index=True)
print(performance_log)
print(f"Training time for Random Forest Classifier: {rf_time} milliseconds")
print(f"Accuracy of the Random Forest model: {rf_test_accuracy*100:.2f}%")

                      Model  Training Time  Test Accuracy  Train Accuracy
0          Dummy Classifier          3.002       0.799466        0.800637
1  Decision Tree Classifier        142.124       0.889415        1.000000
2  Random Forest Classifier       4195.106       0.913328        1.000000
Training time for Random Forest Classifier: 4195.106 milliseconds
Accuracy of the Random Forest model: 91.33%


# What's a better performance?
With accuracy comes increased resources; so while the Random Forest looks to be the most accurate, it takes seconds to complete. While the Dummy Classifier takes milliseconds, the accuracy drops to the seventies. So, it really depends on the desired conditions.

### Problem 8: A Simple Model

Use Logistic Regression to build a basic model on your data.  

In [23]:
lr = LogisticRegression(random_state=42)
lr, lr_pred, lr_time = train_model(lr, X_train, y_train)

lr_test_accuracy = accuracy_score(y_test, lr_pred)

lr_train_accuracy = lr.score(X_train, y_train)

new_row = pd.DataFrame([['Logistic Regression', lr_time, lr_test_accuracy, lr_train_accuracy]],
                            columns=['Model', 'Training Time', 'Test Accuracy', 'Train Accuracy'])

performance_log = pd.concat([performance_log, new_row], ignore_index=True)


print(f"Training time for Logistic Regression: {lr_time} milliseconds")
print(f"Accuracy of the Logistic Regression model: {lr_test_accuracy*100:.2f}%")


Training time for Logistic Regression: 342.233 milliseconds
Accuracy of the Logistic Regression model: 90.98%



lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression



### Problem 9: Score the Model

What is the accuracy of your model?

In [24]:
print(performance_log)

                      Model  Training Time  Test Accuracy  Train Accuracy
0          Dummy Classifier          3.002       0.799466        0.800637
1  Decision Tree Classifier        142.124       0.889415        1.000000
2  Random Forest Classifier       4195.106       0.913328        1.000000
3       Logistic Regression        342.233       0.909808        0.908346


### Problem 10: Model Comparisons

Now, we aim to compare the performance of the Logistic Regression model to our KNN algorithm, Decision Tree, and SVM models.  Using the default settings for each of the models, fit and score each.  Also, be sure to compare the fit time of each of the models.  Present your findings in a `DataFrame` similar to that below:

| Model | Train Time | Train Accuracy | Test Accuracy |
| ----- | ---------- | -------------  | -----------   |
|     |    |.     |.     |

In [25]:
# create a knn model and train it
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5)
knn, knn_pred, knn_time = train_model(knn, X_train, y_train)

knn_test_accuracy = accuracy_score(y_test, knn_pred)
knn_train_accuracy = knn.score(X_train, y_train)

new_row = pd.DataFrame([['K-Nearest Neighbors', knn_time, knn_test_accuracy, knn_train_accuracy]],
                            columns=['Model', 'Training Time', 'Test Accuracy', 'Train Accuracy'])

performance_log = pd.concat([performance_log, new_row], ignore_index=True)

print(performance_log)

print(f"Training time for K-Nearest Neighbors: {knn_time} milliseconds")
print(f"Accuracy of the K-Nearest Neighbors model: {knn_test_accuracy*100:.2f}%")

                      Model  Training Time  Test Accuracy  Train Accuracy
0          Dummy Classifier          3.002       0.799466        0.800637
1  Decision Tree Classifier        142.124       0.889415        1.000000
2  Random Forest Classifier       4195.106       0.913328        1.000000
3       Logistic Regression        342.233       0.909808        0.908346
4       K-Nearest Neighbors        787.720       0.902282        0.931472
Training time for K-Nearest Neighbors: 787.72 milliseconds
Accuracy of the K-Nearest Neighbors model: 90.23%


In [26]:
#create a support vector machine model and train it

svm = SVC(random_state=42)
svm, svm_pred, svm_time = train_model(svm, X_train, y_train)

svm_test_accuracy = accuracy_score(y_test, svm_pred)
svm_train_accuracy = svm.score(X_train, y_train)

new_row = pd.DataFrame([['Support Vector Machine', svm_time, svm_test_accuracy, svm_train_accuracy]],
                            columns=['Model', 'Training Time', 'Test Accuracy', 'Train Accuracy'])

performance_log = pd.concat([performance_log, new_row], ignore_index=True)

print(performance_log)

print(f"Training time for Support Vector Machine: {svm_time} milliseconds")
print(f"Accuracy of the Support Vector Machine model: {svm_test_accuracy*100:.2f}%")

                      Model  Training Time  Test Accuracy  Train Accuracy
0          Dummy Classifier          3.002       0.799466        0.800637
1  Decision Tree Classifier        142.124       0.889415        1.000000
2  Random Forest Classifier       4195.106       0.913328        1.000000
3       Logistic Regression        342.233       0.909808        0.908346
4       K-Nearest Neighbors        787.720       0.902282        0.931472
5    Support Vector Machine      26396.007       0.894513        0.898483
Training time for Support Vector Machine: 26396.007 milliseconds
Accuracy of the Support Vector Machine model: 89.45%


### Problem 11: Improving the Model

Now that we have some basic models on the board, we want to try to improve these.  Below, we list a few things to explore in this pursuit.

- More feature engineering and exploration.  For example, should we keep the gender feature?  Why or why not?
- Hyperparameter tuning and grid search.  All of our models have additional hyperparameters to tune and explore.  For example the number of neighbors in KNN or the maximum depth of a Decision Tree.  
- Adjust your performance metric

In [27]:
knn_pipe = Pipeline([('scaler', StandardScaler()), ('knn', KNeighborsClassifier(n_neighbors=5))])
knn_pipe, knn_pipe_pred, knn_pipe_time = train_model(knn_pipe, X_train, y_train)

knn_pipe_test_accuracy = accuracy_score(y_test, knn_pipe_pred)
knn_pipe_train_accuracy = knn_pipe.score(X_train, y_train)

new_row = pd.DataFrame([['K-Nearest Neighbors with Standard Scaler', knn_pipe_time, knn_pipe_test_accuracy, knn_pipe_train_accuracy]],
                            columns=['Model', 'Training Time', 'Test Accuracy', 'Train Accuracy'])

performance_log = pd.concat([performance_log, new_row], ignore_index=True)

print(performance_log)

print(f"Training time for K-Nearest Neighbors with Standard Scaler: {knn_pipe_time} milliseconds")
print(f"Accuracy of the K-Nearest Neighbors with Standard Scaler model: {knn_pipe_test_accuracy*100:.2f}%")

                                      Model  Training Time  Test Accuracy  \
0                          Dummy Classifier          3.002       0.799466   
1                  Decision Tree Classifier        142.124       0.889415   
2                  Random Forest Classifier       4195.106       0.913328   
3                       Logistic Regression        342.233       0.909808   
4                       K-Nearest Neighbors        787.720       0.902282   
5                    Support Vector Machine      26396.007       0.894513   
6  K-Nearest Neighbors with Standard Scaler        766.363       0.899854   

   Train Accuracy  
0        0.800637  
1        1.000000  
2        1.000000  
3        0.908346  
4        0.931472  
5        0.898483  
6        0.926677  
Training time for K-Nearest Neighbors with Standard Scaler: 766.363 milliseconds
Accuracy of the K-Nearest Neighbors with Standard Scaler model: 89.99%


In [28]:
param_grid = {'n_neighbors': [3, 5, 7, 9, 11], 'weights': ['uniform', 'distance'], 'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute']}
grid_search = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5)
grid_search, grid_search_pred, grid_search_time = train_model(grid_search, X_train, y_train)

grid_search_test_accuracy = accuracy_score(y_test, grid_search_pred)
grid_search_train_accuracy = grid_search.score(X_train, y_train)

new_row = pd.DataFrame([['K-Nearest Neighbors with Grid Search', grid_search_time, grid_search_test_accuracy, grid_search_train_accuracy]],
                            columns=['Model', 'Training Time', 'Test Accuracy', 'Train Accuracy'])

performance_log = pd.concat([performance_log, new_row], ignore_index=True)

print(performance_log)

print(f"Training time for K-Nearest Neighbors with Grid Search: {grid_search_time} milliseconds")
print(f"Accuracy of the K-Nearest Neighbors with Grid Search model: {grid_search_test_accuracy*100:.2f}%")



                                      Model  Training Time  Test Accuracy  \
0                          Dummy Classifier          3.002       0.799466   
1                  Decision Tree Classifier        142.124       0.889415   
2                  Random Forest Classifier       4195.106       0.913328   
3                       Logistic Regression        342.233       0.909808   
4                       K-Nearest Neighbors        787.720       0.902282   
5                    Support Vector Machine      26396.007       0.894513   
6  K-Nearest Neighbors with Standard Scaler        766.363       0.899854   
7      K-Nearest Neighbors with Grid Search     118144.060       0.908109   

   Train Accuracy  
0        0.800637  
1        1.000000  
2        1.000000  
3        0.908346  
4        0.931472  
5        0.898483  
6        0.926677  
7        1.000000  
Training time for K-Nearest Neighbors with Grid Search: 118144.06 milliseconds
Accuracy of the K-Nearest Neighbors with Grid S

In [29]:
# Perfom Cross-validation
from sklearn.model_selection import cross_val_score

knn_cv = KNeighborsClassifier(n_neighbors=5)
cv_scores = cross_val_score(knn_cv, X_train, y_train, cv=5)
cv_mean = cv_scores.mean()

print(f"Cross-validation scores: {cv_scores}")

print(f"Mean Cross-validation score: {cv_mean}")

# Perform Grid Search with Cross-validation

param_grid = {'n_neighbors': [3, 5, 7, 9, 11], 'weights': ['uniform', 'distance'], 'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute']}
grid_search_cv = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5)

grid_search_cv, grid_search_cv_pred, grid_search_cv_time = train_model(grid_search_cv, X_train, y_train)

grid_search_cv_test_accuracy = accuracy_score(y_test, grid_search_cv_pred)

new_row = pd.DataFrame([['K-Nearest Neighbors with Grid Search and Cross-validation', grid_search_cv_time, grid_search_cv_test_accuracy, cv_mean]],
                            columns=['Model', 'Training Time', 'Test Accuracy', 'Cross-validation Mean'])

performance_log = pd.concat([performance_log, new_row], ignore_index=True)

print(performance_log)

print(f"Training time for K-Nearest Neighbors with Grid Search and Cross-validation: {grid_search_cv_time} milliseconds")
print(f"Accuracy of the K-Nearest Neighbors with Grid Search and Cross-validation model: {grid_search_cv_test_accuracy*100:.2f}%")


Cross-validation scores: [0.90500759 0.9047041  0.90091047 0.90257967 0.91016692]
Mean Cross-validation score: 0.9046737481031867
                                               Model  Training Time  \
0                                   Dummy Classifier          3.002   
1                           Decision Tree Classifier        142.124   
2                           Random Forest Classifier       4195.106   
3                                Logistic Regression        342.233   
4                                K-Nearest Neighbors        787.720   
5                             Support Vector Machine      26396.007   
6           K-Nearest Neighbors with Standard Scaler        766.363   
7               K-Nearest Neighbors with Grid Search     118144.060   
8  K-Nearest Neighbors with Grid Search and Cross...     128056.705   

   Test Accuracy  Train Accuracy  Cross-validation Mean  
0       0.799466        0.800637                    NaN  
1       0.889415        1.000000           

In [42]:
#Let's predict the outcome of a random record using all the models we have trained

def predict_with_model(model, record):
    print
    name = type(model).__name__
    record = record.values.reshape(1, -1)
    prediction = model.predict(record)
    if prediction == 1:
        return f'The customer is predicted to subscribe, using {name} model'
    else:
        return f'The customer is not predicted to subscribe, using {name} model'

sample_record = X_test.iloc[random.randint(0, len(X_test))]
print(sample_record)

print(predict_with_model(dummy, sample_record))
print(predict_with_model(dt, sample_record))
print(predict_with_model(rf, sample_record))
print(predict_with_model(lr, sample_record))
print(predict_with_model(knn_pipe, sample_record))
print(predict_with_model(grid_search, sample_record))
print(predict_with_model(grid_search_cv, sample_record))



age                 36.000
job                  0.000
marital              1.000
education            3.000
default              0.000
housing              0.000
loan                 0.000
contact              1.000
month                6.000
day_of_week          3.000
duration          1346.000
campaign             1.000
pdays              999.000
previous             0.000
poutcome             1.000
emp_var_rate        -1.800
cons_price_idx      92.893
cons_conf_idx      -46.200
euribor3m            1.266
nr_employed       5099.100
Name: 36052, dtype: float64
The customer is not predicted to subscribe, using DummyClassifier model
The customer is not predicted to subscribe, using DecisionTreeClassifier model
The customer is not predicted to subscribe, using RandomForestClassifier model
The customer is predicted to subscribe, using LogisticRegression model
The customer is predicted to subscribe, using Pipeline model
The customer is predicted to subscribe, using GridSearchCV model
The c


X does not have valid feature names, but DecisionTreeClassifier was fitted with feature names


X does not have valid feature names, but RandomForestClassifier was fitted with feature names


X does not have valid feature names, but LogisticRegression was fitted with feature names


X does not have valid feature names, but StandardScaler was fitted with feature names


X does not have valid feature names, but KNeighborsClassifier was fitted with feature names


X does not have valid feature names, but KNeighborsClassifier was fitted with feature names



In [None]:
print(performance_log)

                                               Model  Training Time  \
0                                   Dummy Classifier          3.000   
1                           Decision Tree Classifier        166.833   
2                           Random Forest Classifier       4273.949   
3                                Logistic Regression        309.343   
4                                K-Nearest Neighbors        783.728   
5                             Support Vector Machine      19512.009   
6           K-Nearest Neighbors with Standard Scaler        825.618   
7               K-Nearest Neighbors with Grid Search     124460.786   
8  K-Nearest Neighbors with Grid Search and Cross...     119423.028   

   Test Accuracy  Train Accuracy  Cross-validation Mean  
0       0.799466        0.800637                    NaN  
1       0.889415        1.000000                    NaN  
2       0.913328        1.000000                    NaN  
3       0.909808        0.908346                    NaN  

Based on this exercise, the recommendations are as follows.

## Best model to use

The best model to use is the Logistic Regression model. It has a test accuracy of 0.91 and a training time of 309 milliseconds, although not the fastest. The training accuracy is 0.91, which is very high and does not signify overfitting like some of the other models. The model is not overfitting because the test accuracy is close to the training accuracy. The model is also not underfitting because the test accuracy is high.

## Second best model to use

If training time is not a significant concern then the Random Forest Classifier might be the best, although it's training accuracy of 1 is suspicious and indicates overfitting.

## Who should be the primary focus of future compaigns?

Based on the data, those most likely likely to subscribe are students & people with degrees, although there is a significant signal for those that are illiterate, but I'd want to dig into that more.

The second focus should also be on retired people as well.

There are other factors that could be derived as well, like age and whether they had a previous loan. But this needs further inquiry.



##### Questions