# Week 5 Lab: Supervised Learning

<img align="right" style="padding-right:10px;" src="figures_wk5/knn.png" width=300><br>

# Introduction

This week's assignment will focus on completeing a KNN analysis and comparing its performance with other supervised algorithms.


## Our Dataset: 
**Dataset:** bank-additional-full.csv (Provided in folder assign_wk5)


[Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, In press, http://dx.doi.org/10.1016/j.dss.2014.03.001


Input variables:
#### bank client data:
   - **age** (numeric)
   - **job:** type of job 
       (categorical)
       - "admin."
       - "blue-collar"
       - "entrepreneur"
       - "housemaid"
       - "management"
       - "retired"
       - "self-employed"
       - "services"
       - "student"
       - "technician"
       - "unemployed"
       - **"unknown"** [x][missing]
   - **marital:** marital status 
       (categorical)
       - "divorced"
       - "married"
       - "single"
       - **"unknown"** [x][missing]
       **note:** "divorced" means divorced or widowed      
   - **education** 
       (categorical)
      - "basic.4y"
      - "basic.6y"
      - "basic.9y"
      - "high.school"
      - "illiterate"
      - "professional.course"
      - "university.degree"
      - **"unknown"** [x][missing]
   - **default:** has credit in default? 
       (categorical)
       - "no"
       - "yes"
       - **"unknown"** [x][missing]
   - **housing:** has housing loan? 
       (categorical)
       - "no"
       - "yes"
       - **"unknown"** [x][missing]
   - **loan:** has personal loan? 
       (categorical)
       - "no"
       - "yes"
       - **"unknown"** [][missing]
       
#### Related with the Last Contact of the Current Campaign:
  
   - **contact:** contact communication type 
       (categorical)
       - "cellular"
       - "telephone"
       
   - **month:** last contact month of year 
       (categorical)
       - "jan"
       - "feb"
       - "mar"
       - "etc"
       
   - **day_of_week:** last contact day of the week 
      (categorical)
      - "mon"
      - "tue"
      - "wed"
      - "thu"
      - "fri"
      
   - **duration:** last contact duration, in seconds (numeric) 
     
      - **Important note:**  this attribute highly affects the output target (e.g., if duration=0 then y="no"). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, *__this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.__*
      
#### Other Attributes:
   
   - **campaign:** number of contacts performed during this campaign and for this client (numeric) 
        - includes last contact
   - **pdays:** number of days that passed by after the client was last contacted from a previous campaign 
      (numeric)
      - 999 means client was not previously contacted
   - **previous:** number of contacts performed before this campaign and for this client 
      (numeric)
      
   - **poutcome:** outcome of the previous marketing campaign 
      (categorical)
      - "failure"
      - "nonexistent"
      - "success"
      
#### Social and Economic Context Attributes:
   
   - **emp.var.rate:** employment variation rate - quarterly indicator 
      (numeric)
      
   - **cons.price.idx:** consumer price index - monthly indicator 
      (numeric)     
      
   - **cons.conf.idx:** consumer confidence index - monthly indicator 
      (numeric)   
      
   - **euribor3m:** euribor 3 month rate - daily indicator 
      (numeric)
      
   - **nr.employed:** number of employees - quarterly indicator 
      (numeric)

  #### Output variable (desired target):
  
   - **y:** has the client subscribed a term deposit? 
      (binary)
      - "yes"
      - "no"

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline
sns.set()

In [2]:
df = pd.read_csv("assign_wk5/bank-additional-full.csv", delimiter=";")
df.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


# Assignment Requirements

## Part 1: KNN Analysis
**Objective:** According to the dataset's text file, the target column the last column in the dataset.

### - Cleanup the dataset as you see deem appropriate. As always, defend your reasoning!!!
       - Missing values?
       - Column names

According to the supporting text file:

There are several missing values in some categorical attributes, all coded with the "unknown" label. These missing values can be treated as a possible class label or using deletion or imputation techniques.

In [3]:
df.replace('unknown', np.nan, inplace=True)

### First step: Missing Values (Quick removal)

I found a tutorial on the following technique: https://www.youtube.com/watch?v=DNgCfWJIW5A

I first removed any missing values that totaled less than 1% of the entire feature.

In [4]:
features_completecase = [ feature for feature in df.columns if df[feature].isnull().mean() < 0.01 ]

In [5]:
df.shape

(41188, 21)

In [6]:
df[features_completecase].shape

(41188, 17)

In [7]:
df_clean = df.to_csv('assign_wk5/df_clean.csv', index=False)
df_clean = pd.read_csv('assign_wk5/df_clean.csv')
df_clean.dropna(axis=0, subset={'job', 'marital'}, inplace=True)

In [8]:
len(df_clean)/len(df)

0.9902641546081383

I still have 99% of the original dataframe!

### Second step: Missing Values (conditional inputation)

In [9]:
df_clean

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41183,73,retired,married,professional.course,no,yes,no,cellular,nov,fri,...,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,yes
41184,46,blue-collar,married,professional.course,no,no,no,cellular,nov,fri,...,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,no
41185,56,retired,married,university.degree,no,yes,no,cellular,nov,fri,...,2,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,no
41186,44,technician,married,professional.course,no,no,no,cellular,nov,fri,...,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,yes


In [10]:
missing_var = [var for var in df.columns if df_clean[var].isnull().mean()>0 
               and df_clean[var].dtypes == 'O']

In [11]:
missing_var

['education', 'default', 'housing', 'loan']

In [12]:
df_clean[missing_var].isnull().mean()

education    0.039130
default      0.206831
housing      0.024125
loan         0.024125
dtype: float64

In [13]:
df_clean.fillna('Missing', inplace = True)

In [14]:
df_clean['default'].value_counts()

no         32348
Missing     8436
yes            3
Name: default, dtype: int64

In [15]:
df[(df['default'] == 'yes')]

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
21580,48,technician,married,professional.course,yes,no,no,cellular,aug,tue,...,1,999,0,nonexistent,1.4,93.444,-36.1,4.963,5228.1,no
21581,48,technician,married,professional.course,yes,yes,no,cellular,aug,tue,...,1,999,0,nonexistent,1.4,93.444,-36.1,4.963,5228.1,no
24866,31,unemployed,married,high.school,yes,no,no,cellular,nov,tue,...,2,999,1,failure,-0.1,93.2,-42.0,4.153,5195.8,no


I'm thinking that those that are in default and unemployed will not answer. I probably wouldn't 

**debt is scary**

In [16]:
nan_default = df_clean.loc[(df_clean['default'] == 'Missing') 
             & (df_clean['job'] == 'unemployed')]

In [17]:
nan_default['default'].replace({'Missing':'yes'}, inplace=True)

In [18]:
df_clean.update(nan_default)

In [19]:
df_clean.default.value_counts()

no         32348
Missing     8199
yes          240
Name: default, dtype: int64

In [20]:
df_clean['education'].value_counts().sort_values()

illiterate                18
Missing                 1596
basic.6y                2264
basic.4y                4118
professional.course     5225
basic.9y                6006
high.school             9464
university.degree      12096
Name: education, dtype: int64

In [21]:
nan_education = df_clean.loc[(df_clean['education'] == 'Missing') 
             & (df_clean['job'] == 'student')
             & (df_clean['age'] > 18)]

In [22]:
nan_education

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
383,30.0,student,single,Missing,Missing,no,no,telephone,may,tue,...,1.0,999.0,0.0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3011,30.0,student,single,Missing,Missing,no,no,telephone,may,wed,...,4.0,999.0,0.0,nonexistent,1.1,93.994,-36.4,4.859,5191.0,no
3442,31.0,student,single,Missing,Missing,yes,no,telephone,may,thu,...,5.0,999.0,0.0,nonexistent,1.1,93.994,-36.4,4.860,5191.0,no
3620,26.0,student,single,Missing,Missing,no,no,telephone,may,fri,...,5.0,999.0,0.0,nonexistent,1.1,93.994,-36.4,4.859,5191.0,no
5808,30.0,student,single,Missing,Missing,yes,no,telephone,may,mon,...,2.0,999.0,0.0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40923,20.0,student,single,Missing,no,no,no,cellular,oct,mon,...,3.0,999.0,3.0,failure,-1.1,94.601,-49.5,0.977,4963.6,no
40929,20.0,student,single,Missing,no,yes,yes,cellular,oct,tue,...,1.0,3.0,4.0,success,-1.1,94.601,-49.5,0.982,4963.6,yes
40935,20.0,student,single,Missing,no,no,no,telephone,oct,tue,...,1.0,999.0,1.0,failure,-1.1,94.601,-49.5,0.982,4963.6,no
41002,45.0,student,single,Missing,no,yes,no,cellular,oct,wed,...,2.0,999.0,2.0,failure,-1.1,94.601,-49.5,1.016,4963.6,no


I will assume all students have completed highschool and are working on post-secondary education or professional courses. 

**In this world GEDs don't exist**

Also, let's stick to adults.

In [23]:
df_clean.drop(df_clean[df_clean['age'] == 17].index, inplace=True)

In [24]:
nan_education['education'].replace({'Missing':'high.school'}, inplace=True)

In [25]:
df_clean.update(nan_education)

In [26]:
df_clean.education.value_counts()

university.degree      12096
high.school             9616
basic.9y                6003
professional.course     5225
basic.4y                4118
basic.6y                2264
Missing                 1442
illiterate                18
Name: education, dtype: int64

In [34]:
hs_student = df_clean[(df_clean['age'] == 18) & 
         (df_clean['housing'] == 'yes') & 
         (df_clean['education'] == 'Missing')]

In [35]:
hs_student['education'].replace({'Missing':'high.school'}, inplace=True)

In [36]:
df_clean.update(hs_student)

In [37]:
df_clean.education.value_counts()

university.degree      12096
high.school             9625
basic.9y                6003
professional.course     5225
basic.4y                4118
basic.6y                2264
Missing                 1433
illiterate                18
Name: education, dtype: int64

If there are no values across the loan questions the data is not valuable for drawing conclusions or ML training. I don't want my algorithm to learn what not-to-do as much as possible.

**boring data**

In [33]:
df_clean['loan'].value_counts().sort_values()

Missing      984
yes         6183
no         33620
Name: loan, dtype: int64

In [35]:
df_clean.drop( df_clean [ (df_clean['housing'] == 'Missing') &
                         (df_clean['loan'] == 'Missing') &
                         (df_clean['default'] == 'Missing')].index, inplace=True )

Let's see how I did:

In [57]:
df_clean.replace('Missing', np.nan, inplace=True)

In [60]:
df_clean[missing_var].isnull().mean()

education    0.035197
default      0.115474
housing      0.018954
loan         0.018954
dtype: float64

In [59]:
len(df_clean)/len(df)

0.9850441876274643

**I still have 98% of the original data!**

### - Prepare the data for machine learning
       - A little EDA goeas a long way
       - Do you need to do anything about data types?

###  - KNN analysis
       - What is your objective from the analysis?
       - What is your optimal K? 
       - How about accuracy rate? 

###  - Discover any insights from this analysis? 
       - Include numbers/graphs corresponding to your conclusions
       - Discuss ways to improve the performance of your KNN model 
       - Defend and backup your thoughts!!!!!!

## Part 2: Comparison to other supervised algorithm

###  - At the end of part 1 you discussed ways to improve the performance of you KNN model. 
       - Implement one of those methods to improve your KNN model performance.
       - Rerun a KNN analysis for your improved dataset
       - Discuss the change in performance from the model in part 1

###  - Complete a K-fold cross-validation analysis for your improved model
       - You need to use at less three additional models
       - Discuss the difference in the performance of the 4 algorithms against your improved dataset.

# Deliverables:


Upload your Jupyter Notebook to the corresponding location in WorldClass. 

**Note::** Make sure you have clearly indicated each assignment requirement within your notebook.
   
**Important:** Make sure your provide complete and thorough explanations for all of your analysis. You need to defend your thought processes and reasoning.