### Colab activity 12.1:  Introduction to K-Nearest Neighbors


**Expected Time: 60 Minutes**



This activity is meant to introduce you to the `KNeighborsClassifier` from scikit-learn.  You will build a few different versions, changing values for `k` and examining performance.  You will also preprocess your data by scaling so as to improve the performance of your classifier. 

#### Index

- [Problem 1](#-Problem-1)
- [Problem 2](#-Problem-2)
- [Problem 3](#-Problem-3)
- [Problem 4](#-Problem-4)
- [Problem 5](#-Problem-5)
- [Problem 6](#-Problem-6)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import make_column_transformer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

In [2]:
default = pd.read_csv('data/default.csv')

In [3]:
default.head()

Unnamed: 0.1,Unnamed: 0,default,student,balance,income
0,1,No,No,729.526495,44361.625074
1,2,No,Yes,817.180407,12106.1347
2,3,No,No,1073.549164,31767.138947
3,4,No,No,529.250605,35704.493935
4,5,No,No,785.655883,38463.495879


[Back to top](#-Index)

### Problem 1

#### Determine `X` and `y`


Define `X` as all columns except for `default` and `y` as `default` below.

In [9]:


X = default.drop(['default'], axis=1)
y = default[['default']]



# Answer check
print(X.head())
print('==============')
print(y.head())

   Unnamed: 0 student      balance        income
0           1      No   729.526495  44361.625074
1           2     Yes   817.180407  12106.134700
2           3      No  1073.549164  31767.138947
3           4      No   529.250605  35704.493935
4           5      No   785.655883  38463.495879
  default
0      No
1      No
2      No
3      No
4      No


[Back to top](#-Index)

### Problem 2

#### Create train/test split



Use the `train_test_split` function to create a train test split on `X` and `y` with 25% of the data assigned as the test set.  Set `random_state = 42` to assure correct grading.

In [13]:


X_train, X_test, y_train, y_test = '', '', '', ''
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42) 



# Answer check
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(7500, 4)
(2500, 4)
(7500, 1)
(2500, 1)


[Back to top](#-Index)

### Problem 3

#### Column transformer for encoding `student` and scaling `['balance', 'income']`



Use the `make_column_transformer` to create a column transformer. Inside the `make_column_transformer` specify an instance of the `OneHotEncoder` transformer from scikit-learn. Inside `OneHotEncoder` set `drop` equal to `'if_binary'`. Apply this transformation to the `student` column. On the `remainder` columns, apply a `StandardScaler()` transformation.

 Assign your column transformer to `transformer` below.

[Documentation for `make_column_transformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html)

In [20]:


transformer = make_column_transformer(
    (OneHotEncoder(drop='if_binary'), ['student']),
    (StandardScaler(), ['balance'])
)



# Answer check
print(transformer)

ColumnTransformer(transformers=[('onehotencoder',
                                 OneHotEncoder(drop='if_binary'), ['student']),
                                ('standardscaler', StandardScaler(),
                                 ['balance'])])


[Back to top](#-Index)

### Problem 4

#### Pipeline with KNN and `n_neighbors = 5`



Using your column `transformer` defined above, create a `Pipeline` named `fivepipe` below with steps `transform` and `knn` that transform your columns and subsequently build a KNN model using `KNeighborsClassifier()`.  

Use the `fit` function to fit the pipe on the training data and use the `.score` method of the fit pipe to determine the accuracy on the test data.  Assign this to `fivepipe_acc` below.

In [31]:


fivepipe = ''
fivepipe_acc = ''

fivepipe = Pipeline([
    ('transform', transformer),
    ('knn', KNeighborsClassifier())
])

fivepipe_model = fivepipe.fit(X_train, y_train)
fivepipe_acc = fivepipe_model.score(X_test, y_test)

# Answer check
print(fivepipe_acc)

0.9724


  return self._fit(X, y)



[Back to top](#-Index)

### Problem 5

#### Pipeline with `n_neighbors = 50`



Using your column `transformer` defined above, create a `Pipeline` named `fiftypipe` below with steps `transform` and `knn` that transform your columns and subsequently build a KNN model using `KNeighborsClassifier()`. Build the KNN model with `n_neighbors = 50`

Use the `fit` function to fit the pipe on the training data and use the `.score` method of the fit pipe to determine the accuracy on the test data.  Assign this to `fiftypipe_acc` below.


In [72]:


fiftypipe = ''
fiftypipe_acc = ''

fiftypipe = Pipeline([
    ('transform', transformer),
    ('knn', KNeighborsClassifier(n_neighbors=50))
])

fiftypipe_model = fiftypipe.fit(X_train, y_train)
fiftypipe_acc = fiftypipe_model.score(X_test, y_test)

# Answer check
print(fiftypipe_acc)

0.9728


  return self._fit(X, y)


[Back to top](#-Index)

### Problem 6

#### False Predictions


Finally, compare the two pipelines based on the number of sum of the errors (FP+FN) -- those observations that the model predicted to default but incorrectly so. Assign these values as integers to `five_fp` and `fifty_fp`, respectively.   

(Hint: Add up the predictions of X_test that are not equal to y_test)

In [73]:


five_fp = ''
fifty_fp = ''

five_fp = ((fivepipe_model.predict(X_test) == y_test['default']) == False).sum()
fifty_fp = ((fiftypipe_model.predict(X_test) == y_test['default']) == False).sum()

# Answer check
print(f'Number of False Predictions with five neighbors: {five_fp}')
print(f'Number of False Predictions with fifty neighbors: {fifty_fp}')

Number of False Predictions with five neighbors: 69
Number of False Predictions with fifty neighbors: 68
