### Codio Assignment 12.1: Introduction to K Nearest Neighbors


**Expected Time: 60 Minutes**

**Total Points: 50**


This activity is meant to introduce you to the `KNeighborsClassifier` from scikit-learn.  You will build a few different versions changing values for `k` and examining performance.  You will also preprocess your data by scaling so as to improve the performance of your classifier. 

#### Index

- [Problem 1](#-Problem-1)
- [Problem 2](#-Problem-2)
- [Problem 3](#-Problem-3)
- [Problem 4](#-Problem-4)
- [Problem 5](#-Problem-5)
- [Problem 6](#-Problem-6)

In [29]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import make_column_transformer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

In [31]:
default = pd.read_csv('../data/default.csv')

In [33]:
default.head()

Unnamed: 0.1,Unnamed: 0,default,student,balance,income
0,1,No,No,729.526495,44361.625074
1,2,No,Yes,817.180407,12106.1347
2,3,No,No,1073.549164,31767.138947
3,4,No,No,529.250605,35704.493935
4,5,No,No,785.655883,38463.495879


[Back to top](#-Index)

### Problem 1

#### Determine `X` and `y`

**5 Points**

Define `X` as all columns except for `default` and `y` as `default` below.

In [37]:
### GRADED

# YOUR CODE HERE
X = default[['student', 'balance', 'income']]
y = default['default']

# Answer check
print(X.head())
print('==============')
print(y.head())

  student      balance        income
0      No   729.526495  44361.625074
1     Yes   817.180407  12106.134700
2      No  1073.549164  31767.138947
3      No   529.250605  35704.493935
4      No   785.655883  38463.495879
0    No
1    No
2    No
3    No
4    No
Name: default, dtype: object


[Back to top](#-Index)

### Problem 2

#### Create train/test split

**5 Points**

Use the `train_test_split` function to create a train test split on `X` and `y` with 25% of the data assigned as the test set.  Set `random_state = 42` to assure correct grading.

In [41]:
### GRADED

# YOUR CODE HERE
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Answer check
print(X_train.shape)
print(X_test.shape)

(7500, 3)
(2500, 3)


[Back to top](#-Index)

### Problem 3

#### Column transformer for encoding `student` and scaling `['balance', 'income']`

**10 Points**

Use the `make_column_transformer` to create a column transformer. Inside the `make_column_transformer` specify an instance of the `OneHotEncoder` transformer from scikit-learn. Inside `OneHotEncoder` set `drop` equal to `'if_binary'`. Apply this transformation to the `student` column. On the `remainder` columns, apply a `StandardScaler()` transformation.

 Assign your column transformer to `transformer` below.

[Documentation for `make_column_transformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html)

In [45]:
### GRADED

# YOUR CODE HERE
transformer = make_column_transformer(
    (OneHotEncoder(drop='if_binary'), ['student']),
    remainder = (StandardScaler())
)

# Answer check
print(transformer)

ColumnTransformer(remainder=StandardScaler(),
                  transformers=[('onehotencoder',
                                 OneHotEncoder(drop='if_binary'),
                                 ['student'])])


[Back to top](#-Index)

### Problem 4

#### Pipeline with KNN and `n_neighbors = 5`

**10 Points**

Using your column `transformer` defined above, create a `Pipeline` named `fivepipe` below with steps `transform` and `knn` that transform your columns and subsequently build a KNN model using `KNeighborsClassifier()`.  

Use the `fit` function to fit the pipe on the training data and use the `.score` method of the fit pipe to determine the accuracy on the test data.  Assign this to `fivepipe_acc` below.

In [49]:
### GRADED

# YOUR CODE HERE
# Create the pipeline
fivepipe = Pipeline([
    ('transform', transformer),
    ('knn', KNeighborsClassifier(n_neighbors=5))
])

# Fit the pipeline on the training data
fivepipe.fit(X_train, y_train)

# Compute the accuracy on the test data
fivepipe_acc = fivepipe.score(X_test, y_test)

print("Accuracy on test data:", fivepipe_acc)

Accuracy on test data: 0.968


[Back to top](#-Index)

### Problem 5

#### Pipeline with `n_neighbors = 50`

**10 Points**

Using your column `transformer` defined above, create a `Pipeline` named `fiftypipe` below with steps `transform` and `knn` that transform your columns and subsequently build a KNN model using `KNeighborsClassifier()`. Build the KNN model with `n_neighbors = 50`

Use the `fit` function to fit the pipe on the training data and use the `.score` method of the fit pipe to determine the accuracy on the test data.  Assign this to `fiftypipe_acc` below.


In [53]:
### GRADED

# YOUR CODE HERE
# Create the pipeline
fiftypipe = Pipeline([
    ('transform', transformer),
    ('knn', KNeighborsClassifier(n_neighbors=50))
])

# Fit the pipeline on the training data
fiftypipe.fit(X_train, y_train)

# Compute the accuracy on the test data
fiftypipe_acc = fiftypipe.score(X_test, y_test)

print("Accuracy on test data:", fiftypipe_acc)

Accuracy on test data: 0.9712


[Back to top](#-Index)

### Problem 6

#### False Predictions

**10 Points**

Finally, compare the two pipelines based on the number of sum of the errors (FP+FN) -- those observations who the model predicted to default but incorrectly so. Assign these values as integers to `five_fp` and `fifty_fp` respectively.   

(Hint: Add up the predictions of X_test that are not equal to y_test)

In [57]:
### GRADED

# YOUR CODE HERE
# Predictions made by your model
five_y_pred = fivepipe.predict(X_test)
fifty_y_pred = fiftypipe.predict(X_test)

# Calculate the confusion matrix
five_cm = confusion_matrix(y_test, five_y_pred)
fifty_cm = confusion_matrix(y_test, fifty_y_pred)

# Extract False Positives and False Negatives
five_fp = five_cm[0, 1] + five_cm[1, 0]
fifty_fp = fifty_cm[0, 1] + fifty_cm[1, 0]

# Answer check
print(f'Number of False Predictions with five neighbors: {five_fp}')
print(f'Number of False Predictions with fifty neighbors: {fifty_fp}')

Number of False Predictions with five neighbors: 80
Number of False Predictions with fifty neighbors: 72
