### Codio Assignment 12.1: Introduction to K Nearest Neighbors


**Expected Time: 60 Minutes**

**Total Points: 50**


This activity is meant to introduce you to the `KNeighborsClassifier` from scikit-learn.  You will build a few different versions changing values for $k$ and examining performance.  You will also preprocess your data by scaling so as to improve the performance of your classifier. 

In [13]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import make_column_transformer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

In [14]:
default = pd.read_csv("data/default.csv")

In [15]:
default.head()

Unnamed: 0.1,Unnamed: 0,default,student,balance,income
0,1,No,No,729.526495,44361.625074
1,2,No,Yes,817.180407,12106.1347
2,3,No,No,1073.549164,31767.138947
3,4,No,No,529.250605,35704.493935
4,5,No,No,785.655883,38463.495879


[Back to top](#-Index)

### Problem 1

#### Determine `X` and `y`

**5 Points**

First, define `X` as all columns but `default` and `y` as `default` below.

In [16]:
X = default.drop(columns="default")  # wrong
X = default[["student", "balance", "income"]]  # right
y = default["default"]

# Answer check
print(X.head())
print("==============")
print(y.head())

  student      balance        income
0      No   729.526495  44361.625074
1     Yes   817.180407  12106.134700
2      No  1073.549164  31767.138947
3      No   529.250605  35704.493935
4      No   785.655883  38463.495879
0    No
1    No
2    No
3    No
4    No
Name: default, dtype: object


[Back to top](#-Index)

### Problem 2

#### Create train/test split

**5 Points**

Next, create a train test split with 25% of the data assigned as the test set.  Set `random_state = 42` to assure correct grading.

In [17]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=42, test_size=0.25
)

# Answer check
print(X_train.shape)
print(X_test.shape)

(7500, 3)
(2500, 3)


[Back to top](#-Index)

### Problem 3

#### Column transformer for encoding `student` and scaling `['balance', 'income']`

**10 Points**

Create a column transformer to binarize the `student` column and apply a `StandardScaler` to the numeric features.  Be sure in your `OneHotEncoder` to set `drop = if_binary`.  Assign your column transformer to `transformer` below.

[Documentation for OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html)

In [18]:
transformer = make_column_transformer(
    (
        OneHotEncoder(drop="if_binary"),
        ["student"],
    ),
    remainder=StandardScaler(),
)

# Answer check
print(transformer)

ColumnTransformer(remainder=StandardScaler(),
                  transformers=[('onehotencoder',
                                 OneHotEncoder(drop='if_binary'),
                                 ['student'])])


[Back to top](#-Index)

### Problem 4

#### Pipeline with KNN and `n_neighbors = 5`

**10 Points**

Using your column transformer defined above, create a `Pipeline` named `fivepipe` below with steps `transform` and `knn` that transform your columns and subsequently build a KNN model with `n_neighbors = 5`.  Fit the pipe on the training data and use the `.score` method of the fit pipe to determine the accuracy on the test data.  Assign this to `fivepipe_acc` below.

In [32]:
fivepipe = Pipeline(
    [
        ("transform", transformer),
        ("knn", KNeighborsClassifier()),  # omit argument n_neighbors=5 to match codio
    ]
).fit(X_train, y_train)

fivepipe_acc = fivepipe.score(X_test, y_test)

# Answer check
fivepipe_acc

0.968

In [24]:
transformer_ = make_column_transformer(
    (OneHotEncoder(drop="if_binary"), ["student"]), remainder=StandardScaler()
)
fivepipe_ = Pipeline([("transform", transformer_), ("knn", KNeighborsClassifier())])
fivepipe_.fit(X_train, y_train)
fivepipe_acc_ = fivepipe_.score(X_test, y_test)
fivepipe_acc_ == fivepipe_acc

True

[Back to top](#-Index)

### Problem 5

#### Pipeline with `n_neighbors = 50`

**10 Points**

Using your column transformer defined above, create a `Pipeline` named `fiftypipe` below with steps `transform` and `knn` that transform your columns and subsequently build a KNN model with `n_neighbors = 50`.  Fit the pipe on the training data and use the `.score` method of the fit pipe to determine the accuracy on the test data.  Assign this to `fiftypipe_acc` below.

In [20]:
fiftypipe = Pipeline(
    [
        ("transform", transformer),
        ("knn", KNeighborsClassifier(n_neighbors=50)),
    ]
).fit(X_train, y_train)

fiftypipe_acc = fiftypipe.score(X_test, y_test)

# Answer check
print(fiftypipe_acc)

0.9712


[Back to top](#-Index)

### Problem 6

#### False Predictions

**10 Points**

Finally, compare the two pipelines based on the number of **False Predictions** -- those observations who the model predicted to default but incorrectly so. Assign these values as integers to `five_fp` and `fifty_fp` respectively.   

(Hint: Add up the predictions of X_test that are not equal to y_test)

In [41]:
five_fp = ((fivepipe.predict(X_test) == "Yes") != (y_test == "Yes")).sum()
fifty_fp = ((fiftypipe.predict(X_test) == "Yes") != (y_test == "Yes")).sum()

# Answer check
print(f"Number of False Predictions with five neighbors: {five_fp}")
print(f"Number of False Predictions with five neighbors: {fifty_fp}")

Number of False Predictions with five neighbors: 80
Number of False Predictions with five neighbors: 72
