### Codio Activity 12.2: Identifying the Best K

This activity focuses on identifying the "best" number of neighbors that optimize the accuracy of a `KNearestNeighbors` estimator. The ideal number of neighbors will be selected through cross validation and a grid search over the `n_neighbors` parameter.  Again, prior to building the model you will want to scale the data in a `Pipeline`.

**Expected Time: 60 Minutes**

**Total Points: 50**

#### Index

- [Problem 1](#Problem-1)
- [Problem 2](#Problem-2)
- [Problem 3](#Problem-3)
- [Problem 4](#Problem-4)
- [Problem 5](#Problem-5)
- [Problem 6](#Problem-6)


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.compose import make_column_transformer
from sklearn.pipeline import Pipeline

### The Dataset

Again, you will use the credit default dataset to predict default -- yes or no.  The data is loaded and split into train and test set for you below.  You will again build a column transformer to encode the `student` feature.  Note that scikitlearn handles a string target features in the `KNeighborsClassifier`, and we do not need to encode this column.

In [2]:
df = pd.read_csv("data/default.csv", index_col=0)

In [3]:
df.head(2)

Unnamed: 0,default,student,balance,income
1,No,No,729.526495,44361.625074
2,No,Yes,817.180407,12106.1347


In [4]:
X_train, X_test, y_train, y_test = train_test_split(
    df.drop("default", axis=1), df["default"], random_state=42
)

[Back to top](#-Index)

### Problem 1

#### Baseline for Models

**5 Points**

Before starting the modeling process, you should have a baseline to determine whether your model is any good.  For your classifier, consider a baseline model as one in which you always guess the most frequently occurring class.  What would the accuracy of such a classifier be?  Enter your answer as a float to `baseline` below.

In [5]:
baseline = (df["default"] == df["default"].mode()[0]).sum() / len(df["default"])
print(baseline)

0.9667


[Back to top](#-Index)

### Problem 2

#### Column transforms and KNN

**10 Points**


Hopefully you are getting the hang of using the column transformers and estimators in scikit-learn.  Below, use `make_column_transformer` to binarize `student` with the `OneHotEncoder` with `drop = 'if_binary'` and the remaining columns are scaled with the `StandardScaler`.  Assign this to `transformer` below.  

Next, build a `Pipeline` named `knn_pipe` with named steps `transform` and `knn`. Be sure to leave all the settings in `knn` to default.  

In [6]:
transformer = make_column_transformer(
    (
        OneHotEncoder(drop="if_binary"),
        ["student"],
    ),
    remainder=StandardScaler(),
)

knn_pipe = Pipeline(
    [
        ("transform", transformer),
        ("knn", KNeighborsClassifier()),
    ]
)

# Answer check
knn_pipe

[Back to top](#-Index)

### Problem 3

#### Parameter grid

**10 Points**

Now that your pipeline is ready, you are to construct a parameter grid to search over.  Consider two things:

- You will not be able to predict on a test dataset where `n_neigbors > len(test_data)`.  This will limit our upper bound on `k`.  In this example too high a `k` will slow down the computation so only consider `k = [1, 3, 5, ..., 21]`. 
- Ties in voting are decided somewhat arbitrarily and for speed and clarity you should consider only odd values for number of neighbors

Below, construct a parameter grid that searches your `knn` step in the pipeline for `n_neighbors` ranging from 1 to the length of `y_test` including only odd values.  Assign your dictionary to `params` below.

In [7]:
params = {"knn__n_neighbors": list(range(1, 23, 2))}
list(params.values())[0]

[1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21]

[Back to top](#-Index)

### Problem 4

#### Grid search `k`

**10 Points**

Finally, use your pipeline and parameter grid to perform a grid search over the training data with `cv = 5`. Assign your grid as `knn_grid` below.  

Identify the optimal value of `n_neighbors` and assign as an integer to `best_k` below.  

Also, assign your best models accuracy on the test data as a float to `best_acc` . 

In [14]:
knn_grid = GridSearchCV(
    estimator=knn_pipe,
    param_grid=params,
    cv=5,
).fit(X_train, y_train)

best_k = knn_grid.best_params_["knn__n_neighbors"]
best_acc = knn_grid.score(X_test, y_test)

# Answer check
print(best_acc)
print(best_k)

0.9708
11


[Back to top](#-Index)

### Problem 5

#### Other parameters to consider

**10 Points**

The number of neighbors is not the only parameter in the implementation from scikitlearn.  For example, you can also consider different weightings of points based on their distance, change the distance metric, and search over alternative versions of certain metrics like minkowski.  See the docstring from `KNeighborsClassifier` below. 

```
weights : {'uniform', 'distance'} or callable, default='uniform'
    Weight function used in prediction.  Possible values:

p : int, default=2
    Power parameter for the Minkowski metric. When p = 1, this is
    equivalent to using manhattan_distance (l1), and euclidean_distance
    (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.
    
```

Create a new parameter grid and consider both weightings as well as `p = [1, 2]`.  Assign this as a dictionary to `params2` below.  

Search over these parameters in your `knn_pipe` with a `GridSearchCV` named `weight_grid` below. Also, consider `n_neighbors` as in [Problem 4](#-Problem-4).  Did your new grid search results perform better than earlier?  Assign this grids accuracy to `weights_acc` below.

In [11]:
params2 = {
    "knn__n_neighbors": list(range(1, 23, 2)),
    "knn__weights": ["uniform", "distance"],
    "knn__p": [1, 2],
}

weight_grid = GridSearchCV(
    estimator=knn_pipe,
    param_grid=params2,
    cv=5,
).fit(X_train, y_train)

weights_acc = weight_grid.score(X_test, y_test)

# Answer check
print(weights_acc)

0.9708


In [12]:
weight_grid.best_params_

{'knn__n_neighbors': 11, 'knn__p': 2, 'knn__weights': 'uniform'}

[Back to top](#-Index)

### Problem 6

#### Further considerations

**5 Points**

When performing your grid search you want to also be sensitive to the amount of parameters you are searching and the number of different models being built.  How many models were constructed in [Problem 5](#-Problem-5)?  Enter your answer as an integer to `ans6` below.  You might use the grids `.cv_results_` attribute to determine this.

In [13]:
ans6 = len(weight_grid.cv_results_["std_test_score"]) * 5

# Answer check
print(ans6)

220
