### Codio Assignment 12.1: Introduction to K Nearest Neighbors

This activity is meant to introduce you to the `KNeighborsClassifier` from scikit-learn.  You will build a few different versions changing values for `k` and examining performance.  You will also preprocess your data by scaling so as to improve the performance of your classifier. 

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import make_column_transformer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

In [4]:
default = pd.read_csv('codio_12_1_solution/data/default.csv')

In [5]:
default.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  10000 non-null  int64  
 1   default     10000 non-null  object 
 2   student     10000 non-null  object 
 3   balance     10000 non-null  float64
 4   income      10000 non-null  float64
dtypes: float64(2), int64(1), object(2)
memory usage: 390.8+ KB


In [6]:
default.head()

Unnamed: 0.1,Unnamed: 0,default,student,balance,income
0,1,No,No,729.526495,44361.625074
1,2,No,Yes,817.180407,12106.1347
2,3,No,No,1073.549164,31767.138947
3,4,No,No,529.250605,35704.493935
4,5,No,No,785.655883,38463.495879


### Problem 1

#### Determine `X` and `y`

Define `X` as all columns except for `default` and `y` as `default` below.

In [7]:
X = default[['student','balance','income']]
X

Unnamed: 0,student,balance,income
0,No,729.526495,44361.625074
1,Yes,817.180407,12106.134700
2,No,1073.549164,31767.138947
3,No,529.250605,35704.493935
4,No,785.655883,38463.495879
...,...,...,...
9995,No,711.555020,52992.378914
9996,No,757.962918,19660.721768
9997,No,845.411989,58636.156984
9998,No,1569.009053,36669.112365


In [8]:
y = default['default']
y

0       No
1       No
2       No
3       No
4       No
        ..
9995    No
9996    No
9997    No
9998    No
9999    No
Name: default, Length: 10000, dtype: object

### Problem 2

#### Create train/test split

Use the `train_test_split` function to create a train test split on `X` and `y` with 25% of the data assigned as the test set.  Set `random_state = 42` to assure correct grading.

In [11]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.25,random_state = 42)
print(X_train.shape)
X_test.shape

(7500, 3)


(2500, 3)

### Problem 3

#### Column transformer for encoding `student` and scaling `['balance', 'income']`

Use the `make_column_transformer` to create a column transformer. Inside the `make_column_transformer` specify an instance of the `OneHotEncoder` transformer from scikit-learn. Inside `OneHotEncoder` set `drop` equal to `'if_binary'`. Apply this transformation to the `student` column. On the `remainder` columns, apply a `StandardScaler()` transformation.

 Assign your column transformer to `transformer` below.

[Documentation for `make_column_transformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html)

In [12]:
transformer = make_column_transformer((OneHotEncoder(drop = 'if_binary'),['student']),  
                                     remainder = StandardScaler())
transformer

### Problem 4

#### Pipeline with KNN and `n_neighbors = 5`

Using your column `transformer` defined above, create a `Pipeline` named `fivepipe` below with steps `transform` and `knn` that transform your columns and subsequently build a KNN model using `KNeighborsClassifier()`.  

Use the `fit` function to fit the pipe on the training data and use the `.score` method of the fit pipe to determine the accuracy on the test data.  Assign this to `fivepipe_acc` below.

In [13]:
fivepipe = Pipeline([('transform',transformer),
                    ('knn',KNeighborsClassifier())])
fivepipe

In [14]:
fivepipe.fit(X_train,y_train)

The format of the columns of the 'remainder' transformer in ColumnTransformer.transformers_ will change in version 1.7 to match the format of the other transformers.
At the moment the remainder columns are stored as indices (of type int). With the same ColumnTransformer configuration, in the future they will be stored as column names (of type str).



In [15]:
fivepipe_acc = fivepipe.score(X_test,y_test)
fivepipe_acc

0.968

### Problem 5

#### Pipeline with `n_neighbors = 50`

Using your column `transformer` defined above, create a `Pipeline` named `fiftypipe` below with steps `transform` and `knn` that transform your columns and subsequently build a KNN model using `KNeighborsClassifier()`. Build the KNN model with `n_neighbors = 50`

Use the `fit` function to fit the pipe on the training data and use the `.score` method of the fit pipe to determine the accuracy on the test data.  Assign this to `fiftypipe_acc` below.

In [17]:
fiftypipe = Pipeline([('transform',transformer),
                     ('KNN',KNeighborsClassifier(n_neighbors = 50))])
fiftypipe

In [18]:
fiftypipe.fit(X_train,y_train)

The format of the columns of the 'remainder' transformer in ColumnTransformer.transformers_ will change in version 1.7 to match the format of the other transformers.
At the moment the remainder columns are stored as indices (of type int). With the same ColumnTransformer configuration, in the future they will be stored as column names (of type str).



In [20]:
fiftypipe_acc = fiftypipe.score(X_test,y_test)
fiftypipe_acc

0.9712

### Problem 6

#### False Predictions

Finally, compare the two pipelines based on the number of sum of the errors (FP+FN) -- those observations who the model predicted to default but incorrectly so. Assign these values as integers to `five_fp` and `fifty_fp` respectively.   

(Hint: Add up the predictions of X_test that are not equal to y_test)

In [21]:
five_fp = sum(fivepipe.predict(X_test) != y_test)
five_fp

80

In [22]:
fifty_fp = sum(fiftypipe.predict(X_test) != y_test)
fifty_fp

72

### Codio Activity 12.2: Identifying the Best K

This activity focuses on identifying the "best" number of neighbors that optimize the accuracy of a `KNearestNeighbors` estimator. The ideal number of neighbors will be selected through cross validation and a grid search over the `n_neighbors` parameter.  Again, prior to building the model you will want to scale the data in a `Pipeline`.

#### Index

- [Problem 1](#Problem-1)
- [Problem 2](#Problem-2)
- [Problem 3](#Problem-3)
- [Problem 4](#Problem-4)
- [Problem 5](#Problem-5)
- [Problem 6](#Problem-6)


In [27]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.compose import make_column_transformer
from sklearn.pipeline import Pipeline

### The Dataset

Again, you will use the credit default dataset to predict default -- yes or no.  The data is loaded and split into train and test set for you below.  You will again build a column transformer to encode the `student` feature.  Note that scikit-learn handles a string target features in the `KNeighborsClassifier`, and we do not need to encode this column.

In [29]:
df = pd.read_csv('codio_12_2_solution/data/default.csv', index_col=0)

In [30]:
df.head()

Unnamed: 0,default,student,balance,income
1,No,No,729.526495,44361.625074
2,No,Yes,817.180407,12106.1347
3,No,No,1073.549164,31767.138947
4,No,No,529.250605,35704.493935
5,No,No,785.655883,38463.495879


In [31]:
X = df.drop('default', axis = 1)
y = df['default']

In [32]:
X_train,X_test, y_train, y_test = train_test_split(X,y,random_state = 42)
X_train.head()

Unnamed: 0,student,balance,income
4902,Yes,465.583629,15625.633529
4376,No,357.996305,30217.021287
6699,Yes,1230.714628,18581.274613
9806,No,1260.154869,35733.465854
1102,No,850.548099,44501.915038


### Problem 1

#### Baseline for Models

Before starting the modeling process, you should have a baseline to determine whether your model is any good. 

Consider the `default` column of `df`. Perform a `value_counts` operation with the argument `normalize` equal to `True`. 

What would the accuracy of such a classifier be?  Enter your answer as a float to `baseline` below.

In [34]:
# normalize = True means percentage
baseline = df['default'].value_counts(normalize = True)[0]
baseline

  baseline = df['default'].value_counts(normalize = True)[0]


np.float64(0.9667)

### Problem 2

#### Column transforms and KNN

Use the `make_column_transformer` to create a column `transformer`. Inside the `make_column_transformer` specify an instance of the `OneHotEncoder` transformer from scikit-learn. Inside `OneHotEncoder` set `drop` equal to `'if_binary'`. Apply this transformation to the `student` column. On the `remainder` columns, apply a `StandardScaler()` transformation.
 

Next, build a `Pipeline` named `knn_pipe` with  steps `transform` and `knn`. Set `transform` equal to `transformer` and `knn` equal to `KNeighborsClassifier()`. Be sure to leave all the settings in `knn` to default.  

In [36]:
transformer = make_column_transformer((OneHotEncoder(drop = 'if_binary'),['student']),
                                      remainder = StandardScaler())
transformer

In [37]:
knn_pipe = Pipeline([('transform',transformer),
                    ('knn',KNeighborsClassifier())])
knn_pipe

In [38]:
list(knn_pipe.named_steps.keys())

['transform', 'knn']

### Problem 3

#### Parameter grid

Now that your pipeline is ready, you are to construct a parameter grid to search over.  Consider two things:

- You will not be able to predict on a test dataset where `n_neigbors > len(test_data)`.  This will limit our upper bound on `k`.  In this example too high a `k` will slow down the computation so only consider `k = [1, 3, 5, ..., 21]`. 
- Ties in voting are decided somewhat arbitrarily and for speed and clarity you should consider only odd values for number of neighbors

Creating a dictionary called `params` that specifies hyperparameters for the KNN classifier. 

- The key of your dictionary will be `knn__n_neighbors`
- The values in your dictionary will be `list(range(1, 22, 2))`


In [39]:
params = {'knn__n_neighbors': list(range(1,22,2))}
params

{'knn__n_neighbors': [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21]}

### Problem 4

#### Grid search `k`

- Use `GridSearchCV` with the `knn_pipe` and `param_grid` equal to `params`. Assign the result to `knn_grid`.
- Use the `fit` function on `knn_grid` to train your model on `X_train` and `y_train`.
- Retrieve the best value for the hyperparameter `k` from the `best_params_` attribute of the grid search object `knn_grid`. Assign the result to `best_k`.
- Use the `score` function to calculate the accuracy of the `knn_grid` classifier on a test dataset. Assign your best models accuracy on the test data as a float to `best_acc`

In [40]:
knn_grid = GridSearchCV(knn_pipe, param_grid=params)
knn_grid

In [41]:
knn_grid.fit(X_train, y_train)

The format of the columns of the 'remainder' transformer in ColumnTransformer.transformers_ will change in version 1.7 to match the format of the other transformers.
At the moment the remainder columns are stored as indices (of type int). With the same ColumnTransformer configuration, in the future they will be stored as column names (of type str).



In [42]:
best_k = list(knn_grid.best_params_.values())[0]
best_k

11

In [43]:
best_acc = knn_grid.score(X_test, y_test)
best_acc

0.9708

### Problem 5

#### Other parameters to consider

The number of neighbors is not the only parameter in the implementation from scikit-learn.  For example, you can also consider different weightings of points based on their distance, change the distance metric, and search over alternative versions of certain metrics like Minkowski.  See the docstring from `KNeighborsClassifier` below. 

```
weights : {'uniform', 'distance'} or callable, default='uniform'
    Weight function used in prediction.  Possible values:

    - 'uniform' : uniform weights.  All points in each neighborhood
      are weighted equally.
    - 'distance' : weight points by the inverse of their distance.
      in this case, closer neighbors of a query point will have a
      greater influence than neighbors which are further away.
    - [callable] : a user-defined function which accepts an
      array of distances, and returns an array of the same shape
      containing the weights.
      
===========================

p : int, default=2
    Power parameter for the Minkowski metric. When p = 1, this is
    equivalent to using manhattan_distance (l1), and euclidean_distance
    (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.
    
```

Create a new parameter grid and consider both weightings as well as `p = [1, 2]`.  Assign this as a dictionary to `params2` below.  

Search over these parameters in your `knn_pipe` with a `GridSearchCV` named `weight_grid` below. Also, consider `n_neighbors` as in [Problem 4](#-Problem-4).  Did your new grid search results perform better than earlier?  Assign this grids accuracy to `weights_acc` below.

In [48]:
params2 = {'knn__n_neighbors': list(range(1,22,2)),
           'knn__weights':['uniform','distance'],
           'knn__p': [1,2]
    }
params2

{'knn__n_neighbors': [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21],
 'knn__weights': ['uniform', 'distance'],
 'knn__p': [1, 2]}

In [49]:
weight_grid = GridSearchCV(knn_pipe, param_grid = params2)
weight_grid

In [50]:
weight_grid.fit(X_train,y_train)


The format of the columns of the 'remainder' transformer in ColumnTransformer.transformers_ will change in version 1.7 to match the format of the other transformers.
At the moment the remainder columns are stored as indices (of type int). With the same ColumnTransformer configuration, in the future they will be stored as column names (of type str).



In [51]:
weights_acc = weight_grid.score(X_test, y_test)
weights_acc

0.9708

### Problem 6

#### Further considerations

When performing your grid search you want to also be sensitive to the amount of parameters you are searching and the number of different models being built.  How many models were constructed in [Problem 5](#-Problem-5)?  Enter your answer as an integer to `ans6` below.  You might use the grids `.cv_results_` attribute to determine this.

In [53]:
weight_grid.cv_results_

{'mean_fit_time': array([0.01253672, 0.005621  , 0.00555463, 0.00554333, 0.00553603,
        0.00562282, 0.00556707, 0.00559659, 0.00554533, 0.0055696 ,
        0.00577717, 0.00554457, 0.00557604, 0.00558023, 0.00559068,
        0.00563297, 0.0055891 , 0.00558114, 0.00561285, 0.00559106,
        0.00562096, 0.00562463, 0.00559492, 0.00558548, 0.00564656,
        0.00559034, 0.00558896, 0.00558529, 0.005584  , 0.00562563,
        0.00558562, 0.00558705, 0.00558152, 0.0055882 , 0.0056026 ,
        0.00564437, 0.00559716, 0.00558591, 0.00560083, 0.00559101,
        0.00565014, 0.00562634, 0.00560622, 0.00560064]),
 'std_fit_time': array([1.25018987e-02, 1.02846645e-04, 5.88003056e-05, 6.59303204e-05,
        7.77863625e-05, 9.27016567e-05, 5.90206144e-05, 3.36288201e-05,
        5.89275815e-05, 1.32193159e-05, 3.85522161e-04, 1.51995163e-05,
        2.56113495e-05, 1.70641092e-05, 2.04745902e-05, 8.16481960e-05,
        1.67053690e-05, 1.21504620e-05, 3.65049911e-05, 2.19099280e-05,
     

In [56]:
#'Make sure you multiply the number of different parameters times the number of cross validations.'
ans6 = 44*5

In [55]:
ans6

220

### Codio Activity 12.3: Decision Boundaries 

This activity focuses on the effect of changing your decision threshold and the resulting predictions.  Again, you will use the `KNeighborsClassifier` but this time you will explore the `predict_proba` method of the fit estimator to change the thresholds for classifying observations.  You will explore the results of changing the decision threshold on the false negative rate of the classifier for the insurance data.  Here, we suppose the important thing is to not make the mistake of predicting somebody would not default when they really do.  

#### Index

- [Problem 1](#Problem-1)
- [Problem 2](#Problem-2)
- [Problem 3](#Problem-3)
- [Problem 4](#Problem-4)
- [Problem 5](#Problem-5)
- [Problem 6](#Problem-6)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.compose import make_column_transformer
from sklearn.pipeline import Pipeline
from sklearn import set_config

set_config(display="diagram")

### The Dataset

You continue to use the default example, and the data is again loaded and split for you below. 

In [3]:
default = pd.read_csv('codio_12_3_solution/data/default.csv')

In [4]:
default.head()

Unnamed: 0.1,Unnamed: 0,default,student,balance,income
0,1,No,No,729.526495,44361.625074
1,2,No,Yes,817.180407,12106.1347
2,3,No,No,1073.549164,31767.138947
3,4,No,No,529.250605,35704.493935
4,5,No,No,785.655883,38463.495879


In [5]:
X_train, X_test, y_train, y_test = train_test_split(default.drop('default', axis = 1), 
                                                    default['default'],
                                                   random_state=42)

In [8]:
transformer = make_column_transformer((OneHotEncoder(drop = 'if_binary'),['student']),
                                   remainder = StandardScaler())

### Problem 1

#### Basic Pipeline

Use the `Pipeline` function to create a pipeline `base_pipe` with steps `transformer` and `knn`. Assign `transformer` to `'transformer'` and assign a `KNeighborsClassifier()` with `n_neighbors = 10` to `'knn'`. 

In [9]:
base_pipe = Pipeline([('transformer', transformer),
                      ('knn', KNeighborsClassifier(n_neighbors = 10))])
base_pipe

In [10]:
names = list(base_pipe.named_steps.keys())
names

['transformer', 'knn']

### Problem 2

#### Accuracy of KNN with 50% probability boundary

- Use the `fit` function to train `base_pipe` on `X_train` and `y_train`.
- Use the `score` function to calculate the performance of `base_pipe` on the test sets. Assign the result to `base_acc`.
- Use the `predict` function on `base_pipe` to make predictions on `X_test`. Assign the reusl to `preds`.
- Initialize the `base_fn` variable to `0`.
- Use a `for` loop to loop over `zip(preds, y_test)`. Inside the `for` loop:
    - Use an `if` block to determine the accuracy for this default setting and assign it to `base_acc`. Also, consider the proportion of false negatives here.  Assign these as `base_fn`.  

In [12]:
base_pipe.fit(X_train,y_train)

The format of the columns of the 'remainder' transformer in ColumnTransformer.transformers_ will change in version 1.7 to match the format of the other transformers.
At the moment the remainder columns are stored as indices (of type int). With the same ColumnTransformer configuration, in the future they will be stored as column names (of type str).



In [13]:
base_acc = base_pipe.score(X_test,y_test)
base_acc

0.9712

In [14]:
preds = base_pipe.predict(X_test)
preds

array(['No', 'No', 'No', ..., 'No', 'No', 'No'],
      shape=(2500,), dtype=object)

In [16]:
base_fn = 0
for i,j in zip(preds,y_test):
    if i == 'No':
        if j == 'Yes':
            base_fn += 1
print(base_fn)

65


### Problem 3

#### Prediction probabilities

As demonstrated in Video 12.5, your fit estimator has a `predict_proba` method that will output a probability for each observation.  


Use the `predict_proba` function on `base_pipe` to predict the probabilities on `X_test`. Assign the predicted probabilities as an array using the test data to `base_probs` below. 

In [17]:
base_probs = base_pipe.predict_proba(X_test)
base_probs

array([[1. , 0. ],
       [1. , 0. ],
       [1. , 0. ],
       ...,
       [1. , 0. ],
       [1. , 0. ],
       [0.9, 0.1]], shape=(2500, 2))

### Problem 4

#### A Stricter `default` estimation

As discussed in the previous assignment, if you aim to minimize the number of predictions that miss default observations you may consider increasing the probability threshold to make such a classification.  Accordingly, use your probabilities from the last problem to only predict 'No' if you have a higher than 70% probability that this is the label.  Assign your new predictions as an array to `strict_preds`.  Determine the number of false negative predictions here and assign them to `strict_fn` below.  

In [18]:
strict_preds = np.where(base_probs[:, 0] > .7, 'No', 'Yes')
strict_fn = 0
for i, j in zip(strict_preds, y_test):
    if i == 'No':
        if j == 'Yes':
            strict_fn += 1


In [19]:
strict_fn

44

In [24]:
# zip(list1,list2) forming a tuple: (list1[0], list2[0])
for i in zip(strict_preds, y_test):
    print(i)

(np.str_('No'), 'No')
(np.str_('No'), 'No')
(np.str_('No'), 'No')
(np.str_('No'), 'No')
(np.str_('No'), 'No')
(np.str_('No'), 'No')
(np.str_('Yes'), 'Yes')
(np.str_('No'), 'No')
(np.str_('No'), 'No')
(np.str_('No'), 'Yes')
(np.str_('No'), 'No')
(np.str_('No'), 'No')
(np.str_('No'), 'No')
(np.str_('No'), 'No')
(np.str_('No'), 'No')
(np.str_('No'), 'No')
(np.str_('No'), 'No')
(np.str_('No'), 'No')
(np.str_('No'), 'No')
(np.str_('No'), 'No')
(np.str_('No'), 'No')
(np.str_('No'), 'No')
(np.str_('No'), 'No')
(np.str_('No'), 'Yes')
(np.str_('No'), 'No')
(np.str_('No'), 'No')
(np.str_('No'), 'No')
(np.str_('Yes'), 'Yes')
(np.str_('No'), 'No')
(np.str_('No'), 'No')
(np.str_('No'), 'No')
(np.str_('No'), 'No')
(np.str_('No'), 'No')
(np.str_('No'), 'No')
(np.str_('No'), 'No')
(np.str_('No'), 'No')
(np.str_('No'), 'No')
(np.str_('No'), 'No')
(np.str_('No'), 'No')
(np.str_('No'), 'No')
(np.str_('No'), 'No')
(np.str_('Yes'), 'No')
(np.str_('No'), 'No')
(np.str_('No'), 'No')
(np.str_('No'), 'No')
(np

### Problem 5

#### Minimizing False Negatives

Consider a 50%, 70%, and 90% decision boundary for predicting "No".  Which of these minimize the number of false negatives?  Assign your solution as an integer -- 50, 70, or 90 -- to `ans5` below.


In [25]:
stricter_preds = np.where(base_probs[:,0] > 0.9, 'No','Yes')
stricter_fn = 0
for i,j in zip(stricter_preds, y_test):
    if i == 'No':
        if j == 'Yes':
            stricter_fn += 1
            

In [26]:
stricter_fn

22

In [27]:
ans5 = 90

### Problem 6

#### Visualizing decision boundaries

For this exercise, a visualization of the decision boundary using a synthetic dataset is created and plotted below.  Which of these would you choose for minimizing the number of false negatives?  Enter your choice as an integer -- 1, 20, or 50 -- to `ans6` below.

<center>
    <img src = images/dbounds.png />
</center>

In [28]:
ans6 = 1