# <font color='#31394d'>Practical Exercise: k-Nearest Neighbours</font>

In this notebook, we are going to train a k-nearest neighbours model using the [`scikit-learn`](https://scikit-learn.org) library. k-nearest neighbours is a *supervised learning* technique that is suitable for both *regression* (a continuous/numerical outcome) and *classification* (a categorical outcome). In this notebook, we will use it for regression.  

We begin by importing modules for data wrangling:

<!-- 
Even though its name is scikit-learn, it is imported as `sklearn`. It has many submodules.
For example, the `datasets` submodule has a group of simple datasets that can be used to evaluate models without having to use external files.

The Boston Housing dataset is available as a scikitlearn dataset.-->

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns

## <font color='#31394d'> Get and Explore the Data </font>

We'll be using is the [Boston Housing](https://www.kaggle.com/c/boston-housing) dataset from Kaggle. This dataset consists of information about houses in the Boston area. Our goal is to **predict the typical price of a house**.

We import the data from the ``sklearn`` module as follows:

In [None]:
from sklearn import datasets
boston = datasets.load_boston()

`sklearn` datasets behave like a dictionary. Let's see what this dictionary contains:

In [None]:
boston.keys()

The `DESCR` key includes a description of the dataset:

In [None]:
print(boston["DESCR"])

The `target` key holds the target/outcome variable; in this case, the median house value in thousands of dollars.

In [None]:
boston["target"]

The names of the features/independent variables are stored under the `feature_names` key:

In [None]:
boston["feature_names"]

Finally, the values of the features are stored as a numpy array under the `data` key:

In [None]:
boston["data"]

Let's put the Boston data into a pandas dataframe:

In [None]:
df = pd.DataFrame(boston["data"], columns=boston["feature_names"])

df["PRICE"] = boston["target"]

df.head()

## <font color='#31394d'> Training a KNN Model </font>

The algorithms for fitting k-nearest neighbours models are in the `neighbors` submodule of `sklearn`. There are `KNeighborsRegressor` and `KNeighborsClassifier` classes for KNN regression and classification, respectively. Since we're dealing with a continuous outcome variable (`PRICE`), we'll import the `KNeighborsRegressor` class and create (instantiate) an *estimator* object. Note that this is the standard procedure for any machine learning algo available in `sklearn`. 

In [None]:
from sklearn.neighbors import KNeighborsRegressor
knn = KNeighborsRegressor()

The object that we have called `knn` is our as yet untrained machine learning model. After training, it will be updated to contain all the information that is needed to make predictions on new data. Since we did not specify anything between the brackets in `KNeighborsRegressor()`, the object will be instantiated with the default parameters. It is usually a good idea to inspect these defaults (so that you understand the specifics of the model you are fitting) and to change them if needs be:

In [None]:
?knn

🚀 <font color='#d9c4b1'> Exercise: </font> What is the default value of k?

Let's fit the model to our boston dataset using the `knn` object's `fit` method:

In [None]:
knn.fit(X=df.iloc[:,:-1], y=df.PRICE)

The `knn` object has now been updated so it is ready to make predictions:

In [None]:
y_hat = knn.predict(X=df.iloc[:,:-1])
y_hat

Let's see how our predicted prices look relative to the true prices:

In [None]:
sns.scatterplot(x=df.PRICE, y=y_hat)

## <font color='#31394d'> Model Evaluation </font>

We should never be used to evaulate a model's performance. Instead, we should evaluate our model on new, unseen data. We can do this in one of two ways: (1) split the data into training and test sets or (2) use cross-validation. Let's see how we would implement these

### <font color='#31394d'> 1. Train/Test Split </font>

Before we train our KNN model (i.e. call the `fit` method), we can split the data into training and test sets. We then call the `fit` method on the training set and the `predict` method on the test set. Usually, the training set is larger than the test set (75%/25% and 80%/20% splits are common).

We can use the `train_test_split` function from the `sklearn` module to easily split the dataset into training and test subsets. We use the argument `test_size` to define the % size of the test dataset.

The full dataset is divided row-wise into training and test sets *at random*. This means that each time we run `train_test_split`, we will get different datasets. In order to make sure that we get the same splits again and again, we can fix the *random seed*; that is, the number that numpy uses to start its random number generation (used to calculate the splits). We can use the argument `random_state` to set the random seed for `train_test_split`.

In [None]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2, random_state=12345)

print('Training set has', train.shape[0], 'rows')
print('Test set has', test.shape[0], 'rows')

Now we <font color='#d9c4b1'>**FIT**</font> the model on the <font color='#d9c4b1'>**TRAINING DATA**</font>:

In [None]:
model = KNeighborsRegressor()
model.fit(X=train.iloc[:,:-1], y=train.PRICE)

And <font color='#d9c4b1'>**PREDICT**</font> on the <font color='#d9c4b1'>**TEST DATA**</font>:

In [None]:
y_hat = model.predict(X=test.iloc[:,:-1])

Now we can calculate our evaluation metrics using scikit-learn's `metrics` submodule. These functions take the actual values and predicted values of the outcome variable as arguments:

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

In [None]:
mean_squared_error(y_true=test['PRICE'], y_pred=y_hat)

In [None]:
mean_absolute_error(y_true=test['PRICE'], y_pred=y_hat)

The magnitudes of the MSE and MAE are dependent on how the outcome variable is measured. They are therefore not comparable across datasets, but are useful for model and feature selection on a given dataset.

#### <font color='#d9c4b1'> Choosing k </font>

We can use the train/test split to find the optimal value for k:

In [None]:
knn1 = KNeighborsRegressor(n_neighbors = 1) # K = 1
knn1.fit(X=train.iloc[:,:-1], y=train.PRICE)
y_hat1 = knn1.predict(X=test.iloc[:,:-1])

knn3 = KNeighborsRegressor(n_neighbors = 3) # K = 3
knn3.fit(X=train.iloc[:,:-1], y=train.PRICE)
y_hat3 = knn3.predict(X=test.iloc[:,:-1])

knn5 = KNeighborsRegressor(n_neighbors = 5) # K = 5
knn5.fit(X=train.iloc[:,:-1], y=train.PRICE)
y_hat5 = knn5.predict(X=test.iloc[:,:-1])

In [None]:
print('MSE')
print('K = 1\t', mean_absolute_error(y_true=test['PRICE'], y_pred=y_hat1))
print('K = 3\t', mean_absolute_error(y_true=test['PRICE'], y_pred=y_hat3))
print('K = 5\t', mean_absolute_error(y_true=test['PRICE'], y_pred=y_hat5))

The KNN model with k=3 has the lowest MSE on new data and is therefore the best. 

🚀 <font color='#d9c4b1'> Exercise: </font> Try the MAE metric and see if you reach the same conclusion.

In [None]:
# your code goes here

### <font color='#31394d'> 2. Cross Validation </font>

Cross validation is an alternative approach to evaluate out-of-sample model performance. To do cross validation, we simply split the data into *K* folds, and for each fold, we train the model on the data from the *K*-1 remaining folds and evaluate on the one that was not included in the training set. That way, we get out-of-sample predictions and errors for every data point, so we don't rely on a single test set. 

For example, a 5 fold cross validation would look like this:

![title](media/cross_validation.png)

The `cross_val_score` function in `scikit-learn` computes your choice of evaluation metric for each fold. To use this function, we first need to see what "scoring methods" are available:

In [None]:
from sklearn.metrics import SCORERS
SCORERS.keys()

Looks like it defines the evaluation metrics such that "bigger is better". So, if we want to use MSE, for example, we need to choose "neg_mean_squared_error" (the negative MSE).

Note that we do **NOT** do a train/test split. We use the full dataset in the dataframe `df`.

In [None]:
from sklearn.model_selection import cross_val_score

knn_mod_5 = KNeighborsRegressor(n_neighbors = 5)

cv_scores = cross_val_score(estimator=knn_mod_5, X=df.iloc[:,:-1], y=df.PRICE, scoring="neg_mean_squared_error", cv=5)
cv_scores

The cross-validated MSE for KNN model with $K=5$ is therefore:

In [None]:
-cv_scores.mean()

🚀 <font color='#d9c4b1'> Exercise: </font> Repeat the above for $K=1$ and $K=3$, and determine which value is best using 5-fold cross validation.

In [None]:
# your code goes here

If we want to get more information about each split, we can use the `cross_validate` function instead. It also accepts multiple scoring functions/evaluation metrics. Think of `cross_val_score` as the simplified version of `cross_validate`...

In [None]:
from sklearn.model_selection import cross_validate
scoring_functions = {"negMSE": "neg_mean_squared_error", "negMAE": "neg_mean_absolute_error"}
cv_info = cross_validate(estimator=model, X=df.iloc[:,:-1], y=df.PRICE, scoring=scoring_functions, cv=5, return_train_score=False)
cv_df = pd.DataFrame(cv_info)
cv_df

We get results for each one of the folds:
- fit time = how long it took to train the model
- score time = how long it took to make predictions and compute the score
- test and train scores are given for each one of the scoring functions

In [None]:
cv_df.loc[:,cv_df.columns.str.startswith('test')].mean()