### Cross Validation and K-Nearest Neighbors

**Objectives**


- Use `KNeighborsRegressor` to model regression problems using scikitlearn
- Use `StandardScaler` to prepare data for KNN models
- Use `Pipeline` to combine the preprocessing
- Use cross validation to evaluate models


In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import make_column_transformer
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import OneHotEncoder, StandardScaler, PolynomialFeatures
from sklearn.datasets import make_blobs
from sklearn import set_config
set_config('display')

ModuleNotFoundError: No module named 'seaborn'

### A Second Regression Model

In [None]:
#creating synthetic dataset
x = np.linspace(0, 5, 100)
y = 3*x + 4 + np.random.normal(scale = 3, size = len(x))
df = pd.DataFrame({'x': x, 'y': y})
df.head()

In [None]:
#plot data and new observation
plt.scatter(x, y)
plt.axvline(2, color='red', linestyle = '--', label = 'new input')
plt.grid()
plt.legend()
plt.title(r'What do you think $y$ should be?');

### KNearest Neighbors

Predict the average of the $k$ nearest neighbors.  One way to think about "nearest" is euclidean distance.  We can determine the distance between each data point and the new data point at $x = 2$ with `np.linalg.norm`.  This is a more general way of determining the euclidean distance between vectors. 

In [None]:
#compute distance from each point 
#to new observation
df['distance from x = 2'] = np.linalg.norm(df[['x']] - 2, axis = 1)
df.head()

In [None]:
#five nearest points
df.nsmallest(5, 'distance from x = 2')

In [None]:
#average of five nearest points
df.nsmallest(5, 'distance from x = 2')['y'].mean()

In [None]:
#predicted value with 5 neighbors
plt.scatter(x, y)
plt.plot(2, 10.207196799, 'ro', label = 'Prediction with 5 neighbors')
plt.grid()
plt.legend();

#### Using `sklearn`

The `KNeighborsRegressor` estimator can be used to build the KNN model.  

In [None]:
from sklearn.neighbors import KNeighborsRegressor

In [None]:
#predict for all data
knn = KNeighborsRegressor(n_neighbors=5)
knn.fit(x.reshape(-1, 1), y)
predictions = knn.predict(x.reshape(-1, 1))
plt.scatter(x, y)
plt.step(x, predictions, '--r', label = 'predictions')
plt.grid()
plt.legend()
plt.title(r'Predictions with $k = 5$');

In [None]:
from ipywidgets import interact 
import ipywidgets as widgets

In [None]:
def knn_explorer(n_neighbors):
    knn = KNeighborsRegressor(n_neighbors=n_neighbors)
    knn.fit(x.reshape(-1, 1), y)
    predictions = knn.predict(x.reshape(-1, 1))
    plt.scatter(x, y)
    plt.step(x, predictions, '--r', label = 'predictions')
    plt.grid()
    plt.legend()
    plt.title(f'Predictions with $k = {n_neighbors}$')
    plt.show();

In [None]:
#explore how predictions change as you change k
interact(knn_explorer, n_neighbors = widgets.IntSlider(value = 1, 
                                                       low = 1, 
                                                       high = len(x)));

### Cross Validation

"*Cross-validation is a resampling method that uses different portions of the data to test and train a model on different iterations. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice."* -- [Wikipedia](https://en.wikipedia.org/wiki/Cross-validation_(statistics))

In short, this is a way for us to better understand the quality of the predictions made by our estimator. 

#### K-Fold Cross Validation

![](https://scikit-learn.org/stable/_images/grid_search_cross_validation.png)

In [None]:
from sklearn.model_selection import cross_val_score, cross_val_predict

In [None]:
#linear regression model


In [None]:
#knn model


In [None]:
#cross validate linear model


In [None]:
#cross validate knn model


#### Exercise: Predicting Bike Riders

Below a dataset is loaded using the `fetch_openml` function.  The objective is to predict rider count.

In [None]:
from sklearn.datasets import fetch_openml

In [None]:
bikes = fetch_openml(data_id = 44063)

In [None]:
print(bikes.DESCR)

In [None]:
bikes.frame.info()

In [None]:
data = bikes.frame

In [None]:
data.head()

In [None]:
data['holiday'].value_counts()

In [None]:
data['weather'].value_counts()

In [None]:
data['holiday'].value_counts()

In [None]:
data['season'].value_counts()

In [None]:
data['year'].value_counts()

In [None]:
#split data


In [None]:
#encode categorical features


In [None]:
#encode categorical features and scale others


In [None]:
#fit and transform training data


In [None]:
#transform test data


In [None]:
#linear regression model


In [None]:
#cross validate


In [None]:
#knn model with 5 neighbors


In [None]:
#cross validate


#### Other Uses of KNN

Another place the `KNeighborsRegressor` can be used is to impute missing data.  Here, we use the nearest datapoints to fill in missing values.  Scikitlearn has a `KNNImputer` that will fill in missing values based on the average of $n$ neighbors averages.  

In [None]:
from sklearn.impute import KNNImputer

In [None]:
titanic = sns.load_dataset('titanic')
titanic.info()

In [None]:
# instantiate


In [None]:
# fit and transform


In [None]:
# encoder


In [None]:
# pipeline


In [None]:
# fit on train


In [None]:
# score on train and test
