# K-Nearest Neighbors

1. For each observation in the test data, find the $k$ nearest observations in the training data based on input features **x**.
2. To predict the label for the test observation, average the labels of these $k$ nearest training observations.


- **Traning data:** The data for which we know the label.
- **Test data:** The data for which we don't know label and want to predict it.

In [9]:
import pandas as pd

df = pd.read_csv("https://dlsun.github.io/pods/data/bordeaux.csv", index_col="year")
df.head()

Unnamed: 0_level_0,price,summer,har,sep,win,age
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1952,37.0,17.1,160,14.3,600,40
1953,63.0,16.7,80,17.3,690,39
1955,45.0,17.1,130,16.8,502,37
1957,22.0,16.1,110,16.2,420,35
1958,18.0,16.4,187,19.1,582,34


In [10]:
df_train = df.loc[:1980].copy()
df_test = df.loc[1980:].copy()

print(df_train.shape)
print(df_test.shape)

(27, 6)
(12, 6)


In [13]:
X_train = df_train[["win", "summer"]]
y_train = df_train["price"]

# Standardize the features.
X_train_mean = X_train.mean()
X_train_sd = X_train.std()
X_train_scaled = (X_train - X_train_mean) / X_train_sd

X_train_scaled.head()

Unnamed: 0_level_0,win,summer
year,Unnamed: 1_level_1,Unnamed: 2_level_1
1952,-0.065156,0.965533
1953,0.632329,0.352135
1955,-0.82464,0.965533
1957,-1.460127,-0.56796
1958,-0.204653,-0.107912


In [15]:
X_test = df_test[["win", "summer"]]
X_test_scaled = (X_test - X_train_mean) / X_train_sd
X_test_scaled.head()

Unnamed: 0_level_0,win,summer
year,Unnamed: 1_level_1,Unnamed: 2_level_1
1980,-0.235652,-0.72131
1981,-0.568896,0.812183
1982,0.802826,1.425581
1983,1.833554,1.425581
1984,-0.134905,0.045437


After the normalization we can calculate the distance(Euclidean) to find nearest neighbors.

In [16]:
import numpy as np

dists = np.sqrt(
    ((X_test_scaled.loc[1986] - X_train_scaled) ** 2).sum(axis=1)
)

dists

year
1952    1.259860
1953    1.159726
1955    1.314727
1957    1.149883
1958    0.212597
1959    1.936933
1960    1.557535
1961    2.575503
1962    1.038478
1963    0.983970
1964    1.976971
1965    1.412851
1966    2.007525
1967    1.180230
1968    0.395207
1969    0.320488
1970    0.765065
1971    0.772366
1972    2.004492
1973    1.898753
1974    0.085248
1975    0.922736
1976    2.288442
1977    2.269387
1978    1.729248
1979    1.203287
1980    0.474508
dtype: float64

Now we can find the nearest neighbors ($k=5$)

In [17]:
index_nearest = dists.sort_values().index[:5]
index_nearest

Index([1974, 1958, 1969, 1968, 1980], dtype='int64', name='year')

To make a prediction, we average the labels of the nearest neighbors in the training data.

In [18]:
y_train[index_nearest].mean()

np.float64(13.2)

### KNN in Scikit-Learn

Scikit-learn provides a built-in model `KNeighborsRegressor` that fits KNN regression models.

- But first, we need to scale the training and test data.

In [19]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)

# Scale the test data using a scaler that was fit to the training data!
X_test_scaled = scaler.transform(X_test)

In [20]:
from sklearn.neighbors import KNeighborsRegressor

model = KNeighborsRegressor(n_neighbors=5)
model.fit(X=X_train_scaled, y=y_train)
model.predict(X=X_test_scaled)

array([14.2, 35.8, 54. , 52.2, 18.4, 35.6, 13.2, 37. , 51.4, 36.6, 36.6,
       40.6])

### Pipelines in Scikit-Learn

Machine learning models typically involve many more preprocessing steps.

Scikit-Learn's `Pipeline` allows us to chain steps together.

**Pipeline:** A structure that runs multiple processing steps sequentially as if they were a single model.

In [22]:
from sklearn.pipeline import make_pipeline

pipeline = make_pipeline(
    StandardScaler(),
    KNeighborsRegressor(n_neighbors=5)
)

- We can use pipelines like any other machine learning models.

In [23]:
pipeline.fit(X=X_train, y=y_train)
pipeline.predict(X_test)

array([14.2, 35.8, 54. , 52.2, 18.4, 35.6, 13.2, 37. , 51.4, 36.6, 36.6,
       40.6])