As the nearest neighbors are determined, scaling is very relevant to KNN (both regression and classificiation). Let's do it without scaling first to see what happens. We are using the good old diamonds dataset again.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
import sklearn.metrics

## Let's start without scaling!

In [3]:
diamonds = sns.load_dataset('diamonds')

In [4]:
diamonds.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


Let's read out our non-numeric columns

In [12]:
categorical = list(diamonds.select_dtypes(include=['object']).columns)
print(categorical)

['cut', 'color', 'clarity']


In [15]:
dummies = []
for categorical_column in categorical:
    dummies.append(pd.get_dummies(diamonds[categorical_column], drop_first=True))

In [19]:
for dummy in dummies:
    print(dummy.head(), "\n")

   Good  Ideal  Premium  Very Good
0     0      1        0          0
1     0      0        1          0
2     1      0        0          0
3     0      0        1          0
4     1      0        0          0 

   E  F  G  H  I  J
0  1  0  0  0  0  0
1  1  0  0  0  0  0
2  1  0  0  0  0  0
3  0  0  0  0  1  0
4  0  0  0  0  0  1 

   IF  SI1  SI2  VS1  VS2  VVS1  VVS2
0   0    0    1    0    0     0     0
1   0    1    0    0    0     0     0
2   0    0    0    1    0     0     0
3   0    0    0    0    1     0     0
4   0    0    1    0    0     0     0 



We now want to concatenate our DataFrames. For that, we need **one** list of DataFrames. Just writing \[diamonds, dummies\] would result in the following list: \[diamonds, [dummy1, dummy2, ...\]\]. So we need to **unpack** the dummy list. We can do so by writing in asterisk before the list's name.

In [26]:
data = pd.concat([diamonds.drop(categorical, axis=1), *dummies], axis=1)

In [27]:
X = data.drop("price", axis=1)
y = data["price"]

In [28]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [29]:
knn = KNeighborsRegressor()
knn.fit(X_train, y_train)

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
                    metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                    weights='uniform')

In [30]:
y_pred = knn.predict(X_test)

In [32]:
print("RMSE: ", np.sqrt(sklearn.metrics.mean_squared_error(y_pred, y_test)))

RMSE:  1037.2553377657998


## Now with scaling!

In [33]:
st_scaler = StandardScaler()

In [34]:
st_scaler.fit(X_train)

StandardScaler(copy=True, with_mean=True, with_std=True)

In [35]:
X_train_scaled = st_scaler.transform(X_train)

In [38]:
knn = KNeighborsRegressor()
knn.fit(X_train_scaled, y_train)

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
                    metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                    weights='uniform')

We can actually shorten the notation here, if we want to!

In [39]:
y_pred = knn.predict(st_scaler.transform(X_test))

In [40]:
print("RMSE: ", np.sqrt(sklearn.metrics.mean_squared_error(y_pred, y_test)))

RMSE:  862.1279180790867


#### And we were much better, bravo!