# kNN Regression with Python
## Boston Housing data

R data sets can be found at [this link](http://vincentarelbundock.github.io/Rdatasets/datasets.html).

This notebook will use the Boston Housing data.

In [1]:
### load the data
import pandas as pd
df = pd.read_csv('data/Boston.csv')

In [2]:
# train test split
from sklearn.model_selection import train_test_split

X = df.iloc[:, 0:12]
y = df.iloc[:, 13]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234)

print('train size:', X_train.shape)
print('test size:', X_test.shape)

train size: (404, 12)
test size: (102, 12)


In [3]:
# train the algorithm
from sklearn.neighbors import KNeighborsRegressor
regressor = KNeighborsRegressor(n_neighbors=3)
regressor.fit(X_train, y_train)

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
                    metric_params=None, n_jobs=None, n_neighbors=3, p=2,
                    weights='uniform')

In [4]:
# make predictions

y_pred = regressor.predict(X_test)

In [5]:
# evaluation
from sklearn.metrics import mean_squared_error, r2_score
print('mse=', mean_squared_error(y_test, y_pred))
print('correlation=', r2_score(y_test, y_pred))

mse= 66.1533442265795
correlation= 0.35553728726247713


The mse was considerably higher at 35.7 than the mse for linear regression of 27.85. The correlation was much lower. Different values  of k=5 and k=7 resulted in worse performance.

### Scaling

In R, much improved results were achieved by scaling data for kNN. The following code tries kNN again after scaling the data.

In [6]:
from sklearn import preprocessing

scaler = preprocessing.StandardScaler().fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [7]:
regressor2 = KNeighborsRegressor(n_neighbors=3)
regressor2.fit(X_train_scaled, y_train)

# make predictions
y_pred2 = regressor2.predict(X_test_scaled)

# evaluation
print('mse=', mean_squared_error(y_test, y_pred2))
print('correlation=', r2_score(y_test, y_pred2))

mse= 19.192755991285402
correlation= 0.8130250898777173


Wow. Much better results than the non-scaled version, and a significant improvement over linear regression as well.