# Predicting house prices using k-nearest neighbors regression
In this notebook, you will implement k-nearest neighbors regression. You will:
  * Find the k-nearest neighbors of a given query input
  * Predict the output for the query input using the k-nearest neighbors
  * Choose the best value of k using a validation set

In [1]:
import numpy as np
import pandas as pd

In [2]:
dtype_dict = {'bathrooms':float, 'waterfront':int, 'sqft_above':int, 'sqft_living15':float, 'grade':int, 'yr_renovated':int, 'price':float, 'bedrooms':float, 'zipcode':str, 'long':float, 'sqft_lot15':float, 'sqft_living':float, 'floors':float, 'condition':int, 'lat':float, 'date':str, 'sqft_basement':int, 'yr_built':int, 'id':str, 'sqft_lot':int, 'view':int}

In [9]:
train = pd.read_csv('data/kc_house_data_small_train.csv', dtype=dtype_dict)
test = pd.read_csv('data/kc_house_data_small_test.csv', dtype=dtype_dict)
validation = pd.read_csv('data/kc_house_data_validation.csv', dtype=dtype_dict)

In [5]:
def get_numpy_data(data, features, output):
    data['constant'] = 1 # add a constant column 

    # prepend variable 'constant' to the features list
    features = ['constant'] + features

    # select the columns of data_SFrame given by the ‘features’ list into the SFrame ‘features_sframe’
    features_matrix=data[features].to_numpy()

    # assign the column of data_sframe associated with the target to the variable ‘output_sarray’
    output_array = data[output].to_numpy()

    return(features_matrix, output_array)

In [6]:
def normalize_features(features):
    norms = np.linalg.norm(features, axis=0)
    normalized_features = features/norms
    return (normalized_features, norms)

In [15]:
features = ['bedrooms',  
                'bathrooms',  
                'sqft_living',  
                'sqft_lot',  
                'floors',
                'waterfront',  
                'view',  
                'condition',  
                'grade',  
                'sqft_above',  
                'sqft_basement',
                'yr_built',  
                'yr_renovated',  
                'lat',  
                'long',  
                'sqft_living15',  
                'sqft_lot15']
output = 'price'
features_train, output_train = get_numpy_data(train, features, output)
features_test, output_test = get_numpy_data(test, features, output)
features_valid, output_valid = get_numpy_data(validation, features, output)

In [16]:
features_train, norms = normalize_features(features_train)
features_test = features_test / norms
features_valid = features_valid / norms

In [None]:
print(features_test[0])
print(features_train[9])

[ 0.01345102  0.01551285  0.01807473  0.01759212  0.00160518  0.017059
  0.          0.05102365  0.0116321   0.01564352  0.01362084  0.02481682
  0.01350306  0.          0.01345387 -0.01346922  0.01375926  0.0016225 ]
[ 0.01345102  0.01163464  0.00602491  0.0083488   0.00050756  0.01279425
  0.          0.          0.01938684  0.01390535  0.0096309   0.
  0.01302544  0.          0.01346821 -0.01346251  0.01195898  0.00156612]


8. Quiz Question: What is the Euclidean distance between the query house and the 10th house of the training set? 

In [25]:
dist = np.linalg.norm(features_test[0]-features_train[9])
dist



0.05972359371398078

In [26]:
dist2 = np.sqrt(np.sum((features_test[0]-features_train[9])**2))
dist2


0.05972359371398078

In [35]:
query_10dist = [np.sqrt(np.sum((features_test[0]-features_train[i])**2)) for i in range(len(features_train[0:10]))]

In [36]:
query_10dist 

[0.06027470916295592,
 0.08546881147643746,
 0.06149946435279315,
 0.05340273979294363,
 0.05844484060170442,
 0.059879215098128345,
 0.05463140496775461,
 0.055431083236146074,
 0.052383627840220305,
 0.05972359371398078]

10. Quiz Question: Among the first 10 training houses, which house is the closest to the query house?

In [37]:
np.argmin(query_10dist)

8

In [39]:
diff = features_test[0]-features_train[0:10]
diff

array([[ 0.00000000e+00,  3.87821276e-03,  1.20498190e-02,
         1.05552733e-02, -2.08673616e-04,  8.52950206e-03,
         0.00000000e+00,  5.10236549e-02,  0.00000000e+00,
         3.47633726e-03,  5.50336860e-03,  2.48168183e-02,
         1.63756198e-04,  0.00000000e+00,  1.70254220e-05,
        -1.29876855e-05,  5.14364795e-03, -6.69281453e-04],
       [ 0.00000000e+00,  3.87821276e-03,  4.51868214e-03,
         2.26610387e-03, -7.19763456e-04,  0.00000000e+00,
         0.00000000e+00,  5.10236549e-02,  0.00000000e+00,
         3.47633726e-03, -1.30705004e-03,  1.45830788e-02,
         1.91048898e-04, -6.65082271e-02, -4.23090220e-05,
        -6.16364736e-06,  2.89330197e-03, -1.47606982e-03],
       [ 0.00000000e+00,  7.75642553e-03,  1.20498190e-02,
         1.30002801e-02, -1.60518166e-03,  8.52950206e-03,
         0.00000000e+00,  5.10236549e-02,  0.00000000e+00,
         5.21450589e-03,  8.32384500e-03,  2.48168183e-02,
         3.13866046e-04,  0.00000000e+00, -4.70885840e

In [41]:
np.sqrt(np.sum(diff**2, axis=1))

array([0.06027471, 0.08546881, 0.06149946, 0.05340274, 0.05844484,
       0.05987922, 0.0546314 , 0.05543108, 0.05238363, 0.05972359])

In [46]:
def compute_distances(features_instances, features_query):
    distances = np.sqrt(np.sum((features_query-features_instances)**2, axis=1))
    return distances

In [47]:
compute_distances(features_train[0:10],features_test[0])

array([0.06027471, 0.08546881, 0.06149946, 0.05340274, 0.05844484,
       0.05987922, 0.0546314 , 0.05543108, 0.05238363, 0.05972359])

16. Quiz Question: Take the query house to be third house of the test set (features_test[2]).  What is the index of the house in the training set that is closest to this query house?

In [50]:
np.argmin(compute_distances(features_train,features_test[2]))

382

In [65]:
def k_nearest_neighbors(k, feature_train, features_query):
    distances = compute_distances(features_train, features_query)
    sorted_indices = distances.argsort()
    neighbors = feature_train[sorted_indices][0:k]
    return neighbors, sorted_indices[0:k]

In [66]:
k_nearest_neighbors(1, features_train, features_test[2])

(array([[ 0.01345102,  0.01163464,  0.00903736,  0.01013783,  0.00264759,
          0.0085295 ,  0.        ,  0.        ,  0.0116321 ,  0.01216718,
          0.006948  ,  0.0176532 ,  0.01344165,  0.        ,  0.01341151,
         -0.0134471 ,  0.00925857,  0.00340725]]),
 array([382]))

In [63]:
features_train[382]

array([ 0.01345102,  0.01163464,  0.00903736,  0.01013783,  0.00264759,
        0.0085295 ,  0.        ,  0.        ,  0.0116321 ,  0.01216718,
        0.006948  ,  0.0176532 ,  0.01344165,  0.        ,  0.01341151,
       -0.0134471 ,  0.00925857,  0.00340725])

In [86]:
output_train[382]

249000.0

19. Quiz Question: Take the query house to be third house of the test set (features_test[2]).  What are the indices of the 4 training houses closest to the query house?

In [67]:
k_nearest_neighbors(4, features_train, features_test[2])

(array([[ 0.01345102,  0.01163464,  0.00903736,  0.01013783,  0.00264759,
          0.0085295 ,  0.        ,  0.        ,  0.0116321 ,  0.01216718,
          0.006948  ,  0.0176532 ,  0.01344165,  0.        ,  0.01341151,
         -0.0134471 ,  0.00925857,  0.00340725],
        [ 0.01345102,  0.01163464,  0.01204982,  0.010436  ,  0.00160197,
          0.0085295 ,  0.        ,  0.        ,  0.0116321 ,  0.01216718,
          0.00653525,  0.02046748,  0.0130732 ,  0.        ,  0.01348083,
         -0.01346768,  0.01093025,  0.00202813],
        [ 0.01345102,  0.01163464,  0.01204982,  0.00811027,  0.00080259,
          0.0085295 ,  0.        ,  0.        ,  0.0116321 ,  0.01216718,
          0.00502182,  0.01611814,  0.01355083,  0.        ,  0.01348326,
         -0.0134679 ,  0.01048018,  0.00202813],
        [ 0.01345102,  0.01163464,  0.00903736,  0.01079381,  0.00163086,
          0.0085295 ,  0.        ,  0.        ,  0.0116321 ,  0.01216718,
          0.00708559,  0.01995579,  0.0

In [68]:
def predict_output_of_query(k, features_train, output_train, features_query):
    neighbors, index = k_nearest_neighbors(k, features_train, features_query)
    prediction = np.mean(output_train[index])
    return prediction

21. Quiz Question: Again taking the query house to be third house of the test set (features_test[2]), predict the value of the query house using k-nearest neighbors with k=4 and the simple averaging method described and implemented above.

In [69]:
predict_output_of_query(4, features_train, output_train, features_test[2])

413987.5

In [70]:
def predict_output(k, features_train, output_train, features_query):
    predictions = []
    for query in features_query:
        predictions.append(predict_output_of_query(k, features_train, output_train, query))
    return predictions

23. Quiz Question: Make predictions for the first 10 houses in the test set, using k=10. What is the index of the house in this query set that has the lowest predicted value?  What is the predicted value of this house?

In [82]:
pred10=predict_output(10, features_train, output_train, features_test[0:10])

In [83]:
np.argmin(pred10)

6

In [85]:
pred10[6]

350032.0

Choosing the best value of k using a validation set

In [98]:
RSS_list = []
for k in range(1,16):
    predict=predict_output(k, features_train, output_train, features_valid)
    dif = predict-output_valid
    RSS = np.sum(dif**2)
    print("k: ", k, "RSS: ", RSS)
    RSS_list.append(RSS)
    

k:  1 RSS:  105453830251561.0
k:  2 RSS:  83445073504025.5
k:  3 RSS:  72692096019202.56
k:  4 RSS:  71946721652091.69
k:  5 RSS:  69846517419718.6
k:  6 RSS:  68899544353180.836
k:  7 RSS:  68341973450051.09
k:  8 RSS:  67361678735491.5
k:  9 RSS:  68372727958976.09
k:  10 RSS:  69335048668556.74
k:  11 RSS:  69523855215598.83
k:  12 RSS:  69049969587246.17
k:  13 RSS:  70011254508263.69
k:  14 RSS:  70908698869034.34
k:  15 RSS:  71106928385945.16


25. Quiz Question: What is the RSS on the TEST data using the value of k found above?  To be clear, sum over all houses in the TEST set.

In [78]:
pred = predict_output(8, features_train, output_train, features_test)

In [92]:
residuals = output_test - pred
squared = residuals**2
RSS = squared.sum()


In [93]:
print("{:.2E}".format(RSS))

1.33E+14
