# Predicting house prices using k-nearest neighbors regression
In this notebook, you will implement k-nearest neighbors regression. You will:
  * Find the k-nearest neighbors of a given query input
  * Predict the output for the query input using the k-nearest neighbors
  * Choose the best value of k using a validation set

# Fire up GraphLab Create

In [1]:
import turicreate as tc
import scipy.stats as stats
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

### Load house sales data

Dataset is from house sales in King County, the region where the city of Seattle, WA is located.

In [65]:
df_sales = tc.SFrame('../../../data/kc_house_data.gl/')

In [66]:
df_sales.head()

id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront
7129300520,2014-10-13 00:00:00+00:00,221900,3.0,1.0,1180.0,5650,1.0,0
6414100192,2014-12-09 00:00:00+00:00,538000,3.0,2.25,2570.0,7242,2.0,0
5631500400,2015-02-25 00:00:00+00:00,180000,2.0,1.0,770.0,10000,1.0,0
2487200875,2014-12-09 00:00:00+00:00,604000,4.0,3.0,1960.0,5000,1.0,0
1954400510,2015-02-18 00:00:00+00:00,510000,3.0,2.0,1680.0,8080,1.0,0
2008000270,2015-01-15 00:00:00+00:00,291850,3.0,1.5,1060.0,9711,1.0,0
2414600126,2015-04-15 00:00:00+00:00,229500,3.0,1.0,1780.0,7470,1.0,0
1736800520,2015-04-03 00:00:00+00:00,662500,3.0,2.5,3560.0,9796,1.0,0
9297300055,2015-01-24 00:00:00+00:00,650000,4.0,3.0,2950.0,5000,2.0,0
6865200140,2014-05-29 00:00:00+00:00,485000,4.0,1.0,1600.0,4300,1.5,0

view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat
0,3,7,1180,0,1955,0,98178,47.51123398
0,3,7,2170,400,1951,1991,98125,47.72102274
0,3,6,770,0,1933,0,98028,47.73792661
0,5,7,1050,910,1965,0,98136,47.52082
0,3,8,1680,0,1987,0,98074,47.61681228
0,3,7,1060,0,1963,0,98198,47.40949984
0,3,7,1050,730,1960,0,98146,47.51229381
0,3,8,1860,1700,1965,0,98007,47.60065993
3,3,9,1980,970,1979,0,98126,47.57136955
0,4,7,1600,0,1916,0,98103,47.66478645

long,sqft_living15,sqft_lot15
-122.25677536,1340.0,5650.0
-122.3188624,1690.0,7639.0
-122.23319601,2720.0,8062.0
-122.39318505,1360.0,5000.0
-122.04490059,1800.0,7503.0
-122.31457273,1650.0,9711.0
-122.33659507,1780.0,8113.0
-122.14529566,2210.0,8925.0
-122.37541218,2140.0,4000.0
-122.34281613,1610.0,4300.0


Because the features in this dataset have very different scales (e.g. price is in the hundreds of thousands while the number of bedrooms is in the single digits), it is important to normalize the features

To efficiently compute pairwise distances among data points, we will convert the SFrame into a 2D Numpy array. First import the numpy library and then copy and paste `get_numpy_data()` from the second notebook of Week 2.

#### Toma el data frame que se importó de la tabla Sales, y lo separa en tablas tipo array, separando los features y el output

In [81]:
def get_numpy_data(data_sframe, features, output):
    data_sframe['constant'] = 1 
    features = ['constant'] + features 
    features_sframe = data_sframe[features] 
    feature_matrix = features_sframe.to_numpy()
    output_sarray = data_sframe[output]
    output_array = output_sarray.to_numpy()
    return feature_matrix, output_array

In [82]:
(example_features, example_output) = get_numpy_data(df_sales, ['sqft_living'], 'price') # the [] around 'sqft_living' makes it a list
print (example_features[0,:]) # this accesses the first row of the data the ':' indicates 'all columns'
print (example_output[0]) # and the corresponding output

[1.00e+00 1.18e+03]
221900


In [83]:
train_and_validation, test = df_sales.random_split(.8, seed=1) # initial train/test split
train, validation = train_and_validation.random_split(.8, seed=1) # split training set into training and validation sets

# Extract features and normalize

Using all of the numerical inputs listed in `feature_list`, transform the training, test, and validation SFrames into Numpy arrays:

In [78]:
feature_list = ['bedrooms',  
                'bathrooms',  
                'sqft_living',  
                'sqft_lot',  
                'floors',
                'waterfront',  
                'view',  
                'condition',  
                'grade',  
                'sqft_above',  
                'sqft_basement',
                'yr_built',  
                'yr_renovated',  
                'lat',  
                'long',  
                'sqft_living15',  
                'sqft_lot15']
features_train, output_train = get_numpy_data(train, feature_list, 'price')
features_test, output_test = get_numpy_data(test, feature_list, 'price')
features_valid, output_valid = get_numpy_data(validation, feature_list, 'price')

In [None]:
feature_list

In computing distances, it is crucial to normalize features. Otherwise, for example, the `sqft_living` feature (typically on the order of thousands) would exert a much larger influence on distance than the `bedrooms` feature (typically on the order of ones). We divide each column of the training feature matrix by its 2-norm, so that the transformed column has unit norm.

IMPORTANT: Make sure to store the norms of the features in the training set. The features in the test and validation sets must be divided by these same norms, so that the training, test, and validation sets are normalized consistently.

### Hay distintos tipos de escalamiento, el que se usa aquí es normalizado con rango de (0,1)

In [79]:
def normalize_features(feature_matrix):
    norms = np.linalg.norm(feature_matrix, axis=0)
    normalized_features = feature_matrix / norms
    return normalized_features, norms

In [90]:
features_train, norms = normalize_features(features_train) # normalize training set features (columns)
features_train= tc.SFrame(data=pd.DataFrame(features_train))
features_test = features_test / norms # normalize test set by training set norms
features_test = tc.SFrame(data=pd.DataFrame(features_test))
features_valid = features_valid / norms # normalize validation set by training set norms
features_valid = tc.SFrame(data=pd.DataFrame(features_valid))

# Model

In [94]:
#model = tc.nearest_neighbors.create(features_train)
model = tc.nearest_neighbors.create(features_train, features=['0', '1', '2'])

In [95]:
model.summary()

Class                          : NearestNeighborsModel

Attributes
----------
Method                         : ball_tree
Number of distance components  : 1
Number of examples             : 5527
Number of feature columns      : 3
Number of unpacked features    : 3
Total training time (seconds)  : 1.0425

Ball Tree Attributes
--------------------
Tree depth                     : 4
Leaf size                      : 1000



To retrieve the five closest neighbors for new data points or a subset of the original reference data, we query the model with the query method. Query points must also be contained in an SFrame, and must have columns with the same names as those used to construct the model (additional columns are allowed, but ignored). The result of the query method is an SFrame with four columns: query label, reference label, distance, and rank of the reference point among the query point's nearest neighbors.

In [96]:
knn = model.query(features_train[:5], k=5)
knn.head()

query_label,reference_label,distance,rank
0,0,0.0,1
0,6,0.0,2
0,15,0.0,3
0,21,0.0,4
0,70,0.0,5
1,27,0.0,1
1,63,0.0,2
1,115,0.0,3
1,120,0.0,4
1,161,0.0,5


In some cases the query dataset is the reference dataset. For this task of constructing the similarity_graph on the reference data, the model's similarity_graph can be used. For brute force models it can be almost twice as fast, depending on the data sparsity and chosen distance function. By default, the similarity_graph method returns an SGraph whose vertices are the rows of the reference dataset and whose edges indicate a nearest neighbor match. Specifically, the destination vertex of an edge is a nearest neighbor of the source vertex. similarity_graph can also return results in the same form as the query method if so desired

In [97]:
sim_graph = model.similarity_graph(k=3)

In [100]:
sim_graph.summary()

{'num_edges': 16581, 'num_vertices': 5527}

Distance functions
The most critical choice in computing nearest neighbors is the distance function that measures the dissimilarity between any pair of observations.

For numeric data, the options are euclidean, manhattan, cosine, and transformed_dot_product. For data in dictionary format (i.e. sparse data), jaccard and weighted_jaccard are also options, in addition to the numeric distances. For string features, use levenshtein distance, or use the text analytics toolkit's count_ngrams feature to convert strings to dictionaries of words or character shingles, then use Jaccard or weighted Jaccard distance. Leaving the distance parameter set to its default value of auto tells the model to choose the most reasonable distance based on the type of features in the reference data. In the following output cell, the second line of the model summary confirms our choice of Manhattan distance.

In [102]:
model = tc.nearest_neighbors.create(features_train, features=['0', '1', '2'],
                                    distance='manhattan')
model.summary()

Class                          : NearestNeighborsModel

Attributes
----------
Method                         : ball_tree
Number of distance components  : 1
Number of examples             : 5527
Number of feature columns      : 3
Number of unpacked features    : 3
Total training time (seconds)  : 0.0269

Ball Tree Attributes
--------------------
Tree depth                     : 4
Leaf size                      : 1000



Distance functions are also exposed in the turicreate.distances module. This allows us not only to specify the distance argument for a nearest neighbors model as a distance function (rather than a string), but also to use that function for any other purpose.

In the following snippet we use a nearest neighbors model to find the closest reference points to the first three rows of our dataset, then confirm the results by computing a couple of the distances manually with the Manhattan distance function.

In [103]:
model = tc.nearest_neighbors.create(features_train, features=['0', '1', '2'],
                                    distance=tc.distances.manhattan)
knn = model.query(features_train[:3], k=3)
knn.print_rows()

sf_check = features_train[['0', '1', '2']]
print ("distance check 1:", tc.distances.manhattan(sf_check[2], sf_check[10]))
print ("distance check 2:", tc.distances.manhattan(sf_check[2], sf_check[14]))

+-------------+-----------------+----------+------+
| query_label | reference_label | distance | rank |
+-------------+-----------------+----------+------+
|      0      |       2809      |   0.0    |  1   |
|      0      |       2810      |   0.0    |  2   |
|      0      |       2808      |   0.0    |  3   |
|      1      |        1        |   0.0    |  1   |
|      1      |        27       |   0.0    |  2   |
|      1      |        63       |   0.0    |  3   |
|      2      |       1428      |   0.0    |  1   |
|      2      |        2        |   0.0    |  2   |
|      2      |       5470      |   0.0    |  3   |
+-------------+-----------------+----------+------+
[9 rows x 4 columns]

distance check 1: 0.016793789802370986
distance check 2: 0.01528756242317695


Search methods
Another important choice in model creation is the method. The brute_force method computes the distance between a query point and each of the reference points, with a run time linear in the number of reference points. Creating a model with the ball_tree method takes more time, but leads to much faster queries by partitioning the reference data into successively smaller balls and searching only those that are relatively close to the query. The default method is auto which chooses a reasonable method based on both the feature types and the selected distance function. The method parameter is also specified when the model is created. The third row of the model summary confirms our choice to use the ball tree in the next example.

In [105]:
model = tc.nearest_neighbors.create(features_train, features=['0', '1', '2'],
                                    method='ball_tree', leaf_size=5)
model.summary()

Class                          : NearestNeighborsModel

Attributes
----------
Method                         : ball_tree
Number of distance components  : 1
Number of examples             : 5527
Number of feature columns      : 3
Number of unpacked features    : 3
Total training time (seconds)  : 0.0454

Ball Tree Attributes
--------------------
Tree depth                     : 12
Leaf size                      : 5



*** QUIZ QUESTION ***

Again taking the query house to be third house of the test set (`features_test[2]`), predict the value of the query house using k-nearest neighbors with `k=4` and the simple averaging method described and implemented above.

1 nearest 249000   
4 nearest 413987

Compare this predicted value using 4-nearest neighbors to the predicted value using 1-nearest neighbor computed earlier.

In [43]:
#función predict conjunta
def KNNPredict (k, query_vector, train_matrix, output_train):
    knn_idx = kNearest(k, query_vector, train_matrix)
    knn_predict = kNearestPred(knn_idx, output_train)
    return knn_predict

In [44]:
KNNPredict (4, features_test[2], features_train, output_train)

413987.5

## Make multiple predictions

Write a function to predict the value of *each and every* house in a query set. (The query set can be any subset of the dataset, be it the test set or validation set.) The idea is to have a loop where we take each house in the query set as the query house and make a prediction for that specific house. The new function should take the following parameters:
 * the value of k;
 * the feature matrix for the training houses;
 * the output values (prices) of the training houses; and
 * the feature matrix for the query set.
 
The function should return a set of predicted values, one for each house in the query set.

**Hint**: To get the number of houses in the query set, use the `.shape` field of the query features matrix. See [the documentation](http://docs.scipy.org/doc/numpy-1.10.1/reference/generated/numpy.ndarray.shape.html).

#### Ahora se puede hacer algo similar, pero ya no es un query vector, si no un subset de las primeras q filas del test

In [45]:
#takes the first q to make predictions
def qPredictKNN (q, k, features_test, features_train, output_train):
    predict_vector=[]
    subset = features_test[0:q]
    for j in xrange(len(subset)):
        knnpredict = KNNPredict (k, subset[j], features_train, output_train)
        predict_vector.append(knnpredict)
    return predict_vector

In [46]:
qPredictKNN(10, 10, features_test, features_train, output_train )

[881300.0,
 431860.0,
 460595.0,
 430200.0,
 766750.0,
 667420.0,
 350032.0,
 512800.70000000001,
 484000.0,
 457235.0]

*** QUIZ QUESTION ***

Make predictions for the first 10 houses in the test set using k-nearest neighbors with `k=10`. 

1. What is the index of the house in this query set that has the lowest predicted value? 
2. What is the predicted value of this house?

In [48]:
7, 350032

(7, 350032)

## Choosing the best value of k using a validation set

There remains a question of choosing the value of k to use in making predictions. Here, we use a validation set to choose this value. Write a loop that does the following:

* For `k` in [1, 2, ..., 15]:
    * Makes predictions for each house in the VALIDATION set using the k-nearest neighbors from the TRAINING set.
    * Computes the RSS for these predictions on the VALIDATION set
    * Stores the RSS computed above in `rss_all`
* Report which `k` produced the lowest RSS on VALIDATION set.

(Depending on your computing environment, this computation may take 10-15 minutes.)

In [173]:
#get_residual_sum_of_squares
def get_rss(predictions, real_outcome):
    RSS = np.sqrt(((real_outcome - predictions)**2).sum())
    return(RSS)  

In [174]:
q = len(features_valid)
k_predict = qPredictKNN(q, 0, features_valid, features_train, output_train )

IndexError: too many indices for array

In [175]:
get_rss(k_predict, output_valid)

13559459.125031011

In [170]:
qPredictKNN(10, 4, features_valid, features_train, output_train)

[456750.0,
 308625.0,
 619000.0,
 482250.0,
 365962.5,
 548862.5,
 189945.0,
 326612.5,
 576125.0,
 198345.0]

In [172]:
ks=15
q = len(features_valid)
k_RSS = []
mtx = []
for k in xrange(ks):
    k_predict = qPredictKNN(q, k, features_valid, features_train, output_train) #esto es tamaño q el mismo que el valid
    mtx.append(k_predict)
        for m in xrange(len(mtx)):
        krss = get_rss(mtx[m], output_valid)
        k_RSS.append(krss)
        print k_RSS

IndentationError: unexpected indent (<ipython-input-172-c3ee77a0e8b4>, line 8)

In [None]:
L1_penalty = np.logspace(1, 7, num=13)
RSS = np.zeros(len(L1_penalty))
for i in range(len(L1_penalty)):
    lp = L1_penalty[i]
    m = graphlab.linear_regression.create(training, target='price', features=all_features, validation_set=None, verbose = False,l2_penalty=0., l1_penalty=lp)
    p = m.predict(validation)
    res = p-validation['price']
    RSS[i] = (res*res).sum()
    print 'i = ',i, 'L1P = ', lp, 'RSS = ',RSS[i]

In [None]:
L1_penalty = np.logspace(1, 7, num=13)
RSS = np.zeros(len(L1_penalty))
for i in range(len(L1_penalty)):
    lp = L1_penalty[i]
    get_rss(model, data[i], outcome[i])
    print 'i = ',i, 'L1P = ', lp, 'RSS = ',RSS[i]

In [None]:
a la funcion hacerle un loop para que genere rss_all
checar por que no tiene sqrt

To visualize the performance as a function of `k`, plot the RSS on the VALIDATION set for each considered `k` value:

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

kvals = range(1, 16)
plt.plot(kvals, rss_all,'bo-')

***QUIZ QUESTION ***

What is the RSS on the TEST data using the value of k found above?  To be clear, sum over all houses in the TEST set.

In [80]:
#Between 8e13 and 2e14   bien