# k-means rooms
In this notebook, you will implement k-nearest neighbors regression. You will:
  * Find the k-nearest neighbors of a given query input
  * Predict the output for the query input using the k-nearest neighbors
  * Choose the best value of k using a validation set

In [1]:
import turicreate as tc
import scipy.stats as stats
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

### Load house sales data

Dataset is from house sales in King County, the region where the city of Seattle, WA is located.

In [2]:
df_rooms = tc.SFrame('https://s3.eu-west-3.amazonaws.com/pedrohserrano-datasets/airbnb-belgium.csv')

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,int,str,str,str,int,float,int,float,float,str,float,float,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


In [69]:
df_rooms_ = pd.read_csv('https://s3.eu-west-3.amazonaws.com/pedrohserrano-datasets/airbnb-belgium.csv')

In [7]:
df_rooms.head()

room_id,host_id,room_type,borough,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms
14054734,33267800,Shared room,Brussel,Brussel,1,0.0,2,1.0
16151530,105088596,Shared room,Brussel,Brussel,1,0.0,1,1.0
14678546,30043608,Shared room,Brussel,Brussel,14,4.5,2,1.0
8305401,43788729,Shared room,Namur,Namur,12,4.5,2,1.0
14904339,15277691,Shared room,Namur,Gembloux,1,0.0,6,1.0
16228753,61781546,Shared room,Antwerpen,Antwerpen,3,4.5,2,1.0
643309,3216639,Shared room,Roeselare,Roeselare,6,4.0,6,1.0
3879691,19998594,Shared room,Brugge,Knokke-Heist,1,0.0,12,1.0
3710876,18917692,Shared room,Antwerpen,Antwerpen,11,3.0,3,1.0
5141135,20676997,Shared room,Gent,Gent,9,4.5,2,1.0

price,minstay,latitude,longitude,last_modified
55.0,,50.847703,4.379786,2016-12-31 14:49:05.125349 ...
42.0,,50.821832,4.366557,2016-12-31 14:49:05.112730 ...
43.0,,50.847657,4.348675,2016-12-31 14:49:05.110143 ...
48.0,,50.462592,4.818974,2016-12-31 14:49:05.107436 ...
59.0,,50.562263,4.693185,2016-12-31 14:49:05.101899 ...
53.0,,51.203401,4.392493,2016-12-31 14:49:05.096266 ...
22.0,,50.941016,3.123627,2016-12-31 14:49:03.811667 ...
33.0,,51.339016,3.273554,2016-12-31 14:49:02.743608 ...
33.0,,51.232425,4.424612,2016-12-31 14:49:02.710383 ...
38.0,,51.034197,3.714149,2016-12-31 14:49:02.705108 ...


In [9]:
df_rooms[['reviews','overall_satisfaction','accommodates','bedrooms','latitude','longitude']]

reviews,overall_satisfaction,accommodates,bedrooms,latitude,longitude
1,0.0,2,1.0,50.847703,4.379786
1,0.0,1,1.0,50.821832,4.366557
14,4.5,2,1.0,50.847657,4.348675
12,4.5,2,1.0,50.462592,4.818974
1,0.0,6,1.0,50.562263,4.693185
3,4.5,2,1.0,51.203401,4.392493
6,4.0,6,1.0,50.941016,3.123627
1,0.0,12,1.0,51.339016,3.273554
11,3.0,3,1.0,51.232425,4.424612
9,4.5,2,1.0,51.034197,3.714149


Because the features in this dataset have very different scales (e.g. price is in the hundreds of thousands while the number of bedrooms is in the single digits), it is important to normalize the features

To efficiently compute pairwise distances among data points, we will convert the SFrame into a 2D Numpy array. First import the numpy library and then copy and paste `get_numpy_data()` from the second notebook of Week 2.

In [18]:
numeric_features = ['price','reviews','overall_satisfaction','accommodates','bedrooms','latitude','longitude']

In [19]:
def get_numpy_data(data_sframe, features, output):
    data_sframe['constant'] = 1 
    features = ['constant'] + features 
    features_sframe = data_sframe[features] 
    feature_matrix = features_sframe.to_numpy()
    output_sarray = data_sframe[output]
    output_array = output_sarray.to_numpy()
    return feature_matrix, output_array

Using all of the numerical inputs listed in `feature_list`, transform the training, test, and validation SFrames into Numpy arrays:

In [31]:
normalized_rooms, output_rooms = get_numpy_data(train, numeric_features, 'price')

In computing distances, it is crucial to normalize features. Otherwise, for example, the `sqft_living` feature (typically on the order of thousands) would exert a much larger influence on distance than the `bedrooms` feature (typically on the order of ones). We divide each column of the training feature matrix by its 2-norm, so that the transformed column has unit norm.

In [29]:
def normalize_features(feature_matrix):
    norms = np.linalg.norm(feature_matrix, axis=0)
    normalized_features = feature_matrix / norms
    return normalized_features, norms

In [32]:
normalized_rooms, norms = normalize_features(normalized_rooms) # normalize training set features (columns)
normalized_rooms = tc.SFrame(data=pd.DataFrame(normalized_rooms))

In [73]:
K = int(np.sqrt(normalized_rooms.num_rows() / 2.0))

In [74]:
normalized_rooms.num_rows()

10037

In [38]:
print('Number of Clusters K: {}'.format(K))

Number of Clusters K: 70


In [39]:
kmeans_model = tc.kmeans.create(normalized_rooms, num_clusters=K)
kmeans_model.summary()

Class                            : KmeansModel

Schema
------
Number of clusters               : 70
Number of examples               : 10037
Number of feature columns        : 8
Number of unpacked features      : 8
Row label name                   : row_id

Training Summary
----------------
Training method                  : elkan
Number of training iterations    : 10
Batch size                       : 10037
Total training time (seconds)    : 1.1669

Accessible fields
-----------------
cluster_info                    : An SFrame containing the cluster centers.
cluster_id                      : An SFrame containing the cluster assignments.


In [40]:
kmeans_model.summary

<bound method Model.summary of Class                            : KmeansModel

Schema
------
Number of clusters               : 70
Number of examples               : 10037
Number of feature columns        : 8
Number of unpacked features      : 8
Row label name                   : row_id

Training Summary
----------------
Training method                  : elkan
Number of training iterations    : 10
Batch size                       : 10037
Total training time (seconds)    : 1.1669

Accessible fields
-----------------
cluster_info                    : An SFrame containing the cluster centers.
cluster_id                      : An SFrame containing the cluster assignments.>

The model summary shows the usual fields about model schema, training time, and training iterations. It also shows that the K-means results are returned in two SFrames contained in the model: `cluster_id` and `cluster_info`. The cluster_info SFrame indicates the final cluster centers, one per row, in terms of the same features used to create the model.

The last three columns of the cluster_info SFrame indicate metadata about the corresponding cluster: ID number, number of points in the cluster, and the within-cluster sum of squared distances to the center.

In [54]:
cluster_info = kmeans_model.cluster_info[['cluster_id', 'size', 'sum_squared_distance']]
cluster_info.print_rows(num_rows=70, num_columns=3)

+------------+------+------------------------+
| cluster_id | size |  sum_squared_distance  |
+------------+------+------------------------+
|     0      | 142  | 0.0025081520896037546  |
|     1      | 537  |  0.003650761933201352  |
|     2      |  12  | 0.0008939443705457961  |
|     3      | 1266 | 0.0037716348996497118  |
|     4      |  83  | 0.0014725575267675595  |
|     5      |  12  | 0.0012036304324283265  |
|     6      | 146  | 0.0023938884569361107  |
|     7      |  76  | 0.0008792436790372449  |
|     8      | 272  | 0.0020642310127527708  |
|     9      | 109  | 0.0033534230260556797  |
|     10     | 158  | 0.0015577881564468044  |
|     11     | 237  |  0.00372449719600354   |
|     12     |  1   |          0.0           |
|     13     | 524  |  0.005082670406174827  |
|     14     |  1   |          0.0           |
|     15     |  3   | 0.0003956782675231807  |
|     16     |  93  | 0.0013953722102542088  |
|     17     |  9   |  0.001517144813988125  |
|     18     

The `cluster_id` field of the model shows the cluster assignment for each input data point, along with the Euclidean distance from the point to its assigned cluster's center.

In [62]:
#anomalies = 
cluster_info[cluster_info['sum_squared_distance'] == 0]#['cluster_id']

cluster_id,size,sum_squared_distance
12,1,0.0
14,1,0.0
41,1,0.0
48,1,0.0


In [63]:
clusters = kmeans_model.cluster_id

In [71]:
row_ids = clusters[(clusters['cluster_id'] == 12) |
                  (clusters['cluster_id'] == 14) |
                  (clusters['cluster_id'] == 41) |
                  (clusters['cluster_id'] == 48)]['row_id']

In [72]:
df_rooms_.loc[row_ids]

Unnamed: 0,room_id,host_id,room_type,borough,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price,minstay,latitude,longitude,last_modified
244,4733497,24249803,Entire home/apt,Antwerpen,Antwerpen,25,4.5,6,4.0,206.0,,51.217845,4.399271,2016-12-31 12:17:07.697407
247,1623135,8631558,Entire home/apt,Brussel,Brussel,0,0.0,4,2.0,271.0,,50.856277,4.351357,2016-12-31 12:16:52.900154
255,12805464,69782366,Entire home/apt,Bastogne,Bertogne,0,0.0,4,2.0,135.0,,50.037622,5.595746,2016-12-31 12:11:26.840680
7587,16568730,26291239,Entire home/apt,Gent,Gent,0,0.0,3,1.0,54.0,,51.03961,3.731732,2016-12-31 01:26:44.567521
