# Predicting Housing Prices with k-Nearest Neighbors

<img src="neighbors-talking-over-fence-min.jpg" alt="Neighbors talking over fence" style="width: 700px;"/>

### How would you predict the price of an apartment that is about to go on the market?

<img src='For-sale-sign.jpg' alt="For sale sign"/>

## Similar apartments should be similar in price

* Location, location, location!
* Square footage
* Number of bedrooms and bathrooms
* Floor of the building
* etc.

## Distance as a measure of similarity

How 'far away' are apartments from each other given all of their features?

## What is K-Nearest Neighbors?

**_K-Nearest Neighbors_** (or KNN, for short) is a supervised learning algorithm that can be used for both **_Classification_** and **_Regression_** tasks. KNN is a distance-based classifier, meaning that it implicitly assumes that the smaller the distance between 2 points, the more similar they are. In KNN, each column acts as a dimension. In a dataset with two columns, we can easily visualize this by treating values for one column as X coordinates and and the other as Y coordinates. Since this is a **_Supervised Learning Algorithm_**, we must also have the labels for each point in our dataset, or else we can't use this algorithm for prediction.

## Making Predictions with K

KNN takes a point that we want a class prediction for, and calculates the distances between that point and every single point in the training set. It then finds the `K` closest points, or **_Neighbors_**, and examines the values of each. You can think of each of the K-closest points getting a 'vote' about the predicted value. Often times the mean of all the values is taken to make a prediction about the new point.

In the following animation, K=3.

<img src='knn.gif'>

## Our Dataset

We'll be using this dataset scraped from apartment listings in June 2021. We'll also use the `pandas` library to easily load everything into a spreadsheet-like format:

In [1]:
import pandas as pd
df = pd.read_csv("apartment_listings.csv")
df

Unnamed: 0,Price,Street,Avenue,Bedrooms,Bathrooms,SquareFootage,Unit,Address
0,2300,15,1,2,1.0,900,1C,"342 E 15th St APT 1C, New York, NY 10003"
1,3000,14,1,0,1.0,550,9H,"333 E 14th St APT 9H, New York, NY 10003"
2,2250,15,1,1,1.0,650,18,"328 E 15th St APT 18, New York, NY 10003"
3,2950,14,2,2,1.0,800,1A,"321 E 14th St APT 1A, New York, NY 10003"
4,2700,14,3,1,1.0,800,7,"211 E 14th St APT 7, New York, NY 10003"
...,...,...,...,...,...,...,...,...
324,4865,28,11,1,1.0,653,,"282 11th Ave, New York, NY 10001"
325,5795,28,11,2,2.0,1001,,"282 11th Ave, New York, NY 10001"
326,6155,28,11,2,2.0,1185,,"282 11th Ave, New York, NY 10001"
327,6350,28,11,2,2.0,1094,,"282 11th Ave, New York, NY 10001"


All of these are currently labeled with prices. Let's split them into training and test sets so that we can evaluate how well our guesses work:

In [2]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(df, random_state=42)

train

Unnamed: 0,Price,Street,Avenue,Bedrooms,Bathrooms,SquareFootage,Unit,Address
31,3975,16,5,2,1.0,800,7,"10 E 16th St APT 7, New York, NY 10003"
223,3202,27,2,1,1.0,626,0-7N,"240 E 27th St, New York, NY 10016"
196,3935,25,6,0,1.0,542,5E,"55 W 25th St, New York, NY 10010"
152,5382,24,7,1,1.0,542,1-17C,"160 W 24th St, New York, NY 10011"
227,3465,27,2,1,1.0,764,-4B,"240 E 27th St, New York, NY 10016"
...,...,...,...,...,...,...,...,...
188,5149,25,9,2,1.0,824,5I,"401 W 25th St, New York, NY 10001"
71,2000,21,3,1,1.0,700,2D,"202 E 21st St APT 2D, New York, NY 10010"
106,5850,21,6,1,1.0,813,1102,"37 W 21st St, New York, NY 10010"
270,3799,27,6,0,1.0,446,20H,"800 6th Ave, New York, NY 10001"


### Find the 10 Nearest Neighbors Based on Just Street

In [3]:
test_sample = test.sample(n=1, random_state=100)
test_sample

Unnamed: 0,Price,Street,Avenue,Bedrooms,Bathrooms,SquareFootage,Unit,Address
318,3355,28,11,0,1.0,514,,"282 11th Ave, New York, NY 10001"


In [4]:
import numpy as np
train_with_distances = train.copy()

train_with_distances["Street Distance"] = np.abs(train_with_distances["Street"] - test_sample["Street"].values[0])
train_with_distances

Unnamed: 0,Price,Street,Avenue,Bedrooms,Bathrooms,SquareFootage,Unit,Address,Street Distance
31,3975,16,5,2,1.0,800,7,"10 E 16th St APT 7, New York, NY 10003",12
223,3202,27,2,1,1.0,626,0-7N,"240 E 27th St, New York, NY 10016",1
196,3935,25,6,0,1.0,542,5E,"55 W 25th St, New York, NY 10010",3
152,5382,24,7,1,1.0,542,1-17C,"160 W 24th St, New York, NY 10011",4
227,3465,27,2,1,1.0,764,-4B,"240 E 27th St, New York, NY 10016",1
...,...,...,...,...,...,...,...,...,...
188,5149,25,9,2,1.0,824,5I,"401 W 25th St, New York, NY 10001",3
71,2000,21,3,1,1.0,700,2D,"202 E 21st St APT 2D, New York, NY 10010",7
106,5850,21,6,1,1.0,813,1102,"37 W 21st St, New York, NY 10010",7
270,3799,27,6,0,1.0,446,20H,"800 6th Ave, New York, NY 10001",1


In [5]:
ten_nearest = train_with_distances.sort_values(by="Street Distance")[:10]
ten_nearest

Unnamed: 0,Price,Street,Avenue,Bedrooms,Bathrooms,SquareFootage,Unit,Address,Street Distance
313,4336,28,10,1,1.0,761,,"525 W 28th St, New York, NY 10001",0
327,6350,28,11,2,2.0,1094,,"282 11th Ave, New York, NY 10001",0
316,4675,28,10,1,1.0,657,,"525 W 28th St, New York, NY 10001",0
323,4730,28,11,1,1.0,719,,"282 11th Ave, New York, NY 10001",0
284,3335,28,4,0,1.0,440,1-6A,"50 E 28th St, New York, NY 10016",0
287,8250,28,4,2,2.0,1293,1-18K,"50 E 28th St, New York, NY 10016",0
294,2500,28,3,1,1.0,890,5F,"200 E 28th St APT 5F, New York, NY 10016",0
317,3125,28,11,0,1.0,505,,"282 11th Ave, New York, NY 10001",0
298,4100,28,2,1,1.0,675,15G,"247 E 28th St APT 15G, New York, NY 10016",0
285,3855,28,4,0,1.0,431,1-3F,"50 E 28th St, New York, NY 10016",0


In [6]:
ten_nearest["Price"].mean()

4525.6

We have a prediction! Was it close?

In [7]:
test_sample["Price"].values[0]

3355

In [8]:
ten_nearest[ten_nearest["Bedrooms"] == 0]["Price"].mean()

3438.3333333333335

Ok, let's try to do better than that by adding more dimensions, not just street.

## Distance Metrics

Once we start getting into multiple dimensions, there are different **_distance metrics_** when using KNN. For KNN, we can use **_Manhattan_**, **_Euclidean_**, or **_Minkowski Distance_**--from an algorithmic standpoint, it doesn't matter which! However, it should be noted that from a practical standpoint, these can affect our results and our overall model performance. 

In [9]:
from scipy.spatial.distance import euclidean as euc, cityblock as manhattan, minkowski

In [10]:
train.iloc[31]

Price                                         3496
Street                                          24
Avenue                                           4
Bedrooms                                         1
Bathrooms                                        1
SquareFootage                                  891
Unit                                            1D
Address          124 E 24th St, New York, NY 10010
Name: 157, dtype: object

In [11]:
euc(test_sample[["Street", "Avenue"]], train.iloc[31][["Street", "Avenue"]])

8.06225774829855

In [12]:
manhattan(test_sample[["Street", "Avenue"]], train.iloc[31][["Street", "Avenue"]])

11

In [13]:
euc(test_sample[["Street"]], train.iloc[31][["Street"]])

4.0

In [14]:
euc(test_sample[["Street", "Avenue", "Bedrooms", "Bathrooms", "SquareFootage"]], train.iloc[31][["Street", "Avenue", "Bedrooms", "Bathrooms", "SquareFootage"]])

377.0875229969828

In [15]:
class KNN():
    
    def fit(self, X_train, y_train):
        self.X_train = X_train
        self.y_train = y_train
        
    def predict(self, X_test, k=3):
        
        predictions = np.zeros(X_test.shape[0])
        
        for i, point in enumerate(X_test):
            distances = self._get_distances(point)
            k_nearest = self._get_k_nearest(distances, k)
            prediction = self._get_predicted_value(k_nearest)
            predictions[i] = prediction
            
        return predictions
    
    #helper functions
    def _get_distances(self, x):
        '''Take an single point and return an array of distances to every point in our dataset'''
        distances = np.zeros(self.X_train.shape[0])
        for i, point in enumerate(self.X_train):
            distances[i] = euc(x, point)
        return distances
    
    def _get_k_nearest(self, distances, k):
        '''Take in the an array of distances and return the indices of the k nearest points'''
        nearest = np.argsort(distances)[:k]
        return nearest
    
    def _get_predicted_value(self, k_nearest):
        '''Takes in the indices of the k nearest points and returns the mean of their target values'''
        return np.mean(self.y_train[k_nearest])

In [16]:
my_knn = KNN()
my_knn.fit(train.drop(["Price", "Unit", "Address"], axis=1).values, train["Price"].values)

## Let's Make Predictions!

In [17]:
#This will run for a long time
preds = my_knn.predict(test.drop(["Price", "Unit", "Address"], axis=1).values, k=10)

result_df = pd.DataFrame()
result_df["Predicted Price"] = preds
result_df["Actual Price"] = test.reset_index()["Price"]
result_df

Unnamed: 0,Predicted Price,Actual Price
0,4459.3,4250
1,2793.5,1900
2,11276.8,9500
3,8192.0,10000
4,18474.5,20000
...,...,...
78,3954.8,3844
79,4060.3,2250
80,4066.3,3198
81,4351.5,3250


In [18]:
from sklearn.metrics import mean_squared_error

In [19]:
mean_squared_error(test.reset_index()["Price"], preds, squared=False)

2294.1004262374067

In [20]:
test.reset_index()

Unnamed: 0,index,Price,Street,Avenue,Bedrooms,Bathrooms,SquareFootage,Unit,Address
0,9,4250,14,4,1,1.0,685,P7E,"1 Irving Pl, New York, NY 10003"
1,164,1900,24,2,1,1.0,500,5B,"238 E 24th St APT 5B, New York, NY 10010"
2,139,9500,22,6,2,2.5,1894,1B,"125 W 22nd St APT 1B, New York, NY 10011"
3,46,10000,17,10,2,2.0,1401,1222,"450 W 17th St, New York, NY 10011"
4,94,20000,21,11,2,2.5,2469,6A,"551 W 21st St #6A, New York, NY 10011"
...,...,...,...,...,...,...,...,...,...
78,321,3844,28,11,1,1.0,707,,"282 11th Ave, New York, NY 10001"
79,126,2250,21,1,1,1.0,700,2W,"339 E 21st St #2W, New York, NY 10010"
80,307,3198,28,10,0,1.0,525,,"525 W 28th St, New York, NY 10001"
81,114,3250,21,5,1,1.0,800,510,"15 E 21st St APT 510, New York, NY 10010"


In [21]:
preds = my_knn.predict(test.drop(["Price", "Unit", "Address"], axis=1).values, k=72)

result_df = pd.DataFrame()
result_df["Predicted Price"] = preds
result_df["Actual Price"] = test.reset_index()["Price"]
result_df

Unnamed: 0,Predicted Price,Actual Price
0,4088.638889,4250
1,3651.472222,1900
2,9525.625000,9500
3,7487.625000,10000
4,10627.513889,20000
...,...,...
78,4293.250000,3844
79,4228.222222,2250
80,3651.472222,3198
81,4462.277778,3250


In [22]:
mean_squared_error(test.reset_index()["Price"], preds, squared=False)

3240.9388698528037