# KNN for regression


Let's go ahead and apply this KNN regression algorithm that you've created to the challenge of predicting housing prices from home characteristics. Our goal will be to predict the sales price of homes as closely as possible by using key characteristics of the home such as the number of bedrooms and bathrooms, the size of the home, the age and year of renovation, etc. 

The data we'll be using is from home sales in King's County in the U.S. state of Washington and the sales occurred between May 2014 and May 2015. The data can be found [here](https://www.kaggle.com/datasets/harlfoxem/housesalesprediction).

A full description of the metadata associated with this dataset is included below in Table 1. Note that our target variable is the sale price of the home (`price`).

*Table 1. Metadata for the King's County home price dataset. The target variable is the sales price (`price`) and the rest are potential features.*

| Variable | Description |
| -------- | ----------- |
| id | Identifier |
| date | Date of sale |
| price | Sale price |
| bedrooms | Number of bedrooms |
| bathrooms | Number of bathrooms |
| sqft_liv | Size of living area of the home in square feet |
| sqft_lot | Lot size in square feet |
| floors | Number of floors in the home |
| waterfront | Binary variable representing waterfront status (‘1’ if the property is on the waterfront, ‘0’ if not) |
| view | A qualitative assessment of the view from the property (values range from 0 to 4 with 4 being the best) |
| condition | Condition of the house (1 to 5 with 5 being the best) |
| grade | Construction quality (lowest is 1, highest is 13) |
| sqft_above | Square feet above ground |
| sqft_basmt | Square feet below ground |
| yr_built | Year built |
| yr_renov | Year renovated. ‘0’ if never renovated |
| zipcode | zip code |
| lat | Latitude |
| long | Longitude |
| squft_liv15 | Average size of interior housing living space for the closest 15 houses, in square feet |
| squft_lot15 | Average size of land lots for the closest 15 houses, in square feet |

Note that some of the feature variables are categorical variables such as `waterfront`, `view`, `condition`, and `grade`. Fortunately, each of these categorical variables is ordinal, in that there is meaning to degree of the category (e.g. for `condition`, 2 is better than 1 and 5 is better than 4). Some categorical variables don't have ordering. For example, if we had house color as a categorical variable. In general, there is no universal ordering to green vs blue vs grey in terms of quality (although you may have your preferences, those preferences may differ for others).

Because of this, we'll treat each of the features as numerical data here, but know that there are techniques such as one-hot-encoding, for handling categorical data for prediction.

Let's start by loading the data:

In [33]:
import pandas as pd

data = pd.read_csv("data/kc_house_data.csv")
data

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.00,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.7210,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.00,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.00,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.00,1680,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21608,263000018,20140521T000000,360000.0,3,2.50,1530,1131,3.0,0,0,...,8,1530,0,2009,0,98103,47.6993,-122.346,1530,1509
21609,6600060120,20150223T000000,400000.0,4,2.50,2310,5813,2.0,0,0,...,8,2310,0,2014,0,98146,47.5107,-122.362,1830,7200
21610,1523300141,20140623T000000,402101.0,2,0.75,1020,1350,2.0,0,0,...,7,1020,0,2009,0,98144,47.5944,-122.299,1020,2007
21611,291310100,20150116T000000,400000.0,3,2.50,1600,2388,2.0,0,0,...,8,1600,0,2004,0,98027,47.5345,-122.069,1410,1287


Let's drop the ID column since we won't be using it for prediction - it's simply a unique ID assigned to each home as well as the date column since these homes were all sold within a year of one another.

In [34]:
data = data.drop(columns=["id", "date"])
data

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,221900.0,3,1.00,1180,5650,1.0,0,0,3,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,538000.0,3,2.25,2570,7242,2.0,0,0,3,7,2170,400,1951,1991,98125,47.7210,-122.319,1690,7639
2,180000.0,2,1.00,770,10000,1.0,0,0,3,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,604000.0,4,3.00,1960,5000,1.0,0,0,5,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,510000.0,3,2.00,1680,8080,1.0,0,0,3,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21608,360000.0,3,2.50,1530,1131,3.0,0,0,3,8,1530,0,2009,0,98103,47.6993,-122.346,1530,1509
21609,400000.0,4,2.50,2310,5813,2.0,0,0,3,8,2310,0,2014,0,98146,47.5107,-122.362,1830,7200
21610,402101.0,2,0.75,1020,1350,2.0,0,0,3,7,1020,0,2009,0,98144,47.5944,-122.299,1020,2007
21611,400000.0,3,2.50,1600,2388,2.0,0,0,3,8,1600,0,2004,0,98027,47.5345,-122.069,1410,1287


We have 21,613 observations across 18 feature variables and one target variable. Let's start by dividing up our data into separate train and test datasets. We can use roughly 80% of the data for training (17,000) and the rest for testing.

In [35]:
N_train = 17000
data_randomized = data.sample(frac=1)
data_train = data_randomized.iloc[:N_train]
data_test = data_randomized.iloc[N_train:]

As we look at the data above, one thing is quite striking. Note that the scale of the feature variables is dramatically different from variable to variable. For example, let's look at the median value of the number of `bathrooms` and the size of living space `sqft_living`:

In [36]:
data_train["bathrooms"].median()

2.25

In [37]:
data_train["sqft_living"].median()

1920.0

There is a three-order-of-magnitude difference between those two median values. The KNN algorithm that we've implemented relies on Euclidean distance. If one variable is much larger than another, it will tend to dominate in terms of specifying the distance between observations. A solution for this is to place the data on the same scale. This will allow the KNN algorithm to work properly.

So how do we scale our data? The most common approach is to standardize our data, which refers to a standard normal distribution. This sets the mean of the data to zero by subtracting off the mean for a given feature and sets the standard deviation of the feature to 1 by dividing by the standard deviation of the feature.

Let's create a class that does this for a single feature. We make this a class so that we can first get the mean and standard deviation from the training data and then transform both the training and test data feature using the same mean and standard deviation. This prevents information from the test data from impacting the training process. This prevents what's known as data snooping: allowing any information from the test dataset to influence the training data or the prediction model fitting process.

In [38]:
import numpy as np


class Standardize:
    def __init__(self):
        self.mean = []
        self.std = []

    def fit(self, feature):
        self.mean = np.mean(feature)
        self.std = np.std(feature)

    def transform(self, feature):
        return (feature - self.mean) / self.std

Now let's test our `standardize` class on some simple data before applying it to our actual dataset:

In [39]:
original_feature = np.array([152.6, 97.2, 352.4, 221.8])
scaler = Standardize()
scaler.fit(original_feature)
standardized_feature = scaler.transform(original_feature)

print(f"Mean = {np.mean(original_feature)}, Std = {np.std(original_feature)}")

print(f"Mean = {np.mean(standardized_feature)}, Std = {np.std(standardized_feature)}")

Mean = 206.0, Std = 95.3566987683613
Mean = -4.163336342344337e-17, Std = 1.0


Great! We can see that the mean of the standardized feature was transformed to be $-4.16\times 10^{-17}$, which is approximately zero and the standard deviation is 1, as we wanted.



Now, let's apply this to each feature in our dataset and use that value to transform each feature in the test dataset in the same way. Let's keep the implementation simple by looping through the features for both the training and test data and for each feature, collect the mean and standard deviation from the training dataset, then apply that standardization to both the training and test feature.

In [40]:
data_train.loc[:, "price"]

9610     760000.0
11433    245000.0
11933    485500.0
5557     465425.0
14295    783500.0
           ...   
13815    168000.0
11349    490000.0
7234     300000.0
15539    329950.0
7881     575000.0
Name: price, Length: 17000, dtype: float64

In [41]:
scaler = Standardize()

for column in data_train.columns:
    if column != "price":  # Exclude the target variable from standarization
        scaler.fit(data_train.loc[:, column])
        data_train.loc[:, column] = scaler.transform(data_train.loc[:, column])
        data_test.loc[:, column] = scaler.transform(data_test.loc[:, column])

Now let's verify that they data do appear standardized:

In [42]:
data_train

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
9610,760000.0,2.800704,2.103374,1.867632,-0.213162,0.932627,-0.08502,-0.305481,0.901267,0.285534,2.427016,-0.656732,0.202589,-0.208303,0.475613,0.969472,-0.896823,-0.232684,-0.278448
11433,245000.0,-0.393873,-0.471715,-0.651099,-0.265992,-0.915236,-0.08502,-0.305481,0.901267,-0.560861,-0.374337,-0.656732,0.304723,-0.208303,-1.024946,-1.887764,-1.088896,-0.232684,-0.212439
11933,485500.0,0.670986,0.493944,-0.305179,-0.251904,0.932627,-0.08502,-0.305481,-0.631653,0.285534,0.010398,-0.656732,0.951574,-0.208303,-0.049582,0.290310,1.422273,-0.276234,-0.300979
5557,465425.0,0.670986,0.815830,0.375852,0.128939,-0.915236,-0.08502,-0.305481,-0.631653,-0.560861,0.767845,-0.656732,-0.137859,-0.208303,-0.105853,1.553464,0.533049,0.856052,0.508354
14295,783500.0,-0.393873,0.493944,0.829872,-0.190152,0.932627,-0.08502,-0.305481,-0.631653,1.978324,0.238835,1.284395,0.304723,-0.208303,0.831996,0.313382,-0.534020,0.420558,-0.230357
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13815,168000.0,-1.458731,-0.793601,-0.932159,-0.273787,0.008696,-0.08502,-0.305481,0.901267,0.285534,-0.686935,-0.656732,0.168544,-0.208303,2.257528,-1.291516,-0.740320,-1.176256,-0.330613
11349,490000.0,0.670986,0.172058,-0.067358,-0.241103,0.932627,-0.08502,-0.305481,-0.631653,-0.560861,-0.097809,0.042977,-1.125158,-0.208303,0.738211,0.916840,-1.103123,-0.581080,-0.299177
7234,300000.0,0.670986,0.172058,-0.002498,-0.176768,-0.915236,-0.08502,-0.305481,-0.631653,-0.560861,-0.410406,0.765257,-0.103815,-0.208303,-0.368451,-0.898583,0.312521,-0.131069,-0.175524
15539,329950.0,-0.393873,-1.437373,-0.359229,-0.174420,-0.915236,-0.08502,-0.305481,0.901267,-0.560861,-0.771096,0.697543,-0.512352,-0.208303,1.657304,-0.720501,-0.918165,-0.348816,-0.190665


In [43]:
data_test

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
19229,395000.0,-0.393873,-0.793601,-1.083499,-0.288532,0.008696,-0.08502,-0.305481,0.901267,-0.560861,-0.855256,-0.656732,-1.737965,-0.208303,0.700697,0.891606,-0.640727,-0.856893,-0.285478
18425,420000.0,1.735845,1.137716,0.073172,-0.213937,-0.915236,-0.08502,-0.305481,-0.631653,-0.560861,-0.506590,1.103825,1.087753,-0.208303,0.531884,-0.155254,-1.010644,-0.000420,-0.244596
10801,436000.0,-0.393873,0.172058,-0.651099,-0.325067,2.780491,-0.08502,-0.305481,-0.631653,0.285534,-0.374337,-0.656732,1.257977,-0.208303,0.475613,0.947842,-0.911051,-0.987542,-0.415295
10669,245000.0,0.670986,-0.793601,-0.251128,-0.180361,0.008696,-0.08502,-0.305481,0.901267,-0.560861,0.070513,-0.656732,-0.376173,-0.208303,2.257528,-1.356404,-0.676296,-0.377849,-0.190052
542,318888.0,0.670986,-0.471715,0.256942,-0.075805,-0.915236,-0.08502,-0.305481,0.901267,0.285534,0.575478,-0.543876,0.066409,-0.208303,-0.856133,-1.406872,-0.469996,0.188294,-0.107028
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11052,398000.0,-0.393873,-0.793601,-0.834869,-0.287217,0.932627,-0.08502,-0.305481,-0.631653,-0.560861,-0.578728,-0.656732,0.917529,-0.208303,1.244650,0.204514,-0.605158,-0.987542,-0.355163
12354,450000.0,-0.393873,0.493944,-0.445709,-0.296774,0.932627,-0.08502,-0.305481,-0.631653,0.285534,-0.145901,-0.656732,0.985619,-0.208303,-0.949918,-0.202839,1.187518,-0.464948,-0.358768
6192,346000.0,-1.458731,-1.437373,-1.094309,-0.306001,-0.915236,-0.08502,-0.305481,0.901267,-0.560861,-1.095716,-0.227878,-1.840099,-0.208303,1.244650,0.094204,-0.669182,-1.205289,-0.333425
2801,525000.0,-0.393873,1.137716,0.419092,0.498159,0.932627,-0.08502,-0.305481,0.901267,0.285534,0.815937,-0.656732,0.304723,-0.208303,-0.949918,-0.673637,1.543208,0.464107,0.811178


In [44]:
data_train["bathrooms"].mean()

-2.8129133009796907e-16

In [45]:
data_train["bathrooms"].std()

1.0000294130623473

Excellent - now we're ready to apply our KNN regression algorithm. Let's start by extracting separate numpy arrays for our features and target variables so we can use our `Knn_regression` class. As always, we'll check the shape of our data to make sure we know what's going into our algorithm.

In [46]:
# Separate features from targets
x_train = data_train.drop(columns=["price"]).values
y_train = data_train["price"].values
x_test = data_test.drop(columns=["price"]).values
y_test = data_test["price"].values

print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

(17000, 18)
(17000,)
(4613, 18)
(4613,)


Let's make our predictions for k = 3 nearest neighbors!

In [47]:
from src.knn import Knn_regression

# Initialize the KNN model
myknn = Knn_regression()

# Train the model
myknn.fit(x_train, y_train)

# Make predictions on the test data
k = 3
y_prediction = myknn.predict(x_test, k)
y_prediction

array([514380., 347960., 400290., ..., 419790., 471490., 496758.])

We'll save our predictions and target variables for further analysis in the next section where we'll investigate the performance of our regression model. We save them as binary files using the `pickle` package. There are more storage efficient alternatives for this process such as parquet and feather, however, we'll keep it simple and use `pickle` for this small set of files.

In [48]:
import pickle

# Saving the objects:
savefile = "data/regression.pkl"
with open(savefile, "wb") as f:
    pickle.dump([y_test, y_train, y_prediction, x_test, x_train], f)