<a href="https://colab.research.google.com/github/helloworld53/projects/blob/main/KNN_continuous.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **KNN for regression problem**
*continuous variables*

Dataset: https://www.kaggle.com/c/house-prices-advanced-regression-techniques

Subsest of columns used for example: LotArea OverallQual YearBuilt SalePrice.

Data are interpreted as continuous variables.

In [None]:
# importing necessary libraries
import numpy as np
import csv
import matplotlib.pyplot as plt
import pandas as pd

In [None]:
# reading given data and visualizing it
df = pd.read_csv("mixedDataExample.tsv",sep='\t')
df = df[['LotArea', 'OverallQual', 'YearBuilt', 'SalePrice']]
df.head()

FileNotFoundError: ignored

KNN is a distance based method. Usually, the Euclidean distance is used to compare records. Significantly different values of different parameters can have unbalanced impact in distance value. Thus, standardization (mean normalization) is required.

In [None]:
# standardize data based on the initial (train) dataset (to be later used for new record to calculate)
def standardize_data(processed_dataset, initial_dataset, column_name):
  processed_dataset[column_name] =(processed_dataset[column_name]-initial_dataset[column_name].mean())/initial_dataset[column_name].std()

In [None]:
# standardize data
prepared_data = df.copy()
standardize_data(prepared_data, df, "LotArea")
standardize_data(prepared_data, df, "OverallQual")
standardize_data(prepared_data, df, "YearBuilt")
prepared_data.head()

NameError: ignored

In [None]:
# initialize K value (number of nearest neighbours)
K = 800

In [None]:
# query example
query =  pd.DataFrame(np.array([[8500, 5, 2000]]),
                   columns=["LotArea","OverallQual", "YearBuilt"])
#standardize query
prepared_query = query.copy()
standardize_data(prepared_query, df, "LotArea")
standardize_data(prepared_query, df, "OverallQual")
standardize_data(prepared_query, df, "YearBuilt")

In [None]:
prepared_data

Unnamed: 0,LotArea,OverallQual,YearBuilt,SalePrice
0,-0.207071,0.651256,1.050634,208500
1,-0.091855,-0.071812,0.156680,181500
2,0.073455,0.651256,0.984415,223500
3,-0.096864,0.651256,-1.862993,140000
4,0.375020,1.374324,0.951306,250000
...,...,...,...,...
1455,-0.260471,-0.071812,0.918196,175000
1456,0.266316,-0.071812,0.222899,210000
1457,-0.147760,0.651256,-1.002149,266500
1458,-0.080133,-0.794879,-0.704164,142125


In [None]:
# calculate distance for each record in data and query
distances_indices = []

for i in range(np.shape(prepared_data)[0]):
    # use  Euclidean distance
    distance = np.linalg.norm(prepared_data.iloc[[i],:-1].to_numpy()-prepared_query.iloc[[0]].to_numpy())
    distances_indices.append((distance, i))

sorted_distances_indices = sorted(distances_indices)
k_nearest_distances_indices = sorted_distances_indices[:K]
# print k nearest distances and indices
print(k_nearest_distances_indices)
# print k most similar records
indices = list(zip(*k_nearest_distances_indices))[1]

df_nearest = df.loc[indices,:]
df_nearest

[(0.13689914249206395, 376), (0.19481153834817005, 838), (0.1986891328279999, 117), (0.20055785815601382, 1079), (0.20323160697296797, 1307), (0.21221561737670938, 1047), (0.2219761902134229, 880), (0.23197367002520608, 613), (0.2340382964440972, 1379), (0.30874306000835094, 36), (0.3713598148799435, 1131), (0.41864437759851125, 606), (0.4882233445657455, 738), (0.49743262204175154, 1407), (0.5109665774574296, 1452), (0.5340572677373959, 213), (0.5370605745409696, 18), (0.5520141646851141, 1176), (0.564179895427815, 908), (0.5669678207362413, 42), (0.5962604508278487, 274), (0.5964540465222999, 587), (0.6084261071186344, 5), (0.608944028261344, 770), (0.6256316116761874, 592), (0.6256316116761874, 1327), (0.6748751165937174, 1129), (0.7106825317300408, 569), (0.7109741294749193, 102), (0.7109741294749193, 188), (0.7109741294749193, 894), (0.7109741294749193, 897), (0.7240534449789837, 379), (0.7240639253549318, 370), (0.7243919011026483, 1108), (0.7261781148289136, 1455), (0.7265253183

Unnamed: 0,LotArea,OverallQual,YearBuilt,SalePrice
376,8846,5,1996,148000
838,9525,5,1995,144000
117,8536,5,2006,155000
1079,8775,5,1994,126000
1307,8072,5,1994,138000
...,...,...,...,...
279,10005,7,1977,192000
782,16285,7,2001,187100
1151,17755,5,1959,149900
1361,16158,7,2005,260000


In [None]:
k_nearest_distances_indices

[(0.13689914249206395, 376),
 (0.19481153834817005, 838),
 (0.1986891328279999, 117),
 (0.20055785815601382, 1079),
 (0.20323160697296797, 1307),
 (0.21221561737670938, 1047),
 (0.2219761902134229, 880),
 (0.23197367002520608, 613),
 (0.2340382964440972, 1379),
 (0.30874306000835094, 36),
 (0.3713598148799435, 1131),
 (0.41864437759851125, 606),
 (0.4882233445657455, 738),
 (0.49743262204175154, 1407),
 (0.5109665774574296, 1452),
 (0.5340572677373959, 213),
 (0.5370605745409696, 18),
 (0.5520141646851141, 1176),
 (0.564179895427815, 908),
 (0.5669678207362413, 42),
 (0.5962604508278487, 274),
 (0.5964540465222999, 587),
 (0.6084261071186344, 5),
 (0.608944028261344, 770),
 (0.6256316116761874, 592),
 (0.6256316116761874, 1327),
 (0.6748751165937174, 1129),
 (0.7106825317300408, 569),
 (0.7109741294749193, 102),
 (0.7109741294749193, 188),
 (0.7109741294749193, 894),
 (0.7109741294749193, 897),
 (0.7240534449789837, 379),
 (0.7240639253549318, 370),
 (0.7243919011026483, 1108),
 (0.726

After distances are calculated and K nearest neighbours filtered, the predicted price is equal to mean value of the K nearest neighbours.

In [None]:
print("Predicted price for ")
print(query)
print("predicted price: {0:7.2f}".format(df_nearest["SalePrice"].mean()))

Predicted price for 
   LotArea  OverallQual  YearBuilt
0     8500            5       2000
predicted price: 166567.05


**To do**:
Evaluate the accuracy of the method by splitting data to train and test datasets.
Use train dataset to calculate transformation and make SalePrice predictions for flats in the test dataset.
Calculate mean square error between the predicted and actual values of the dataset.