# Boston Housing
This is another popular dataset used in pattern recognition literature. The data set comes from the real estate industry in Boston (US). This is a regression problem. The data has 506 rows and 14 columns. Thus, it’s a fairly small data set where you can attempt any technique without worrying about your laptop’s memory being overused.
<br>
<br>
<b>Problem:</b> Predict the median value of owner occupied homes.

In [1]:
import numpy as np
import pandas as pd
from classifiers import KNearestNeighbor
from classifiers import baseModel
from classifiers import weightedKNN
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
%load_ext autoreload
%autoreload 2

## K Nearest Neighbor for Classification
The Dataset is fairly simple and small with no missing values. So we will not focus too much on Data Munging. Instead we will focus on building a KNN Classifier from scratch and running on this data. Now, since the Boston Housing is a regression problem and we want to build KNN for both Classification and Regression we will take another simple dataset for Classification - <b>the Wisconsin Breast Cancer data</b>.
<br>
<br>
We will first build the KNN classifier for Classification task and run it on Breast Cancer data and check the accuracy. Next we will build the KNN classifier for Regression task and run it on the Boston Housing data. Finanlly I want to build one more flaver of KNN known as <b>Weighted KNN</b> and run it on Breast Cancer Data.
<br>
<br>
The implementation of all these classifiers can be found in the <b>classifier.py</b> file.

In [23]:
Breast_cancer_data = pd.read_csv('breast-cancer-wisconsin.data.txt')

the Breast Cancer data doesn't need much pre-processing. Two operation we will do before we run on our classifier on it are:
<br>
1) We will drop the first column called 'ID'. 
<br>
2)There are few missing values, shown as '?'. We will replace those with a value -99999. This is large number would make these values as outlier and hence they will not play any role in classification


In [24]:
Breast_cancer_data.drop('ID',1,inplace=True)
Breast_cancer_data.replace('?',-99999,inplace=True)
Breast_cancer_data = Breast_cancer_data.apply(pd.to_numeric)
Breast_cancer_data.head()

Unnamed: 0,clump_thickness,uniform_cell_size,uniform_cell_shape,marg_adhesion,single_epith_cell_size,bare_nuclei,bland_chrom,normal_nucleoli,mitoses,class
0,5,1,1,1,2,1,3,1,1,2
1,5,4,4,5,7,10,3,2,1,2
2,3,1,1,1,2,2,3,1,1,2
3,6,8,8,1,3,4,3,7,1,2
4,4,1,1,3,2,1,3,1,1,2


Next we will instentiate the KNN classifier and run it on the Breast Cancer data and check the accuracy we get.

In [25]:
knn = KNearestNeighbor(k=3)

KNN object instantiated


In [26]:
knn.fit(Breast_cancer_data)
accuracy = knn.score_class(Breast_cancer_data)
print("accuracy",accuracy)

accuracy 0.9785408


The accuracy of 97% is great. 

## Base Model
Next we will load the Boston housing data. but instead of directly running it on the KNN classifier, we will first run it on a <b>Base Model</b>. This base mode does no learning. Instead it just retrun the mean value of the lable as prediction. It's always a good idea to build such a model and use it a benchmark. If the performance (RMSE in case of Regression) of the our classifier is better then that of the Base Model then we would know that we are moving in right direction.

In [27]:
boston_df = pd.read_csv('boston.csv')
boston_df = boston_df.apply(pd.to_numeric)
boston_train,boston_test = train_test_split(boston_df,test_size =0.2)
boston_train.head()

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
239,0.09252,30.0,4.93,0,0.428,6.606,42.200001,6.1899,6,300,16.6,383.779999,7.37,23.299999
36,0.09744,0.0,5.96,0,0.499,5.841,61.400002,3.3779,5,279,19.200001,377.559998,11.41,20.0
30,1.13081,0.0,8.14,0,0.538,5.713,94.099998,4.233,4,307,21.0,360.170013,22.6,12.7
468,15.5757,0.0,18.1,0,0.58,5.926,71.0,2.9084,24,666,20.200001,368.73999,18.129999,19.1
308,0.49298,0.0,9.9,0,0.544,6.635,82.5,3.3175,4,304,18.4,396.899994,4.54,22.799999


In [28]:
baseMod = baseModel()
baseMod.fit(boston_train)
rmse_train = baseMod.score_reg(boston_train)
rmse_test = baseMod.score_reg(boston_train)
print('rmse_train',rmse_train)
print('rmse_test',rmse_test)

base model object instantiated
rmse_train [9.39836537]
rmse_test [9.39836537]


## K Nearest Neighbor for Regression
<br>
Now that we have the RMSE number for the base model, we will run the Boston housing data on the KNN classifier for Regression and compare the results with the Base Model. But before we do so let us normalize the features so that all of the feature values are between 0 and 1. 

In [29]:
min_val = boston_train['crim'].min()
max_val = boston_train['crim'].max()
boston_train['crim'] = (boston_train['crim']- min_val)/(max_val -min_val)
boston_test['crim'] = (boston_test['crim']- min_val)/(max_val -min_val)

min_val = boston_train['zn'].min()
max_val = boston_train['zn'].max()
boston_train['zn'] = (boston_train['zn']- min_val)/(max_val -min_val)
boston_test['zn'] = (boston_test['zn']- min_val)/(max_val -min_val)

min_val = boston_train['indus'].min()
max_val = boston_train['indus'].max()
boston_train['indus'] = (boston_train['indus']- min_val)/(max_val -min_val)
boston_test['indus'] = (boston_test['indus']- min_val)/(max_val -min_val)

min_val = boston_train['rm'].min()
max_val = boston_train['rm'].max()
boston_train['rm'] = (boston_train['rm']- min_val)/(max_val -min_val)
boston_test['rm'] = (boston_test['rm']- min_val)/(max_val -min_val)

min_val = boston_train['age'].min()
max_val = boston_train['age'].max()
boston_train['age'] = (boston_train['age']- min_val)/(max_val -min_val)
boston_test['age'] = (boston_test['age']- min_val)/(max_val -min_val)

min_val = boston_train['dis'].min()
max_val = boston_train['dis'].max()
boston_train['dis'] = (boston_train['dis']- min_val)/(max_val -min_val)
boston_test['dis'] = (boston_test['dis']- min_val)/(max_val -min_val)

min_val = boston_train['rad'].min()
max_val = boston_train['rad'].max()
boston_train['rad'] = (boston_train['rad']- min_val)/(max_val -min_val)
boston_test['rad'] = (boston_test['rad']- min_val)/(max_val -min_val)

min_val = boston_train['tax'].min()
max_val = boston_train['tax'].max()
boston_train['tax'] = (boston_train['tax']- min_val)/(max_val -min_val)
boston_test['tax'] = (boston_test['tax']- min_val)/(max_val -min_val)

min_val = boston_train['ptratio'].min()
max_val = boston_train['ptratio'].max()
boston_train['ptratio'] = (boston_train['ptratio']- min_val)/(max_val -min_val)
boston_test['ptratio'] = (boston_test['ptratio']- min_val)/(max_val -min_val)

min_val = boston_train['black'].min()
max_val = boston_train['black'].max()
boston_train['black'] = (boston_train['black']- min_val)/(max_val -min_val)
boston_test['black'] = (boston_test['black']- min_val)/(max_val -min_val)

min_val = boston_train['lstat'].min()
max_val = boston_train['lstat'].max()
boston_train['lstat'] = (boston_train['lstat']- min_val)/(max_val -min_val)
boston_test['lstat'] = (boston_test['lstat']- min_val)/(max_val -min_val)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the cavea

In [30]:
boston_train.describe()

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
count,404.0,404.0,404.0,404.0,404.0,404.0,404.0,404.0,404.0,404.0,404.0,404.0,404.0,404.0
mean,0.048611,0.120124,0.387364,0.076733,0.554613,0.524467,0.671651,0.24427,0.368704,0.418175,0.610149,0.895322,0.301689,22.912624
std,0.110542,0.241165,0.252302,0.266497,0.11618,0.137437,0.293247,0.192525,0.381925,0.322887,0.236907,0.231962,0.200523,9.410019
min,0.0,0.0,0.0,0.0,0.385,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0
25%,0.001022,0.0,0.169538,0.0,0.44875,0.44635,0.408342,0.087975,0.130435,0.171756,0.457447,0.938777,0.14305,17.275
50%,0.003465,0.0,0.317632,0.0,0.538,0.506611,0.770855,0.199602,0.173913,0.272901,0.659574,0.984719,0.254043,21.650001
75%,0.04999,0.185,0.646628,0.0,0.62575,0.588427,0.939238,0.371284,1.0,0.914122,0.808511,0.996835,0.41844,25.525
max,1.0,1.0,1.0,1.0,0.871,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,50.0


We have build two options for measuring the distance between the quary point and neighboring point - <b>Euclidean distance</b> and <b> Manhattan distance</b>. We will use both these options and check RMSE.

In [31]:
knn.fit(boston_train)
rmse_train = knn.score_reg(boston_train)
rmse_test = knn.score_reg(boston_test)
print('Performance using Euclidean distance')
print('rmse_train',rmse_train)
print('rmse_test',rmse_test)

Performance using Euclidean distance
rmse_train [3.09857266]
rmse_test [3.84422591]


In [32]:
rmse_train = knn.score_reg(boston_train,dist_type='manhattan')
rmse_test = knn.score_reg(boston_test,dist_type='manhattan')
print('Performance using manhattan distance')
print('rmse_train',rmse_train)
print('rmse_test',rmse_test)

Performance using manhattan distance
rmse_train [3.02491446]
rmse_test [3.59733477]


The RMSE is less then that of Base Model which means our classifier is definitely. Furthormore, The RMSE is more or less the same for distance measures we have used. So for this data either of the two can be used.

## Weighted K Nearest Neighbor
<br>
The implementation of the Weighted KNN is done by referring to the method defined at below link:
<br>
https://epub.ub.uni-muenchen.de/1769/1/paper_399.pdf
<br>
<br>
We have implemented two Kernal method <b> Gaussian </b> and <b>inversion</b>

In [33]:
wKNN = weightedKNN(k=10)

weighted KNN object instantiated


In [34]:
wKNN.fit(Breast_cancer_data)
accuracy = wKNN.score_class(Breast_cancer_data,kernal='gauss')
print("accuracy",accuracy)

accuracy 0.9756795


In [35]:
accuracy = wKNN.score_class(Breast_cancer_data,kernal='inversion')
print("accuracy",accuracy)

accuracy 1.0


Both Kernal give acceptable perfromance on the data. 