# Data Loading and Preprocessing

We consider the same notebook used in the labs, containing house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015.

https://www.kaggle.com/harlfoxem/housesalesprediction

For each house we know 18 house features (e.g., number of bedrooms, number of bathrooms, etc.) plus its price, that is what we would like to predict.

## TO DO: Insert your ID number ("numero di matricola") below

In [1]:
#put here your ``numero di matricola''
numero_di_matricola = 2021445

Load the required packages

In [2]:
#import all packages needed
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Read the data, remove data samples/points with missing values (NaN), and print some statistics.

In [3]:
#load the data
df = pd.read_csv('kc_house_data.csv', sep = ',')

#remove the data samples with missing values (NaN)
df = df.dropna() 

df.describe()

Unnamed: 0,id,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
count,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0
mean,4645240000.0,535435.8,3.381163,2.071903,2070.027813,15250.54,1.434893,0.009798,0.244311,3.459229,7.615676,1761.252212,308.775601,1967.489254,94.668774,98077.125158,47.557868,-122.212337,1982.544564,13176.302465
std,2854203000.0,380900.4,0.895472,0.768212,920.251879,42544.57,0.507792,0.098513,0.776298,0.682592,1.166324,815.934864,458.977904,28.095275,424.439427,54.172937,0.140789,0.139577,686.25667,25413.180755
min,1000102.0,75000.0,0.0,0.0,380.0,649.0,1.0,0.0,0.0,1.0,3.0,380.0,0.0,1900.0,0.0,98001.0,47.1775,-122.514,620.0,660.0
25%,2199775000.0,315000.0,3.0,1.5,1430.0,5453.75,1.0,0.0,0.0,3.0,7.0,1190.0,0.0,1950.0,0.0,98032.0,47.459575,-122.32425,1480.0,5429.5
50%,4027701000.0,445000.0,3.0,2.0,1910.0,8000.0,1.0,0.0,0.0,3.0,7.0,1545.0,0.0,1969.0,0.0,98059.0,47.5725,-122.226,1830.0,7873.0
75%,7358175000.0,640250.0,4.0,2.5,2500.0,11222.5,2.0,0.0,0.0,4.0,8.0,2150.0,600.0,1990.0,0.0,98117.0,47.68025,-122.124,2360.0,10408.25
max,9839301000.0,5350000.0,8.0,6.0,8010.0,1651359.0,3.5,1.0,4.0,5.0,12.0,6720.0,2620.0,2015.0,2015.0,98199.0,47.7776,-121.315,5790.0,425581.0


Get the feature matrix and the vector of target values. We want to predict the price by using features other than id as input.

In [4]:
Data = df.values
# m = number of input samples
m = Data.shape[0]
print("Amount of data:",m)
Y = Data[:m,2]
X = Data[:m,3:]

feature_names = df.columns[3:]

Amount of data: 3164


We split the $m$ samples of the data into 3 parts: one will be used for training and choosing the parameters, one for choosing among different models, and one for testing. The part for training and choosing the parameters will consist of $m_{train}=2/3 m$ samples, the one for choosing among different models will consist of $m_{val}= (m - m_{train})/2$ samples, while the other part consists of $m_{test}=m - m_{train} - m_{val}$ samples.

In [5]:
# Split data into train (2/3 of samples), validation (1/6 of samples), and test data (the rest)
m_train = int(2./3.*m)
m_val = int((m-m_train)/2.)
m_test = m - m_train - m_val
print("Amount of data for training and deciding parameters:",m_train)
print("Amount of data for validation (choosing among different models):",m_val)
print("Amount of data for test:",m_test)
from sklearn.model_selection import train_test_split

#Xtrain_and_val, Ytrain_and_val is the part of data for training and validation
#Xtest, Ytest is the part of data for testing
Xtrain_and_val, Xtest, Ytrain_and_val, Ytest = train_test_split(X, Y, test_size=m_test/m, random_state=numero_di_matricola)

#if you need to consider a specific training and validation split, use
#Xtrain, Ytrain for training and Xval, Yval for validation
Xtrain, Xval, Ytrain, Yval = train_test_split(Xtrain_and_val, Ytrain_and_val, test_size=m_val/(m_train+m_val), random_state=numero_di_matricola)

Amount of data for training and deciding parameters: 2109
Amount of data for validation (choosing among different models): 527
Amount of data for test: 528


Let's scale the data.

In [6]:
# Data pre-processing
from sklearn import preprocessing
scaler = preprocessing.StandardScaler().fit(Xtrain)
Xtrain_scaled = scaler.transform(Xtrain)
Xtrain_and_val_scaled = scaler.transform(Xtrain_and_val)
Xval_scaled = scaler.transform(Xval)
Xtest_scaled = scaler.transform(Xtest)

# Neural Networks

Let's learn the best neural network with 1 hidden layer and between 1 and 9 hidden nodes, choosing the best number of hidden nodes with cross-validation.

In [7]:
from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import GridSearchCV

mlp_cv = MLPRegressor()
param_grid = {'hidden_layer_sizes': [i for i in range(1,10)],
              'activation': ['relu'],
              'solver': ['lbfgs'], 
              'random_state': [numero_di_matricola]
             }
mlp_GS = GridSearchCV(mlp_cv, param_grid=param_grid, 
                   cv=5, verbose=True)
mlp_GS.fit(Xtrain_and_val_scaled, Ytrain_and_val)

Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-lear

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("

GridSearchCV(cv=5, error_score=nan,
             estimator=MLPRegressor(activation='relu', alpha=0.0001,
                                    batch_size='auto', beta_1=0.9, beta_2=0.999,
                                    early_stopping=False, epsilon=1e-08,
                                    hidden_layer_sizes=(100,),
                                    learning_rate='constant',
                                    learning_rate_init=0.001, max_fun=15000,
                                    max_iter=200, momentum=0.9,
                                    n_iter_no_change=10,
                                    nesterovs_momentum=True, power_t=0.5,
                                    random_state=None, shuffle=True,
                                    solver='adam', tol=0.0001,
                                    validation_fraction=0.1, verbose=False,
                                    warm_start=False),
             iid='deprecated', n_jobs=None,
             param_grid={'activation'

Now let's check what is the best parameter, and compare the best NNs with the linear model (learned on train and validation) on test data.

In [8]:
#let's print the best model according to grid search
print("Best model: ",mlp_GS.best_estimator_)
#let's print the error 1-R^2 for the best model
print("Error (1-R^2) of best model: ",1. - mlp_GS.best_score_)

Best model:  MLPRegressor(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
             beta_2=0.999, early_stopping=False, epsilon=1e-08,
             hidden_layer_sizes=7, learning_rate='constant',
             learning_rate_init=0.001, max_fun=15000, max_iter=200,
             momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
             power_t=0.5, random_state=2021445, shuffle=True, solver='lbfgs',
             tol=0.0001, validation_fraction=0.1, verbose=False,
             warm_start=False)
Error (1-R^2) of best model:  0.19499652629058806


Let's learn the best NN using all of training and validation, and then compare the error of the best NN on train and validation and on test data.

In [9]:
best_mlp = MLPRegressor(hidden_layer_sizes=(6,), activation='relu', solver='lbfgs', random_state = numero_di_matricola)
best_mlp.fit(Xtrain_and_val_scaled,Ytrain_and_val)

print("Error best model on train and validation: ",1. - best_mlp.score(Xtrain_and_val_scaled,Ytrain_and_val))
print("Error best model on test data: ",1. - best_mlp.score(Xtest_scaled,Ytest))

Error best model on train and validation:  0.15640023941577474
Error best model on test data:  0.15780180798281307


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


# Linear Regression

Now let's learn the linear model on train and validation, and get error (1-R^2) on train and validation and on test data.

In [10]:
from sklearn import linear_model
#LR the linear regression model
LR = linear_model.LinearRegression()

#fit the model on training data
LR.fit(Xtrain_and_val_scaled, Ytrain_and_val)

print("1 - coefficient of determination on training data:"+str(1 - LR.score(Xtrain_and_val_scaled,Ytrain_and_val)))
print("1 - coefficient of determination on test data:"+str(1 - LR.score(Xtest_scaled,Ytest)))

1 - coefficient of determination on training data:0.2739738379137041
1 - coefficient of determination on test data:0.32454538377939723


# k-Nearest Neighbours

You will now explore the k-Nearest Neighbours (kNN) method for regression. In order to do this, you will need to use load the scikit-learn package *neighbors.KNeighborsRegressor* 

k-Nearest Neighbours for regression works as follows: the predicted value $h(\textbf{x})$ for an instance $\textbf{x}$ is obtained by first finding the $\ell$ instances *in the training set* that are clostest to $\textbf{x}$; the predicted value $h(\textbf{x})$ is then the mean of the targets of such $\ell$ instances. $\ell$ is a parameter of the method. The targets of the $\ell$ instances used for prediction can be weighted by the (inverse of) their distance to $\textbf{x}$.

## TO DO: load the package for kNN regression, learn the model with default parameters using the training and validation scaled data, and print the error (1-R^2) on the data used to train the model and on the test data.

In [11]:
#TO DO: import package
from sklearn.neighbors import KNeighborsRegressor

#TO DO: learn model
knn = KNeighborsRegressor().fit(Xtrain_and_val_scaled, Ytrain_and_val)

print("Error on training and validation data:"+str(1 - knn.score(Xtrain_and_val_scaled,Ytrain_and_val)))
print("Error on test data:"+str(1 - knn.score(Xtest_scaled,Ytest)))

Error on training and validation data:0.1521765394969138
Error on test data:0.334522404655212


## TO DO: repeat the point (including the printing instructions) above using the kNN version where points are weighted by the inverse of their distance 

In [12]:
#TO DO: import package
from sklearn.neighbors import KNeighborsRegressor

#TO DO: learn model 
#distance : weight points by the inverse of their distance. 
knn = KNeighborsRegressor(weights = 'distance').fit(Xtrain_and_val_scaled, Ytrain_and_val)

print("Error on training and validation data:"+str(1 - knn.score(Xtrain_and_val_scaled,Ytrain_and_val)))
print("Error on test data:"+str(1 - knn.score(Xtest_scaled,Ytest)))

Error on training and validation data:0.0006495792904336328
Error on test data:0.3315370231490401


## TO DO: use cross validation to choose the best number of neighbours between 2 and 20)

In [13]:
from sklearn.model_selection import GridSearchCV

knn2 = KNeighborsRegressor()
params = {'n_neighbors': np.arange(2, 21)}
knn_gs = GridSearchCV(knn2, params, cv=5).fit(Xtrain_and_val_scaled, Ytrain_and_val)

## TO DO: print the best model according to cross validation above, and print the score of the best model 

In [14]:
#let's print the best model according to grid search
print("Best model: ", knn_gs.best_estimator_)
#let's print the error 1-R^2 for the best model
print("Score of best model: ",knn_gs.best_score_)

Best model:  KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
                    metric_params=None, n_jobs=None, n_neighbors=6, p=2,
                    weights='uniform')
Score of best model:  0.7535968802115536


## TO DO: learn the best model on all of the training and validation scaled data, and print the error on training and validation scaled data, and on test scaled data

In [15]:
#TO DO: learn model
knn3 = KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
                    metric_params=None, n_jobs=None, n_neighbors=6, p=2,
                    weights='uniform').fit(Xtrain_and_val_scaled, Ytrain_and_val)

print("Error best model on train and validation: ",1. - knn3.score(Xtrain_and_val_scaled,Ytrain_and_val))
print("Error best model on test data: ",1. - knn3.score(Xtest_scaled,Ytest))

Error best model on train and validation:  0.1568521666199003
Error best model on test data:  0.3073877968909494


## TO DO: compare the error on test data of the best kNN model with the error on test data of linear regression and of NNs. Describe what you observe and give a potential explanation.
## [USE MAX 10 LINES]

Error on test data linear regression = 0.32454538377939723                                                                     
Error on test data NN = 0.15780180798281307                                                                                     
Error on test data kNN = 0.3073877968909494                                                                                     

kNN and linear regression are easy to implement but the results are not as successful as those obtained using NN, so in many cases the best way to find the best result is to use neural networks. 

# Clustering and "Local" Linear Models

You are now going to explore the use of clustering to identify groups of *similar* instances, and then learning models that are specific to each group.

Once you have clustered the data, and then learned a model for each cluster, the prediction for a new instance is obtained by using the model of the cluster that is the closest to the instance, where the distance of a cluster to the instance is defined as the distance of the *center* of the cluster to the instance.

**Note**: in this part you are not explicitely told which part of the data to use, deciding which one is the correct one is part of the homework!

## TO DO: use k-means in sklearn to learn a cluster with 5 clusters.

In [16]:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=5).fit(Xtrain_and_val_scaled)

## TO DO: for each cluster, learn a linear model using the elements of the cluster. For each model, print the error on the data used to learn it.

In [17]:
from sklearn.linear_model import LinearRegression
X_cluster = [[],[],[],[],[]] #each list of X_cluster contains the data associated with cluster = index
y_cluster = [[],[],[],[],[]]
n = Xtrain_and_val_scaled.shape[0];
for i in range(n):
    index = kmeans.predict(Xtrain_and_val_scaled)[i]
    X_cluster[index].append(Xtrain_and_val_scaled[i])
    y_cluster[index].append(Ytrain_and_val[i])

In [18]:
lm = []
for i in range(0,5):
    lm.append(LinearRegression().fit(X_cluster[i], y_cluster[i]))
    print("Error on training cluster["+str(i)+"] = "+str(1. - lm[i].score(X_cluster[i], y_cluster[i])))

Error on training cluster[0] = 0.22642647465108223
Error on training cluster[1] = 0.3710165545079471
Error on training cluster[2] = 0.3325090068855213
Error on training cluster[3] = 0.31229142341460125
Error on training cluster[4] = 0.054734280385335454


## TO DO: *compute* the error (1 - R^2) on the data not used to learn the models.
For each instance not used to learn the model, the prediction is done by:
- finding the cluster C whose center is the closest to the instance
- use the model learned for cluster C to make the prediction

In [19]:
X_cluster_test = [[],[],[],[],[]]
y_cluster_test = [[],[],[],[],[]]
for i in range(Xtest_scaled.shape[0]):
    distances = []
    for j in range(0,5):
         distances.append(np.linalg.norm(Xtest_scaled[i]-kmeans.cluster_centers_[j])) #Euclidean distance
    index = distances.index(min(distances))
    X_cluster_test[index].append(Xtest_scaled[i])
    y_cluster_test[index].append(Ytest[i])

## TO DO: *print* the error (1-R^2) on the data not used to learn the models

In [20]:
for i in range(0,5):
    print("Error on test cluster["+str(i)+"] = "+str(1. - lm[i].score(X_cluster_test[i], y_cluster_test[i])))

Error on test cluster[0] = 0.1874460283946796
Error on test cluster[1] = 0.31358063613194154
Error on test cluster[2] = 0.2648153198557609
Error on test cluster[3] = 0.426351816022963
Error on test cluster[4] = 0.10349550974351973


## TO DO: compare the error of the model "clustering + linear models" and of the linear model (see the beginning of the HW). Describe what you observe, and provide a possible explanation.
## [USE MAX 10 LINES]

Error on test data linear regression = 0.32454538377939723

I notice that in the case of cluster = 3 I got a bad result because the number of samples within this cluster is too low to build a good model but in the case of cluster = 2 it is possible to see the result is better the outcome obtained by using the linear regression, this is due because the data are sufficient to produce a good model and especially the samples I used are very similar to each other (same cluster).

## TO DO: compare the error of the model "clustering + linear models" and of kNN. Describe what you observe, and provide a possible explanation.
## [USE MAX 10 LINES]

Error on test data kNN = 0.3073877968909494 

I have noticed that clusters 2 and 4 perform better than kNN it is also important to say that the distance calculation is computationally expensive and proportional to the size of the dataset, so this could be the cause of the high error for kNN.

# Clustering and "Local" NNs

Repeat the same as above, but using neural networks instead of linear models.

**Note**: note that we are not telling you which parameters to use for NNs. You have to decide how to select the parameters.

## TO DO: clearly explain how you decided to set the parameters, motivating the choice of your strategy.

I decided to use GridSearchCV to find the best parameters as we have done for the HW2. The best result obtained with this technique corresponds to two hidden layers with 50 neurons each.

## TO DO: repeat the analysis in part "Clustering and "Local" Linear Models" using NNs instead of linear models.

In [21]:
from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import GridSearchCV
hl_parameters = {'hidden_layer_sizes': [(10,), (50,), (10,10,), (50,50,)]}

from sklearn.model_selection import KFold
#Increase max_iter becouse the optimization hasn't converged 
cv = KFold(n_splits=5)
mlp_cv = MLPRegressor(max_iter=3000)
grid = GridSearchCV(mlp_cv, hl_parameters, cv=cv)
grid.fit(Xval_scaled.tolist(), Yval.tolist())

print ('RESULTS FOR NN\n')
print("Best parameters set found:")
print(grid.best_params_)
print("Score with best parameters:")
print(grid.best_score_)
print("\nAll scores on the grid:")
for a,b in zip(grid.cv_results_['mean_test_score'], grid.cv_results_['params']):
    print("%0.3f - %r"% (a, b))



RESULTS FOR NN

Best parameters set found:
{'hidden_layer_sizes': (50, 50)}
Score with best parameters:
0.622922940643712

All scores on the grid:
-2.421 - {'hidden_layer_sizes': (10,)}
-2.265 - {'hidden_layer_sizes': (50,)}
-0.050 - {'hidden_layer_sizes': (10, 10)}
0.623 - {'hidden_layer_sizes': (50, 50)}




In [22]:
mlp = []
for i in range(0,5):
    mlp.append(MLPRegressor(hidden_layer_sizes=(50,50), max_iter=2000).fit(X_cluster[i], y_cluster[i]))
    print("NN Error on training cluster["+str(i)+"] = "+str(1. - mlp[i].score(X_cluster[i], y_cluster[i])))
    print("NN Error on test cluster    ["+str(i)+"] = "+str(1. - mlp[i].score(X_cluster_test[i], y_cluster_test[i]))+"\n")



NN Error on training cluster[0] = 0.8508240313174289
NN Error on test cluster    [0] = 0.7415850925021548





NN Error on training cluster[1] = 0.3607398907366741
NN Error on test cluster    [1] = 0.3750387086138194





NN Error on training cluster[2] = 0.3581145318079466
NN Error on test cluster    [2] = 0.31013152852642256





NN Error on training cluster[3] = 0.40368722864321327
NN Error on test cluster    [3] = 0.4492354652015865

NN Error on training cluster[4] = 0.8368352680707952
NN Error on test cluster    [4] = 0.7703618282544413





## TO DO: compare the error of the model "clustering + NNs" and of NNs (see the beginning of the HW). Describe what you observe, and provide a possible explanation.
## [USE MAX 10 LINES]

Error on test data NN = 0.15780180798281307

In this situation the results of the clusters are all worse than the NNs this is caused because the number of samples within each cluster is not enough to obtain a good model, so we are in a condition of undercutting (Cluster = 3/4)

## TO DO: compare the error of the model "clustering + NNs" and of kNN. Describe what you observe, and provide a possible explanation.
## [USE MAX 10 LINES]

Test data error kNN = 0.3073877968909494

As before, the result of calculating the distance kNN is computationally expensive and proportional to the size of the dataset, however in general clusters perform worse because the neural network requires a lot of input data for good model.

## TO DO: compare the error of the model "clustering + NNs" and of "clustering + Linear Models". Describe what you observe, and provide a possible explanation.
## [USE MAX 10 LINES]

Linear regression error of test data = 0.32454538377939723

I notice that in the case of cluster = 3/4 I got a negative result because the number of samples within this cluster is too low to build a good model however for the first three models, compared to linear regression, the results are very close. It is also possible to see that the test error is less than the training error, usually this happens when a method generalizes well.