#  Regression on House Pricing Dataset with SVM
We consider a reduced version of a dataset containing house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015.

https://www.kaggle.com/harlfoxem/housesalesprediction

For each house we know 18 house features (e.g., number of bedrooms, number of bathrooms, etc.) plus its price, that is what we would like to predict.

# Overview

In the notebook you will first:
- split the data into training, validation, and test
- standardize the data

You will then be asked to learn various SVM models, in particular:
- for each of the kernels ‘linear’, ‘poly’, ‘rbf’, and ‘sigmoid’, you will learn the best model having to choose among various values of some hyperparameters; the choice of hyperparameters must be done with 5-fold cross-validation
- choose the best kernel, using a validation approach (not cross-validation)
- learn the best SVM model overall

You will then be asked to estimate the generalization error of the best SVM model you report.

At the end, just for comparison, you will alsk be asked to learn a standard linear regression model (with squared loss), and estimate its generalization error.

### IMPORTANT
- Note that in each of the above steps you will have to choose the appropriate split of the data (see the first bullet point above)
- The code should run without requiring modifications even if some best choice of parameters, changes; for example, you should not pass the best value of hyperparameters "manually" (i.e., passing the values as input parameters to the models). The only exception is in the TO DO titled 'ANSWER THE FOLLOWING'
- For SVM, since the values to be predicted are all in the thousands of dollars, you will need to always set epsilon=1000
- Do not change the printing instructions (other than adding the correct variable name for your code), and do not add printing instructions!

## TO DO - INSERT YOUR NUMERO DI MATRICOLA BELOW

In [None]:
#put here your ``numero di matricola''
numero_di_matricola = 2087643

The following code loads all required packages

In [None]:
#import all packages needed
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn import svm
from sklearn import model_selection
from sklearn import linear_model

The code below loads the data and remove samples with missing values. It also prints the number of samples in the datasets.

In [None]:
#load the data - do not change the path below!
df = pd.read_csv('kc_house_data.csv', sep = ',')

#remove the data samples with missing values (NaN)
df = df.dropna()

Data = df.values
m = Data.shape[0]
Y = Data[:m,2]
X = Data[:m,3:]

print("Total number of samples:",m)

Total number of samples: 3164


# Data preprocessing

## TO DO - SPLIT DATA INTO TRAINING, VALIDATION, AND TESTING, WITH THE FOLLOWING PERCENTAGES: 60%, 20%, 20%

Use the train_test_split function from sklearn.model_selection to do it; in every call fix random_state to your numero_di_matricola. At the end, you should store the data in the following variables:
- Xtrain, Ytrain: training data
- Xval, Yval: validation data
- Xtrain_val, Ytrain_val: training and validation data
- Xtest, Ytest: test data

The code then prints the number of samples in Xtrain, Xval, Xtrain_val, and Xtest

IMPORTANT:
- first split the data into training+validation and test; the first part of the data in output from train_test_split must correspond to the training+validation
- then split training+validation into training and validation; the first part of the data in output from train_test_split must correspond to the training


In [None]:
m_train = int(2./3 * m)
m_val = int((m - m_train)/2.)
m_test = m - m_train - m_val

Xtrain_val, Xtest, Ytrain_val, Ytest = train_test_split(X, Y, test_size=m_test/m, random_state=numero_di_matricola) #test size = fraction on data that will be in test (datatest / total_data_1)
Xtrain, Xval, Ytrain, Yval = train_test_split(Xtrain_val, Ytrain_and_val, test_size=m_val/(m_val+m_train), random_state=numero_di_matricola) #test size = fraction on data that will be in val (dataval / total_data_2) #split the train_and_val data into training and validation



print("Training size: ", Xtrain.shape[0])
print("Validation size: ", Xval.shape[0])
print("Training and validation size",Xtrain_val.shape[0])
print("Test size",Xtest.shape[0])

Training size:  2109
Validation size:  527
Training and validation size 2636
Test size 528


## TO DO - STANDARDIZE THE DATA

Standardize the data using the preprocessing.StandardScaler from scikit learn.

If V is the name of the variable storing part of the data, the corresponding standardized version should be stored in V_scaled. For example, the scaled version of Xtrain should be stored in Xtrain_scaled

In [None]:
scaler = preprocessing.StandardScaler().fit(Xtrain)
Xtrain_scaled = scaler.transform(Xtrain) #scaled training data
Xtrain_val_scaled = scaler.transform(Xtrain_val) #scaled training and validation data
Xval_scaled = scaler.transform(Xval) #scaled validation data
Xtest_scaled = scaler.transform(Xtest) #scaled test data

# SVM models: learning the best model for each kernel

## TO DO - CHOOSE THE BEST HYPERPARAMETERS FOR LINEAR KERNEL

Consider svm.SVR and linear kernel. Consider the following hyperparameters and their values:
- C: 0.1, 1, 10, 100, 1000

Leave all other input parameters to default.

Find the best value of the hyperparameters using 5-fold cross validation. Use model_selection.GridSearchCV to perform the cross-validation.

Print the best value of the hyperparameters (they are in the attribute best_params_ from GridSearchCV)

In [None]:
print("\nLinear SVM")

# Define parameters
param_grid_linear = {'C': [0.1, 1, 10, 100, 1000], 'kernel': ['linear']}

# Create an SVR object
linear_svr = svm.SVR()

# Create a GridSearchCV object with 5-fold cross-validation
grid_search_linear = model_selection.GridSearchCV(linear_svr, param_grid_linear, cv=5)

grid_search_linear.fit(Xtrain_scaled, Ytrain)

best_param_linear = grid_search_linear.best_params_

print("Best value for hyperparameters: ", best_param_linear)



Linear SVM
Best value for hyperparameters:  {'C': 1000, 'kernel': 'linear'}


## TO DO - LEARN A MODEL WITH LINEAR KERNEL AND BEST CHOICE OF HYPERPARAMETERS

This model will be compared with the best models with other kernels using validation (not cross validation).

DO NOT PASS PARAMETERS BY HARD-CODING THEM IN THE CODE.

Print the training score of the best model.

In [None]:
best_linear_svm_model = svm.SVR(kernel="linear", C=best_param_linear['C'])
best_linear_svm_model.fit(Xtrain_scaled, Ytrain)
best_linear_svm_model_training_score = 1 - best_linear_svm_model.score(Xtrain_scaled, Ytrain)

print("Training score: ", best_linear_svm_model_training_score)

Training score:  0.328223583595309


## TO DO - CHOOSE THE BEST HYPERPARAMETERS FOR POLY KERNEL

Consider svm.SVR and polynomial kernel. Consider the following hyperparameters and their values:
- C: 0.1, 1, 10, 100, 1000
- degree: 2, 3, 4

Leave all other input parameters to default.

Find the best value of the hyperparameters using 5-fold cross validation. Use model_selection.GridSearchCV to perform the cross-validation.

Print the best value of the hyperparameters.

In [None]:
print("\nPoly SVM")

# Define parameters
param_grid_poly = {'C': [0.1, 1, 10, 100, 1000], 'kernel': ['poly'], 'degree':[2, 3, 4]}

# Create an SVM object
poly_svr = svm.SVR()

# Create a GridSearchCV object with 5-fold cross-validation
grid_search_poly = model_selection.GridSearchCV(poly_svr, param_grid_poly, cv=5)

grid_search_poly.fit(Xtrain_scaled, Ytrain)

best_param_poly = grid_search_poly.best_params_

print("Best value for hyperparameters: ",  best_param_poly)


Poly SVM
Best value for hyperparameters:  {'C': 1000, 'degree': 3, 'kernel': 'poly'}


## TO DO - LEARN A MODEL WITH POLY KERNEL AND BEST CHOICE OF HYPERPARAMETERS

This model will be compared with the best models with other kernels using validation (not cross validation).

DO NOT PASS PARAMETERS BY HARD-CODING THEM IN THE CODE.

Print the training score of the best model.

In [None]:
best_poly_svm_model = svm.SVR(kernel="poly", C=best_param_poly['C'], degree=best_param_poly['degree'])
best_poly_svm_model.fit(Xtrain_scaled, Ytrain)
best_poly_svm_model_training_score = 1 - best_poly_svm_model.score(Xtrain_scaled, Ytrain)

print("Training score: ", best_poly_svm_model_training_score)

Training score:  0.43843809706826264


## TO DO - CHOOSE THE BEST HYPERPARAMETERS FOR RBF KERNEL

Consider svm.SVR and RBF kernel. Consider the following hyperparameters and their values:
- C: 0.1, 1, 10, 100, 1000
- gamma: 0.01

Leave all other input parameters to default.

Find the best value of the hyperparameters using 5-fold cross validation. Use model_selection.GridSearchCV to perform the cross-validation.

Print the best value of the hyperparameters.

In [None]:
print("\nRBF SVM")

# Define parameters
param_grid_rbf = {'C': [0.1, 1, 10, 100, 1000], 'kernel': ['rbf'], 'gamma':[0.01]}

# Create an SVM object
rbf_svr = svm.SVR()

# Create a GridSearchCV object with 5-fold cross-validation
grid_search_rbf = model_selection.GridSearchCV(rbf_svr, param_grid_rbf, cv=5)

grid_search_rbf.fit(Xtrain_scaled, Ytrain)

best_param_rbf = grid_search_rbf.best_params_


print("Best value for hyperparameters: ", best_param_rbf)


RBF SVM
Best value for hyperparameters:  {'C': 1000, 'gamma': 0.01, 'kernel': 'rbf'}


## TO DO - LEARN A MODEL WITH RBF KERNEL AND BEST CHOICE OF HYPERPARAMETERS

This model will be compared with the best models with other kernels using validation (not cross validation).

DO NOT PASS PARAMETERS BY HARD-CODING THEM IN THE CODE.

Print the training score of the best model.

In [None]:
best_rbf_svm_model = svm.SVR(kernel="rbf", C=best_param_rbf['C'], gamma=best_param_rbf['gamma'])
best_rbf_svm_model.fit(Xtrain_scaled, Ytrain)
best_rbf_svm_model_training_score = 1 - best_rbf_svm_model.score(Xtrain_scaled, Ytrain)

print("Training score: ", best_rbf_svm_model_training_score)

Training score:  0.8625983785245206


## TO DO - CHOOSE THE BEST HYPERPARAMETERS FOR SIGMOID KERNEL

Consider svm.SVR and sigmoid kernel. Consider the following hyperparameters and their values:
- C: 0.1, 1, 10, 100, 1000
- gamma: 0.01
- coef0: 0, 1

Leave all other input parameters to default.

Find the best value of the hyperparameters using 5-fold cross validation. Use model_selection.GridSearchCV to perform the cross-validation.

Print the best value of the hyperparameters.

In [None]:
print("\nSigmoid SVM")
# Define parameters
param_grid_sigmoid = {'C': [0.1, 1, 10, 100, 1000], 'kernel': ['sigmoid'], 'gamma':[0.01], 'coef0':[0,1]}

# Create an SVM object
sigmoid_svr = svm.SVR()

# Create a GridSearchCV object with 5-fold cross-validation
grid_search_sigmoid = model_selection.GridSearchCV(sigmoid_svr, param_grid_sigmoid, cv=5)

grid_search_sigmoid.fit(Xtrain_scaled, Ytrain)

best_param_sigmoid = grid_search_sigmoid.best_params_

print("Best value for hyperparameters: ", best_param_sigmoid)


Sigmoid SVM
Best value for hyperparameters:  {'C': 1000, 'coef0': 0, 'gamma': 0.01, 'kernel': 'sigmoid'}


## TO DO - LEARN A MODEL WITH SIGMOID KERNEL AND BEST CHOICE OF HYPERPARAMETERS

This model will be compared with the best models with other kernels using validation (not cross validation).

DO NOT PASS PARAMETERS BY HARD-CODING THEM IN THE CODE.

Print the training score of the best model.

In [None]:
best_sigmoid_svm_model = svm.SVR(kernel="sigmoid", C=best_param_sigmoid['C'], gamma=best_param_sigmoid['gamma'], coef0=best_param_sigmoid['coef0'])
best_sigmoid_svm_model.fit(Xtrain_scaled, Ytrain)
best_sigmoid_svm_model_training_score = 1 - best_sigmoid_svm_model.score(Xtrain_scaled, Ytrain)


print("Training score: ", best_sigmoid_svm_model_training_score)

Training score:  0.8757985981828674


## TO DO - USE VALIDATION TO CHOOSE THE BEST MODEL AMONG THE ONES LEARNED FOR THE VARIOUS KERNELS

Use validation to choose the best model among the four ones (one for each kernel) you have learned above.

Print, following exactly the order described here, with 1 value for each line:
- the validation score of SVM with linear kernel (the template below does not include such print)
- the validation score of SVM with polynomial kernel (the template below does not include such print)
- the validation score of SVM with rbf kernel (the template below does not include such print)
- the validation score of SVM with sigmoid kernel (the template below does not include such print)
- the best kernel (e.g., sigmoid)
- the validation score of the best kernel

For the first 4 prints, use the format: "kernel validation score: ". For example, for linear kernel "Linear validation score: ", for rbf "rbf validation score: "

In [None]:
print("\nVALIDATION TO CHOOSE SVM KERNEL")

val_score_linear = 1 - best_linear_svm_model.score(Xval_scaled,Yval)
val_score_poly = 1 - best_poly_svm_model.score(Xval_scaled,Yval)
val_score_rbf = 1 - best_rbf_svm_model.score(Xval_scaled,Yval)
val_score_sigmoid = 1 - best_sigmoid_svm_model.score(Xval_scaled,Yval)

print("Linear validation score:", val_score_linear)
print("Polynimial validation score:", val_score_poly)
print("rbf validation score:", val_score_rbf)
print("Sigmoid validation score:", val_score_sigmoid)

models_parameters_and_score = [("linear", val_score_linear, best_param_linear)
                                , ("polynimial", val_score_poly, best_param_poly)
                                , ("RBF", val_score_rbf, best_param_rbf)
                                , ("sigmoid", val_score_sigmoid, best_param_sigmoid)
                              ]
best_kernel_tuple = max(models_parameters_and_score, key=lambda x: x[1])

print("Best kernel: ", best_kernel_tuple[0])
print("Validation score of best kernel: ", best_kernel_tuple[1])


VALIDATION TO CHOOSE SVM KERNEL
Linear validation score: 0.39721680602394904
Polynimial validation score: 0.3132530598697222
rbf validation score: 0.8814999765226084
Sigmoid validation score: 0.8794327593708887
Best kernel:  RBF
Validation score of best kernel:  0.8814999765226084


## TO DO - LEARN THE FINAL MODEL FOR WHICH YOU WANT TO ESTIMATE THE GENERALIZATION ERROR

Learn the final model (i.e., the one you would use to make predictions about future data).

Print the score of the model on the data used to learn it.

In [None]:
print("\nTRAINING SCORE BEST MODEL")

if best_kernel_tuple[0] == "Linear":
  best_model = svm.SVR(kernel="linear", C=best_kernel_tuple[2]['C'])

elif best_kernel_tuple[0] == "Polynomial":
  best_model = svm.SVR(kernel="poly", C=best_kernel_tuple[2]['C'], degree=best_kernel_tuple[2]['degree'])

elif best_kernel_tuple[0] == "RBF":
  best_model = svm.SVR(kernel="rbf", C=best_kernel_tuple[2]['C'], gamma=best_kernel_tuple[2]['gamma'])

else:
  best_model = svm.SVR(kernel="sigmoid", C=best_kernel_tuple[2]['C'], gamma=best_kernel_tuple[2]['gamma'], coef0=best_kernel_tuple[2]['coef0'])

best_model.fit(Xtrain_val_scaled, Ytrain_val)
best_model_training_score = 1 - best_model.score(Xtrain_val_scaled, Ytrain_val)

print("Score of the best model on the data used to learn it: ", best_model_training_score)


TRAINING SCORE BEST MODEL
Score of the best model on the data used to learn it:  0.8378934821404114


## TO DO - PRINT THE ESTIMATE  OF THE GENERALIZATION ERROR FOR THE FINAL MODEL

Print the estimate of the generalization "score" for the final model. The generalization "score" is the score computed on the data used to estimate the generalization error.

In [None]:
print("\nGENERALIZATION SCORE BEST MODEL")
best_model_generalization_score = 1 - best_model.score(Xtest_scaled, Ytest)

print("Estimate of the generalization score for best SVM model: ", best_model_generalization_score)


GENERALIZATION SCORE BEST MODEL
Estimate of the generalization score for best SVM model:  0.7461346917155992


## TO DO - ANSWER THE FOLLOWING

Print the training score (score on data used to train the model) and the generalization score (score on data used to estimate the generalization error) of the final SVM model THAT YOU OBTAIN WHEN YOU RUN THE CODE, one per line, printing the smallest one first. NOTE: THE VALUES HERE SHOULD BE HARDCODED

Print you answer (yes/no) to the following question: does the relation (i.e., smaller, larger) between the training score and the generalization score agree with the theory?

Print your motivation for the yes/no answer above, using at most 500 characters.

In [None]:
print("\nANSWER")

print("Generalization score: 0.7461346917155992")
print("Training score: 0.8378934821404114")

motivation = "the relation between the training score and the generalization score agree with the theory since we generally expect the model to perform better on data that it has already seen and on which it has optimized it's parameters."

print(motivation)


ANSWER
Generalization score: 0.7461346917155992
Training score: 0.8378934821404114
the relation between the training score and the generalization score agree with the theory since we generally expect the model to perform better on data that it has already seen and on which it has optimized it's parameters.


## TO DO: LEARN A STANDARD LINEAR MODEL
Learn a standard linear model using scikit learn.

Print the score of the model on the data used to learn it.

Print the generalization "score" of the model.

In [None]:
print("\nLR MODEL")
# Create a Linear Regression model
linear_regression_model = linear_model.LinearRegression()

# Train the model on the training set
linear_regression_model.fit(Xtrain_val_scaled, Ytrain_val)

# Make predictions on the train set
y_pred_train = linear_regression_model.predict(Xtrain_scaled)

# Make predictions on the test set
y_pred_test = linear_regression_model.predict(Xtest_scaled)

# Score on train data
score_train = linear_regression_model.score(Xtrain_val_scaled, Ytrain_val)

# Score on test data
score_test = linear_regression_model.score(Xtest_scaled, Ytest)


print("Score of LR model on data used to learng it: ", score_train)
print("Generalization score of LR model: ", score_train)