#  Regression on Diamonds Price Dataset with SVM

The **Diamonds dataset** from Kaggle is a dataset containing information about the physical and pricing attributes of nearly 54,000 diamonds. The dataset is commonly employed in tasks like regression analysis, feature engineering, and exploratory data analysis.

We will consider a **reduced version** of the dataset, containing 4000 samples, and without categorical features.

### Key Features:
- **Carat**: The weight of the diamond.
- **Depth**: The total depth percentage (z / mean(x, y)).
- **Table**: Width of the diamond's top as a percentage of its widest point.
- **Price**: Price in US dollars.
- **X, Y, Z**: Dimensions of the diamond in mm (length, width, depth).

This dataset is useful for exploring relationships between physical attributes and pricing, and for building predictive models to estimate diamond prices based on their features.

For more information see: https://www.kaggle.com/datasets/shivam2503/diamonds.

# Overview

In the notebook you will perform a complete pipeline of machine learning - regression task. First, you will:
- split the data into training, validation, and test;
- standardize the data.

You will then be asked to learn various SVM models, in particular:
- for each of the kernels *linear*, *poly*, *rbf*, and *sigmoid*, you will learn the best model, choosing among some fixed values of the considered hyperparameters. In particular, the choice of hyperparameters must be done with **5-fold cross-validation**, as we have seen in the labs.

Then, from the models trained with the best hyperparameters selected as above, you will:
- choose the best kernel, using a validation approach (not cross-validation), and
- learn the best SVM model overall.

Furthermore, you will then be asked to estimate the generalization error of the best SVM model you report. 

At the end, just for comparison, you will also be asked to learn a standard linear regression model (with squared loss), and estimate its generalization error.

### IMPORTANT
- Note that in each of the above steps you will have to choose the appropriate split of the data (see the first bullet point above);
- The code should run without requiring modifications even if some best choice of parameters, changes; for example, you should not pass the best value of hyperparameters "manually" (i.e., passing the values as input parameters to the models). The only exception is in the TO DO titled 'ANSWER THE FOLLOWING'
- $\texttt{epsilon}$ parameter: For SVM, since the values to be predicted are all in the thousands of dollars, you will need to always set $\texttt{epsilon} = 100$
- Do not change the printing instructions (other than adding the correct variable name for your code), and do not add printing instructions!

## TO DO - INSERT YOUR NUMERO DI MATRICOLA BELOW

In [None]:
# -- put here your ID number (numero di matricola)
numero_di_matricola = 2110403 # COMPLETE

The following code loads all required packages

In [None]:
# -- import all packages needed
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn import svm
from sklearn import model_selection
from sklearn import linear_model
from sklearn.model_selection import KFold
from itertools import product

The code below loads the data and remove samples with missing values. It also prints the number of samples and a brief description of our dataset.

In [None]:
# -- load the data - do not change the path below!
df = pd.read_csv('diamonds.csv', sep = ',')

# -- remove the data samples with missing values (NaN)
df = df.dropna()
# -- let's drop the column containing the id of the data
df = df.drop(columns=['Unnamed: 0'], axis=1)

In [None]:
print('Dataset shape:', df.shape)
# -- description of dataset
print(df.describe())

In [None]:
print('First 5 samples of the dataset:\n\n', df.head(5))

In the following cell, we convert our (pandas) dataframe into set X (containing our features) and the set Y (containing our target, i.e., the price)

In [None]:
m = df.shape[0]

# -- let's compute X and Y sets
X = df.drop(columns=['price'], axis=1)
Y = df['price']

print("Total number of samples:", m)

X = X.values
Y = Y.values

# -- print shapes
print('X shape: ', X.shape)
print('Y shape: ', Y.shape)

# Data preprocessing

## TO DO - SPLIT DATA INTO TRAINING, VALIDATION, AND TESTING, WITH THE FOLLOWING PERCENTAGES: 60%, 20%, 20%

Use the $\texttt{train\_test\_split}$ function from sklearn.model_selection to do it; in every call fix $\texttt{random\_state}$ to your numero_di_matricola. 
At the end, you should store the data in the following variables:
- X_train, Y_train: training data;
- X_val, Y_val: validation data;
- X_train_val, Y_train_val: training and validation data;
- X_test, Y_test: test data.

The code then prints the number of samples in X_train, X_val, X_train_val, and X_test

**IMPORTANT:**
- first split the data into training+validation and test; the first part of the data in output from $\texttt{train\_test\_split}$ must correspond to the training+validation;
- then split training+validation into training and validation; the first part of the data in output from $\texttt{train\_test\_split}$ must correspond to the training


In [None]:
# -- split the data into training + validation and test
# -- TODO
# -- split the training + validation data into training and validation
# -- TODO

print("Training size:", X_train.shape[0)
print("Validation size:", X_val.shape[0])
print("Training and validation size:", X_train_val.shape[0])
print("Test size:", X_test.shape[0])

## TO DO - STANDARDIZE THE DATA

Standardize the data using the $\texttt{preprocessing.StandardScaler}$ from scikit learn.

If V is the name of the variable storing part of the data, the corresponding standardized version should be stored in V_scaled. For example, the scaled version of X_train should be stored in X_train_scaled.

In [None]:
# -- TODO

# SVM models: learning the best model for each kernel

The following function, i.e., $\texttt{k\_fold\_cross\_validation}$, will perform $k$-fold cross validation (with $k$ = 5 by default). Look carefully at the signature of the below function: you have in input some sets X and Y, the default number of folds, and a length-variable keyword argumens, with which the SVM model will be trained in the cross-validation phase. If you are not familiar with the notation, look at kwargs in Python documentation.

In the first lines of the below function, the unpacked parameters (i.e., input parameter $\texttt{param\_grid}$) are converted into a python list by means of cartesian product. The resulting list (i.e., $\texttt{param\_list}$) will be the one for which you need to iterate over and perform $k$-fold cross-validation, using $\texttt{KFold}$ object frmo scikit-learn.

At the end, note that you need to return $\texttt{best\_param}$, that is the best set of parameters you found with the cross-validation procedure. 

In [None]:
def k_fold_cross_validation(X, Y, num_folds = 5, **param_grid):

    # -- grid of hyperparams into list
    param_keys = list(param_grid.keys())
    param_values = list(param_grid.values())
    
    # Generate Cartesian product of values
    combinations = product(*param_values)
    
    # Create a list of dictionaries from combinations
    param_list = [dict(zip(param_keys, combination)) for combination in combinations]

    # -- TODO

## TO DO - CHOOSE THE BEST HYPERPARAMETERS FOR LINEAR KERNEL

For the SVM, consider $\texttt{svm.SVR}$ class. We will begin by training the SVM with linear kernel. For the latter, consider the following hyperparameters and their values:

- $C: [0.1, 1, 10, 100, 1000]$

Remember that both the $\texttt{kernel}$ type and the value of $\texttt{epsilon}$ are considered as parameters to pass to the above method. Leave all other input parameters to default. 

Find the best value of the hyperparameters using 5-fold cross validation. Use the function defined above to perform the cross-validation.

Print the best value of the hyperparameters.

In [None]:
print("\nLinear SVM:")
# -- TODO
print("Best value for hyperparameters: ", # -- TODO)

## TO DO - LEARN A MODEL WITH LINEAR KERNEL AND BEST CHOICE OF HYPERPARAMETERS

This model will be compared with the best models with other kernels using validation (not cross validation).

DO NOT PASS PARAMETERS BY HARD-CODING THEM IN THE CODE.

Print the **training score** (that is, $R^2$ coefficient) of the best model, trained with the best parameter find from the above cell.

In [None]:
# -- TODO
print("Training score:", # -- TODO)

## TO DO - CHOOSE THE BEST HYPERPARAMETERS FOR POLY KERNEL

Now, let's consider $\texttt{svm.SVR}$ with polynomial kernel. Consider the following hyperparameters and their values:
- $C: [0.1, 1, 10, 100, 1000]$
- $degree: [2, 3, 4]$

Leave all other input parameters to default. 

Find the best value of the hyperparameters using 5-fold cross validation. Use the function defined above to perform the cross-validation.

Print the best value of the hyperparameters.

In [None]:
print("\nPoly SVM")
# -- TODO
print("Best value for hyperparameters: ", # -- TODO)

## TO DO - LEARN A MODEL WITH POLY KERNEL AND BEST CHOICE OF HYPERPARAMETERS

This model will be compared with the best models with other kernels using validation (not cross validation).

DO NOT PASS PARAMETERS BY HARD-CODING THEM IN THE CODE.

Print the **training score** (that is, $R^2$ coefficient) of the best model, trained with the best parameter find from the above cell.

In [None]:
# -- TODO
print("Training score:", # -- TODO)

## TO DO - CHOOSE THE BEST HYPERPARAMETERS FOR RBF KERNEL

Consider $\texttt{svm.SVR}$ with RBF kernel. Consider the following hyperparameters and their values:
- $C: [0.1, 1, 10, 100, 1000]$
- $gamma: [0.01, 0.03, 0.04, 0.05]$

Leave all other input parameters to default. 

Find the best value of the hyperparameters using 5-fold cross validation. Use the function defined above to perform the cross-validation.

Print the best value of the hyperparameters.

In [None]:
print("\nRBF SVM")
# -- TODO
print("Best value for hyperparameters: ", # -- TODO)

## TO DO - LEARN A MODEL WITH RBF KERNEL AND BEST CHOICE OF HYPERPARAMETERS

This model will be compared with the best models with other kernels using validation (not cross validation).

DO NOT PASS PARAMETERS BY HARD-CODING THEM IN THE CODE.

Print the **training score** (that is, $R^2$ coefficient) of the best model, trained with the best parameter find from the above cell.

In [None]:
# -- TODO
print("Training score:", # -- TODO)

## TO DO - CHOOSE THE BEST HYPERPARAMETERS FOR SIGMOID KERNEL

Consider $\texttt{svm.SVR}$ with sigmoid kernel. Consider the following hyperparameters and their values:
- $C: [0.1, 1, 10, 100, 1000]$
- $gamma: [0.01, 0.05, 0.1]$
- $coef0: [0, 1]$

Leave all other input parameters to default. 

Find the best value of the hyperparameters using 5-fold cross validation. Use the function defined above to perform the cross-validation.

Print the best value of the hyperparameters.

In [None]:
print("\nSigmoid SVM")
# -- TODO
print("Best value for hyperparameters: ", # -- TODO)

## TO DO - LEARN A MODEL WITH SIGMOID KERNEL AND BEST CHOICE OF HYPERPARAMETERS

This model will be compared with the best models with other kernels using validation (not cross validation).

DO NOT PASS PARAMETERS BY HARD-CODING THEM IN THE CODE.

Print the **training score** (that is, $R^2$ coefficient) of the best model, trained with the best parameter find from the above cell.

In [None]:
# -- TODO
print("Training score:", # -- TODO)

## TO DO - USE VALIDATION TO CHOOSE THE BEST MODEL AMONG THE ONES LEARNED FOR THE VARIOUS KERNELS

Use validation to choose the best model among the four ones (one for each kernel) you have learned above.

Print, following exactly the order described here, with 1 value for each line:
- the validation score of SVM with linear kernel (the template below does not include such print)
- the validation score of SVM with polynomial kernel (the template below does not include such print)
- the validation score of SVM with rbf kernel (the template below does not include such print)
- the validation score of SVM with sigmoid kernel (the template below does not include such print)
- the best kernel (e.g., sigmoid) 
- the validation score of the best kernel 

For the first 4 prints, use the format: "*kernel* validation score: ". For example, for linear kernel "linear validation score: ", for rbf "rbf validation score: "

In [None]:
print("\nVALIDATION TO CHOOSE SVM KERNEL:")

# -- TODO

print("\n---\nBest kernel: ", # -- TODO)
print("Validation score of best kernel: ", # -- TODO)

## TO DO - LEARN THE FINAL MODEL FOR WHICH YOU WANT TO ESTIMATE THE GENERALIZATION SCORE

Learn the final model (i.e., the one you would use to make predictions about future data).

Print the **final model hyperparameters** and the **score** of the model on the data used to learn it.

In [None]:
print("\nBEST MODEL:")

# -- TODO

print("Best model hyperparameters:", # -- TODO)
print("Score of the best model on the data used to learn it: ", # -- TODO)

## TO DO - PRINT THE ESTIMATE  OF THE GENERALIZATION SCORE FOR THE FINAL MODEL

Print the estimate of the generalization **score** for the final model. The generalization "score" is the score computed on the data used to estimate the generalization error.

In [None]:
print("\nGENERALIZATION SCORE BEST MODEL:")

# -- TODO

print("Estimate of the generalization score for best SVM model: ", # -- TODO)

## TO DO - ANSWER THE FOLLOWING

Print the **training score** (score on data used to train the model) and the **generalization score** (score on data used to assess generalization) of the final SVM model THAT YOU OBTAIN WHEN YOU RUN THE CODE, one per line, printing the smallest one first. 

NOTE: THE VALUES HERE SHOULD BE HARDCODED.

Print you answer (YES/NO) to the following question: does the relation (i.e., smaller, larger) between the training score and the generalization score agree with the theory?

Print your motivation for the YES/NO answer above, using at most 500 characters.

In [None]:
print("\nANSWER")

# -- TODO

# -- note that you may have to invert the order of the following 2 lines, print the smallest 1 first
print("Generalization score: ", # -- TODO)
 print("Training score: ", # -- TODO)

# -- the following is a string with you answer
motivation = "TODO"

print(motivation)

## TO DO: LEARN A STANDARD LINEAR MODEL
Learn a standard linear model using scikit learn.

Print the **score** of the model on the data used to learn it.

Print the **generalization score** of the model.

In [None]:
print("\nLR MODEL")
# -- TODO
print("Score of LR model on data used to learn it: ", # -- TODO)
print("Generalization score of LR model: ", # -- TODO)