# Support Vector Regressor MODEL


In [2]:
import pandas as pd
import numpy as np

# For imports
from notebooks import utility
import importlib

# For optimization
from sklearnex import patch_sklearn
patch_sklearn()

Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)


## Data import
Let's import the data that was previously cleaned

In [3]:
X_train = pd.read_csv("../DWMProjectData/formodel/X_train.csv")
y_train = pd.read_csv("../DWMProjectData/formodel/y_train.csv")
X_valid = pd.read_csv("../DWMProjectData/formodel/X_valid.csv")
y_valid = pd.read_csv("../DWMProjectData/formodel/y_valid.csv")
X_test = pd.read_csv("../DWMProjectData/formodel/X_test.csv")
y_test = pd.read_csv("../DWMProjectData/formodel/y_test.csv")
# Transform all y in a 1-dimensional array - required to avoid warning in model building
y_train = np.ravel(y_train)
y_valid = np.ravel(y_valid)
y_test = np.ravel(y_test)

## Scale data
For SVR, data scaling bring to much better results, although it is not strictly required. The reasons for this can be found [here](https://www.baeldung.com/cs/svm-feature-scaling) and [here](https://scikit-learn.org/stable/modules/svm.html) (They refears to SVM but for regression the reason are the same)

In [4]:
from utility import scale
importlib.reload(utility)
X_train, X_valid, X_test = scale(X_train, X_valid, X_test)

## Score function

I defined the score functions used for the regression. For a more clear approach I wrote the function `print_metrics` in the file `utility.py` In particular, I decided to write a function that prints the following values to compare models:
- mean absolute error
- mean squared error
- $r^2$, where the best score is 1, good is above 0.7
- explained variance score, where the best score is 1

In [5]:
from utility import print_metrics
importlib.reload(utility)

<module 'notebooks.utility' from 'C:\\Users\\marco\\Documents\\UNI\\Y3\\DataWebMining\\project\\DWMProject\\notebooks\\utility.py'>

## Model building

In [6]:
from sklearn.svm import SVR
model_base = SVR(verbose=True)

model_base.fit(X_train, y_train)
print_metrics(y_test, model_base.predict(X_test))

+--------------------------+--------+
|          Method          | Value  |
| mean absolute error      | 0.078  |
+--------------------------+--------+
| mean squared error       | 0.031  |
+--------------------------+--------+
| r^2                      | -0.013 |
+--------------------------+--------+
| explained variance score | -0.012 |
+--------------------------+--------+


Here GridSearch is way too slow, so I opted for RandomizedSearch for parameter tuning

In [8]:
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

param_grid = {
    "kernel": ["linear", "poly", "rbf", "sigmoid"],
    "C": [0.1, 1, 10, 100]
}

model_fitted = RandomizedSearchCV(model_base, param_grid, n_jobs=1, n_iter=5, verbose=4)
# model_fitted = GridSearchCV(model_base, param_grid, n_jobs=1, verbose=4)

model_fitted.fit(X_train, y_train)
print(f"Best params are {model_fitted.best_params_}")

Fitting 5 folds for each of 5 candidates, totalling 25 fits
[CV 1/5] END ..............C=0.1, kernel=linear;, score=0.002 total time=   7.0s
[CV 2/5] END .............C=0.1, kernel=linear;, score=-0.000 total time=   6.1s
[CV 3/5] END ..............C=0.1, kernel=linear;, score=0.002 total time=   6.6s
[CV 4/5] END ..............C=0.1, kernel=linear;, score=0.001 total time=   5.3s
[CV 5/5] END ..............C=0.1, kernel=linear;, score=0.003 total time=   5.1s
[CV 1/5] END ...C=100, kernel=sigmoid;, score=-5282389040.299 total time=  20.5s
[CV 2/5] END ...C=100, kernel=sigmoid;, score=-5324908737.687 total time=  18.7s
[CV 3/5] END ...C=100, kernel=sigmoid;, score=-7377524383.201 total time=  25.5s
[CV 4/5] END ...C=100, kernel=sigmoid;, score=-3804498421.723 total time=  18.4s
[CV 5/5] END ...C=100, kernel=sigmoid;, score=-4126266682.532 total time=  20.0s
[CV 1/5] END .................C=1, kernel=poly;, score=-2.998 total time=  19.6s
[CV 2/5] END .................C=1, kernel=poly;, 

## Model re-building with best parameters + Metrics

In [11]:
model_final = SVR(**model_fitted.best_params_)

X_train_n =np.concatenate([X_train, X_valid])
y_train_n = np.concatenate([y_train, y_valid])

model_final.fit(X_train_n, y_train_n)
print_metrics(y_test, model_final.predict(X_test))

+--------------------------+--------+
|          Method          | Value  |
| mean absolute error      | 0.073  |
+--------------------------+--------+
| mean squared error       | 0.030  |
+--------------------------+--------+
| r^2                      | -0.004 |
+--------------------------+--------+
| explained variance score | -0.004 |
+--------------------------+--------+
