# Notebook 3: Support Vector Regressor Model

**Model 2**: Support Vector Regressor

**Reason for model**: simple yet powerful non-linear model with a geometric interpretation.

**Metric**: RMSE

**Reason for metric**: focus on penalising large errors over small errors, RMSE is the better choice.

**Metrics of last best model**: Base model

***RMSE Train:*** 208.41857342335229

***RMSE Val:*** 208.42276109810624

In [1]:
ROOT_PATH_FROM_NOTEBOOK = ".."
DATA_PATH = "data"
PROCESSED_DATA_PATH = "processed"
SAMPLE_DATASET_NAME = "data_sample.parquet"

df_path = f"{ROOT_PATH_FROM_NOTEBOOK}/{DATA_PATH}/{PROCESSED_DATA_PATH}/{SAMPLE_DATASET_NAME}"

In [2]:
import sys
import os
from joblib import dump

import pandas as pd
import numpy as np
from sklearn.svm import SVR
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

from assignment2_pkg_11919925.metrics.regression import print_regressor_scores_from_gridsearchcv

In [3]:
# Set Pandas option to show all columns in prints
pd.set_option('display.max_columns', None)

In [4]:
# Get the current working directory
current_dir = os.getcwd()

# Add the src directory to sys.path to use custom functions
sys.path.append(os.path.abspath(os.path.join(current_dir, '..', 'src')))

In [5]:
df = pd.read_parquet(df_path)

In [6]:
df.head()

Unnamed: 0,flightDayOfWeekSin,flightDayOfWeekCos,flightMonthSin,flightMonthCos,flightHourSin,flightHourCos,flightMinuteSin,flightMinuteCos,timeDeltaDays,travelDurationDay,totalTravelDistance,totalFare,isBasicEconomy,isRefundable,isNonStop,numLegs,business,coach,first,premium coach
0,-0.974928,-0.222521,0.5,-0.866025,-0.5,-0.8660254,-0.951057,-0.309017,15,0.195139,1191.0,294.6,-1,-1,-1,2,-1,1,-1,-1
1,-0.433884,-0.900969,0.5,-0.866025,0.258819,0.9659258,-0.5,-0.866025,37,0.095139,762.0,262.6,-1,-1,1,1,-1,1,-1,-1
2,0.781831,0.62349,0.866025,-0.5,-1.0,-1.83697e-16,0.5,0.866025,1,0.127083,1235.0,234.59,-1,-1,1,1,-1,1,-1,-1
3,0.974928,-0.222521,0.5,-0.866025,-0.5,-0.8660254,-0.104528,0.994522,34,0.101389,762.0,118.6,-1,-1,1,1,-1,1,-1,-1
4,0.433884,-0.900969,0.5,-0.866025,-0.965926,-0.258819,-0.669131,0.743145,17,0.333333,2618.0,446.6,-1,-1,-1,2,-1,1,-1,-1


In [7]:
y = df.pop('totalFare')
X = df

## Support Vector Regressor with Cross Validation

In [8]:
from models.preprocessing import CustomStandardScaler

In [9]:
standardscaler_transformer = Pipeline(
    steps=[
        ('standard_scaler', StandardScaler()
        )
    ]
)

In [10]:
# Build the ColumnTransformer
encoder = ColumnTransformer(
    transformers=[
        ("standard_cols", standardscaler_transformer, ['timeDeltaDays', 'travelDurationDay', 'totalTravelDistance'])
    ]
)

In [11]:
svr_pipe = Pipeline(
    steps=[
        ("scaler", encoder),
        ("regressor", SVR(kernel="rbf"))
    ]
)

svr_search = GridSearchCV(
    estimator=svr_pipe,
    param_grid={
            "regressor__C": list(np.logspace(-3, 3, 7)),
            "regressor__epsilon": list(np.logspace(-3, 3, 7))
        },
        cv=5,
        scoring="neg_root_mean_squared_error",
        refit=True,
        return_train_score=True,
        verbose=5
)

In [12]:
svr_search.fit(X, y)

Fitting 5 folds for each of 49 candidates, totalling 245 fits
[CV 1/5] END regressor__C=0.001, regressor__epsilon=0.001;, score=(train=-208.903, test=-208.239) total time=  32.9s
[CV 2/5] END regressor__C=0.001, regressor__epsilon=0.001;, score=(train=-208.352, test=-210.260) total time=  34.8s
[CV 3/5] END regressor__C=0.001, regressor__epsilon=0.001;, score=(train=-208.322, test=-210.021) total time=  33.5s
[CV 4/5] END regressor__C=0.001, regressor__epsilon=0.001;, score=(train=-209.031, test=-207.152) total time=  35.0s
[CV 5/5] END regressor__C=0.001, regressor__epsilon=0.001;, score=(train=-208.854, test=-207.796) total time=  35.6s
[CV 1/5] END regressor__C=0.001, regressor__epsilon=0.01;, score=(train=-208.904, test=-208.240) total time=  37.1s
[CV 2/5] END regressor__C=0.001, regressor__epsilon=0.01;, score=(train=-208.352, test=-210.259) total time=  31.8s
[CV 3/5] END regressor__C=0.001, regressor__epsilon=0.01;, score=(train=-208.322, test=-210.021) total time=  32.0s
[CV 4

In [13]:
print_regressor_scores_from_gridsearchcv(svr_search)

RMSE Train: 167.2884452552443
RMSE Val: 167.7360076114627


In [14]:
# Best parameter for SVR
svr_search.best_params_

{'regressor__C': np.float64(1000.0), 'regressor__epsilon': np.float64(100.0)}

In [15]:
dump(svr_search.best_estimator_,  '../models/nicholas_svr_pipe_sample_dataset.joblib')

['../models/nicholas_svr_pipe_sample_dataset.joblib']

**Observations**: The SVR model with less regularisation works better than the base model in the RMSE. The epsilon value suggests a lenient epsilon tube, hence fewer support vectors and more "flat" predictions (https://kernelsvm.tripod.com/).

**Next model**: Random Forest Regressor.