# HOML Ch.5 Exercise 10

## Exercise 10

### Exercise: Train an SVM regressor on the California housing dataset.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
assert sklearn.__version__ >= "0.20"

Let's begin by loading the California housing dataset and looking at a sample of the data.

In [3]:
# Load dataset using scikit_learn
from sklearn.datasets import fetch_california_housing

housing = fetch_california_housing()
X = housing["data"]
y = housing["target"]

In [4]:
# View the dataset and its attributes
housing

{'data': array([[   8.3252    ,   41.        ,    6.98412698, ...,    2.55555556,
           37.88      , -122.23      ],
        [   8.3014    ,   21.        ,    6.23813708, ...,    2.10984183,
           37.86      , -122.22      ],
        [   7.2574    ,   52.        ,    8.28813559, ...,    2.80225989,
           37.85      , -122.24      ],
        ...,
        [   1.7       ,   17.        ,    5.20554273, ...,    2.3256351 ,
           39.43      , -121.22      ],
        [   1.8672    ,   18.        ,    5.32951289, ...,    2.12320917,
           39.43      , -121.32      ],
        [   2.3886    ,   16.        ,    5.25471698, ...,    2.61698113,
           39.37      , -121.24      ]]),
 'target': array([4.526, 3.585, 3.521, ..., 0.923, 0.847, 0.894]),
 'feature_names': ['MedInc',
  'HouseAge',
  'AveRooms',
  'AveBedrms',
  'Population',
  'AveOccup',
  'Latitude',
  'Longitude'],
 'DESCR': '.. _california_housing_dataset:\n\nCalifornia Housing dataset\n--------------------

The data is already clean, so we can jump to splitting, standardizing, and fitting our data to the linear SVR algorithm.

In [5]:
# Split data into train and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=99)

In [6]:
# Scale the data
from sklearn.preprocessing import StandardScaler

st_scaler = StandardScaler()
X_train_st_scaler = st_scaler.fit_transform(X_train)
X_test_st_scaler = st_scaler.transform(X_test)

In [7]:
# Import and fit linear SVR on training data
from sklearn.svm import LinearSVR

lsvr = LinearSVR(random_state=99)
lsvr.fit(X_train_st_scaler, y_train)



LinearSVR(C=1.0, dual=True, epsilon=0.0, fit_intercept=True,
     intercept_scaling=1.0, loss='epsilon_insensitive', max_iter=1000,
     random_state=99, tol=0.0001, verbose=0)

In [8]:
# Predictions for training set
y_pred_lsvr = lsvr.predict(X_train_st_scaler)

Now that we have our predictions, let's find the mean squared and root mean squared errors to get a sense of how strong our model is. Both MSE and RMSE tell us how well our model fits  the data. The lower the values, the better. The advantage of RMSE is that it has the same units as the target data.

In [9]:
# Find the mean squared and root mean squared errors
from sklearn.metrics import mean_squared_error

mse_lsvr = mean_squared_error(y_train, y_pred_lsvr)
rmse_lsvr = np.sqrt(mse_lsvr)
print('MSE: ' + str(mse_lsvr))
print('RMSE: ' +str(rmse_lsvr))

MSE: 0.9235205896401963
RMSE: 0.9609997864933146


Since the RMSE error values are in the same units as the target value, where the target values are in the tens of thousands of dollars, it tells us that our predictions may be off by about $10,000, which isn't great. We'll try another SVR model, this time using an RBF kernel to see if we can decrease our error values.

In [12]:
# Import and fit SVR on default 'rbf' kernal value
from sklearn.svm import SVR

svr = SVR()
svr.fit(X_train_st_scaler, y_train)

SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
  gamma='auto_deprecated', kernel='rbf', max_iter=-1, shrinking=True,
  tol=0.001, verbose=False)

In [13]:
# Training set predictions
y_pred_svr = svr.predict(X_train_st_scaler)

In [14]:
# Calculate MSE and RSME
mse_svr = mean_squared_error(y_train, y_pred_svr)
rmse_svr = np.sqrt(mse_svr)
print('MSE: ' + str(mse_svr))
print('RMSE: ' +str(rmse_svr))

MSE: 0.3438182495435124
RMSE: 0.5863601704955005


The MSE and RMSE values for SVR with RBF are considerably lower than those of linear SVR, with our error reduced to around $6,000. So, let's tune the RBF SVR model 
to see if we can lower the error values even further.

In [16]:
# Import GridsearchCV and tune based on gamma and C
from sklearn.model_selection import GridSearchCV

params = {'C': [4, 6, 8], 'gamma': [0.01, 0.1, 1]}

svr = SVR()
grid_search = GridSearchCV(svr, params, cv=3, verbose=3)
grid_search.fit(X_train_st_scaler, y_train)

Fitting 3 folds for each of 9 candidates, totalling 27 fits
[CV] C=4, gamma=0.01 .................................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] ........ C=4, gamma=0.01, score=0.6756234654033157, total=  22.6s
[CV] C=4, gamma=0.01 .................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   31.5s remaining:    0.0s


[CV] ........ C=4, gamma=0.01, score=0.6699661097878262, total=  22.1s
[CV] C=4, gamma=0.01 .................................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  1.0min remaining:    0.0s


[CV] ........ C=4, gamma=0.01, score=0.6803028800440785, total=  23.6s
[CV] C=4, gamma=0.1 ..................................................
[CV] ......... C=4, gamma=0.1, score=0.7397280347318287, total=  24.8s
[CV] C=4, gamma=0.1 ..................................................
[CV] ......... C=4, gamma=0.1, score=0.7389211810033438, total=  25.8s
[CV] C=4, gamma=0.1 ..................................................
[CV] ......... C=4, gamma=0.1, score=0.7395810433063787, total=  25.4s
[CV] C=4, gamma=1 ....................................................
[CV] ............ C=4, gamma=1, score=0.749676258724927, total=  47.2s
[CV] C=4, gamma=1 ....................................................
[CV] ............ C=4, gamma=1, score=0.755795056412663, total=  46.6s
[CV] C=4, gamma=1 ....................................................
[CV] ........... C=4, gamma=1, score=0.7468779028206765, total=  45.1s
[CV] C=6, gamma=0.01 .................................................
[CV] .

[Parallel(n_jobs=1)]: Done  27 out of  27 | elapsed: 20.3min finished


GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
  gamma='auto_deprecated', kernel='rbf', max_iter=-1, shrinking=True,
  tol=0.001, verbose=False),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'C': [4, 6, 8], 'gamma': [0.01, 0.1, 1]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=3)

In [17]:
# Best parameter values from grid search
grid_search.best_params_

{'C': 4, 'gamma': 1}

In [18]:
# Fit the tuned model
svr_params = SVR(C=4, gamma=1)
svr_params.fit(X_train_st_scaler, y_train)

SVR(C=4, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma=1,
  kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)

In [19]:
# MSE and RMSE values of the tuned model
y_pred_svr_params = svr_params.predict(X_train_st_scaler)
mse_svr_params = mean_squared_error(y_train, y_pred_svr_params)
rmse_svr_params = np.sqrt(mse_svr_params)
print('MSE: ' + str(mse_svr_params))
print('RMSE: ' +str(rmse_svr_params))

MSE: 0.2032781729709123
RMSE: 0.4508638075637834


In [20]:
# Run the tuned model on the test data
y_pred_svr_params = svr_params.predict(X_test_st_scaler)
mse_svr_params = mean_squared_error(y_test, y_pred_svr_params)
rmse_svr_params = np.sqrt(mse_svr_params)
print('MSE: ' + str(mse_svr_params))
print('RMSE: ' +str(rmse_svr_params))

MSE: 0.2862598968262924
RMSE: 0.5350326128623305


Our tuned model did considerably better than our original RBF model, reducing our error to around $4,500. However, the error values on test data are a bit higher than on the training data, suggesting some degree of overfitting. If we wished to, we could continue tuning the model further to reduce overfitting, but we'll stop here since we've met the requirements of this particular exercise. 