Exercise: train an SVM regressor on the California housing dataset.

### Load data

In [1]:
from sklearn.datasets import fetch_california_housing

housing = fetch_california_housing()
X = housing["data"]
y = housing["target"]

Train / Test Split

In [2]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Scale data for SVM

In [3]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)

### Train first model

Train without tuning parameters.

In [4]:
from sklearn.svm import SVR, LinearSVR

svr = LinearSVR()
svr.fit(X_train_scaled, y_train)



LinearSVR()

In [5]:
svr.get_params()

{'C': 1.0,
 'dual': True,
 'epsilon': 0.0,
 'fit_intercept': True,
 'intercept_scaling': 1.0,
 'loss': 'epsilon_insensitive',
 'max_iter': 1000,
 'random_state': None,
 'tol': 0.0001,
 'verbose': 0}

Evaluate model

In [6]:
from sklearn.metrics import mean_squared_error

y_train_pred = svr.predict(X_train_scaled)

mse = mean_squared_error(y_train, y_train_pred)
mse

0.7485175595906046

In [7]:
from numpy import sqrt

rmse = sqrt(mse)
rmse

0.8651690930625091

Not that bad. Let's see on the test set.

In [8]:
X_test_scaled = scaler.transform(X_test)
y_pred_test = svr.predict(X_test_scaled)
sqrt(mean_squared_error(y_test, y_pred_test))

0.7707869323285762

Roughly 1. Since the target unit is ten thousand dollars, the mean error will be roughly $10,000. Not great. Let's try to achieve half of that error.

### Tune parameters in new model

In [9]:
from sklearn.model_selection import GridSearchCV
from numpy import linspace

new_svr = SVR()

parameters = {"gamma": linspace(0.001, 0.1, 4), "C": linspace(1, 10, 4)}
gridsearch = GridSearchCV(new_svr, parameters, verbose=2, cv=3)

In [10]:
gridsearch.fit(X_train_scaled, y_train)

Fitting 3 folds for each of 16 candidates, totalling 48 fits
[CV] END .................................C=1.0, gamma=0.001; total time=  15.8s
[CV] END .................................C=1.0, gamma=0.001; total time=  17.2s
[CV] END .................................C=1.0, gamma=0.001; total time=   9.9s
[CV] END .................................C=1.0, gamma=0.034; total time=   9.6s
[CV] END .................................C=1.0, gamma=0.034; total time=   9.6s
[CV] END .................................C=1.0, gamma=0.034; total time=   9.6s
[CV] END .................................C=1.0, gamma=0.067; total time=   9.6s
[CV] END .................................C=1.0, gamma=0.067; total time=   9.5s
[CV] END .................................C=1.0, gamma=0.067; total time=   9.6s
[CV] END ...................................C=1.0, gamma=0.1; total time=   9.5s
[CV] END ...................................C=1.0, gamma=0.1; total time=   9.5s
[CV] END ...................................C=1.

GridSearchCV(cv=3, estimator=SVR(),
             param_grid={'C': array([ 1.,  4.,  7., 10.]),
                         'gamma': array([0.001, 0.034, 0.067, 0.1  ])},
             verbose=2)

### Test tuned model

In [11]:
gridsearch.best_estimator_

SVR(C=10.0, gamma=0.1)

In [12]:
y_pred = gridsearch.best_estimator_.predict(X_test_scaled)

In [13]:
mse = mean_squared_error(y_test, y_pred)
sqrt(mse)

0.5917040732347578