# Model engineering 004: Support Vector Regression

In this part of the project to predict the photovoltaic production of solar cells on a roof we are considering a Support Vector regression model. 

We will treat this as a regression problem, not taking the temporal aspect, i.e. time series forcasting, into account. Data mining and missing value treatment of weather data from the DarkSky API and data from the photovoltaic system were covered in other notebooks:

- Data mining and EDA of weather data: https://kyso.io/heiko/predicting-solar-panel-output-eda-of-photovoltaic-data

- Data mining and EDA of photovoltaic data: https://kyso.io/heiko/predicting-solar-panel-output-eda-of-weather-data

- Missing value treatment: https://kyso.io/heiko/predicting-solar-panel-output-missing-value-treatment-of-weather-data

## Methodology

We will apply support vector regression by following these steps:

1. Load the data into one dataframe

2. Select the features we will use for the prediction. We can look at the correlation matrix and remove redundant features that are correlated. Multicollinearity undermines the statistical significance of an independent variable. While it should not have a major impact on the model’s accuracy, it does affect the variance associated with the prediction, as well as, reducing the quality of the interpretation of the independent variables. In other words, the effect your data has on the model isn’t trustworthy. Your explanation of how the model takes the inputs to produce the output will not be reliable. (You can read more about this here: https://towardsdatascience.com/multicollinearity-why-is-it-a-problem-398b010b77ac). We will just drop (one of the) columns that are correlated. We could feature engineer another features that combines the correlated features, but at this point this will not be considered.

3. Consider missing values. This is still part of the feature selection process. We remove features that have lots of missing values that we could not interpolate.

4. Model selection. We will use RandomizedSearch and GridSearch approach to tune hyperparameters of the SupportVectorRegression model.

In [16]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.svm import SVR
from sklearn.model_selection import RandomizedSearchCV

from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.model_selection import GridSearchCV

import eli5
sns.set()
%matplotlib inline
%config InlineBackend.figure_format = 'svg'

# Load dataset

This is the cleaned dataset from the MultiLinear model.

In [17]:
data = pd.read_csv('../02.Prepared_data/dataset.Model_eng.001.csv')

# Feature selection

Select the features in a pipeline as for the linear model.

In [19]:
# X = df_cleaned.iloc[:, 1:].values
X = data.iloc[:, 1:]
X = X.reset_index(drop=True)
# y = df_cleaned.iloc[:, 0].values
y = data.iloc[:, 0]
y = y.reset_index(drop=True)

### Numeric features: StandardScaler

Standardize features by removing the mean and scaling to unit variance

In [20]:
numeric_features = ['apparentTemperatureHigh', 'precipProbability', 'uvIndex', 'precipIntensityMax_cm', 'sun_uptime']
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())])

### Categorical features: OneHotEncoder

This preprocessing step will encode the `season` and `precipType` column values into a vector of length 3. For example, winter will be encoded as [1,0,0], summer as [0,1,0], and so on for the other seasons.

In [21]:
categorical_features = ['season', 'precipType']
categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder())])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

In [22]:
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', SVR())])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

### Randomized and GridSearch to tune the hyperparameters

SVRs are easily adaptable and work well on non linear problems and are generally not biased by outliers. However, they are more difficult to understand compared to other models.

In an SVR model, a tube is fit to the dataset with a margin that is considered as epsilon. Points outside of this tube dictate how the tube (and hence prediction) look like. That is why they are called support vectors.

In [24]:
# first access the parameter keys of the individual estimators
list(clf.get_params().keys())

['memory',
 'steps',
 'verbose',
 'preprocessor',
 'classifier',
 'preprocessor__n_jobs',
 'preprocessor__remainder',
 'preprocessor__sparse_threshold',
 'preprocessor__transformer_weights',
 'preprocessor__transformers',
 'preprocessor__verbose',
 'preprocessor__num',
 'preprocessor__cat',
 'preprocessor__num__memory',
 'preprocessor__num__steps',
 'preprocessor__num__verbose',
 'preprocessor__num__scaler',
 'preprocessor__num__scaler__copy',
 'preprocessor__num__scaler__with_mean',
 'preprocessor__num__scaler__with_std',
 'preprocessor__cat__memory',
 'preprocessor__cat__steps',
 'preprocessor__cat__verbose',
 'preprocessor__cat__onehot',
 'preprocessor__cat__onehot__categories',
 'preprocessor__cat__onehot__drop',
 'preprocessor__cat__onehot__dtype',
 'preprocessor__cat__onehot__handle_unknown',
 'preprocessor__cat__onehot__sparse',
 'classifier__C',
 'classifier__cache_size',
 'classifier__coef0',
 'classifier__degree',
 'classifier__epsilon',
 'classifier__gamma',
 'classifier__ke

In [25]:
# Kernel
kernel = ['linear', 'poly', 'rbf', 'sigmoid', 'precomputed']
# Tolerance for stopping criterion
tol = [1e-3, 1e-4, 1e-5]
# Regularization parameter.
C = [0.0001, 0.01, 0.1, 1, 10, 100, 1000, 10000]
# Epsilon in the epsilon-SVR model.
epsilon = [0.0001, 0.001, 0.01, 0.1, 1, 10]


# Create the random grid
random_grid = {'classifier__kernel': kernel,
               'classifier__tol': tol,
               'classifier__C': C,
               'classifier__epsilon': epsilon}


# Use the random grid to search for best hyperparameters
# First create the base model to tune
# clf = RandomForestClassifier()
# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
clf_random = RandomizedSearchCV(estimator=clf, param_distributions=random_grid, n_iter=100, cv=3, verbose=2, random_state=42, n_jobs=-1)
# Fit the random search model
clf_random.fit(X_train, y_train)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:    2.0s
[Parallel(n_jobs=-1)]: Done 223 tasks      | elapsed:    3.5s
[Parallel(n_jobs=-1)]: Done 277 out of 300 | elapsed:    4.0s remaining:    0.3s
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:    7.0s finished


RandomizedSearchCV(cv=3,
                   estimator=Pipeline(steps=[('preprocessor',
                                              ColumnTransformer(transformers=[('num',
                                                                               Pipeline(steps=[('scaler',
                                                                                                StandardScaler())]),
                                                                               ['apparentTemperatureHigh',
                                                                                'precipProbability',
                                                                                'uvIndex',
                                                                                'precipIntensityMax_cm',
                                                                                'sun_uptime']),
                                                                              ('cat',
                  

In [26]:
clf_random.best_params_

{'classifier__tol': 0.0001,
 'classifier__kernel': 'rbf',
 'classifier__epsilon': 0.001,
 'classifier__C': 10}

Looks like the RBF (radial basis function, gaussian) is the best kernel for this problem.

In [27]:
# Kernel
kernel = ['rbf']
# Tolerance for stopping criterion
tol = [1e-3, 1e-4, 1e-5]
# Regularization parameter.
C = [0.0001, 0.01, 0.1, 1, 10, 100]
# Epsilon in the epsilon-SVR model.
epsilon = [0.00001, 0.0001, 0.001, 0.01, 0.1]


# Create the random grid
param_grid = {'classifier__kernel': kernel,
               'classifier__tol': tol,
               'classifier__C': C,
               'classifier__epsilon': epsilon}


# Instantiate the grid search model
grid_search = GridSearchCV(estimator=clf, param_grid=param_grid, 
                          cv=3, n_jobs=-1, verbose=2)
grid_search.fit(X_train, y_train)


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.


Fitting 3 folds for each of 90 candidates, totalling 270 fits


[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:    0.3s
[Parallel(n_jobs=-1)]: Done 270 out of 270 | elapsed:    1.3s finished


GridSearchCV(cv=3,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(transformers=[('num',
                                                                         Pipeline(steps=[('scaler',
                                                                                          StandardScaler())]),
                                                                         ['apparentTemperatureHigh',
                                                                          'precipProbability',
                                                                          'uvIndex',
                                                                          'precipIntensityMax_cm',
                                                                          'sun_uptime']),
                                                                        ('cat',
                                                                         Pipeline(st

In [28]:
best_model = grid_search.best_estimator_
best_model.fit(X_train, y_train)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('scaler',
                                                                   StandardScaler())]),
                                                  ['apparentTemperatureHigh',
                                                   'precipProbability',
                                                   'uvIndex',
                                                   'precipIntensityMax_cm',
                                                   'sun_uptime']),
                                                 ('cat',
                                                  Pipeline(steps=[('onehot',
                                                                   OneHotEncoder())]),
                                                  ['season', 'precipType'])])),
                ('classifier', SVR(C=10))])

### Evaluate the model: r2 score and MSE

In [29]:
y_pred = best_model.predict(X_test)
r2_score(y_test, y_pred)

0.7944537346006463

In [30]:
mean_squared_error(y_test, y_pred)

40.02518288147158

# Conclusion

We applied Support Vector Regression to the dataset, predicting the solar panel output values for different weather conditions.
For the model with an rbf kernel, we found a MSE (root mean squared error) of 40.0 and an r2_score of 0.79.

