___
<h1> Machine Learning </h1>
<h2> M. Sc. in Electrical and Computer Engineering </h2>
<h3> Instituto Superior de Engenharia / Universidade do Algarve </h3>

[MEEC](https://ise.ualg.pt/en/curso/1477) / [ISE](https://ise.ualg.pt) / [UAlg](https://www.ualg.pt)

Pedro J. S. Cardoso (pcardoso@ualg.pt)
___

_Note: running this notebook will, probably, require some hours._ 

# Support vector machines (SVMs) 

Support vector machines (SVMs) are a set of supervised learning methods used for classification, regression and outliers detection.

The advantages of support vector machines are:
* Effective in high dimensional spaces.
* Still effective in cases where number of dimensions is greater than the number of samples.
* Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.
* Versatile: different Kernel functions can be specified for the decision function. Common kernels are provided, but it is also possible to specify custom kernels.

See https://scikit-learn.org/stable/modules/svm.html for an explanation of the module and https://scikit-learn.org/stable/modules/svm.html#svm-mathematical-formulation for a the mathematical formulation.



## Classification

Let us start with a simple example of classification using SVM. We will use the iris dataset and, as usual, we will split the dataset into training and test sets.

In [1]:
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()

X_train, X_test, y_train, y_test = train_test_split(iris.data,
                                                    iris.target, 
                                                    random_state=10)

Now, let us train a SVM classifier using the training set and test it using the test set.

In [2]:
svm = SVC(C=.1, 
          kernel='poly', 
          degree=4).fit(X_train, y_train)

score = svm.score(X_test, y_test)
score

0.9736842105263158

Changing C and kernel parameters, we can get better results.

In [3]:
svm = SVC(
    C=.01, 
    kernel='poly', 
    degree=4).fit(X_train, y_train)

score = svm.score(X_test, y_test)
print(score)
print('"1.0!! Pure luke"!! try with other random state value (train_test_split)!')

1.0
"1.0!! Pure luke"!! try with other random state value (train_test_split)!


See also https://scikit-learn.org/stable/auto_examples/svm/plot_iris_svc.html#sphx-glr-auto-examples-svm-plot-iris-svc-py

## Regression

Next we present a few examples of regression using SVM. 

Let us consider the Seoul Bike Sharing Demand dataset. The dataset contains the hourly count of rental bikes between years 2017 and 2018 in Seoul, Korea with the corresponding weather and seasonal information. The dataset can be downloaded from https://archive.ics.uci.edu/dataset/560/seoul+bike+sharing+demand but we have already downloaded it and saved it in the data folder.

Let us start by loading the dataset into a pandas dataframe. 

In [4]:
import pandas as pd
df = pd.read_csv('./data/SeoulBikeData.csv')
df.head()

Unnamed: 0,Date,Rented Bike Count,Hour,Temperature(C),Humidity(%),Wind speed (m/s),Visibility (10m),Dew point temperature(C),Solar Radiation (MJ/m2),Rainfall(mm),Snowfall (cm),Seasons,Holiday,Functioning Day
0,01/12/2017,254,0,-5.2,37,2.2,2000,-17.6,0.0,0.0,0.0,Winter,No Holiday,Yes
1,01/12/2017,204,1,-5.5,38,0.8,2000,-17.6,0.0,0.0,0.0,Winter,No Holiday,Yes
2,01/12/2017,173,2,-6.0,39,1.0,2000,-17.7,0.0,0.0,0.0,Winter,No Holiday,Yes
3,01/12/2017,107,3,-6.2,40,0.9,2000,-17.6,0.0,0.0,0.0,Winter,No Holiday,Yes
4,01/12/2017,78,4,-6.0,36,2.3,2000,-18.6,0.0,0.0,0.0,Winter,No Holiday,Yes


By calling the dataframe's info method, we can see that there are no missing values but there are some categorical columns.

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8760 entries, 0 to 8759
Data columns (total 14 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Date                      8760 non-null   object 
 1   Rented Bike Count         8760 non-null   int64  
 2   Hour                      8760 non-null   int64  
 3   Temperature(C)            8760 non-null   float64
 4   Humidity(%)               8760 non-null   int64  
 5   Wind speed (m/s)          8760 non-null   float64
 6   Visibility (10m)          8760 non-null   int64  
 7   Dew point temperature(C)  8760 non-null   float64
 8   Solar Radiation (MJ/m2)   8760 non-null   float64
 9   Rainfall(mm)              8760 non-null   float64
 10  Snowfall (cm)             8760 non-null   float64
 11  Seasons                   8760 non-null   object 
 12  Holiday                   8760 non-null   object 
 13  Functioning Day           8760 non-null   object 
dtypes: float

The categorical columns need to be converted into, for example, dummy variables. 

A dummy variable is a numerical variable used in regression analysis to represent subgroups of the sample in your study. In research design, a dummy variable is often used to distinguish different treatment groups. for example the season column has four categories: Spring, Summer, Autumn, and Winter. We can convert this column into four columns, one for each category, and use 0 or 1 to indicate if the sample belongs to that category or not. To achieve this, we can use the pandas get_dummies method.

In [6]:
df = pd.get_dummies(df, columns=['Seasons', 'Holiday', 'Functioning Day'], drop_first=True)
df

Unnamed: 0,Date,Rented Bike Count,Hour,Temperature(C),Humidity(%),Wind speed (m/s),Visibility (10m),Dew point temperature(C),Solar Radiation (MJ/m2),Rainfall(mm),Snowfall (cm),Seasons_Spring,Seasons_Summer,Seasons_Winter,Holiday_No Holiday,Functioning Day_Yes
0,01/12/2017,254,0,-5.2,37,2.2,2000,-17.6,0.0,0.0,0.0,False,False,True,True,True
1,01/12/2017,204,1,-5.5,38,0.8,2000,-17.6,0.0,0.0,0.0,False,False,True,True,True
2,01/12/2017,173,2,-6.0,39,1.0,2000,-17.7,0.0,0.0,0.0,False,False,True,True,True
3,01/12/2017,107,3,-6.2,40,0.9,2000,-17.6,0.0,0.0,0.0,False,False,True,True,True
4,01/12/2017,78,4,-6.0,36,2.3,2000,-18.6,0.0,0.0,0.0,False,False,True,True,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8755,30/11/2018,1003,19,4.2,34,2.6,1894,-10.3,0.0,0.0,0.0,False,False,False,True,True
8756,30/11/2018,764,20,3.4,37,2.3,2000,-9.9,0.0,0.0,0.0,False,False,False,True,True
8757,30/11/2018,694,21,2.6,39,0.3,1968,-9.9,0.0,0.0,0.0,False,False,False,True,True
8758,30/11/2018,712,22,2.1,41,1.0,1859,-9.8,0.0,0.0,0.0,False,False,False,True,True


We can split this column into two columns: month and day, and day of week. To achieve this, we can use the pandas to_datetime method as follows:

In [7]:
# make sure the date column is in datetime format
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')

# create new columns for month, day, and day of week
df['month'] = df['Date'].dt.month
df['day'] = df['Date'].dt.day
df['day_of_week'] = df['Date'].dt.day_of_week

# drop the original date column
df.drop('Date', axis=1, inplace=True)

Let us now recheck the dataframe's info method.

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8760 entries, 0 to 8759
Data columns (total 18 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Rented Bike Count         8760 non-null   int64  
 1   Hour                      8760 non-null   int64  
 2   Temperature(C)            8760 non-null   float64
 3   Humidity(%)               8760 non-null   int64  
 4   Wind speed (m/s)          8760 non-null   float64
 5   Visibility (10m)          8760 non-null   int64  
 6   Dew point temperature(C)  8760 non-null   float64
 7   Solar Radiation (MJ/m2)   8760 non-null   float64
 8   Rainfall(mm)              8760 non-null   float64
 9   Snowfall (cm)             8760 non-null   float64
 10  Seasons_Spring            8760 non-null   bool   
 11  Seasons_Summer            8760 non-null   bool   
 12  Seasons_Winter            8760 non-null   bool   
 13  Holiday_No Holiday        8760 non-null   bool   
 14  Function

Since the target variable is the Rented Bike Count, we can split the dataframe into two dataframes: one with the target variable and another with the remaining variables.

In [9]:
X = df.drop('Rented Bike Count', axis=1)
y = df['Rented Bike Count']

Following the usual procedure, we can split the dataset into training and test sets. Shuffling the dataset is important to avoid any ordering bias.

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    shuffle=True,
                                                    random_state=42,
                                                    test_size=0.2)

To train a SVM regressor, we can use the SVR class from the sklearn.svm module. Furthermore, we can use the GridSearchCV class to perform a grid search to find the best parameters for the SVR model.

In [11]:
from sklearn.svm import SVR
from sklearn.model_selection import GridSearchCV

def create_model_with_GSCV(X_train, y_train):
    grid_search_parameters = [
        {'kernel': ['linear'], 'C': [10**i for i in range(-2, 3)]},
        {'kernel': ['rbf'], 'C': [10**i for i in range(-2, 3)], 'gamma': [10**i for i in range(-2, 3)]},
        {'kernel': ['poly'], 'C': [10**i for i in range(-2, 3)], 'degree': [2]}
    ]
    
    # create the model
    svr = SVR()
    
    # create grid search and fit it to the training data
    gs_model = GridSearchCV(estimator=svr,
                            param_grid=grid_search_parameters,
                            cv=5,
                            n_jobs=-1,
                            verbose=1).fit(X_train, y_train)
    return gs_model

In [None]:
gdcv_model = create_model_with_GSCV(X_train, y_train)

Fitting 5 folds for each of 35 candidates, totalling 175 fits


The best parameters and score can be obtained as follows:

In [None]:
gdcv_model.best_params_

In [None]:
gdcv_model.best_score_ 

Note that refit is by default=True, which means that the GridSearchCV will refit an estimator using the best found parameters on the whole dataset. And the best estimator can be obtained as follows:

In [None]:
model = gdcv_model.best_estimator_

Over the test set, we can obtain the score as follows, which somehow indicates how well the model generalizes.

In [None]:
model.score(X_test, y_test)

We can make predictions over the test set as follows and compare the predicted values with the actual values, by plotting them. On a prefect regression, the points would be on the diagonal.

In [None]:
import matplotlib.pyplot as plt

# make predictions over the test set
pred = model.predict(X_test)

# plot pred vs actual
plt.figure(figsize=(10,10))
plt.plot(y_test.values, pred, c='g', marker='o', linestyle='None')
plt.plot([0,3500], [0, 3500], c='r')
plt.xlabel("Actual")
plt.ylabel("Predicted")
plt.title("Actual vs Predicted")
