# Lecture 18: Support Vector Machine Part 2

### 1. SVM for Classification

Load modules.

scikit learn ver. 1.1.3 is needed. 

Run *pip install scikit-learn==1.1.3* in terminal.

In [None]:
import matplotlib.pyplot as plt
from sklearn import svm
from sklearn.datasets import make_blobs
from sklearn.inspection import DecisionBoundaryDisplay

We create 40 separable points

In [None]:
X, y = make_blobs(n_samples=40, centers=2, random_state=6)

plt.scatter(X[:, 0], X[:, 1], c=y, s=30, cmap=plt.cm.Paired)

Fit the linear SVM model for classification (SVC)

In [None]:
clf = svm.SVC(kernel="linear", C=1000)
clf.fit(X, y)

In [None]:
# plot the decision function
ax = plt.gca()
DecisionBoundaryDisplay.from_estimator(
    clf,
    X,
    plot_method="contour",
    colors="k",
    levels=[-1, 0, 1],
    alpha=0.5,
    linestyles=["--", "-", "--"],
    ax=ax,
)
# plot support vectors
ax.scatter(
    clf.support_vectors_[:, 0],
    clf.support_vectors_[:, 1],
    s=100,
    linewidth=1,
    facecolors="none",
    edgecolors="k",
)
plt.scatter(X[:, 0], X[:, 1], c=y, s=30, cmap=plt.cm.Paired)
plt.show()

### 2. SVM for Regession

The support vector machine can be applied in regression problems as a way to optimise the generalization boundaries for the regression line.

We will work on the mammals dataset again.

Load the modules.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

We sort the data by body size with the method
.sort_values().

In [None]:
mammals = pd.read_csv('mammals.csv').sort_values('body')

body = mammals[ ['body'] ].values
brain = mammals['brain'].values

We can have up to 1600 components in this image.

SVM contains methods for Support Vector Regression (SVR).

We run a model with a linear kernel and one with a Gaussian kernel.

In [None]:
from sklearn import svm

svm_lm = svm.SVR(kernel='linear', C=1e1)
svm_rbf = svm.SVR(kernel='rbf', C=1e1)

We train our models with the log of the available variables.

In [None]:
svm_lm.fit(np.log(body), np.log(brain))
svm_rbf.fit(np.log(body), np.log(brain))

For comparison, we will also fit a regression model with logarithmic transformation.

In [None]:
from sklearn.linear_model import LinearRegression

logfit = LinearRegression().fit(np.log(body), np.log(brain))

Obtain the predictions for all the trained models.

In [None]:
mammals['log_regr'] = np.exp(logfit.predict(np.log(body)))
mammals['linear_svm'] = np.exp(svm_lm.predict(np.log(body)))
mammals['rbf_svm'] = np.exp(svm_rbf.predict(np.log(body)))

print(mammals['log_regr'])
print(mammals['linear_svm'])
print(mammals['rbf_svm'])

Make plot for comparing the regression curves.

In [None]:
plt.plot(body, mammals['log_regr'] )
plt.plot(body, mammals['linear_svm'] )
plt.plot(body, mammals['rbf_svm'] )

plt.legend(['Linear regresoin (log transformed)','Linear SVM','Gaussian SVM'])
plt.scatter(body, brain)
plt.show()

In this case, Linear SVM performs no better and even worse than the simple linear regression model. 
The Gaussian kernel did not require us to make any explicit transformations and, in this case, gets closer to the values
observed in the original dataset.

Note that close to the origin the Gaussian kernel produced some wiggles. Overfitting is still an adversary that needs to be considered.

I leave the tuning of the hyperparameter C as well as the implementation of cross-validation as a take-home exercise.