# Support Vector Machine (SVM) Machine Learning Models


Decision tree-based models tend to perform well for many problems. However, depending on our problem, other algorithms may work better. One widely used machine learning algorithm is the support vector
machine (SVM).

Like linear and logistic regression, SVMs have been around for a
while – since 1963. SVMs can be used for regression and classification, sometimes called support vector regressors (SVRs) and support vector classifiers (SVCs).

Although SVMs have been around for a while and have become less popular with
the rise of other ML algorithms, it's still worth trying SVMs as one of your ML
algorithms for supervised learning problems.

SVMs have some advantages:
- They work well with a high number of dimensions (many features)
- They are memory-efficient, since they only use a subset of datapoints to classify new ones
- Using kernels to transform data can make them more flexible for higherdimension and complex feature spaces

However, there are some disadvantages:
- Vanilla SVM implementations do not scale well with increasing features and datapoints (although there are implementations for big data with Spark and other software)
- Probability estimates for class predictions need to be found with crossvalidation, which is computationally expensive


There are four common kernels we'll see:
- Linear – the first example we saw
- Polynomial – can work for slightly more complex data
- Radial basis function (RBF) – works for very complex data
- Sigmoid – can work for complex data

## SVM for regression

SVMs work a little differently for regression and are called SVRs. Instead of trying to maximize the margin between the hyperplane and points of different classes, SVRs in essence fit a hyperplane to the data. This is similar to how linear regression works, although we are optimizing a different function with SVRs. Essentially, we try to minimize the difference between predictions of datapoints from the hyperplane and actual values.

### SVM with sklearn

The sklearn package has a few different SVC and SVR implementations:
- Linear SVMs (svm.LinearSVC, LinearSVR, linear_model.SGDClassifier, and SGDRegressor)
- General SVMs (svm.SVC and SVR)
- Nu SVMs (svm.NuSVC and NuSVR)

The linear SVM can be implemented with svm.LinearSVC and svm.SVC, although the
LinearSVC implementation is better (because it scales better to large datasets and has more flexibility, as described in the documentation: https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html). The SVC implementation allows any kernel to be used, and has pre-made options for using different kernels: polynomial (poly), RBF (rbf), and sigmoid (sigmoid).

The Nu SVMs introduce a hyperparameter, nu, which is an upper bound to the
number of misclassified points for classification, and a lower bound to the number of support vectors.

> Using these is the same as any other sklearn supervised learning algorithm – create
the model (with any chosen hyperparameters), train it, then use and evaluate it. First,
let's load the credit card default data we've used previously and create train and test
sets:

In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

df = pd.read_excel(r'data\sample - default of credit card clients.xls', skiprows=1, index_col='ID')

target_column = 'default payment next month'

features = df.drop(target_column, axis = 1)

targets = df[target_column]

train_x, test_x, train_y, test_y = train_test_split(features, targets, stratify=targets, random_state=42)

In [4]:
scaler = StandardScaler()

scaled_train_x = scaler.fit_transform(train_x)
scaled_test_x = scaler.transform(test_x)

We are loading the data and breaking it into train and test sets,using stratify in train_test_split to make sure the balance of binary targets stays the same between the train and test sets. We also prepared some scaled features. `For SVMs, there are a few caveats – one is that it is helpful to scale the data.`

This is due to how the math behind SVMs works (which, again, is complex, and will
require deeper/further study to fully understand). `One other caveat is that getting probability predictions is not built in, and is not available with LinearSVC.`

Let's look first at using LinearSVC with the default hyperparameters. Almost all of
the configuration parameters have to do with the iterative solver for the function,
such as `max_iter` (for the maximum number of iterations) and `tol` (which controls
when the iteration has stopped, if the optimization has not improved enough after
an iteration). The main hyperparameter to tune here is `C`, which is a regularization
coefficient. Higher values of C mean less regulation (less prevention of overfitting),
or a smaller-margin hyperplane (less separation between points and the hyperplane).
A good visual explanation of this can be found here: https://stats.stackexchange.com/a/159051/120921.

In [5]:
from sklearn.svm import LinearSVC

lsvc = LinearSVC()
lsvc.fit(scaled_train_x, train_y)
print(lsvc.score(scaled_train_x, train_y))
print(lsvc.score(scaled_test_x, test_y))

0.7968
0.7877333333333333




This is the same pattern we followed from other sklearn models. Our accuracy here
is 78.7% on the test set, slightly better than the no information rate of 78.3% (from
`targets.value_counts(normalize=True)`). If we instead use the non-scaled data, our
accuracy is much lower (around 50%).

We can also implement a linear SVC with a few other methods – using the `linear_model.SGDClassifier` with the default value of `loss='hinge'`, or using the SVC or
NuSVC models with `kernel='linear'`. SGD stands for stochastic gradient descent,
since it is using gradient descent to optimize the model (the hyperplane for our SVC).
With the SGD model, we do not have the C hyperparameter, though we can fine-tune
the gradient descent and L1 and L2 losses used with the model in more detail .

We also have an alpha parameter for the `L1` and `L2` regularization that penalizesbigger values in the w vector from our hyperplane equation.

While C penalizes misclassified points, alpha penalizes bigger coefficients in w. Both have the same effect of increasing the margin (the distance from the hyperplane to the nearest points) with bigger penalties (small C or bigger alpha values). 

For the SVC and NuSVC models, a different algorithm is used for the linear SVC that does not scale as well with bigger data. However, we can get probability estimates for predictions by setting the probability=True parameter:

In [6]:
from sklearn.svm import SVC

svc = SVC(probability=True)
svc.fit(scaled_train_x, train_y)
print(svc.score(scaled_train_x, train_y))
print(svc.score(scaled_test_x, test_y))

0.8272
0.8181333333333334


After using probability=True, we can then use the predict_proba method of the
model (such as svc.predict_proba(scaled_test_x)) to predict probabilities of
classes. However, these probabilities are estimated from a cross-validation method,
so may not always agree with the actual predictions. Interestingly, the different
solver used with the SVC model (libsvm instead of liblinear with LinearSVC)
seems to perform slightly better than the LinearSVC model here, with a test accuracy
of 81.8%.

We could also try using the SGDClassifier and the NuSVM models for
comparison. The NuSVM model works in almost the same way as the SVC model,
although it has the nu hyperparameter, which should be greater than 0 and less
than or equal to 1. The nu hyperparameter determines the maximum fraction of
misclassified points and the minimum fraction of support vectors.


To use SVMs for regression in sklearn, the process is the same, although we
don't have the probability or predict_proba options available and use the SVRbased
classes (or SGDRegressor) instead of the SVC versions. We also don't have
one hyperparameter that we had for classification, class_weight, which, if set to
balanced, inversely weights points occurring to their class frequency. We do have
another hyperparameter with SVR, which is epsilon. This determines a distance
from the hyperplane where wrong predictions are not penalized. A bigger value
of epsilon means the model will have more bias (less overfitting), while smaller
epsilon makes the model fit the data more exactly (more variance).

### Tuning SVMs with PyCaret

As with other models, we can easily tune them using pycaret. By default, pycaret
uses the SGDClassifier or regressor for a linear SVM and uses SVC or SVR for an RBF
kernel SVM. But we can use pycaret to tune any sklearn model; so, if want to tune
the LinearSVC model we already tried, we can do it like so:

In [19]:
from pycaret.distributions import UniformDistribution
from pycaret.classification import setup, create_model, tune_model


clf_setup = setup(data=df,
                 target='default payment next month',
                 normalize=True)


rbfsvmc = create_model('rbfsvm')

tuned_lscv = tune_model(rbfsvmc, search_library='scikit-optimize', custom_grid={"C": UniformDistribution(0, 500)})

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,0.8171,0.7178,0.3036,0.6538,0.4146,0.3231,0.3565
1,0.8076,0.6615,0.3009,0.6071,0.4024,0.3029,0.3295
2,0.821,0.683,0.3451,0.661,0.4535,0.3588,0.3859
3,0.821,0.7132,0.2743,0.7209,0.3974,0.3163,0.3675
4,0.819,0.6918,0.292,0.6875,0.4099,0.3231,0.3645
5,0.8419,0.7029,0.3363,0.8261,0.478,0.4037,0.4606
6,0.8076,0.7245,0.2212,0.6579,0.3311,0.2499,0.3009
7,0.8305,0.7437,0.3717,0.7,0.4855,0.3953,0.4237
8,0.8248,0.7328,0.3274,0.6981,0.4458,0.3575,0.3937
9,0.834,0.7536,0.375,0.7119,0.4912,0.4032,0.4328


We import the necessary functions from the pycaret.classification module,
then set up our pycaret space. We leave the defaults for the detected numeric
or categorical columns and normalize our features (this normalization uses
standardization by default). Then we create our SVM - Radial Kernel Classifier model and tune it with Bayesian search in the range 0 to 50. The result is a C value of around 2 with an accuracy of about 81% on the 10-fold cross-validation.


The optimal model here is 82.24%. rbfsvm, which uses sklearn.svm.SVC with kernel='rbf', searches the C hyperparameter from 0 to 50, and tries using class_weight='balanced' and None.


With 'balanced', this will set weights of classes as inversely proportional to their prevalence. rbfsvm takes much longer to run than LinearSVC or SGDClassifier and ends up with an accuracy of 81.6%. In this case, it looks like the best SVC model to use would be SGDClassifier (the svm model from pycaret), since it has similar accuracy to the other models but runs the fastest.


The RBF kernel does have one more hyperparameter, which is gamma. This is set
to auto with pycaret, although we could try tuning it as well. The gamma value
affects how the data is transformed into a new dimension, with larger values of
gamma tending to cause overfitting and smaller values causing underfitting. A very small value of gamma ends up being like a linear SVM. More details on the RBF hyperparameters are provided in the sklearn documentation: https://scikitlearn.org/stable/auto_examples/svm/plot_rbf_parameters.html.

Using pycaret to optimize SVMs for regression is similar to classification, except we
use the pycaret.regression module instead. However, the regression models do not
have the class_weight hyperparameter, but do have an epsilon hyperparameter.
This epsilon hyperparameter determines a distance from the hyperplane where
wrong predictions are not penalized, and pycaret searches a space of 1-2 for this.
There is also not a linear SVM available for SVR by default, although we can create
our own LinearSVR model and use that.