# Support Vector Machines

A support vector machine (SVM) is a powerful and versatile machine
learning model, capable of performing linear or nonlinear classification,
regression, and even novelty detection. SVMs shine with small to medium￾sized nonlinear datasets (i.e., hundreds to thousands of instances), especially
for classification tasks. However, they don’t scale very well to very large
datasets, as you will see.

## Linear SVM Classification

Linear SVM (Support Vector Machine) classification is a supervised learning algorithm used for binary classification problems. The goal of the algorithm is to find a hyperplane that separates the two classes of data points with the largest margin possible.

SVM are sensitive to the feature scales.

Unlike LogisticRegression, LinearSVC doesn’t have a predict_proba()
method to estimate the class probabilities. That said, if you use the SVC class
(discussed shortly) instead of LinearSVC, and if you set its probability
hyperparameter to True, then the model will fit an extra model at the end of
training to map the SVM decision function scores to estimated probabilities.


### Soft Margin Classification

If we strictly impose that all instances must be off the street and on the
correct side, this is called **hard margin classification**. There are two main
issues with hard margin classification. First, it only works if the data is
linearly separable. Second, it is sensitive to outliers.

To avoid these issues, we need to use a more flexible model. The objective is
to find a good balance between keeping the street as large as possible and
limiting the margin violations (i.e., instances that end up in the middle of the
street or even on the wrong side). This is called **soft margin classification**.

To do that, we can specify hyperparameter C. Reducing C makes the street larger, 
but it also leads to more margin violations. In other
words, reducing C results in more instances supporting the street, so there’s
less risk of overfitting. But if you reduce it too much, then the model ends up
underfitting, as seems to be the case here: the model with C=100 looks like it
will generalize better than the one with C=1.

## Nonlinear SVM Classification

Although linear SVM classifiers are efficient and often work surprisingly
well, many datasets are not even close to being linearly separable. One
approach to handling nonlinear datasets is to add more features, such as
polynomial features.


### Polynomial Kernel

Adding polynomial features is simple to implement and can work great with
all sorts of machine learning algorithms (not just SVMs). That said, at a low
polynomial degree this method cannot deal with very complex datasets, and
with a high polynomial degree it creates a huge number of features, making
the model too slow.

Fortunately, when using SVMs you can apply an almost miraculous
mathematical technique called the kernel trick (which is explained later in
this chapter). The kernel trick makes it possible to get the same result as if
you had added many polynomial features, even with a very high degree,
without actually having to add them. This means there’s no combinatorial
explosion of the number of features. This trick is implemented by the SVC
class.

### Similarity Features

Another technique to tackle nonlinear problems is to add features computed
using a similarity function, which measures how much each instance
resembles a particular landmark.

You may wonder how to select the landmarks. The simplest approach is to
create a landmark at the location of each and every instance in the dataset.
Doing that creates many dimensions and thus increases the chances that the
transformed training set will be linearly separable. The downside is that a
training set with m instances and n features gets transformed into a training
set with m instances and m features (assuming you drop the original features).
If your training set is very large, you end up with an equally large number of
features.

### Gaussian RBF Kernel

Just like the polynomial features method, the similarity features method can
be useful with any machine learning algorithm, but it may be computationally
expensive to compute all the additional features (especially on large training
sets). Once again the kernel trick does its SVM magic, making it possible to
obtain a similar result as if you had added many similarity features, but
without actually doing so.

Models trained with different values of hyperparameters gamma (γ) and
C. Increasing gamma makes the bell-shaped curve narrower. 
Conversely, a small gamma value makes the bell-shaped
curve wider: instances have a larger range of influence, and the decision
boundary ends up smoother. So γ acts like a regularization hyperparameter: if
your model is overfitting, you should reduce γ; if it is underfitting, you should
increase γ (similar to the C hyperparameter).

## SVM Regression

To use SVMs for regression instead of classification, the trick is to tweak the
objective: instead of trying to fit the largest possible street between two
classes while limiting margin violations, SVM regression tries to fit as many
instances as possible on the street while limiting margin violations.

The width of the street is controlled by a hyperparameter, ϵ. Reducing ϵ increases 
the number of support vectors, which regularizes the
model. Moreover, if you add more training instances within the margin, it
will not affect the model’s predictions; thus, the model is said to be ϵ-
insensitive.

To tackle nonlinear regression tasks, you can use a kernelized SVM model.