# Support Vector Machines

A **Support Vector Machine (SVM)** is a very powerful and versatile Machine Learning
model, capable of performing linear or nonlinear classification, regression, and even
outlier detection.

SVMs are particularly well suited for **classification of complex but small- or medium-sized datasets.**

## Linear SVM Classification
**On Left** the model whose decision boundary is represented by the dashed line is so bad that it
does not even separate the classes properly. The other **two models work perfectly on
this training set, but their decision boundaries come so close to the instances that
these models will probably not perform as well on new instances.**

In contrast, the solid line in the plot on the right represents the decision boundary of an SVM classifier;
this line not only **separates the two classes but also stays as far away from the
closest training instances as possible.** You can think of an SVM classifier as fitting the widest possible street (represented by the parallel dashed lines) between the classes.
This is called **large margin classification.**

<img src="images/svm_concept.jpg" width='800' />

Notice that adding more training instances “off the street” will not affect the decision
boundary at all: it is fully determined (or “supported”) by the instances located on the
edge of the street. These instances are called the **support vectors** (they are circled in above fig).

#### SVMs are sensitive to the feature scales.
- On the left plot, the vertical scale is much larger than the horizontal scale, so the widest possible street is close to horizontal.

- After feature scaling the decision boundary looks much better.

<img src="images/sensitivity_to_feature_scales.jpg" width='800' />




## Soft Margin Classification

If we strictly impose that all instances be off the street and on the right side, this is
called **hard margin classification.** 

Two issues with it
- it only works if the data is linearly separable
- it is quite sensitive to outliers

<img src="images/hard_soft_margin.png" width='600' />

To avoid these issues it is preferable to use a more flexible model. The objective is to
find a good balance between keeping the street as large as possible and limiting the
margin violations (i.e., instances that end up in the middle of the street or even on the
wrong side). This is called **soft margin classification.**

##### C hyperparameter :
- a smaller C value leads to a wider street but more margin violations.
- high C value the classifier makes fewer margin violations but ends up with a smaller margin. However,
it seems likely that the **first classifier will generalize better:** in fact even on this training
set it makes fewer prediction errors, since most of the margin violations are
actually on the correct side of the decision boundary.

<img src="images/large_margin_vs_lower_margin.jpg" width='800' />

- If SVM model is overfitting, you can try regularizing it by reducing C.

In [1]:
import numpy as np
from sklearn import datasets
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC


In [2]:
iris = datasets.load_iris()

X= iris['data'][:,(2,3)]   # petal length, petal width
y=(iris['target']==2).astype(np.float64) # Iris-Virginica

svm_clf = Pipeline([('scaler', StandardScaler()),
                   ('linear_scv', LinearSVC(C=1, loss='hinge'))])

svm_clf.fit(X, y)


Pipeline(steps=[('scaler', StandardScaler()),
                ('linear_scv', LinearSVC(C=1, loss='hinge'))])

In [3]:
svm_clf.predict([[5.5, 1.7]])

array([1.])

- Unlike Logistic Regression classifiers, SVM classifiers do not output
probabilities for each class.
- Alternatively, you could use the SVC class, using SVC(kernel="linear", C=1), but it
is much slower, especially with large training sets
- Another
option is to use the SGDClassifier class, with SGDClassifier **(loss="hinge",
alpha=1/(m*C)).** This applies regular Stochastic Gradient Descent to
train a linear SVM classifier. It does not converge as fast as the LinearSVC class, but it
can be useful to handle huge datasets that do not fit in memory (out-of-core training),
or to handle online classification tasks.

## Nonlinear SVM Classification

Adding features to make a dataset linearly separable

<img src="images/adding_features_make_dataset_lin_separable.jpg" width='600' />

Left not linear separable, But if we add a second feature
x2 = (x1)2, the resulting 2D dataset on right is perfectly linearly separable.

In [4]:
from sklearn.preprocessing import PolynomialFeatures
polynomial_svm_clf = Pipeline([("poly_features", PolynomialFeatures(degree=3)),
                               ("scaler", StandardScaler()),
                               ("svm_clf", LinearSVC(C=10, loss="hinge"))])
polynomial_svm_clf.fit(X, y)

Pipeline(steps=[('poly_features', PolynomialFeatures(degree=3)),
                ('scaler', StandardScaler()),
                ('svm_clf', LinearSVC(C=10, loss='hinge'))])

## Polynomial Kernel

- Adding polynomial features is simple to implement and can work great with all sorts of ML algorithms. but at a low polynomial degree it cannot deal with very complex datasets, and with a high polynomial degree it creates a huge number of features, making the model too slow.
- When using SVMs we can apply an almost miraculous mathematical technique called the **kernel trick**. It makes it possible to
  get the same result as if we added many polynomial features, even with very highdegree polynomials, without actually having to   add them.
  
This trick is implemented by the SVC class (**kernel trick**)

In [5]:
from sklearn.svm import SVC

poly_kernel_svmclf = Pipeline([('scaler', StandardScaler()),
                              ('svm_clf', SVC(C=5, kernel='poly', degree=3, coef0=1))])

poly_kernel_svmclf.fit(X,y)

Pipeline(steps=[('scaler', StandardScaler()),
                ('svm_clf', SVC(C=5, coef0=1, kernel='poly'))])

If our model is overfitting, we might want to reduce the polynomial degree. Conversely, if it is underfitting, we can try increasing it. <br>
The **hyperparameter coef0** controls how much the model is influenced by highdegree
polynomials versus low-degree polynomials.

<img src="images/svm_clf_poly_kernel.jpg" width='600' />

A common approach to find the right hyperparameter values is to
use grid search.

## Adding Similarity Features

Another technique to tackle nonlinear problems is to add features computed using a
similarity function that measures how much each instance resembles a particular
landmark.

### Gaussian Radial Basis Function (RBF) Kernel

Just like the polynomial features method, the similarity features method can be useful
with any Machine Learning algorithm, but it may be computationally expensive to
compute all the additional features, especially on large training sets. However, once
again the kernel trick does its SVM magic: it makes it possible to obtain a similar
result as if you had added many similarity features, without actually having to add
them.

In [6]:
rbf_kernel_svm_clf = Pipeline([("scaler", StandardScaler()),
                               ("svm_clf", SVC(kernel="rbf", gamma=5, C=0.001))])

rbf_kernel_svm_clf.fit(X, y)

Pipeline(steps=[('scaler', StandardScaler()),
                ('svm_clf', SVC(C=0.001, gamma=5))])

As a rule of thumb, you should always try the linear
kernel first (remember that LinearSVC is much faster than SVC(ker
nel="linear")), especially if the training set is very large or if it
has plenty of features. If the training set is not too large, you should
try the Gaussian RBF kernel as well; it works well in most cases.

### Computational Complexity

<img src="images/complexity_table.png" width='800' />

## SVM Regression

SVM algorithm also supports linear and nonlinear
regression. The trick is to reverse the objective: instead of trying to fit the largest possible
street between two classes while limiting margin violations, SVM Regression
tries to fit as many instances as possible on the street while limiting margin violations
(i.e., instances off the street). The width of the street is controlled by a hyperparameter
ϵ.

<img src="images/svm_reg.jpg" width='600' />

Adding more training instances within the margin does not affect the model’s predictions;
thus, the model is said to be ϵ-insensitive.

Scikit-Learn’s LinearSVR class to perform linear SVM Regression is used.

In [7]:
from sklearn.svm import LinearSVR
svm_reg = LinearSVR(epsilon=1.5)
svm_reg.fit(X, y)

LinearSVR(epsilon=1.5)

To tackle nonlinear regression tasks, we can use a kernelized SVM model. Scikit-Learn’s SVR class (which supports the kernel trick) is equivalent of the SVC class, and the LinearSVR class is the regression equivalent
of the LinearSVC class. The LinearSVR class scales linearly with the size of the training
set (just like the LinearSVC class), while the SVR class gets much too slow when
the training set grows large (just like the SVC class).

In [10]:
from sklearn.svm import SVR
svm_poly_reg = SVR(kernel='poly', degree=2, C=100, epsilon=0.1)
svm_poly_reg.fit(X, y)

SVR(C=100, degree=2, kernel='poly')

## Under the Hood
### Decision Function and Predictions

The linear SVM classifier model predicts the class of a new instance **x** by simply computing
the decision function $ w^T.x + b = w_1.x_1 + ⋯ + w_n.x_n + b: $ if the result is positive,
the predicted class **ŷ** is the positive class (1), or else it is the negative class (0);

<img src="images/lin_svm_pred.png" width='400' />



## Training Objective

Consider the slope of the decision function: it is equal to the norm of the weight vector,
$ ∥ w ∥ $. If we divide this slope by 2, the points where the decision function is equal
to $ ±1 $ are going to be twice as far away from the decision boundary. In other words,
dividing the slope by 2 will multiply the margin by 2.

<img src="images/w_vs_margin.jpg" width='400' />

The smaller the weight vector w, the larger the margin.

So we want to minimize $ ∥ w ∥ $ to get a large margin. However, if we also want to avoid
any margin violation (hard margin), then we need the decision function to be greater
than 1 for all positive training instances, and lower than –1 for negative training
instances. If we define $ t^i = –1 $ for negative instances (if $ y^i = 0) $ and $t^i = 1$ for positive
instances (if $y^i = 1)$, then we can express this constraint as $t^i(w^T x^i + b) ≥ 1$ for all
instances.

<img src="images/svm_eq.png" width='400' />

We can therefore express the hard margin linear SVM classifier objective as the constrained
optimization problem


<img src="images/svm_eq_2.jpg" width='600' />

We are minimizing $1/2w^T w$, which is equal to $1/2
∥ w ∥^2$, rather than
minimizing $∥ w ∥$. Indeed, $1/2
∥ w ∥^2$ has a nice and simple derivative
(it is just w) while $∥ w ∥$ is not differentiable at w = 0. Optimization
algorithms work much better on differentiable functions.

To get the **soft margin** objective, we need to introduce a **slack variable** $ ζ^i ≥ 0 $ for each
instance. $ ζ^i $ measures how much the ith instance is allowed to violate the margin. We
now have two conflicting objectives: making the slack variables as small as possible to
reduce the margin violations, and making $ 1/2
w^T w $ as small as possible to increase the
margin. This is where the **C hyperparameter** comes in: it allows us to define the trade‐off between these two objectives. This gives us the constrained optimization problem.

<img src="images/svm_eq_3.png" width='600' />

## Kernelized SVM

Transformed training set. <br>

<img src="images/kernelized.png" width='600' />

Notice that the transformed vector is three-dimensional instead of two-dimensional. <br>
Now let’s look at what happens to a couple of two-dimensional vectors, **a** and **b**, if we
apply this 2nd-degree polynomial mapping and then compute the dot product of the
transformed vectors.

<img src="images/kernelized_2.png" width='600' />

The dot product of the transformed vectors is equal to the square of
the dot product of the original vectors: $ ϕ(a)^T ϕ(b) = (a^T b)^2 $.

Don’t actually need to transform the training instances at all: just replace the dot
product by its square. The result will be strictly the same as if you
went through the trouble of actually transforming the training set then fitting a linear
SVM algorithm, but this trick makes the whole process much more computationally
efficient. This is the essence of the **kernel trick.**

## Online SVMs

Online learning means learning incrementally, typically as new instances arrive.

<img src="images/loss.png" width='600' />

The first sum in the cost function will push the model to have a small weight vector
$w$, leading to a larger margin. The second sum computes the total of all margin violations.
An instance’s margin violation is equal to 0 if it is located off the street and on
the correct side, or else it is proportional to the distance to the correct side of the
street. Minimizing this term ensures that the model makes the margin violations as
small and as few as possible.

