# <font color = #ffffff>Support Vector Machines

## Linear SVM Classification

Capable of performing linear or nonlinear classification, regression, and even outlier detection.
You can think of an SVM classifier as fitting the widest possible street between the classes.
This is called large margin classification.
VMs are sensitive to the feature scales.

### Soft Margin Classification

If we strictly impose that all instances must be off the street and on the right side, This is called hard margin classification.
It only works if the data is linearly separable and it is sensitive to outliers.
To avoid these issues, use a more flexible model.
The objective is to find a good balance between keeping the street as large as possible and limiting the margins, This is called soft margin classification
Hyperparameter C , set a low value, then margin is big, set a high value, then margin is small.
If your SVM model is overfitting, you can try regularzing it by reducing C

In [1]:
import numpy as np
from sklearn import datasets
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC

In [5]:
iris = datasets.load_iris()
X = iris['data'][:, (2, 3)] # petal length, petal width
y = (iris['target'] == 2).astype(np.float64) # iris virginica

svm_clf = Pipeline([
    ('scaler', StandardScaler()),
    ('liner_svc', LinearSVC(C = 1, loss = 'hinge'))
])

svm_clf.fit(X, y)

Pipeline(steps=[('scaler', StandardScaler()),
                ('liner_svc', LinearSVC(C=1, loss='hinge'))])

In [6]:
svm_clf.predict([[5.5, 1.7]])

array([1.])

### Nonlinear SVM Classification

Although linear SVM classifiers are efficient and work surprisingly well in many cases, many datasets are not even close to being linearly separable.
One approach to handling nonlinear datasets is to add more features, such as polynomial features


In [8]:
from sklearn.datasets import make_moons
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures

X, y = make_moons(n_samples = 100, noise = 0.15)
polynomial_svm_clf = Pipeline([
    ('poly_featues', PolynomialFeatures(degree = 3)),
    ('scaler', StandardScaler()),
    ('svm_clf', LinearSVC(C = 10, loss = 'hinge'))
])

polynomial_svm_clf.fit(X, y)



Pipeline(steps=[('poly_featues', PolynomialFeatures(degree=3)),
                ('scaler', StandardScaler()),
                ('svm_clf', LinearSVC(C=10, loss='hinge'))])

### Polynomial Kernel

Adding polynomial features is simple to implemet and can work great.
But this method cannot deal wih very complex datasets, and with a high polynomial degree it creates a huge number of features, making the model too slow.
The kernel trick makes it possible to get the same result as if had added many polynomial features, even with very high-degree polynomials, without actually having to add them.

In [9]:
from sklearn.svm import SVC

poly_kernel_svm_clf = Pipeline([
    ('scaler', StandardScaler()),
    ('svm_clf', SVC(kernel = 'poly', degree = 3, coef0 = 1, C = 5))
])

poly_kernel_svm_clf.fit(X, y)

Pipeline(steps=[('scaler', StandardScaler()),
                ('svm_clf', SVC(C=5, coef0=1, kernel='poly'))])

If your model is overfitting, reduce the polynomial degree.
If your model is underfitting, increase the polynomial degree.
coef0 controls how much the model is influenced by high-degree polynomials vs low-degree polynomial
A common approach to finding the right hyperparameters values is to use grid search.

### Similarity Features

Another technique to tackle nonlinear problems is to add features computed using a similarity function, which measures how much each instance resembles a particular landmark.
Similarity function is a function that is used to assess the similarity between two data-points. Given two data-points it outputs a similarity score.
The arguable simplest example is the linear kernel, also called dot-product. Given two vectors, the similarity is the length of the projection of one vector on another.

### Gaussian RBF Kernel

Once again the kernel trick does its SVM magic, making it possible to obtain a similar result as if you had added many similarity features.

In [10]:
rbf_kernel_svm_clf = Pipeline([
    ("scaler", StandardScaler()),
    ('svm_clf', SVC(kernel = 'rbf', gamma = 5, C = 0.001))
])

rbf_kernel_svm_clf.fit(X, y)

Pipeline(steps=[('scaler', StandardScaler()),
                ('svm_clf', SVC(C=0.001, gamma=5))])

Increasing gamma makes the bell-shaped curve narrower, each instance's range of influence is smaller.
A small gamma makes the bell-shaped curve wider, instances have a larger range of influence.
If your model is overfitting, you should reduce the gamma, if it is underfitting you should increase it (similar to the C hyperparameter)
You should always try the linear kernel first, because LinearSVC is much faster than SVC(kernel = 'linear'), especially if the training set is very large or if it has plenty of features.
If the training set is not too large, you should also try the Gaussian RBF kernel.

### Computational Complexity

The algorithm takes longer if you require very high precision. This is controlled by the tolerance hyperparameter (colled tol in Scikit_Learn).

## SVM Regression

The trick is to reverse the objective:
instead of trying to fit the largest possible street between two classes while limiting margin violations, SVM Regression tries to fit as many instances as possible on the street while limiting margin violations.
The width of the street is controlled by a hyperparameter, epsilon.
Adding more training instances within the margin does not affect the model's predictions, thus, the model is said to be epsilon-insensitive.

In [11]:
from sklearn.svm import LinearSVR

svm_reg = LinearSVR(epsilon = 1.5)
svm_reg.fit(X, y)

LinearSVR(epsilon=1.5)

In [13]:
from sklearn.svm import SVR

svm_poly_reg = SVR(kernel = 'poly', degree = 2, C = 100, epsilon = 0.1)
svm_poly_reg.fit(X, y)

SVR(C=100, degree=2, kernel='poly')

The SVR class is the regression equivalent of the SVC class, and LinearSVR class is the regression equivalent of the LinearSVC class.
The LinearSVC class scales linearly with the size of the training set, while the SVR class gets much too slow when the training set grows large.