In [1]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn

%matplotlib inline

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

# Support Vector Machines

SVMs are very popular in machine learning due to their versatile capabilities in classification. They utilize _large margin classification_ which basically means to separate classes by the widest margin possible.

Aurelien notes that SVMs are sensitive to feature scales, so it is advisable to scale them using either a NormalScaler or StandardScaler.

We can also implement SVMs using _hard margin classification_, but the data must be linearly separable and it's very sensitive to outliers. Thus, we prefer more flexible models, ones that will keep the street as wide as possible and limit the number of margin violations. Here, we arrive at _soft margin classification_.

The hyperparameter that controls this balance is **C**. We can reduce **C** if our model is overfitting. 

In [2]:
from sklearn import datasets
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC

iris = datasets.load_iris()
X = iris["data"][:, (2, 3)] # petal length, petal width
y = (iris["target"] == 2).astype(np.float64) # Iris-Virginica

svm_clf = Pipeline([
    ("scaler", StandardScaler()),
    ("linear_svc", LinearSVC(C=1, loss="hinge", random_state=42))
])

svm_clf.fit(X, y)

Pipeline(steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('linear_svc', LinearSVC(C=1, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='hinge', max_iter=1000, multi_class='ovr',
     penalty='l2', random_state=42, tol=0.0001, verbose=0))])

In [3]:
svm_clf.predict([[5.5, 1.7]])

array([ 1.])

Logistic Regression does not output probabilities for each class.

from sklearn.svm import SVC

You can use the regular old SVC kernel and set its kernel="linear" and C=1, but apparently this is much slower. 

Using SGDClassifer(loss="hinge", alpha=1/(m*C)) will use SGD to train a linear SVM classifier. This is useful for large datasets
that don't fit in memory.

Aurelien makes further suggestions on how to improve the use of this classifier in a note.

## Nonlinear SVM Classification

Much like regression, SVMs don't have to rely on linear separability. In fact, most datasets won't be linearly separable. Additionally, low numbers of features may not reveal separations, so we can use PolynomialFeatures just as in polynomial regression to add these features.

In [4]:
from sklearn.datasets import make_moons
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures

X, y = make_moons(n_samples=100, noise=0.15, random_state=42)

polynomial_smv_clf = Pipeline([
    ("poly_features", PolynomialFeatures(degree=3)),
    ("scaler", StandardScaler()),
    ("smv_clf", LinearSVC(C=10, loss="hinge", random_state=42))
])

polynomial_smv_clf.fit(X, y)

Pipeline(steps=[('poly_features', PolynomialFeatures(degree=3, include_bias=True, interaction_only=False)), ('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('smv_clf', LinearSVC(C=10, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='hinge', max_iter=1000, multi_class='ovr',
     penalty='l2', random_state=42, tol=0.0001, verbose=0))])

We can use a polynomial kernel in the SVC class that will perform the same polynomial operation as PolynomialFeatures, but it's significantly more performant. This is known as the _kernel trick_. This is ideal to use since SVCs don't deal well with low polynomial degrees and higher polynomial degrees slow the model performance.

To reduced overfitting, reduce the number of polynomial degrees or if its underfitting, increase them. The coef0 parameter affects how the model is influenced by high degree polynomials versus low degree. 

Use grid search to find the best hyperparameters, starting with a coarse search then refining it so you get into the right space where you understand how each is playing into the model.

In [5]:
from sklearn.svm import SVC
poly_kernel_svm_clf = Pipeline([
    ("scaler", StandardScaler()),
    ("svm_clf", SVC(kernel="poly", degree=3, coef0=1, C=5))
])
poly_kernel_svm_clf.fit(X, y)

Pipeline(steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('svm_clf', SVC(C=5, cache_size=200, class_weight=None, coef0=1,
  decision_function_shape=None, degree=3, gamma='auto', kernel='poly',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])

Another technique we can use is a similarity function which measures how much each instance resembles a landmark (so one particular instance). We can define what our similarity function is that the similarity function will use to make its determination, and in this case we'll use a _radial basis function_. I can't find a very intuitive explanation on the internet, but from trying to put two and two together, I think it will  make its similarity decision based on how far it is from the landmark, if the Guassian radial basis function is centered on it. So if it's within _n_ standard deviations or something, it will classify it as _some class_.

He defines this function, but I do not think it presents any value to replicate it here.

The simplest way to do this is to add a lndmark at the location of each and every instance in the dataset. That way we maximize the number of dimensions and increase the chance that the data will become linearly separable. But thi sbecomes very computationally expensive. A training set with _m_ instances and _n_ features becomes a set with _m x m_ features.

Just like the polynomial kernel, there is also a Gaussian RBF kernel trick.

In [7]:
rbf_kernel_svm_clf = Pipeline((
        ("scaler", StandardScaler()),
        ("svm_clf", SVC(kernel="rbf", gamma=5, C=0.001))
))
rbf_kernel_svm_clf.fit(X, y)

Pipeline(steps=(('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('svm_clf', SVC(C=0.001, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=5, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))))

Here we used a gamme hyperparameter. This will affect how narrow the bell curve is around each instance. The smaller the value, the wider its range of influence, and conversely, the larger the value, the smaller its range of influence. If you suspect your model is overfitting, try reducing its value.

Other kernels are available, but they have niche uses like classifying text documents or DNA sequences.

As a rule of thumb, start out with the linearSVC kernel since it is fast, especially if there are many instances and features. Otherwise, try to use the Gaussian. Otherwise, cross validation and grid search can help narrow down which kernel may be best.

A helpful table from the book:


In [8]:
pd.DataFrame(
    {'Class': ['LinearSVC', 'SGDClassifier', 'SVC'],
     'Time complexity': ['O(m x n)', 'O(m x n)', 'O(m^2 x n) - O(m^3 x n)'],
     'Out-of-core support': ['No', 'Yes', 'No'],
     'Scaling required': ['Yes', 'Yes', 'Yes'],
     'Kernel trick': ['No', 'No', 'Yes']
    }
)

Unnamed: 0,Class,Kernel trick,Out-of-core support,Scaling required,Time complexity
0,LinearSVC,No,No,Yes,O(m x n)
1,SGDClassifier,No,Yes,Yes,O(m x n)
2,SVC,Yes,No,Yes,O(m^2 x n) - O(m^3 x n)
