Skip to content
Jeff Levesque edited this page Mar 24, 2017 · 47 revisions

Overview

Support Vector Machines (SVM) classifies data by finding the best hyperplane that separates all data points of one class from those of another class. Generally, the best hyperplane for an SVM, means the one with the largest margin between the two classes. The margin (often represented by epsilon), is determined by the support vectors, which are the closest data points to the hyperplane, and tangent on the line parallel to the hyperplane, at epsilon distance. Therefore, the margin is the maximal width, parallel to the hyperplane, that contains no interior data points.

Today, support vector machines are used to model complex, real-world problems such as text and image classification, hand-writing recognition, and bioinformatics (or biosequence) analysis.

Note: when the dimensional space is n, the hyperplane is n-1 dimensions. For a trivial example, given a line, pick any point. The chosen point divides the line into two parts. Now, in a two-dimensional plane, pick any line. The chosen line divides the plane into two parts.

Inseparable Data

Though, many times a supplied data is assumed to be separable with a hyperplane, there are many cases where it is not. For example, consider a circle in a two-dimensional space, with dots on along the perimeter, with alternative colors (red, blue). There are no separating hyperplanes (in two-dimension), that can separate the dots.

Now, consider another (simpler) inseparable example, a one-dimensional line with alternating dots (red, blue). By definition, a one-dimensional hyperplane cannot be used for a one-dimensional space. However, if the line is projected to a two dimensional space, f: x -> (x, x^2), a hyperplane (any arbitrary line) can be determined. Therefore, the projected space (higher dimension) can be used by the support vector machine, to determine a separating plane. The function (f) responsible for the transformation is called the kernel function.

Kernel Functions

Support vector machines generally implement kernel functions to transform a given vector space, to a higher dimension. More specifically, a kernel is the inner product of two feature vectors in the higher dimensional space being projecting into. Usually, this process (as discussed earlier), resolves cases of inseparable data.

The scikit-learn library, by default always defines a kernel function. For example, the SVC class, by default, implements the rbf kernel.

Note: this wiki assumes the use of the SVC class. However, the scikit-learn provides additional classes, including (non exhaustive):

Though, the scikit-learn SVM classes, by default define a default kernel, it can always be explicitly defined with another. For example, the following kernels are supported within the library:

  • linear: svc=svm.SVC(kernel='linear')
  • polynomial: requires a polynomial degree, svc=svm.SVC(kernel='poly', degree=3)
  • radial basis function (rbf): default for SVC class, svc=svm.SVC(kernel='rbf')
  • sigmoid: function having an "S" shape curve
  • custom: a custom kernel can be defiend either via a python function, or by precomputing the Gram matrix

In general, each (of the above) kernel allows additional keyword arguments. But, in most cases, since the SVC class defines default values, the additional arguments can be omitted.

Note: a kernel simply transforms the vector space, while the SVM determines the hyperplane(s), by determining support vectors with the largest epsilon distance to the hyperplane.

Note: implementing a kernel (other than linear), is synonymous to the term, kernel trick.

Custom Kernels

Python function used for a custom SVM kernel:

import numpy as np
from sklearn import svm
def my_kernel(x, y):
    return np.dot(x, y.T)
...
clf = svm.SVC(kernel=my_kernel)
clf.fit(X, Y)

Note: y.T gives the transpose of y, and is required to define the above linear kernel (equivalent to kernel=linear).

An alternative to using a python function for a custom SVM kernel, is to use the Gram matrix. A Gram matrix can often be thought as a matrix of pairwise distances, and the kernel being the distance measure. Specifically, the Gram matrix has the dimension of n_samples x n_samples, where every entry, is the result of the kernel function (on two entries).

In context to SVM, the gram matrix can be represented as G_ij = K(X_i, X_j). A specific example of the Gram matrix for a custom SVM kernel, can be seen as follows:

import numpy as np
from sklearn import svm
X = np.array([[0, 0], [1, 1]])
y = [0, 1]
clf = svm.SVC(kernel='precomputed')
# linear kernel computation
gram = np.dot(X, X.T)
clf.fit(gram, y) 
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3,
gamma=0.0, kernel='precomputed', max_iter=-1, probability=False,
random_state=None, shrinking=True, tol=0.001, verbose=False)
# predict on training examples
clf.predict(gram)
array([0, 1])

This means the np.dot method, is the kernel function, while X, and X.T are the corresponding kernel arguments. It is important to know, the two points (arguments) to the kernel function, are needed to compute the inner-product, since the kernel is a function of two data points. However, the arguments, need not be transposes of one another (except the linear kernel case). Also, other inner-product implementations could be used, instead of the standard inner-product function, np.dot.

Note: the general definition of a Gram matrix is G = K(X, X.T), where X.T is the transpose of X, and K is the inner product operation on X, and X.T.

Note: when using the Gram matrix, remember to set kernel='precomputed', and pass the Gram matrix instead of X in the fit method.

Kernel Suggestion

Sometimes, choosing a kernel may not be trivial, and a difficult choice. For these cases, the scikit-learn library, has provided the Grid Search. This functionality, essentially performs kernel suggestions. However, the implementation requires various ranges (i.e. kernels, C, gamma, degree) to be explicitly defined, in order to be tested, and a suggestion to be provided.

###Classification

Generally, SVM were originally designed for binary classification. However, there are two general approaches to implement multi-class classification. The first is to combine the results of multiple binary classifiers:

The second approach, is to treat the overall problem, as a single optimization problem:

  • Crammer Singer

In general, the scikit-learn library implements classification as a series of binary classifiers. Specifically, the one-against-one approach is used for the SVC, and NuSVC classes. The LinearSVC class, on the other hand, can implement either the one-against-all, or the Crammer Singer strategy.

Classifiers

Classifiers are used to predict the class from a trained data set. As discussed earlier, an svm is a binary classifier, and can only distinguish between two classes. However, a multi-class classification, is simply a series of binary classifiers. Depending on the approach (i.e. OAO, OAA, Crammer Singer), the number of classifiers vary.

Specifically, the OAO approach has n_class * (n_class - 1) / 2 classifiers, where each one trains from two classes. However, the OAA has the same number of classifiers as classes, except for the trivial binary case (two classes), which would consist of a single classifier.