# Support Vector Machines

[Support Vector Machines](https://en.wikipedia.org/wiki/Support_vector_machine), or SVM in short, is a supervised binary classification algorithms suitable for numerical data.

## The setup

Our dataset $\{x^{(i)}\}$ consists of points in a vector space $\mathbb{R}^n$, in other words each data point $x^{(i)}$ has $n$-components.  We also have a binary labelling scheme: each data point is labelled by one of two labels. For convenience, we choose these labels to be $+1$ and $-1$.

![img](images/svm.png)

In the simplest version of the setup, the data is linearly separable, i.e. there is a hyperplane that sits in between class of points labelled as +1, and class of points labelled as -1.  Our aim is to find such a hyperplane with *the largest margin.*  The margin is the width of the gap on both sides of the hyperplane that fits between the datasets of different labels.

## Some linear algebra

Recall from linear algebra that each hyperplane is determined by a normal vector $\mathbf{w}$ and a displacement $b$ from the origin.  Then the set of points on the hyperplane determined by $\mathbf{w}$ and displacement $b$ is given by

$$ \mathbf{w}\cdot\mathbf{x} + b = 0 $$

And this hyperplane splits our space into two disjoint subspaces: one subspace on one side of the hyperplane, while the other lies on the other side of the hyperplane.  More importantly for us, each side is determined by the sign of the displaced inner product.

$$ H_+ = \{ \mathbf{x}\in\mathbb{R}^n\mid \mathbf{w}\cdot\mathbf{x} + b > 0 \} $$
and
$$ H_- = \{ \mathbf{x}\in\mathbb{R}^n\mid \mathbf{w}\cdot\mathbf{x} + b < 0 \} $$


## The optimization problem

From the argument above, we see that $\mathbf{w}$ and $b$ need to satisfy the following constraint on our dataset:

$$ y^{(i)}(\mathbf{w}\cdot\mathbf{x}^{(i)}+b) > 0 $$

Notice that $\mathbf{w}$ and $b$ do satisfy the constraint above then for every $\lambda>0$ we also have

$$ y^{(i)}(\lambda\mathbf{w}\cdot\mathbf{x}^{(i)}+\lambda b) > 0 $$

So, we choose a *normalization* for the pair of parameters $\mathbf{w}$ and $b$ and set

$$ y^{(i)}(\mathbf{w}\cdot\mathbf{x}^{(i)}+b) \geq 1 $$

This means our optimization problem is to minimize $\|\mathbf{w}\|$ subject to the condition we gave above.

[Here](http://fourier.eng.hmc.edu/e161/lectures/svm/node1.html) is another good mathematical explanation of how and why SVM works.

## An example

Let us look at the [Sonar Dataset](http://archive.ics.uci.edu/ml/datasets/connectionist+bench+(sonar,+mines+vs.+rocks)) from [UCI data repository](http://archive.ics.uci.edu/ml/index.php)

In [15]:
import pandas as pd
from sklearn import svm
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

In [2]:
data = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data",sep=",",header=None)
data.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,51,52,53,54,55,56,57,58,59,60
0,0.02,0.0371,0.0428,0.0207,0.0954,0.0986,0.1539,0.1601,0.3109,0.2111,...,0.0027,0.0065,0.0159,0.0072,0.0167,0.018,0.0084,0.009,0.0032,R
1,0.0453,0.0523,0.0843,0.0689,0.1183,0.2583,0.2156,0.3481,0.3337,0.2872,...,0.0084,0.0089,0.0048,0.0094,0.0191,0.014,0.0049,0.0052,0.0044,R
2,0.0262,0.0582,0.1099,0.1083,0.0974,0.228,0.2431,0.3771,0.5598,0.6194,...,0.0232,0.0166,0.0095,0.018,0.0244,0.0316,0.0164,0.0095,0.0078,R
3,0.01,0.0171,0.0623,0.0205,0.0205,0.0368,0.1098,0.1276,0.0598,0.1264,...,0.0121,0.0036,0.015,0.0085,0.0073,0.005,0.0044,0.004,0.0117,R
4,0.0762,0.0666,0.0481,0.0394,0.059,0.0649,0.1209,0.2467,0.3564,0.4459,...,0.0031,0.0054,0.0105,0.011,0.0015,0.0072,0.0048,0.0107,0.0094,R
5,0.0286,0.0453,0.0277,0.0174,0.0384,0.099,0.1201,0.1833,0.2105,0.3039,...,0.0045,0.0014,0.0038,0.0013,0.0089,0.0057,0.0027,0.0051,0.0062,R
6,0.0317,0.0956,0.1321,0.1408,0.1674,0.171,0.0731,0.1401,0.2083,0.3513,...,0.0201,0.0248,0.0131,0.007,0.0138,0.0092,0.0143,0.0036,0.0103,R
7,0.0519,0.0548,0.0842,0.0319,0.1158,0.0922,0.1027,0.0613,0.1465,0.2838,...,0.0081,0.012,0.0045,0.0121,0.0097,0.0085,0.0047,0.0048,0.0053,R
8,0.0223,0.0375,0.0484,0.0475,0.0647,0.0591,0.0753,0.0098,0.0684,0.1487,...,0.0145,0.0128,0.0145,0.0058,0.0049,0.0065,0.0093,0.0059,0.0022,R
9,0.0164,0.0173,0.0347,0.007,0.0187,0.0671,0.1056,0.0697,0.0962,0.0251,...,0.009,0.0223,0.0179,0.0084,0.0068,0.0032,0.0035,0.0056,0.004,R


In [3]:
classifier = svm.SVC(kernel='linear')
xs = data.iloc[:,0:59]
ys = data.iloc[:,60]

train_xs, test_xs, train_ys, test_ys = train_test_split(xs,ys,test_size=0.25)
classifier.fit(train_xs,train_ys)

predicted_ys = classifier.predict(test_xs)
confusion_matrix(test_ys,predicted_ys)

array([[25,  3],
       [ 6, 18]])

## Soft margin

In some cases the data may not be perfectly linearly separable:

![img](images/svm2.png)

and we would like to allow some points being inside the margin. We call this situation as SVM classifier with *a soft margin*.  In mathematical terms, we do not strictly insist on the condition

$$ y^{(i)}(\mathbf{w}\cdot\mathbf{x}^{(i)}+b) \geq 1 $$

and allow points transgress the boundary with some error

$$ y^{(i)}(\mathbf{w}\cdot\mathbf{x}^{(i)}+b) \geq 1-\xi_i $$

But this time we optimize

$$ \|\mathbf{w}\| + C \sum_{i=1}^N \xi_i^2 $$

where $C$ is a hyper-parameter we tune for the application at hand.

Please read [these lecture notes](http://fourier.eng.hmc.edu/e161/lectures/svm/node5.html) and [sklearn SVM with soft margins](https://scikit-learn.org/stable/auto_examples/svm/plot_svm_margin.html) from the [sklearn documentation](https://scikit-learn.org/stable/user_guide.html).

In [4]:
classifier_sm = svm.SVC(kernel='linear', C=2.05)

train_xs, test_xs, train_ys, test_ys = train_test_split(xs,ys,test_size=0.25)
classifier_sm.fit(train_xs,train_ys)

predicted_ys = classifier_sm.predict(test_xs)
confusion_matrix(test_ys,predicted_ys)

array([[20,  4],
       [ 7, 21]])

## SVM with different kernels

It is also possible that the data is not linearly separable at all, but points become linearly separable when we map them to a higher dimensional space:

![img](images/svm3.png)

Instead of looking for a suitable higher dimensional space, and an embedding function, we play with the inner product function and replace them with suitable *kernel*s.

We did not mension above, but the inner product we use in the computations above is the standard *Euclidean* inner product of two vectors:

$$ \mathbf{x}\cdot\mathbf{y} = \sum_{i=1}^n x_iy_i $$

We replace this inner product with a kernel $k(\mathbf{x},\mathbf{y})$. Some of the useful kernels we use are

* linear kernels (ordinary inner products) $\mathbf{x}\cdot\mathbf{y}$
* polynomial kernels $(\mathbf{x}\cdot\mathbf{y}+r)^d$
* sigmoid function $tanh(\mathbf{x}\cdot\mathbf{y}+r)$
* radial basis functions $exp(-\gamma\|\mathbf{x}-\mathbf{y}\|)$

Please read [*kernels*](https://scikit-learn.org/stable/modules/svm.html#svm-kernels) section of the sklearn toolkit to see how these are used.

In [5]:
classifier_kernel = svm.SVC(kernel='rbf', gamma=3.0)

train_xs, test_xs, train_ys, test_ys = train_test_split(xs,ys,test_size=0.25)
classifier_kernel.fit(train_xs,train_ys)

predicted_ys = classifier_kernel.predict(test_xs)
confusion_matrix(test_ys,predicted_ys)

array([[24,  4],
       [ 3, 21]])

## Multiclass SVM

SVM is designed as a binary classifier but one can extend it as a multi-label classifier as well.  In fact, all of the standard libraries of R and python dealing with SVM do that already.

In [6]:
iris = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",sep=",", header=None)
iris.head(10)

Unnamed: 0,0,1,2,3,4
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
5,5.4,3.9,1.7,0.4,Iris-setosa
6,4.6,3.4,1.4,0.3,Iris-setosa
7,5.0,3.4,1.5,0.2,Iris-setosa
8,4.4,2.9,1.4,0.2,Iris-setosa
9,4.9,3.1,1.5,0.1,Iris-setosa


In [7]:
xs = iris.iloc[:,0:3]
ys = iris.iloc[:,4]
train_xs, test_xs, train_ys, test_ys = train_test_split(xs,ys,test_size=0.25)

classifier = svm.SVC(kernel='rbf', gamma=1.0)
classifier.fit(train_xs,train_ys)

predicted_ys = classifier.predict(test_xs)
confusion_matrix(test_ys,predicted_ys)

array([[11,  0,  0],
       [ 0, 12,  2],
       [ 0,  0, 13]])

In [8]:
wine = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data",sep=",", header=None)
wine.head(10)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735
5,1,14.2,1.76,2.45,15.2,112,3.27,3.39,0.34,1.97,6.75,1.05,2.85,1450
6,1,14.39,1.87,2.45,14.6,96,2.5,2.52,0.3,1.98,5.25,1.02,3.58,1290
7,1,14.06,2.15,2.61,17.6,121,2.6,2.51,0.31,1.25,5.05,1.06,3.58,1295
8,1,14.83,1.64,2.17,14.0,97,2.8,2.98,0.29,1.98,5.2,1.08,2.85,1045
9,1,13.86,1.35,2.27,16.0,98,2.98,3.15,0.22,1.85,7.22,1.01,3.55,1045


In [9]:
xs = wine.iloc[:,1:13]
ys = wine.iloc[:,0]
train_xs, test_xs, train_ys, test_ys = train_test_split(xs,ys,test_size=0.25)

classifier = svm.SVC(kernel='linear')
classifier.fit(train_xs,train_ys)

predicted_ys = classifier.predict(test_xs)
confusion_matrix(test_ys,predicted_ys)

array([[18,  1,  0],
       [ 0, 14,  0],
       [ 0,  0, 12]])

## One large example (MNIST)

[MNIST](http://yann.lecun.com/exdb/mnist/) dataset of handwritten digits stored as grayscale images of size 28x28 pixels.

![MNIST sample](images/MNIST-sample.png)

For this example, you need to install the library [mlxtend](https://github.com/rasbt/mlxtend).

In [11]:
from mlxtend.data import loadlocal_mnist

X, y = loadlocal_mnist(
        images_path='data/t10k-images-idx3-ubyte', 
        labels_path='data/t10k-labels-idx1-ubyte')


In [20]:
Xtrain, Xtest, ytrain, ytest = train_test_split(X,y,test_size=0.25)

classifier = svm.SVC(kernel='linear',C=2.0)
classifier.fit(Xtrain,ytrain)

predicted_ys = classifier.predict(Xtest)
confusion_matrix(ytest,predicted_ys)

array([[241,   0,   0,   0,   0,   1,   2,   0,   0,   0],
       [  0, 272,   1,   0,   1,   2,   1,   0,   0,   0],
       [  1,   3, 230,   3,   3,   1,   3,   2,   1,   2],
       [  1,   2,   6, 242,   1,  10,   0,   3,   9,   1],
       [  0,   1,   2,   0, 241,   0,   1,   0,   1,   8],
       [  2,   3,   0,   8,   2, 187,   2,   1,   4,   1],
       [  1,   1,   4,   0,   1,   0, 222,   1,   1,   0],
       [  3,   1,   3,   0,   3,   0,   0, 222,   0,  11],
       [  1,   8,   1,   9,   1,   7,   2,   0, 225,   3],
       [  2,   2,   0,   4,  10,   2,   0,   8,   0, 232]])

In [21]:
print(classification_report(ytest,predicted_ys))

             precision    recall  f1-score   support

          0       0.96      0.99      0.97       244
          1       0.93      0.98      0.95       277
          2       0.93      0.92      0.93       249
          3       0.91      0.88      0.89       275
          4       0.92      0.95      0.93       254
          5       0.89      0.89      0.89       210
          6       0.95      0.96      0.96       231
          7       0.94      0.91      0.92       243
          8       0.93      0.88      0.90       257
          9       0.90      0.89      0.90       260

avg / total       0.93      0.93      0.93      2500

