# SVM - Support Vector Machines

_Learn by maximizing margin separation_

# Why SVMs?

* One of the better _off-the shelf_ algorithms

* Optimize an intuitive notion of separation

* Non-linear behaviour with linear runtimes

![SVM](images/svm.png)

![svm-example](images/svm-example.png)

# Optimal Margin Classifier

Objective:

$$max_{\omega,b}\gamma$$

same as:

$$min_{\omega,b}\frac{1}{2}\|\omega\|^2$$

while:

$$y_i(\omega x_i+b)\ge 1, i=1,...,m$$

Above, all points where $\gamma=1$ are _closest_ to the margin

# Are we done?

* We _can_ just leave it here
* However, we notice that boundary only depends on the _closest_ points ($\gamma=1$)
    * This gives us an additional constraint
    * Label active points with $\alpha \ne 0$
    * Under **Karush-Kuhn-Tucker** (KKT) conditions, find the support vectors _and_ $\omega$
   

# Support Vectors

![support-vectors](images/svm-support-vectors.png)

# Margins for Support Vectors

$$\omega = \sum_{i=0}^m \alpha_i y_i x_i$$

While

$$\sum_{i=0}^m \alpha_i y_i = 0$$

Now, if we have $\alpha$, we can find $\omega$ easily!

# Intermezzo:

## Separability in higher dimensions

[Eric Kim's Kernels Page](http://www.eric-kim.net/eric-kim-net/posts/1/kernel_trick.html)

![hd-sep](images/data_2d_to_3d_hyperplane.png)

# Kernel Trick

$$\begin{align}
\omega x + b &= \left( \sum_{i=0}^m \alpha_i y_i x_i \right) x + b \\
&= \sum_{i=0}^m \alpha_i y_i \langle x_i,x\rangle + b
\end{align}$$

Remember that most $\alpha$s are zero.

Dot products of functions are often simpler than functions of dot products

# Kernels

_Kernels are small functions_

For SVMs, a _kernel_ is defined as inner product of feature transformations $\phi$:

$$K(x, z) = \phi(x)^T \phi(x)$$

The kernels allow SVM to learn from high-dimensional feature space.

# Kernel Example

In general:

$$K(x, z) = \phi(x)^T \phi(x)$$

Let's say we want to fit a polynomial transformation:

$$ K(x,z) = (x^T z)^2 $$

We can calculate $\phi$ directly, but that is quite hard. For $n=2$:

$$K(x,z)=\left(\begin{bmatrix}x_1 x_1 \\ x_1 x_2 \\ x_2 x_1 \\ x_2 x_2 \end{bmatrix} 
\begin{bmatrix} z_1 z_1 & z_1 z_2 & z_2 z_1 & z_2 z_2\end{bmatrix} \right) ^{2}$$

However, we can simplify this:

$$\begin{align} K(x,z) &=\left( \sum_{i=1}^n x_i z_i \right) \left( \sum_{j=1}^n x_i z_i \right) \\
&=\sum_{i,j=1}^n (x_i x_j) (z_i z_j)
\end{align} $$



# Common Kernels

* Linear 
    * $\langle x, z \rangle$
* Polynomial 
    * $(\gamma \langle x, z \rangle + r)^d$
* Gaussian Radial Basis (RBF) 
    * $e^{-\gamma(\| x - z \|^2)}$
* Sigmoid
    * $tanh(\gamma\langle x, z \rangle+r)$

# Summary

## Pros

* Scale well
* Non-linear separation
* Fast

## Cons

* Could be hard to interpret
* Arbitrary transofrmations

# Additional materials

Andrew Ng's lectures for [CS229](http://cs229.stanford.edu/notes/cs229-notes3.pdf), Stanford

# Exercise

In [2]:
#import the classifiers
from sklearn.metrics import roc_curve, roc_auc_score, accuracy_score
from sklearn.svm import SVC

#import the functions to use the dataset
from pathogenicity_predictor import prepare_variants, concat_training_data, partition_into_training_and_test, plot_line_graph

In [3]:
data, feature_names = prepare_variants('../data/variants.json.gz')
variants, labels = concat_training_data(data)
training_vars, test_vars, training_labels, test_labels = partition_into_training_and_test(variants, labels, 0.8)

In [18]:
# Run random forest classification!
svm_classifier = SVC(kernel='rbf', probability=True).fit(training_vars, training_labels)

# Get probabilities
pathogenicity_probs = svm_classifier.predict_proba(training_vars)[:,1]

# Question 1

Examine the support vectors for the classifier (`svm_classifier.support_vectors`). What percentage of the data is used?

# Question 2

Compare different kernels. We use `sigmoid` to have logistic-like properties. Try `rbf` for Gaussian distances. Do the results improve?

# Quetion 3

Build a $ROC$ curve for the classifier. Is SVM performing well?