# SVM - Support Vector Machines

_Learn by maximizing margin separation_

![SVM](images/svm.png)

SVM is a very robust family of machine learning algorithms.

Here, we will consider a supervised classification problem with two classes in some detail.

Intuitively, SVMs try to find a boundary separating the different classes of observations. This boundary is always linear. The larger the gap separating the different classes, the more confidence we have in the prediction. 

We call this separation gap a margin, and our objective is to chose the largest margin. In the figure above, the margin is the green line, and our objective is to maximize the distance from the margin to the closest data points.

![svm-example](images/svm-example.png)

# Optimal Margin Classifier

$$max_{\omega,b}\gamma$$

same as:

$$min_{\omega,b}\frac{1}{2}\|\omega\|^2$$

while:

$$y_i(\omega x_i+b)\ge 1, i=1,...,m$$

We want to find the values of $\omega$ and $b$ that maximize the margin $\gamma$. This turns out to be equivalent to minimizing the square size (norm) of the vector $\omega$, as above.

In addition, we want to constrain the sizes of the margin to be larger or equal to $1$. In this case, the points with $\omega=1$ will be exactly the closest to the margin.

# Support Vectors

![support-vectors](images/svm-support-vectors.png)

We can leave it at that, and just try to optimize the formulation above. However, we can notice that the margin only depends on the points closest to the separation boundary. These are the points where $\omega=1$. This adds an extra step to the problem setup - but turns out to simplify the problem. We can optimize with additional constrains, including that only some points are _active_. These our our support vectors.

By using Lagrange multipliers, a general approach for constrained optimization, under **Karush-Kuhn-Tucker** conditions, we can find optimal support vectors for a particular margin.

# Margins for Support Vectors

$$\omega = \sum_{i=0}^m \alpha_i y_i x_i$$

While

$$\sum_{i=0}^m \alpha_i y_i = 0$$

Given the values $\alpha$, the non-zero Lagrange multipliers of the support vectors, we can now find the $\omega$ quickly.

# Separability in higher dimensions

[Eric Kim's Kernels Page](http://www.eric-kim.net/eric-kim-net/posts/1/kernel_trick.html)

![hd-sep](images/data_2d_to_3d_hyperplane.png)

# Kernel Trick

$$\begin{align}
\omega x + b &= \left( \sum_{i=0}^m \alpha_i y_i x_i \right) x + b \\
&= \sum_{i=0}^m \alpha_i y_i \langle x_i,x\rangle + b
\end{align}$$

Now, if we want to find the margin for a new point, we use the margin function $\omega x + b$. We can express this using the formulation with $\alpha$ from before. Most of $\alpha$s are zero, so this inner product is easier to calculate. In addition, it gives us extra power, since it is often possible to find the inner product of two transformed vectors without finding the transformed vectors themselves. This gives SVMs much of their power.

# Kernels

_Kernels are small functions_

For SVMs, a _kernel_ is defined as inner product of feature transformations $\phi$:

$$K(x, z) = \phi(x)^T \phi(x)$$

The kernels allow SVM to learn from high-dimensional feature space.

# Kernel Example

Let's say we want to fit a polynomial transformation:

$$ K(x,z) = (x^T z)^2 $$

However, we can simplify this:

$$\begin{align} K(x,z) &=\left( \sum_{i=1}^n x_i z_i \right) \left( \sum_{j=1}^n x_i z_i \right) \\
&=\sum_{i,j=1}^n (x_i x_j) (z_i z_j)
\end{align} $$

On the other side, calculating $\phi$ directly is hard. For $n=2$:

$$\phi(x)=\begin{bmatrix}x_1 x_1 \\ x_1 x_2 \\ x_2 x_1 \\ x_2 x_2 \end{bmatrix}$$

# Common Kernels

* Linear 
    * $\langle x, z \rangle$
* Polynomial 
    * $(\gamma \langle x, z \rangle + r)^d$
* Gaussian Radial Basis (RBF) 
    * $e^{-\gamma(\| x - z \|^2)}$
* Sigmoid
    * $tanh(\gamma\langle x, z \rangle+r)$

# Additional materials

Andrew Ng's lectures for CS229, Stanford