# SVMs from Scratch #
    - Matt Robinson

I have an opinion that is perhaps better left unsaid, but I'll say it anyway: most machine learning books are absolutely terrible at explaining SVMs (Support Vector Machines). Most books mention things like the maximum margin and Lagrange multipliers, but fail to adequetly explain many of the steps in the math. To be fair, many of the books are not purporting to be lengthy math treatises. However, I do think the steps should be clear and there should be more intuition for choosing the "widest street" besides "that one looks the best" or "that's the line you would probably draw."

The problem with saying all of these things is that now I am going to try to explain SVMs, and I'll probably do a poor job of it. My tip to all readers is to go read *Pattern Recognition and Machine Learning* by Bishop (2006) or *Learning from Data* by Abu-Mostafa *et al.* (2012). These books are excellent and do by far the best job of explaining the concepts.

Since these books are amazing, I am actually just going to cover the material that I found extra-confusing about *Hard-margin* SVMs. I repeat, this tutorial will not be extensive, but will instead try to clear up the tricky points. If you are confused, much of the material is derived from the aforementioned books -- I again suggest you give them a read.

## Why is a Fatter Margin Better? ###

*Learning from Data* (2012) includes a great, lengthy discussion on this topic. However, I will include just one figure detailing what I find the most intuitively pleasing reason: 

Note the following figures are taken from the slides from the Learning from Data [course website](https://work.caltech.edu/lectures.html#lectures). The book has better figures, but it has frequent warnings not to distribute the material anywhere...
<img src="img/learning_from_data_svm_no_noise.png" width="800" height="600">


Now most of us would say the linear discriminant on the right is the most reasonable choice. But why? Well lets imagin that each data point has some associated noise or measurement error. Let's now plot how much measurement noise/error in the samples each margin allows (noise/error is represented by the black circles): 

<img src="img/learning_from_data_svm_noise.png" width="800" height="600">

The right-most margin (the maximum margin) clearly is more robust to error or noise. Even a little bit of measurement error could result in a data point being incorrectly classified in the left most plot. However, a point with slight measurement error will still likely be classified by the classifier in the right plot. 

All of this can be made more formal, but we will settle for this justification for now. 

# The Math #

Okay, time to get into the thick of it -- the part where most of the testbooks lost me. There are a lot of little small tricks and assumptions behind the math of SVMs; I'll try to make most of the explicit. 

At it's core, a hard-margin SVM is really just a simple linear discriminant. We have $N$ $D$-dimensional input vectors $\mathbf{x}_1,...,\mathbf{x}_N \in \mathbb{R}^D$. We use a linear function $y(\mathbf{x})$ to transform the vectors so that they can then be classified.

$$
y(\mathbf{x}) = \mathbf{w}^T\mathbf{x} + b
$$

where $\mathbf{w}$ is the weights vector and $b$ is the bias.

All points for which $y(\mathbf{x}) \geq 0$ are classified as $+1$ and all points $y(\mathbf{x}) \leq 0$ are classified as $-1$. The decision boundary, therefore, consists of all those points for which $y(\mathbf{x}) = \mathbf{w}^T\mathbf{x} + b = 0$. Let's now dig into some of the math questions that stumped me.

#### Why is $\mathbf{w}$ orthogonal to the decision boundary? ####

Let's start by considering the following figure depicting a linear discriminant function in 2-dimensions from Bishop.

<img src="img/Bishop_2D_discriminant.png" width="600" height="600">

The decsion surface is shown in red, and note that Bishop uses $w_o$ to denote the bias parameter, which we are calling $b$.

Now if we pick two generic inputs $\mathbf{x}_1$ and $\mathbf{x}_2$ that lie on the decision surface, we automatically have that $y(\mathbf{x}_1)=y(\mathbf{x}_2)=0$ since they are on the surface. Furthermore, the vector going from $\mathbf{x}_1$ to $\mathbf{x}_2$ is in the direction of $(\mathbf{x}_2 - \mathbf{x}_1)$ and is parallel to the decision surface. But we also know that, 

$$
y(\mathbf{x}_1)=\mathbf{w}^T\mathbf{x}_1 + b =0=\mathbf{w}^T\mathbf{x}_2 + b = y(\mathbf{x}_2) \\ \implies \mathbf{w}^T(\mathbf{x}_2-\mathbf{x}_1)=0
$$

which shows that $\mathbf{x}$ is orthogonal to every vector in the lying in the decision surface. 

#### Why does $\frac{1}{||\mathbf{w}||}$  represent the width of the margin? ####

Consider again the figure above. The generic vector $\mathbf{x}$ can be decomposed as follows:

$$
\mathbf { x } = \mathbf { x } _ { \perp } + r \frac { \mathbf { w } } { \| \mathbf { w } \| }
$$

where $\mathbf { x } _ { \perp }$ is the orthogonal projection of $\mathbf{x}$ on the decision surface and $r$ is a scalar giving the perpendicular distance from the deicsion surface to $\mathbf{x}$. Note that $r$ is multiplied by the unit vector $\frac { \mathbf { w } } { \| \mathbf { w } \| }$, which is necessarily orthogonal to the decision surface, as shown above.

Let's work with this equation a bit:

$$
\mathbf { x } = \mathbf { x } _ { \perp } + r \frac { \mathbf { w } } { \| \mathbf { w } \| } \\
\begin{align*}
&\implies \mathbf{w}^T\mathbf { x } = \mathbf{w}^T\mathbf { x } _ { \perp } + r { \| \mathbf { w } \| } \\
&\implies \mathbf{w}^T\mathbf { x } + b = \mathbf{w}^T\mathbf { x } _ { \perp } + b + r { \| \mathbf { w } \| } \\
&\implies y(\mathbf{x}) = y(\mathbf { x } _ { \perp }) + r { \| \mathbf { w } \| } \\
&\implies y(\mathbf{x}) = 0 + r { \| \mathbf { w } \| } \\
&\implies r = \frac{y(\mathbf { x })}{ \| \mathbf { w } \| } \\
\end{align*}
$$

where we used the fact that $y(\mathbf { x } _ { \perp })=0$ because $\mathbf { x } _ { \perp }$ is on the decision surface.

This value $r$ will be positive above the decision boundary and negative below it, as dictated by the sign of $y(\mathbf { x })$. The trick now is to choose a scaling of $\mathbf{w}$ and $b$ such that $y(\mathbf { x })=1$ for the closest points to the boundary. This will insure that our margin can simply be represented as $\frac{1}{||\mathbf{w}||}$.