#Support Vector Machines

1. Support Vector Machines: Introduction & Motivation
1. SVM Concepts
1. Linear SVMs
1. Nonlinear SVMs
1. Applications of SVM

##Support Vector Machines: Introduction & Motivation

Support Vector Machines (SVMs) are among the most important tools you will learn about in your study of machine learning. Although there are cases wherein SVMs are not necessarily the best choice, they can do many things that other models do. In some cases, SVMs are superior in performance to more sophisticated tools. 

The SVM is by definition a *Classifier* and discriminative model. At the most basic level, it is a **binary** classifier, meaning it distinguishes between two classes; think (+) or (-). More advanced SVMs can discriminate between many classes.

Because of the intellectual importance of SVMs to a foundation in machine learning, we shall motivate and discuss them in some detail. 


###SVMs: Motivation

In your previous classwork, you observed the use of Lasso regression against an essentially binary dataset. Other forms of regularized regression reduced the emphasis of the regression on the x-value, but Lasso eliminated dependence on the x-value completely. Suppose, that instead of a regression, you were more worried about finding an **optimal boundary of discrimination** between two classes within the sample space, rather than trying to predict the next point. 

We shall define this boundary as follows:

**Given an n-dimensional feature space, the optimal boundary is that n-dimensional hyperplane that maximizes distance between the two classes. This in turn, means we seek to optimize the distance between a training example of a given class and the plane itself.**


###SVMs: Derivation

Instead of projecting output variables into the feature space as we normally do, we are going to define the general equation of a (hyper)plane in the n-dimensional feature space (in two dimensions a plane is a line):

$$\beta_0 + \beta_1x_1 + \beta_2x_2 + \cdots + \beta_nx_p = z$$

$$\beta_0 + \textbf{$\beta^{T}$}\textbf{x} = z$$

Our plane is going to divide the two classes in hyperspace. There are an infinite number of ways of choosing the different coefficients to represent the plane, but in this case we are just going to make it as simple as possible:

$$ |\beta_0 + \textbf{$\beta^{T}$}\textbf{x}| = 1 $$

In this case $\textbf{x}$ represents those training examples closest to the hyperplane. This is an important element of what we are doing, because ideally we want to find the plane that maximizes distance between the positive and negative training examples. 

###Distance between a point and a plane

Now we use vector math to calculate the distance between any point $x$ (a vector in p-dimensional space) and any plane. To illustrate this, take a plane z in 3 dimensions:

$$ax_0+bx_1+cx_2+z=0$$

The normal vector to the plane is:

$$\textbf{v} = \left| \begin{array}{c}
a  \\
b  \\
c  \end{array} \right|$$


Given a point $\textbf{u} = (u_0,u_1,u_2)$, a vector $\textbf{w}$ from a point $\textbf{x}$ on the plane to $\textbf{u}$ is given by:

$$\textbf{w} = \textbf{u}-\textbf{x} = - \left| \begin{array}{c}
x_0-u_0  \\
x_1-u_1  \\
x_2-u_2  \end{array} \right|$$

Now we can take the projection of $\textbf{w}$ onto $\textbf{v}$ to give the distance from $\textbf{u}$:

$$D=|proj_{\textbf{v}}\textbf{w}|$$

$$D=\frac{|\textbf{v}\cdot\textbf{w}|}{|\textbf{v}|}$$


We have to work this equation to get the result we want:

$$D = \frac{|a(x_0-u_0)+b(x_1-u_1)+c(x_2-u_2)|}{|\textbf{v}|}$$

$$D = \frac{|ax_0-au_0+bx_1-bu_1+cx_2-cu_2|}{|\textbf{v}|}$$

$$D = \frac{|-z-au_0-bu_1-cu_2|}{|\textbf{v}|}$$

$$D = \frac{au_0+bu_1+cu_2+z}{|\textbf{v}|}$$

###Definition of the SVM equations

Extending these definitions to our discussion of distance from the hyperplane, we can simply compute the distance between the support vector $\textbf{x}$ and the special hyperplane as:

$$D_{hyperplane} = \frac{|\beta_0 + \textbf{$\beta^{T}$}\textbf{x}|}{\|\beta\|}$$

Extending it to our generalization, we can set the distance to the support vectors as:

$$D_{hyperplane} = \frac{|\beta_0 + \textbf{$\beta^{T}$}\textbf{x}|}{\|\beta\|} = \frac{1}{\|\beta\|}$$

The total **margin** between the plane on both sides of the divide is given by **M**:

$$M = \frac{2}{\|\beta\|}$$

With an SVM, we seek to **maximize M**.

###Solution to the SVM

####Part 1

Recall that we were discussing classifying the data into the (+) and (-) group. Thus we consider these classification states as output variables $\textbf{y}$ to our method. Going back to our equation, we can write that 

$$\beta_0 + \textbf{$\beta^{T}$}\textbf{x} = 1$$

For the positive class, and

$$\beta_0 + \textbf{$\beta^{T}$}\textbf{x} = -1$$

For the negative class, again recalling that the $y$ are either $+1$ or $-1$. 

Also recall that the plane includes all data point solutions above the plane in the positive class. All solutions below the plane are in the negative class, therefore;

$$\beta_0 + \textbf{$\beta^{T}$}\textbf{x} \geq 1$$

and also

$$\beta_0 + \textbf{$\beta^{T}$}\textbf{x} \leq -1$$

Thinking of the $y_i$ as multipliers by $+1$ or $-1$, we can write:

$$y_i(\beta_0 + \beta^{T}x_{i}) \geq 1$$

for all datapoints $i=1,2, \cdots , m$

####Part 2

It so turns out that the above equation is miserably difficult to solve directly. This is because it involves 

##SVM Concepts

###Hypothesis

The variables can be most clearly classified in terms of a plane separating them.

**We can define the plane in terms of the training points closest to the plane. These are points are called *support vectors***.

The equation of the plane is given as 

$$|\beta_0 + \textbf{$\beta^{T}$}\textbf{x}|=1$$

Where the $\textbf{x}$ are the **support vectors**.


###Cost Function

Recakk

###Optimization


###Reasoning


