# Support Vector Machines

### Contents
1. Introduction  
2. The maximal margin classifier


## Introduction

The support vector machine is an approach to classification problems that has been developed in the 1990s as a generalization of a simpler classifier called the ***maximal margin classifier***. In the first part of this notebook, we will assume that the training data points are linearly separable and belong to only two classes (binary classification).

Differently from other classification algorithms, for example logistic regression, SVMs, in addition to classifying data points (each of which is represented by a $p$-dimensional vector), aims to find the post possible boundary, that is the largest separation between the classes by means of a $(p-1)$-dimensional *hyperplane*.

**Definition:** a *hyperplane* of a $p$-dimensional space $V$ is a flat affine subspace of dimension $p-1$, or equivalently, of codimension $1$ in $V$.

**Example:** if $V$ is the vector space $\mathbb{R}^3$, then a hyperplane is a flat two dimensional subspace, that is, a plane.

A $p$ dimensional hyperplane has the form 

$$\beta_0 + \beta_1X_1 + \beta_2X_2 + \dots + \beta_pX_p = 0.$$

In the sense that, if a point $X = [X_1,\dots,X_p]^\top$ satisfy this equation, then $X$ lies on the hyperplane. If that equation is not satisfied, for example $\beta_0 + \beta_1X_1 + \dots + \beta_pX_p > 0$ then $X$ will lie on one side of the hyperplane.

The best hyperplane which we are seeking is the one so that the distance from it to the nearest data point on each side is maximized. If such hyperplane exists, it is called the **maximum-margin hyperplane** and the linear classifier it identifies is called the *maximum-margin classifier*.


~The cost function to minimize contains two contributions: one due to the *classification error* and the other due to the *margin error*.~

# The maximal margin classifier
Consider a binary classification problem where our dataset consists into $n$ samples, each of which consists in $p$ features. Therefore $X \in \mathbb{R}^{n\times p}$. The training samples are:

$$x_1 = \begin{bmatrix}x_{11} \\ \vdots \\ x_{1p}\end{bmatrix}, \dots, x_n = \begin{bmatrix}x_{n1} \\ \vdots \\ x_{np}\end{bmatrix},$$

and they fall into two classes $y = \{-1,1\}$. We want to predict the output label of a new sample $x^* = \begin{bmatrix}x_1^* & \dots & x_p^*\end{bmatrix}^\top$. Similarly as what we have already seen in previous chapters, we seek a *separating hyperplane*, which has the property that, for all $n$,

$$y_i(\beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \dots + \beta_px_{ip}) > 0$$

which is a compact instead of writing:

$$\begin{cases}
\beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \dots + \beta_px_{ip} > 0 & \text{if } y_i = 1 \\
\beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \dots + \beta_px_{ip} < 0 & \text{if } y_i = -1
\end{cases}$$

If a separating hyperplane exists, then we can build a simple classifier according to the side in which the observation we want to classify falls, that is the sign of $f(x^*) = \beta_0 + \beta_1x_1^* + \dots + \beta_px_p^*$. The *magnitude* of $f(x^*)$ can be used as a confidence measure about the class prediction. But which of the inifinitely many hyperplanes should be chosen? A natural choice is the *maximal margin hyperplane*, which is the fartest hyperplane from the training observations. For a fixed separating hyperplane, we can compute the perpendicular distance of each data point from it. The smallest distance is known the ***margin***. The maximal margin hyperplane is the separating hyperplane that has the farthest minimum distance from the observations.

**Observation:** the maximal margin classifier with coefficients $\beta_0, \dots, \beta_p$ can lead to overfitting when $p$ is large.

In the following image we can observe the maximal margin hyperplane that has been found in a two dimensional space, with the corresponding margin.

<img src='images/svm/support-vectors.png' alt='maximal margin hyperplane' style='width: 300px;'/>

There are 3 points which are equidistant from the hyperplane. Those are called ***support vectors***. The chosen separating hyperplane depends directly on those three points but not on the other observations: moving around even one support vector may affect the position of the hyperplane.

## Constructing the maximal margin classifier
The maximal margin hyperplane based on a set of $n$ observations is the solution to the optimization problem:

$$\begin{align}
&\max_{\beta_0,\beta_1,\dots,\beta_p, M}{M} \notag\\
&\text{subject to} \sum_{j=1}^{p}{\beta_j^2} = 1, \notag\\
&y_i(\beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \dots + \beta_px_{ip}) \ge M \quad \forall i=1,\dots,M
\end{align}$$

The last constraint guarantees every observation to be on the right side of the hyperplane, where $M$ represents the margin of the hyperplane, which we want to maximize. The first constraint ensures that the perpendicular distance from the *i*-th observation to the hyperplane is given by $y_i(\beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \dots + \beta_px_{ip})$. This fact comes to the formula of the [distance from a point to a hyperplane](https://en.wikipedia.org/wiki/Distance_from_a_point_to_a_plane):

$$\frac{\lvert\beta_0 + \beta_1x_{i1} + \dots + \beta_px_{ip}\rvert}{\lVert \beta \rVert}.$$

In most cases no separating hyperplane exists, therefore the optimization problem we have just presented has no solution. Fortunately, the concept of separating hyperplane can be generalized to one that *almost* separates the classes, using a *soft margin*. This is what a support vector classifier does.