<a href="https://colab.research.google.com/github/pietroventurini/machine-learning-notes/blob/master/8%20-%20Support%20Vector%20Machines.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Support Vector Machines
### Contents
1. **[Introduction](#Introduction)**  
2. **[The maximal margin classifier](#The-maximal-margin-classifier)**  
3. **[The support vector classifier](#The-support-vector-classifier)**  
    3.1. Bias-variance tradeoff  
    3.2. Robustness  
    3.3. Alternative formulation
4. **[Support Vector Machines](#Support-Vector-Machines-(SVM))**  
    4.1. Kernels  
    4.2. Multi-class classification

## Introduction

The support vector machine is an approach to classification problems that has been developed in the 1990s as a generalization of a simpler classifier called the ***maximal margin classifier***. In the first part of this notebook, we will assume that the training data points are linearly separable and belong to only two classes (binary classification). Differently from other classification algorithms (for example logistic regression), SVMs, in addition to classifying data points (each of which is represented by a $p$-dimensional vector), aims to find the best possible boundary, that is the largest separation between the classes by means of a $(p-1)$-dimensional *hyperplane*.

**Definition:** a ***hyperplane*** of a $p$-dimensional space $V$ is a flat affine subspace of dimension $p-1$, or equivalently, of codimension $1$ in $V$.

**Example:** if $V$ is the vector space $\mathbb{R}^3$, then a hyperplane is a flat two dimensional subspace, that is, a plane.

A $p$ dimensional hyperplane has the form 

$$\beta_0 + \beta_1x_1 + \beta_2x_2 + \dots + \beta_px_p = 0.$$

In the sense that, if a point $\mathbf{x} = [x_1,\dots,x_p]^\top$ satisfies this equation, then $\mathbf{x}$ lies on the hyperplane. If that equation is not satisfied, for example $\beta_0 + \beta_1x_1 + \dots + \beta_px_p > 0$ then $\mathbf{x}$ will lie on one side of the hyperplane.
Using vectorized notation, a hyperplane is defined by:

$$\beta_0 + \mathbf{\beta}^\top \mathbf{x} = 0.$$

---
#### Extra: finding the hyperplane

An hyperplane is identified by a vector $\mathbf{\beta}=[\beta_1\dots\beta_p]^\top$ perpendicular to the hyperplane itself, and by a point $\mathbf{p}=[p_1\dots p_p]^\top$ belonging to the hyperplane. A point $\mathbf{x}=[x_1\dots x_p]^\top$ belongs to the hyperplane if the vector $\mathbf{x}-\mathbf{p}$ is orthogonal to the direction identified by $\mathbf{\beta}$, that is if $\langle \mathbf{x}-\mathbf{p},\mathbf{\beta}\rangle=0$ (meaning that $\mathbf{x}-\mathbf{p}$ lies on the hyperplane). 

<img src="images/svm/point_of_hyperplane.png" alt="point belonging to a hyperplane" style="display: block; margin-left: auto; margin-right: auto; width:20em"/>

If we develop it, we obtain:

$$
\begin{split}
    & \langle \mathbf{\beta}, \mathbf{x}-\mathbf{p}\rangle=0 \\
    &\Leftrightarrow \mathbf{\beta}^\top (\mathbf{x}-\mathbf{p})=0, \\
    &\Leftrightarrow \beta_1(x_1-p_1) + \beta_2(x_2-p_2) + \dots + \beta_p(x_p-p_p)=0, \\
    &\Leftrightarrow \beta_1x_1 + \dots + \beta_px_p=\beta_1p_1 + \dots + \beta_pp_p, \\
    &\Leftrightarrow \mathbf{\beta}^\top \mathbf{x} = \mathbf{\beta}^\top\mathbf{p},
\end{split}
$$

if we now call $\beta_0 = -\mathbf{\beta}^\top\mathbf{p}$, we obtain:

$$\beta_0 + \mathbf{\beta}^\top \mathbf{x} = 0.$$

Notice that, by changing the value of $\beta_0$ (i.e. by moving $\mathbf{p}$), we are moving the hyperplane along the $\mathbf{\beta}$ direction.

---

The best hyperplane which we are seeking is the one so that the distance from it to the nearest data point on each side is maximized. If such hyperplane exists, it is called the **maximum-margin hyperplane** and the linear classifier it identifies is called the **maximum-margin classifier**.

Classifiers having decision boundaries with large margins are preferred since they tend to have lower generalization error (less prone to overfitting).

## The maximal margin classifier
Consider a binary classification problem where our dataset consists of $n$ samples, each of which consists of $p$ features. Therefore, $X \in \mathbb{R}^{n\times p}$. The training samples (rows of $X$) are:

$$x_1 = \begin{bmatrix}x_{11} \\ \vdots \\ x_{1p}\end{bmatrix}, \; \dots \;, x_n = \begin{bmatrix}x_{n1} \\ \vdots \\ x_{np}\end{bmatrix},$$

and they fall into two classes $y = \{-1,1\}$. We want to predict the output label of a new sample $\mathbf{x^*} = \begin{bmatrix}x_1^* & \dots & x_p^*\end{bmatrix}^\top$. Similarly as what we have already seen in the previous chapters, we seek a *separating hyperplane*, which has the property that, for all $n$,

$$y_i(\beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \dots + \beta_px_{ip}) > 0$$

which is a compact way of writing:

$$\begin{cases}
\beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \dots + \beta_px_{ip} > 0 & \text{if } y_i = 1 \\
\beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \dots + \beta_px_{ip} < 0 & \text{if } y_i = -1
\end{cases}$$

In other words, every entry of our dataset, must be on the right side of the separating hyperplane. If a separating hyperplane exists, then we can build a simple classifier according to the side on which the observation we want to classify falls, that is the sign of $f(\mathbf{x^*}) = \beta_0 + \mathbf{\beta}^\top \mathbf{x^*}= \beta_0 + \beta_1x_1^* + \dots + \beta_px_p^*$. The *magnitude* of $f(\mathbf{x^*})$ can be used as a confidence measure about the class prediction. But which of the inifinitely many hyperplanes should be chosen? A natural choice is the *maximal margin hyperplane*, which is the fartest hyperplane from the training observations. 

For a fixed separating hyperplane, we can compute the perpendicular distance of each data point from it. The smallest distance is known as the ***margin*** (or gutter). The **maximal margin hyperplane** is the separating hyperplane that has the farthest minimum distance from the observations.

**Remark:** the maximal margin classifier with coefficients $\beta_0, \dots, \beta_p$ can lead to overfitting when $p$ is large.

In the following image we can observe the maximal margin hyperplane that has been found in a two dimensional space, with the corresponding margin.

<img src='images/svm/support-vectors.png' alt='maximal margin hyperplane' style="display: block; margin-left: auto; margin-right: auto; width:20em"/>

There are 3 points which are equidistant from the hyperplane. Those are called ***support vectors***. The chosen separating hyperplane depends directly on those three points but not on the other observations: moving around even one support vector may affect the position of the hyperplane.

### Constructing the maximal margin classifier

The maximal margin hyperplane based on a set of $n$ observations is the solution to the optimization problem:

$$\begin{align}
&\max_{\beta_0,\beta_1\dots\beta_p, M}{M} \notag\\
\text{subject to: } & \sum_{j=1}^{p}{\beta_j^2} = 1, \notag\\
&y_i(\beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \dots + \beta_px_{ip}) \ge M \quad \forall i=1\dots n
\end{align}$$

The last constraint guarantees every observation to be on the right side of the hyperplane, where $M$ represents the margin of the hyperplane which we want to maximize. 

The first constraint, which is equivalent to $\Vert\mathbf{\beta}\Vert=1$, ensures:
1. to find a unique set of parameters that define a certain hyperplane, otherwise we could obtain the same hyperplane by scaling the parameters by a common factor (for instance $3x_1+2x_2-1=0$ and $6x_1+4x_2-2=0$ both define the same hyperplane),
2. that the perpendicular distance from the $i$-th observation to the hyperplane is given by $y_i(\beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \dots + \beta_px_{ip})$. This fact comes from the formula of the perpendicular [distance from a point $\mathbf{x}$ to a hyperplane](https://en.wikipedia.org/wiki/Distance_from_a_point_to_a_plane):

$$\frac{\vert \mathbf{x}^\top\mathbf{\beta} + \beta_0\vert}{\Vert \mathbf{\beta} \Vert}.$$

This problem can be solved after reformulating it as a quadratic programming optimization problem, subject to linear constraints, in which the objective function is **convex** (therefore, it has a single global minimum). 

First we remove the constraint $\Vert\mathbf{\beta}\Vert=1$ by rewriting the other constraints as $\frac{1}{\Vert\mathbf{\beta}\Vert}y_i(\mathbf{x_i}^\top\mathbf{\beta} + \beta_0 )\ge M$. We can arbitrarily set $\Vert\mathbf{\beta}\Vert=1/ M$ (meaning that the distance from the hyperplane, to any of the two margins is equal to $M=\frac{1}{\Vert \beta\Vert}$). Then, instead of maximizing the gap between the two margins (that now have become $\mathbf{\beta}^\top\mathbf{x_i} + \beta_0=1$ and $\mathbf{\beta}^\top\mathbf{x_i} + \beta_0=-1$), we write the distance between them as $\frac{2}{\Vert \mathbf{\beta} \Vert}$, and we proceed to maximize it by minimizing the denominator (since the norm is a convex function, we can minimize its square and obtain the same solution).

$$
\begin{split}
&\min_{\beta_0,\mathbf{\beta}}{\frac{1}{2}\Vert \mathbf{\beta}\Vert^2} \\
\text{subject to: } &\; y_i(\mathbf{x_i}^\top \mathbf{\beta} + \beta_0) \ge 1 \quad \forall i=1\dots n \\
\end{split}
$$

This constrained minimization problem can be recasted into an unconstrained one by means of Lagrange multipliers. First we write the Lagrange primal function.

$$L_P = \frac{1}{2}\Vert\mathbf{\beta}\Vert^2 - \sum_{i=1}^{n} \alpha_i\left[y_i(\mathbf{x_i}^\top\mathbf{\beta}+\beta_0)-1\right]\qquad (*)$$

In order to minimize it, we compute the partial derivatives with respect to $\beta$ and $\beta_0$ and set them equal to zero:

$$
\begin{split}
&\frac{\partial}{\partial \mathbf{\beta}} L_P = \mathbf{\beta} - \sum_{i=1}^{n} \alpha_i y_i \mathbf{x_i} = 0 \quad\Leftrightarrow \quad\mathbf{\beta} = \sum_{i=1}^{n} \alpha_i y_i \mathbf{x_i} \qquad (**) \\
&\frac{\partial}{\partial \beta_0} L_P = - \sum_{i=1}^{n} \alpha_i y_i = 0 \quad \Leftrightarrow \quad 0 = \sum_{i=1}^{n} \alpha_i y_i\\
\end{split}
$$

By substituting these two values back into $(*)$, we obtain the Wolfe dual:

$$
\begin{split}
&L_D = \sum_{i=1}^{n} \alpha_i - \frac{1}{2} \sum_{i=1}^{n} \sum_{k=1}^{n} \alpha_i \alpha_k y_i y_k \mathbf{x_i}^\top \mathbf{x_k} \\
&\text{subject to: } \alpha_i \ge 0.
\end{split}
$$

The solution is obtained by maximizing $L_D$ with respect to $\alpha$ (by differentiating $L_D$ with respect to $\alpha$ and setting it equal to zero), constrained to the complementary slackness condition: 

$$\alpha_i\left[y_i(\mathbf{x_i}^\top\mathbf{\beta}+\beta_0)-1\right]=0 \quad \forall i. \qquad (***)$$

Therefore:
- if $\alpha_i > 0$, then $|\mathbf{x_i}^\top\mathbf{\beta}+\beta_0| = 1$, meaning that $\mathbf{x_i}$ is on the boundary of the margin, that is, a support vector;
- if $|\mathbf{x_i}^\top\mathbf{\beta}+\beta_0| > 1$, then $\mathbf{x_i}$ is not on the boundary of the margin, and $\alpha_i=0$ (this will happen for most of the points).

From this fact and from equation $(**)$, it follows that the solution vector $\hat{\mathbf{\beta}}$ depends only on the points $\mathbf{x_i}$ that lie on the boundary of the slab, i.e. the support vectors.
The optimal value of $\hat{\beta}_0$ can be found by solving $(***)$ for any of the support vectors $\mathbf{x_i}$.

**Question:** why not just solve the original problem? Because the dual lets us solve the problem by computing just the inner products of $x_i$ and $x_k$ (which will be very important later on when we want to solve non-linearly separable classification problems).


In most cases, data points may not be separable, therefore, the optimization problem we have just presented has no solution. Fortunately, the concept of separating hyperplane can be generalized to one that *almost* separates the classes, using a ***soft margin***. This is what a support vector classifier does.

## The support vector classifier

A classifier based on a maximal margin separating hyperplane may not be desirable since it would be extrimely sensitive to changes in individual observations, which is a sign of overfitting. Indeed, we may face a situation in which the positive training samples are all concentrated in a particular area of the, say, $\mathbb{R}^2$ space except one, which is further away from the others and close to the negative training points. That outlier will considerably affect the hyperplane and its margin, which can be a tiny one, resulting in an unsatisfactory classifier.

To cope with this issue, we seek for a hyperplane that does not necessarily perfectly separate the two classes, aiming at greater robustness to individual observations and better classification of most of the training observations. This is what a ***support vector classifier*** (also known as *soft margin classifier*) does: it allows some observations to be on the incorrect side of the margin, or even on the wrong side of the hyperplane. The classification still depends on which side of the hyperplane the point lies.

It is the solution to the following optimization problem:

$$\begin{align}
&\max_{\beta_0,\beta_1,\dots,\beta_p, \epsilon_1,\dots,\epsilon_n, M}{M} \notag\\
\text{subject to: } & \sum_{j=1}^{p}{\beta_j^2} = 1, \notag\\
&y_i(\beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \dots + \beta_px_{ip}) \ge M(1-\epsilon_i) \quad \forall i=1,\dots,n, \\
&\epsilon_i \ge 0, \quad \sum_{i=1}^{n}{\epsilon_i} \le \Gamma
\end{align}$$

where $\Gamma\ge 0$ is a tuning parameter and $M$ is the distance of each margin to the hyperplane. $\epsilon_i$ are called **slack variables** and allow individual observations to be on the wrong side of the margin ($\epsilon_i$ is the proportional amount by which the prediction is on the wrong side of the margin):
- If $\epsilon_i = 0$ then the $i$-th observation lies on the correct side of the margin.
- If $\epsilon_i > 0$ then the $i$-th observation *has violated the margin* because it is on the wrong side.
- If $\epsilon_i > 1$ then it is on the wrong side of the hyperplane.

The parameter $\Gamma$ bounds the sum of the $\epsilon_i$ terms, and so it determines how much the violation is tolerated. Note that if $\Gamma>0$, then no more than $\Gamma$ observations can be on the wrong side of the hyperplane.

The only points that affect the hyperplane are those which lie on the margin or that violate it. Therefore, if we change the position of a point that strictly lies on the correct side of the margin (and we leave it on that side), then it won't affect the hyperplane, hence the classifier. The observations that are on the margin or on its wrong side are called ***support vectors*** and their position affects the classifier.

### Bias-variance tradeoff

The parameter $\Gamma$ controls the *bias-variance tradeoff*. 
- **If $\Gamma$ is small**, there will be fewer support vectors, thus we seek narrow margins that are rarely violated. The resulting classifier will highly fit the data, possibily resulting in a classifier with **low bias but high variance**. 
- **If $\Gamma$ is large**, then the margin is wider and more observations are allowed to violate it. Consequently there will be many support vectors meaning that many observations are involved in determining the hyperplane. The resulting classifier will potentially have **low variance but high bias**.

### Robustness of the SVC

Unlike other classification methods, such as [linear discriminant analysis](https://en.wikipedia.org/wiki/Linear_discriminant_analysis), that depends on all the observations, the support vector classifier depends only on a subset of points, thus it is robust to the observations that are far away from the hyperplane, similarly to logistic regression, which has low sensitivity to points far away from the decision boundary.

### Alternative formulation

Similarly to what we've done with the maximal margin classifier, we can drop the $\Vert\mathbf{\beta}\Vert=1$ constraint by changing the other constraints to $\frac{1}{\Vert\mathbf{\beta}\Vert}y_i(\mathbf{x_i}^\top\mathbf{\beta}+\beta_0)\ge M(1-\epsilon_i) \quad \forall i=1\dots n$ and then defining $M=\frac{1}{\Vert\mathbf{\beta}\Vert}$, thus obtaining an equivalent formulation:

$$
\begin{split}
&\min_{\beta_0,\mathbf{\beta}, \epsilon_1,\dots,\epsilon_n, }{\frac{1}{2}\Vert\mathbf{\beta}\Vert^2 + C\sum_{i=1}^n{\epsilon_i}} \\
\text{subject to: }\; &y_i(\mathbf{x_i}^\top \mathbf{\beta} + \beta_0) \ge 1-\epsilon_i \quad \forall i=1,\dots,n, \\
&\epsilon_i \ge 0 \quad \forall i=1,\dots,n
\end{split}
$$

The hyperparameter $C$ works in an opposite way to $\Gamma$: the larger $C$ is, the fewer are the violations of the margin that we allow.

This constrained optimization problem can be translated into an unconstrained one using Lagrange multipliers. First we write the  Lagrange primal objective function (to be minimized)

$$
\begin{split}
& L_P = \frac{1}{2}\Vert\mathbf{\beta}\Vert^2 + C\sum_{i=1}^n{\epsilon_i} - 
\sum_{i=1}^n \alpha_i\left[y_i(\mathbf{x_i}^\top \mathbf{\beta}+\beta_0)-(1-\epsilon_i))\right] -
\sum_{i=1}^n \mu_i\epsilon_i \qquad (*) \\
\text{subject to: } & \alpha_i\ge 0, \mu_i \ge 0, \epsilon_i \ge 0 \quad \forall i=1\dots n \\
\end{split}
$$

We can minimize it by computing the derivatives with respect to $\mathbf{\beta}, \beta_0$ and $\epsilon_i$.

$$
\begin{split}
\mathbf{\beta} &= \sum_{i=1}^{n} \alpha_i y_i \mathbf{x_i}, \\
0 &= \sum_{i=1}^{n} \alpha_i y_i,\\
\alpha_i &= C - \mu_i \quad \forall i=1\dots n.
\end{split}
$$

Now, if we substitute these threee definitions into $(*)$ we obtain the Lagrangian dual objective function (to be maximized):

$$
L_D = \sum_{i=1}^{n}\alpha_i - \frac{1}{2}\sum_{i=1}^{n}\sum_{k=1}^{n} \alpha_i\alpha_k y_i y_k \mathbf{x_i}^\top \mathbf{x_k}
$$

In addition to the non-negativity constraints of $\alpha_i, \mu_i$ and $\epsilon_i$, we have to impose the constraints coming from the conditions on the partial derivatives (for them to be zero), and the constraints:

$$
\begin{split}
\alpha_i\left[ y_i (\mathbf{x_i}^\top\beta + \beta_0) - (1-\epsilon_i)\right] & = 0 \quad \forall i = 1\dots n \\
\mu_i\epsilon_i &= 0 \quad \forall i = 1\dots n\\
y_i (\mathbf{x_i}^\top\beta + \beta_0) - (1-\epsilon_i) & \ge 0 \quad \forall i = 1\dots n
\end{split}
$$

From the condition on the first partial derivative, we see that a solution has the form $\hat{\mathbf{\beta}} = \sum_{i=1}^{n} \hat{\alpha}_i y_i \mathbf{x_i}$ and has nonzero $\hat{\alpha}_i$ coefficients only for those points $\mathbf{x_i}$ such that $y_i (\mathbf{x_i}^\top\beta + \beta_0) - (1-\epsilon_i) = 0$. They are called support vectors, since $\mathbf{\beta}$ can be represented in terms of them alone. 
- Those who have $\hat{\epsilon}_i=0$ will lie on the edge of the margin, and will have $0 < \hat{\alpha}_i < C$. 
- The ones with $\hat{\epsilon}_i>0$, will have $\hat{\alpha}_i = C$. 

Any margin point ($0<\hat{\alpha}_i,\hat{\epsilon}_i=0$) can be used to find $\beta_0$ by solving $y_i f(x_i)=1$ where $f(x) = \mathbf{x}^\top \mathbf{\beta} + \beta_0$.

A new point $x$ will by classified according to:

$$
\begin{split}
G(\mathbf{x}) &= \text{sign}f(\mathbf{x}) \\
&=\text{sign}(\mathbf{x}^\top \hat{\beta} + \hat{\beta}_0) \\
&=\text{sign}(\mathbf{x}^\top \sum_{i=1}^{n} \hat{\alpha}_i y_i \mathbf{x_i} + \hat{\beta}_0)
\end{split}
$$

**Remark:** We don't really need all the $n$ points to make a prediction, but just those with $\hat{\alpha}_i > 0$. 

# Support Vector Machines (SVM)

Support vector machines (SVM) are an extension of the support vector classifier that results from **enlarging the feature space using _kernels_**. Enlarging the feature space is done when there is not a linear relationship between the predictors $X_1\dots X_p$ and the output, in order to accomodate non-linear boundaries between the classes. More precisely, the feature space is enlarged by using ***basis expansions***, considering functions of the predictors, such as polynomial terms, for example $X_1^3 \dots X_p^3$, or interaction terms, for example $X_iX_j$ with $i\neq j$. We choose the basis functions $h_m(x),\; m=1\dots M$, and fit the classifier with transformed input features $h(x_i) = \left(h_1(x_i), \dots, h_M(x_i)\right)$, producing the *nonlinear* function $\hat{f}(x) = h(x)^\top\hat{\beta}+\hat{\beta}_0$. $h$ is a function mapping from a $p$-dimensional space to a $M$-dimensional space. Usually, $M$ is much larger than $p$.

For instance, we may have $h:\mathbb{R}^2 \rightarrow \mathbb{R}^6$ defined as $h((x_1,x_2)) = (x_1,x_2,x_1^2,x_2^2,x_1x_2)$.

SVMs allow the dimension of the enlarged space to become very large, and perform this trick in an efficient way avoiding a large number of computations that would otherwise come from a large number of features.

## Kernels

First of all, notice that, for the support vector classifier, both the Lagrange dual function $L_D$ and the solution function $f(x)$ can be written by means of inner products, directly for the transformed feature vectors $h(\mathbf{x_i})$, recalling that the inner product between two vectors $\mathbf{x}$ and $\mathbf{y}$ of $p$ components is defined as $\langle \mathbf{x},\mathbf{y} \rangle = \mathbf{x}^\top \mathbf{y} = \sum_{i=1}^{p}{x_i y_i}$.


$$
\begin{split}
L_D &= \sum_{i=1}^{n}\alpha_i - \frac{1}{2}\sum_{i=1}^{n}\sum_{k=1}^{n} \alpha_i\alpha_k y_i y_k \langle h(\mathbf{x_i}), h(\mathbf{x_k}) \rangle \\
f(\mathbf{x}) &= h(\mathbf{x})^\top \mathbf{\beta} + \beta_0 \\
&= \sum_{i=1}^{n} \alpha_i y_i \langle h(\mathbf{x}), h(\mathbf{x_i}) \rangle + \beta_0
\end{split}
$$

Now we want to extend the solution of the support vector classifier, by replacing the inner product with a *generalization* of the form $K(x,x')$, which takes the name of ***kernel***, where $x\in\mathbb{R}^p$ and $x'\in\mathbb{R}^p$. For all $x,x'$ in the input space, the kernel function $K(x,x')$ can be expressed as an inner product in another space.

**Kernel trick:** we don't need to perform the features transformation step explicitely, neither we have to specify the transformation $h(x)$ at all, but only the kernel function $K(x,x') = \langle h(x),h(x')\rangle$, which has to be a symmetric positive (semi-) definite function.

**Definition:** let $\mathcal{X}$ be a set. A symmetric function $K:\mathcal{X}\times \mathcal{X} \rightarrow\mathbb{R}$ is a **[positive definite kernel function](https://en.wikipedia.org/wiki/Positive-definite_kernel)** on $\mathcal{X}$ if and only if $\forall n \in \mathbb{N}$, $\forall x_1,\dots, x_n \in \mathcal{X}$ and $\forall c_1,\dots, c_n \in \mathbb{R}$, it holds:

$$\sum_{i=1}^n \sum_{j=1}^n c_i c_j K(x_i,x_j) > 0.$$

Some [typical kernel functions](http://crsouza.com/2010/03/17/kernel-functions-for-machine-learning-applications/) are listed below. 

#### Linear kernel

In the case of the SVC, we had

$$K(x,x') = \langle x,x' \rangle = x^\top x' = \sum_{j=1}^{p} x_jx_j' $$

which is called *linear kernel* and it quantifies the similarity of a pair $(x,x')$ of observations using Pearson standard correlation.

#### Polynomial kernel

A polynomial kernel of degree $d>0$ has the form:

$$K(x,x') = \left(c + \langle x,x' \rangle \right)^d = \left( c + \sum_{j=1}^{p} x_jx_j' \right)^d, $$

with $c\ge0$.
Using a polynomial kernel in the standard support vector classifier allows to fit it in a higher-dimensional space which involves polynomials of degree $d$. Combining the support vector classifier (SVC) with a *non-linear* kernel results in a so-called **support vector machine (SVM)**, where the non-linear function has the form:

$$f(x) = \beta_0 + \sum_{i\in\mathcal{S}}\alpha_i K(x , x_i)$$

#### Gaussian kernel (or Radial basis)

A radial kernel has the form:

$$
K(x,x') = \mathrm{exp} \left(-\gamma \Vert x-x' \Vert^2 \right)
= \mathrm{exp}\left(-\gamma \sum_{j=1}^p (x_j - x_j')^2 \right)
$$

where $\gamma > 0$ is a constant. If $\gamma$ is very small, the model behaves like a linear SVM. On the contrary, if $\gamma$ is large, we face the risk of overfitting.

<img src="images/svm/gaussian_kernel_gamma.png" alt="Comparison of decision boundaries using a Gaussian kernel with different values of gamma" style="display: block; margin-left: auto; margin-right: auto; width:50em"/>

Geometrically, the Gaussian kernel corresponds a "bump" or a "cavity" centered at a training point $x'$. The result is a combination of bumps and cavities.

| <img src="images/svm/gaussian_kernel.png" alt="Training points using gaussian kernele corresponds to bumps or cavities" style="display: block; margin-left: auto; margin-right: auto; width:20em"/> | <img src='images/svm/gaussian_kernel_hyperplane.png' alt="Resulting hyperplane after feature transformation using gaussian kernel" style="display: block; margin-left: auto; margin-right: auto; width:20em"/> |
|--------------------------------------------------------|-------------------------------------------------------------------------------------|
| Each training point corresponds to a bump or a cavity. | The resulting hyperplane after feature transformation using a Gaussian kernel. |

### Advantages of using kernels

The main advantages of using kernels instead of enlarging the feature spaces by directly augmenting the features is that the number of computations is reduced to compute $K(x,x')$ for all the $\binom{n}{2}$ distinct pairs of training observations. In some cases, such as with the radial kernel, the feature space is *infinite dimensional*, thus computationally intractable.
Normally calculating $\langle h(x), h(x')\rangle$ requires us to calculate $h(x), h(x')$ first, and then do the dot product. These two computation steps can be quite expensive as they involve manipulations in a $M$ dimensional space. The result of the dot product is a scalar, thus we are back into the 1-dimensional space. With the choice of a clever kernel, we can avoid these computations.

#### Example
Consider a feature space consisting of two variables $X_1$ and $X_2$, and consider the polynomial kernel with $c=1$ and $d=2$:

$$
\begin{split}
K(\mathbf{X},\mathbf{X'}) &= \left( 1 + \langle \mathbf{X}, \mathbf{X'}\rangle\right)^2 \\
&= \left( 1 + X_1X_1' + X_2X_2'\right)^2 \\
&= 1 + 2X_1X_1' + 2X_2X_2' + (X_1X_1')^2 + (X_2X_2')^2 + X_1X_1'X_2X_2'.
\end{split}
$$

This kernel can be seen as a dot product in a different feature space, indeed, if we define:

$$
h(\mathbf{X}) = \begin{bmatrix} h_1(\mathbf{X}) \\ \vdots \\ h_6(\mathbf{X})\end{bmatrix} = \begin{bmatrix} 1 \\ \sqrt{2}X_1 \\ \sqrt{2}X_2 \\ X_1^2 \\ X_2^2 \\ \sqrt{2}X_1X_2\end{bmatrix},
$$

then we can rewrite the kernel function as $K(\mathbf{X},\mathbf{X'}) = \langle h(\mathbf{X}), h(\mathbf{X'})\rangle$.

Therefore, rather than computing both $h(\mathbf{X})$ and $h(\mathbf{X'})$, and then performing the dot product $\langle h(\mathbf{X}), h(\mathbf{X'})\rangle$, it is more convenient to compute directly the kernel using $K(\mathbf{X},\mathbf{X'}) = \left( 1 + \langle \mathbf{X}, \mathbf{X'}\rangle\right)^2$. In this way, we avoided the computation of all the features in the higher-dimensional space.

### Kernels describe similarity

Since the kernel is a dot product $\langle h(x), h(x')\rangle$ in a certain feature space, we can think at its geometrical interpretation: that inner product is just the projection of $h(x)$ onto $h(x')$, therefore, it describes _how much overlap do $x$ and $x'$ have in their feature space_. In other words, how similar they are.

## Multi-class classification

When the target classes are more than two, there are different approaches that we can follow.

### One vs all (OVA)

With the OVA approach, if there are $k$ target classes, we construct $k$ different classifiers, each of which considers the points belonging to one of the $k$ classes as positive examples, and the remaining as negative. 

<img src="images/svm/ova_classifiers.png" alt="Four different classifiers learned on a k=4 classes problem" style="display: block; margin-left: auto; margin-right: auto; width:18em"/>

To make a prediction for a new point, we make a prediction for each classifier and choose the one that classified it as positive. However it is possible that more than a classifier predicted a positive class for the point, leading to ambiguous decisions (Figure A). For this reason, for the classes resulted in a positive classification, we may take into consideration the margin between each point and the hyperplanes, choosing the class associated with the farthest hyperplane from the point (Figure B).

| <img src="images/svm/ova_ensemble.png" alt="Four different classifiers learned on a k=4 classes problem" style="display: block; margin-left: auto; margin-right: auto; width:18em"/> | <img src="images/svm/ova_heuristic.png" alt="Four different classifiers learned on a k=4 classes problem" style="display: block; margin-left: auto; margin-right: auto; width:18em"/> | 
|:------------------------------------------------------:|:-----------------------------------------------------------------------------------:|
| Figure A | Figure B |

**Problem:** The main issue with this approach, is that every classifier is trained on an imbalanced set, where the number of negative training examples is much higher than the positive's, so there's no guarantee that the quantities returned by the decision function have the same scale.

### One vs one (OVO)

In the one vs one approach, we define separating hyperplanes for every pair of classes. This will lead to a larger number of classifiers, equal to $\binom{k}{2}=k(k-1)/2$, where $k$ is the number of classes. 

<img src="images/svm/ovo_classifiers.png" alt="Different classifiers learned on a k=4 classes problem using the One vs One approach" style="display: block; margin-left: auto; margin-right: auto; width:25em"/>

Predictions are made according to a *voting strategy*: the new observation we want to predict is passed through each classifier, and the predicted class is recorded. The class having the most votes is the one being assigned to the example.

<img src="images/svm/ovo_heuristic.png" alt="resulting decision boundary using a One vs One approach" style="display: block; margin-left: auto; margin-right: auto; width:14em"/>

### OVA vs OVO

If the number of classes is not too large choose OVO, otherwise, choose OVA.