# Supervised Machine Learning - Support Vector Machines


<!-- TOC START min:2 max:4 link:true asterisk:true update:true -->
* [What you will learn in this class](#what-you-will-learn-in-this-class)
* [Introduction](#introduction)
  * [What are Vector Machine Support (SVM)?](#what-are-vector-machine-support-)
  * [Hyperplan](#hyperplan)
    * [Vector space](#vector-space)
    * [Hyperplan](#hyperplan-1)
    * [Support Vector Machine: graphic representation](#support-vector-machine--graphic-representation)
  * [Non-linear SVM](#non-linear-svm)
    * [Size increase](#size-increase)
    * [Non-linear SVM](#non-linear-svm-1)
    * [SVM in Python](#svm-in-python)
<!-- TOC END -->




## What you will learn in this class

This course is dedicated to the exploration of a family of models called Support Vector Machines or more commnly _SVM_ for shorts. These models were first invented for binary classification, but now exist for any classification or regression problems.


## Introduction

The idea behind support vector machines is to separate points belonging to different classes by finding a hyperplane (a vector sub-space of dimension d-1 where d is the dimension of the space containing the data, in other words d is the number of explanatory variables) that stands as far as possible from all the points to be classified, so that there is the largest possible area or margin between the two classes that does not contain any data points shown in the figure below :


![intr_svm](https://drive.google.com/uc?export=view&id=1_HvXcNAQp0TSWncMStK19qQ22wi7-eX3)

Note that the red line in the figure on the right is a hyperplane of the observation space in two dimensions (i.e. a straight line), but the red line in the figure on the left is not a hyperplane of the observation space. We will see later that Support Vector Machines are able to separate the data in a non-linear way thanks to a trick called kernel estimators.



### What are Support Vector Machines (SVM)?

This term refers to a supervised machine learning model that is used primarily for data classification, but regression versions also exist and can be used in sklearn. In the SVM algorithm, observations are represented as a scatter plot in a $d$ dimensional space where $d$ is the number of explanatory variables we use for modeling. The coordinates of the observations in this space are the values taken by each explanatory variable for these observations. From there, we realize a classification by finding a hyperplane that differentiates the two classes.



### Hyperplane
#### Vector space

A vector space built on the set of real numbers $\mathbb{R}$ is a set $E$ whose elements are called vectors. It satisfies two laws:

* A law of internal composition (think addition for simplicity) $+$ which is called vector sum: for every couple of elements in $E$, the sum of those elements still belongs to $E$. 

$\forall{x,x'}\in{E}$

$x+x'\in{E}$

* An external composition law (multiplication by a scalar): you can multiply the elements of $E$ by any real number and the resulting element will always be in $E$. 

$\forall{a}\in\mathbb{R}$

$\forall{x}\in{E},ax\in{E}$

The only examples of vector spaces we will see in this course are vector spaces based on $\mathbb{R}$. For example $\mathbb{R}^2$ which is the two-dimensional real component vector space. Vectors in this space are for example : $[2.5.5.77]$ or $[0.3.6]$. More generally, the vector spaces we are dealing with are of the form $\mathbb{R}^d$, composed of real component vectors with $d$ real components, just like a dataset with $d$ numerical explanatory variables.



#### Hyperplane

Hyperplanes are vector subspaces of dimension $d - 1$, where $d$ is the dimension of the original vector space. The name comes from the fact that in $\mathbb{R}^3$ real three-dimensional vector space,hyperplanes are two dimensional vector spaces aka a plane.

A hyperplane of $\mathbb{R}^d$ is a vector space $H$ included in $\mathbb{R}^d$, meaning that all the elements of $H$ are in $\mathbb{R}^d,\forall{h}\in{H},h\in\mathbb{R}^d$, and $H$ is of dimension $d - 1$.

A notion that can help you understand hyperplanes: to define a hyperplane, we can think of its supplementary subspace of dimension one (a line). The additional line is as its name indicates a line $D$ (of dimension one), it has a base vector $\overrightarrow{v}$ which gives its direction in $\mathbb{R}^d$. The supplementary hyperplane to $D$ generated by the vector $\overrightarrow{v}$ is the set of vectors $\overrightarrow{x}$ from $\mathbb{R}^d$ such that the scalar product of $\overrightarrow{x}$ and $\overrightarrow{v}$ is zero. This is called the dual representation of an hyperplane.

$E=\{\overrightarrow{x}\in{\mathbb{R}^d},\overrightarrow{x}\cdot\overrightarrow{v}=0\}$, These notions of supplementary subspace and base vector are important to understand how Support Vector Machines work.

#### Support Vector Machine: graphic representation

To summarize, SVMs are a way of separating the data with two distinct classes by a hyperplane in the observation space (the space defined by the explanatory variables), i.e. a linear classifier that separates two groups of observations. What differentiates SVM from other linear classifiers such as the two examples shown below?


![svm2](https://drive.google.com/uc?export=view&id=1tC7kyx0Cuku9v1vDHndgwo3BcYjINwCE)


In your opinion, which of these two lines best separates the data? If you think it is the yellow line, you are right, because it is the line (hyperplane) that the SVM algorithm builds. Let's go into more detail to explain why.

The green line in the example is very close to the class of red observations, although it's able to separate the data from each class. Imagine we now look at a new set of observations from the blue and red classes, it is likely that some red dots will find themselves on the wrong side of the green line. This means that the classification model defined by the green line is likely to generalize poorly for the classification of new observations, which is genrally the main objective of supervised machine learning.

Why then is the yellow line more adapted to our requirements? Here comes the notion of support vectors: these are the observations that are closest to the separation line. We calculate the distance between the support vectors and the separation line, this distance is called the margin. The purpose of Support Vector Machines is to maximize the margin, as shown in the example below :



![svm3](https://drive.google.com/uc?export=view&id=1QLcdwSQ8V1lxhfRpbxg6oaaanbgOPyD4)


This separation hyperplane is optimal in terms of generalization because it is the hyperplane that is the farthest from both the red and blue class. If both classes are really limited to these areas, then the risk of a red or blue observation ending up on the wrong side of the line is lowest given the information at our disposal.

This is a very simple example in which the data are linearly separable, since a hyperplane can be drawn that separates them perfectly. This is not always the case in practice. Let's now learn how SVM are able to deal with data that is not necessarily not linearly seprable.



### Non-linear SVM (Kernel-based estimators)

#### Polynomial kernel

To produce non-linear classifiers from a linear model, we will not work on the data as is to produce the estimates. We have to transform them using a function called kernel. This function aims to take the observations into a larger dimensional space, in which we hope that the geometric properties of the observations will be linearly separable.

For example, if we observe a group of cows and we have their weight and height, we can produce a new dimension thanks to a polynomial kernel: $k(\overrightarrow{weight},\overrightarrow{size})=(\overrightarrow{weight},\overrightarrow{size})^p$ where $k$ is the polynomial kernel function, and $p ≥ 2$ is the order of the chosen polynomial. This new dimension is not a linear combination of the explanatory variables and thus increases the dimension of the observation space from two to three. A linear separator can be computed as an SVM in this new space, and when we have the hyperplane separator equation in this space, we can also represent it in the original space. However, it will not be represented by a straight line but by a non-linear separator as in the figure below :


![](https://drive.google.com/uc?export=view&id=1_HvXcNAQp0TSWncMStK19qQ22wi7-eX3)

#### Radial Kernel

In the example below, it is not possible to separate the data thus presented by a hyperplane (here a line):


![svm4](https://drive.google.com/uc?export=view&id=1xWz8mJS9YQh1hG6dJAlhul-klPAZZIjN)


However, this problem can be transformed into a linearly separable problem in a larger space. If we add a third dimension $z=x^2+y^2$ which represents the distance from a point to the origin of the orthonormal coordinate system, viewing the data on the axes $x$ and $z$, we obtain the following figure:


![svm5](https://drive.google.com/uc?export=view&id=1qAHLPe_CxlJr7AsGu6xdOk5UohJtSg-x)


In this space, it is very simple to separate the data by a hyperplane of equation $z = k$ for example. When we represent this hyperplane in the original space, we obtain a non-linear separator :



![svm6](https://drive.google.com/uc?export=view&id=1YLQm6N4rHigrUphQOTudHrHB4vejXopX)


This example gives the impression that the use of kernels is very simple, in practice however, it is quite difficult to find the right kernel to be able to simply separate the data by a linear classifier. That said, we can always make the problem linearly separable by addind a certain number of dimensions, however remember the master rule of statistics, simplicity is preferable to performance, the more dimensions you add to the original space, the more you risk overfitting and getting an unstable SVM model, i.e. one that will not generalize well on unknown data.



#### SVM in Python

In order to import, define and estimate the parameters of the SVM model in python, just use the three commands below, the names of the objects ```clf```, $X$ and $y$ should be replaced with your own local variables.

```python
from sklearn.svm import SVC
clf = SVC(kernel='linear')
clf = SVC.fit(X, y)
```

### Fight overfitting

For SVM models, as for any model with naturally high variance and low bias, it is essential to find a good bias variance trade off in order to avoid over fitting. Two main penalization parameters are typically used with SVM in order to limit overfitting.

#### C penalty

SVM uses the C parameter in order to penalize errors commited by the model on the training data. When the data is not linearly separable even after using a kernel, then some errors can be commited. The C parameters influences the model in the following way : larger values of C will force the model to prioritize classifying training examples correctly as opposed to finding a smooth decision boundary with a large margin. On the contrary smaller values of C bring the model to prioritize smoother decision boundaries regardless of training errors.

#### Gamma parameter

When using a radial kernel (```kernel="rbf"```) in python, the influence of each observation on the construction of the decision boundary can be interpreted as a circular area of influence centered on the observation. It is possible to control the size of this influence area by changing the value of the ```gamma``` parameter. The gamma parameter is a measure of how far the influence of each observation reaches when building the decision boundary, high values of ```gamma``` result in areas of influence being very small, consequently the decision boundary is free to fit the data as closely as it can because it is not contrained by the observations' influence unless it gets really close. Alternatively, when ```gamma``` is low (close to zero) the area of influence of each observation is very large, causing the model to draw its decision boundary where the influence of different observations average out, resulting in a much smoother model.
In other words, high values of gamma result in high variance low bias models, and low gamma values cause the model to have higher bias and lower variance.


#### Influence of C and gamma
The figure below shows the decision boundaries obtained on the training set with a SVM model with a radial kernel, for different values of `C`and `gamma`:

![SVM_C_gamma](https://full-stack-assets.s3.eu-west-3.amazonaws.com/images/SVM_C_Gamma.png)

It seems that `gamma` really drives the way the model will make predictions that fit closely to the data or not, while `C` influences how many observations will be misclassified in the train set. A higher `C` will let fewer points be missclassified, a higher `gamma` will let the model draw more and more complicated decision boundaries and a steeper decision function.

The behavior of the model is very sensitive to the `gamma` parameter. If
`gamma` is too large, the radius of the area of influence of the support
vectors only includes the support vector itself and no amount of
regularization with `C` will be able to prevent overfitting.

When `gamma` is very small, the model is too constrained and cannot capture
the complexity or "shape" of the data. The region of influence of any selected
support vector would include the whole training set. The resulting model will
behave similarly to a linear model with a set of hyperplanes that separate the
centers of high density of any pair of two classes.

## Resources 📚📚
* [Support vector machines in scikit-learn](https://scikit-learn.org/stable/modules/svm.html)
* [SVM classifier tutorial](https://www.kaggle.com/code/prashant111/svm-classifier-tutorial)