# Support Vector Machines

Support vector machines (SVMs) are a set of supervised learning methods used for classification, regression and outliers detection.

The advantages of support vector machines are:

+ Effective in high dimensional spaces.
+ Still effective in cases where number of dimensions is greater than the number of samples.
+ Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.
+ Versatile: different Kernel functions can be specified for the decision function. Common kernels are provided, but it is also possible to specify custom kernels.

The disadvantages of support vector machines include:

+ If the number of features is much greater than the number of samples, avoid over-fitting in choosing Kernel functions and regularization term is crucial.
+ SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation.
+ SVM does not perform very well when the data set has more noise i.e. target classes are overlapping.

### Why do we need SVC when we have Logistic Regression?

Logistic regression for classification separates the instances into two classes. However, there is an infinite number of decision boundaries, Logistic Regression only picks an arbitrary one. Logistic Regression doesn't care whether the instances are close to the decision boundary. Therefore, the decision boundary it picks may not be optimal which makes us less confident in our predictions. Therefore, the optimal decision boundary should be able to maximize the distance between the decision boundary and all instances. i.e., maximize the margins. That's why SVMs algorithm is important! Moreover if the decision boundary is non linear then also logistic regression fails to find a boundary.

![](./fig/svm.png)

### Support Vector, Hyperplane and Margin

The vector points closest to the hyperplane are known as the support vector points because only these two points are contributing to the result of the algorithm, other points are not. If a data point is not a support vector, removing it has no effect on the model. On the other hands, deleting the support vectors will then change the position of the hyperplane.

The dimension of the hyperplane depends upon the number of features. If the number of input features is 2, then the hyperplane is just a line. If the number of input features is 3, then the hyperplane becomes a two-dimensional plane. It becomes difficult to imagine when the number of features exceeds 3.

The distance of the vectors from the hyperplane is called the margin which is a separation of a line to the closest class points. We would like to choose a hyperplane that maximises the margin between classes. 

When all support vectors have same distance form maximum margin hyperplane then its called a Good Margin, in other casese it called Bad Margin.

![](./fig/margin.png)

The maximum margin classifier is super sensitive for outliers in data. Choosing a threshold that allows misclassification is an example of Bias-Variance Tradeoff. Thats why we have soft margins. To compare which softmargin is better we use cross validation where in we observe that by allowing a misclassfied observation inside our margin how well the classifier does on validation or test data. Soft margin is allowed by a tuning parameter called simply C that defined the magnitude of the wiggle allowed across all dimensions. The C parameter defines the amount of violation of the margin allowed. If C = 0 , then there is no violation and we have simple Maximal-Margin Classifier! The larger the value of C, the more violations of hyperplane are permitted.

We have seen support vector classifier for data which is linearly seperable. But what if our data is not linearly seperable. What do we do, This is the time to Introduce **Support vector Machines**. Suppose our data looks like this.

![](./fig/nl.png)


### Calculating High dimension relationships

Suppose we have a y-axis with a value dosage square, Then if we plot the data tha data would look like in the figure below and we can see we can have a classifier to seperate our data.

![](./fig/hd.png)

Now we might have a question, why choose dosage square? why not cube or any other function. Support Vector machines uses something called **Kernel** inorder to find support vector classifier in **higher dimensions**. In the example above we used a **polynomial kernel** which has a parameter $d$ which stands for degree of polynomial. When d = 1 the polynomial kernel computes the relationship between each data points in one-dimension and these relationships are used to find a support vector classifier. When d = 2 we get a second dimension relationships between each pair of observations, and those relationships are used to find a support vecor classifier in two dimension. Polynomial kernel systematically increases $d$ to find a support vector classifier in each dimension. We can find a good value of $d$ using cross validation. 

Another commonly used kernal is **Radial Kernel**  also known as **RBF kernel (Radial Bias Function)** RBF kernel finds support vector classifier in infinite dimensions so its hard for us to visualize. However while classifying a new observation it behaves like a weighted nearest neighbour mode. In other words the closest observation have more infulence on classifacation when compared to those observations which are far aawy form the data point.

Note: Kernal functions only calculate the relationships between data points as if they arein higher dimensions. They do not actually do the transformation. This trick of calculating the high dimentional relationships between the data points, without actually transforming the data is also reffered as **Kernel Trick**.

**Polynomial Kernel**

Polynomial kernel  = $(a*b+r)^d$, where $a$ and  $b$ refer to two different observations in the data set. $r$ determines the coefficient of the polynomial. $d$ sets the degree of the polynomial. Lets set r = 1/2 and d = 2.

$(a*b+1/2)^2 = (a*b+1/2)(a*b+1/2) = ab + a^2b^2 + 1/4 = (a, a^2, 1/2).(b. b^2, 1/2)$

Now we have dot product of two vectors. The first term refers to x coordinates and second refers to y coordiantes and third refers to z coordinates. Alternatively if we look at

$(a*b+1)^2 = (a*b+1)(a*b+1) = 2ab + a^2b^2 + 1 = (\sqrt{2}a, a^2, 1).(\sqrt{2}b, b^2, 1)$

The new x-axis coordiantes are moved by a factor of $\sqrt{2}$ and y-axis coordiantes are squared. We can ignore the z-azis coordiante in both of the above cases since they are constant values. We see to calculate the high dimensional relationship all we have to do is dot product of those two points. We need not transform the data. Just plug in the values and we get the relationship.

**Radial Kernel**

radial Kernel = $e^{-\gamma(a-b)^2}$ a and b refers to 2 different observations in the data set. $(a-b)^2$ is squared distance between the observations. $\gamma$ which is determined by cross validation scales the squared distance, thus scaling the infuence. It is dot product of two infinite dimension vectors. This can be further studied using taylor series expansion.

### Hinge Loss

The hinge loss is a loss function used for training classifiers. The hinge loss is used for "maximum-margin" classification, most notably for support vector machines (SVMs).

![](./fig/hinge.png)

The x-axis represents the distance from the boundary of any single instance, and they y-axis represents the loss size, or penality, that the function will incur depending on its distance.

There are a few things to note here:

+ That dotted line on the x-axis represents the number 1. This means that when an instance’s distance from the boundary is greater than or at 1, our loss size is 0. Keep this in mind, as it will really help in understanding the maths of the function.

+ If the distance from the boundary is 0 (meaning that the instance is literally on the boundary), then we incur a loss size of 1.

+ We see that correctly classified points will have a small(or none) loss size, while incorrectly classified instances will have a high loss size.

+ A negative distance from the boundary incurs a high hinge loss. This essentially means that we are on the wrong side of the boundary, and that the instance will be classified incorrectly.

+ On the flip size, a positive distance from the boundary incurs a low hinge loss, or no hinge loss at all, and the further we are away from the boundary(and on the right side of it), the lower our hinge loss will be.

Lets look at an example of multi class classifiation example to understand hinge loss. We have three imags and their predictions say from a Neural Network or SVM.

![](./fig/multiclass.jpg)

**Hinge loss Function** is given by
$$L_i = \sum_{j \neq i} max(0, s_j - s_{y_{i}}+1)$$

If we are calulating loss of $i^{th}$ category. We will subract $j^{th}$ category score from $i^{th}$ category score and add 1 and take $max (0, the calculated value)$, and we sum up like this for all other categories scores.

**Calculating Loss for Image-1** 

$= max(0, (1.49 - (-0.39) + 1)) + max(0,(4.21 - (-0.39) + 1))$  
$= max(0, 2.88) + max(0, 5.6)$  
$= 2.88 + 5.6 = 8.48$  


**Calculating Loss for Image-2** 

$= max(0, (1.49 - (-0.39) + 1)) + max(0,(4.21 - (-0.39) + 1))$  
$= max(0, 2.88) + max(0, 5.6)$  
$= 2.88 + 5.6 = 8.48$

**Calculating Loss for Image-3** 

$= max(0, (1.03 - (-2.27) + 1)) + max(0,(-2.37 - (-2.27) + 1))$  
$= max(0, 4.3) + max(0, 0.9)$  
$= 4.3 + 0.9 = 5.2$  

We see the model wrongly classifies dog as horse in image 2 so the loss is also high 8.42, wheras in image 2 it correctly classifies it as cat with highest score and the loss in this case turns out to be zero. Similary in case of image 3 loss is high for wrong prediction.