# Support Vector Machines

## An optimization problem:
Let a hyperplane be given by $W^{T}x + b$   
Then the distance from the origin is given by: $\frac{b}{||W||}$   
And the distance of any general point ($\textbf{x}$) is given by: $\frac{|W^{T}x + b|}{||W||}$
  
Let y be the vector specifying the classes.   
Let $y = 1, -1$ denote the two classes (Let's assume binary classification for now) 
  
Then, for a perfect classifier, we need to have:
$$y\cdot(W^{T}x_{i} + b) \geq 0 \ \  \forall \ \ i \in (1,\dots,m)$$
This is the 'constraint' for our optimization problem.
  
For getting an even better classifier, we also need to have the decision boundary located at a maximum distance from the data points. This distance is called the geometric margin. We also need to maximize this.
  
$$\text{max}\{ \min_{i = 1\dots m} \frac{|W^{T}x_i + b|}{||W||} \}$$

Thus, this is the 'objective' of our optimization problem.


## A few assumptions:
As we have effectively chosen the class values to be $1, -1$, we have the following equality:   
  
$$\frac{|W^{T}x_i + b|}{||W||} = \frac{|y_i\cdot (W^{T}x_i + b|)}{||W||}$$
  
Now, let's assume a perfect classifier that classifies all the data points.  
As a direct consequence of the constraint condition, we have   
   
$$|y_i\cdot (W^{T}x_i + b| = y_i\cdot (W^{T}x_i + b)$$
   
Thus, the optimization condition now becomes   
  
$$\text{max}\{ \min_{i = 1\dots m} \frac{y_i\cdot (W^{T}x_i + b)}{||W||} \}$$
  
Let the mininum distance be achieved at $i = i_0$, thus, the optimization condition now becomes:  
  
$$\text{max}\{ \frac{y_{i_0}\cdot (W^{T}x_{i_0} + b)}{||W||} \}$$
  
The 'functional margin' is defined by:  $\text{geometric margin}\cdot ||W|| = y_{i_0}\cdot (W^{T}x_{i_0} + b)$  
  
Let   
$$\epsilon := y_{i_0}\cdot (W^{T}x_{i_0} + b)$$  
$$\tilde W := \frac{W}{\epsilon}$$  
$$\tilde b := \frac{b}{\epsilon}$$
  
Thus, we have that $y_i\cdot (W^{T}x_i + b) \gt y_{i_0}\cdot (W^{T}x_{i_0} + b)$.  
  
Therefore, $y_i\cdot (\tilde{W}^{T}x_i + \tilde{b}) \geq 1$   
  
Which in turn implies that $\frac{y_i\cdot (\tilde{W}^{T}x_i + \tilde{b})}{||\tilde W||} \geq \frac{1}{||\tilde W||}$ 
  
Thus the new simpler optimization condition and the constraint is: 
$$\max{\frac{1}{||\tilde W||}} = \min{||\tilde W|| = \min{ ||\tilde W||^2}}$$
  
$$ y_i\cdot (\tilde{W}^{T}x_i + \tilde{b}) \geq 1 $$

## The C-SVM classifier:
  
For linearly separable cases where we assume a perfect classifier to exist, we have:
$$\frac{1}{2}\min{ ||\tilde W||^2} \ \  \text{s.t} \ \  y_i\cdot (\tilde{W}^{T}x_i + \tilde{b}) \geq 1 $$
   
If a linear classifier cannot perfectly classify the points, we find a model using:  
$$\frac{1}{2}\min{ ||\tilde W||^2} + C\sum_{i=1}^{m} L_{hinge}(\tilde W^{T}x_i + \tilde b, y_i)$$
  
Where $L_{hinge}(z, y)$ is the the hinge cost function.   
  
This is very similar to the logistic regression classifier, except that here, $C$ acts as $\frac{1}{\lambda}$, where $\lambda$ is the regularization parameter. 

Thus the final cost function needed to be minimized is:
$$\frac{1}{2}||\tilde W||^2 + C\sum_{i=1}^{m} L_{hinge}(\tilde W^{T}x_i + \tilde b, y_i)$$