# Support Vector Machine

A more flexible or more "powerful" version of perceptron, both try to form a hyperplane to separate the classes in feature space. Not really a probabilistic approach, more of a computational optimization method.

While the data is not linearly separable, SVM can
- soften the idea of "separating"
- enrich or enlarge the feature space so that the separation becomes possible

A **hyperplance** in $p$ dimensions is a flat affine subspace of dimension $p-1$. In otherwords, form a function, $f(x) = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_px_p$, such that $f(X)>0$ for data points in one class and $f(X)<0$ for data points in another class.

It is a maximal margin classifier, finding the hyperplane that makes the the biggest gap or magin between the two classes.

To work on non-linearly separable data, allow some mistakes/slack meansured in relative to the size of the margin, controlled by some budget, C, a tuning parameter to control the sub-margins. Tuning C is playing around with the bias-variance trade-off.

Sometimes linear separator simply won't fit to the data despite high C. We can enlarge the space of features by transformations, going from p dimensions to p+ dimensions (e.g. x to (x,x^2)). This results in non-linear decision boundaries in the original space.

One way to do this is through **kernels**. If we can compute inner-products between observations, we can fit a SV classifier, and some special kernal functions can do this for us.

Some common kernels are:
- Polynomial kernel: popular in image processing
- Gaussian kernel: general-purpose kernel, used when there is no prior knowledge about the data
- RBF kernel: general-purpose kernel, used when there is no prior knowledge about the data

## Derivation

**Loss Function**: Hinge-loss

<img src="./img/9.svm/figure1.png" alt="Figure 1" width="300"/>

<img src="./img/9.svm/figure2.png" alt="Figure 2" width="500"/>

**Cost Function**: $l(\hat{y_i}) =$ 
- $max(0,1-\beta^TX)$ if $y=1$
- $max(0,1+\beta^TX)$ if $y=0$

**Objective Function**:

$min$ $J({\beta}) = \sum_{i=1}^{n} y*max(0,1-\beta^TX) + (1-y)*max(0,1+\beta^TX))$ with respect to $\hat{\beta_j}$

**SVM vs. Logistic Regression**

- When classes are (nearly) separable, SVM generally does better than LR
- When not, LR(with ridge penalty) and SVM are similar
- LR is probablistic and SVM is not
- Kernel SVM is good for forming nonlinear boundaries (while we can use kernels with LR as well, the computations are more expensive)