# Chapter 5 - Support Vector Machines
## Linear SVM Classification
SVM classifier: fitting the widest possible street between the classes. <br>
A decision boundary is fully determined by support vectors (instances located on the edge of the street). <br>
SVMs are sensitive to the features scales.
### Soft Margin Classification
Hard margin classification: strictly imposing that all instances be off the street and on the right side <br>
Soft margin classification: find a good balance between keeping the steet as large as possible and limiting the margin violations. <br>
Margin violation: An instance that end up in the middle of the street or even on the wrong side. <br>
`C` hyperparameter: a smaller `C` value leads to a wider street but more margin violations, while a high `C` value makes fewer margin violation but ends up with a smaller margin. <br>


In [30]:
import numpy as np
from sklearn import datasets
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC

iris = datasets.load_iris()
X = iris["data"][:, (2,3)]
y = (iris["target"] == 2).astype(np.float64)

svm_clf = Pipeline([
    ("scaler", StandardScaler()),
    ("linear_svc", LinearSVC(C=1, loss="hinge"))
])

svm_clf.fit(X, y)

Pipeline(steps=[('scaler', StandardScaler()),
                ('linear_svc', LinearSVC(C=1, loss='hinge'))])

In [31]:
svm_clf.predict([[5.5, 1.7]])

array([1.])

## Nonlinear SVM Classification
Adding polynomial features

### Polynomial Kernel
The kernel trick makes it possible to get the same result as if you added many polynomial features, even with very high degree polynomials, without actually having to add them. <br>
`ceof` controls how much the model is influenced by high-degree polynomials versus low-degree polynomials.
### Adding Similarity Features
### Gaussian RBF Kernel
Increasing `gamma` makes the bell-shape curve narrower. <br>
=> Each instance's range of influece is smaller. <br>
=> The dicision boundary ends up being more irregular, wiggling around indiidual instances. <br>
=> $\gamma$ acts like a regularization hyperparameter: if your model is overfitting, you should reduce it, and vice versa.
### Computaional Complexity
Training time complexity of the `LinearSVC` class: $O(m \times n)$ <br>
Training time complexity of the `SVC` class: $O(m^{2} \times n)$ or $O(m^{3} \times n)$


## SVM Regerssion
SVM Regression tries to fit as many instances as possible on the street while liming margin violations. <br>
The width of the street is controlled by a hyperparameter $\varepsilon$.

## Under the Hood
### Decision Function and Predictions
The linear SVM classifier model predicts the class of a new instance x y simply computing the decision function: $w^{T}x + b = w_{1}x_{1} + \cdots + w_{n}x_{n} + b$ <br>
$$\hat y = \begin{cases} 0 \text{ if $w^{T}x + b < 0,$} \\ 1 \text{ if $w^{T}x + b \geq 0$} \end{cases}$$
Training a linear SVM classifier means find the value of w and b that make this margin as wide as possible while avoidng margin violations (hard margin) or limiting them (soft margin).


### Training Objective
The slope of the decision function is equal to the norm of the weight vector, $||w||$. <br>
If $t^{(i)} = -1$ when $y^{(i)}=0$ and $t^{(i)}=1$ when $y^{(i)}=1$, then $t^{(i)}(w^{T}x^{(i)}+b) \geq 1$ for all instances. <br>
Hard margin linear SVM classifier objective: <br>
minimize $\frac{1}{2}w^{T}w$ ($=\frac{1}{2}||w||^{2}$) (because $||w||$ is not differential at $w = 0$) <br>
subject to $t^{(i)}(w^{T}x^{(i)}+b) \geq 1$ for $i = 1, 2, \cdots, m$ <br>
Slack variable $\zeta^{(i)}$: $\zeta^{(i)}$ measures how much the $i^{th}$ instance is allowed to violate the margin. <br>
The C hyperparameter allows us to define the trade-off between the large margin and the less margin violations.

### Quadratic Programming
A Quadratic Programming (QP) problem is a convex quadratic optimization problem with linear constraints.
### The Dual Problem
If the primal problem is given, its dual problem can be expressed. <br>
Under some conditions, the primal problem and the dual problem have the same solution. <br>
Computing $\hat{\alpha}$ for the dual problem $\Rightarrow$ computing $\hat{w}, \hat{b}$ for the primal problem. <br>
The dual problem is faster to solve than the primal when the number of training instances is smaller than the number of features. <br>
The dual problem makes the kernel trick possible, while the primal does not.
### Kernelized SVM
In Machine Learning, a *kernel* is a function capable of computing the dot product $\phi (a)^{T} \phi (b)$ based only on the original vectors **a** and **b**. $K(a, b)$
### Online SVMs
Hinge Function
