# Linear SVM

- Goal: Find a vector to the hyperplane that maximizes the margin
  - Hyperplane:
    - Positive $H_{positive}: w x + b = 1$
    - Negative $H_{negative}: w x + b = -1$
  - Margin: $\hat{\gamma}$
- Support Vectors: data points that are closest to the hyperplane and influence the position / orientation of the pyperplane
- Loss function: Hinge Loss
$$\mathcal{L} = \sum_{i=1}^n max(0, 1 - y_i (w x_i + b))$$

<img src="https://miro.medium.com/max/921/1*06GSco3ItM3gwW2scY6Tmg.png" alt="SVM"  style="width:600px;"/>

$\textbf{Derivations}$:
- Primal:
$$\underset{\gamma}{\mathrm{max}} \frac{\hat{\gamma}}{||w||} s.t. y_i (w x_i + b) \geq \hat{\gamma}$$
- Rescale $\hat{\gamma}=1$, transform to convex optimization:
$$\underset{w, b}{\mathrm{min}} \frac{1}{2} ||w||^2 s.t. 1 -  y_i (w x_i + b) \leq 0$$
- Lagrange Multiplier:
$$L_p (w, b, a_i) = \frac{1}{2} ||w||^2 + \sum_i a_i [1 -  y_i (w x_i + b)]$$
$$\frac{\partial L}{\partial w} = w - \sum_i a_i y_i x_i = 0, \frac{\partial L}{\partial b} = -\sum_i a_i y_i = 0$$
- Plug in $L_p$, becomes a dual problem:
$$\underset{\gamma}{\mathrm{min}} L_D (w, b, a_i) = \underset{\gamma}{\mathrm{min}} \left( \frac{1}{2} \sum_{i, j} a_i a_j y_i y_j \langle x_i, x_j \rangle - \sum_i a_i \right)$$
$$ s.t. \sum_i a_i y_i = 0, a_i \geq 0$$
  - transform the problem by computing the inner product
  - can solve non-linear separable problem by kernel trick
- Decision function:
$$f(x) = sign \left(\sum_{i=1}^N a_i \tilde{y}_i x_i x + b \right)$$

$\textbf{Sequential Minimum Optimization (SMO)}$:
- choose some pair $a_i, a_j$, keep other $a_k$'s fixed
- optimize $L_D (w, b, a_i, a_j)$ w.r.t $a_i, a_j$
- repeat until convergence

$\textbf{Equivalent Optimization Problem}$:
- Hard Margin:
$$\underset{w, b}{\mathrm{min}} \frac{1}{2} ||w||^2 s.t. y_i (w x_i + b) \geq 1$$
- Soft Margin:
$$\underset{w, b}{\mathrm{min}} \left[\frac{1}{2} ||w||^2 + C \sum_{i=1}^n \xi^{(i)} \right] s.t. y_i (w x_i + b) \geq 1 - \xi^{(i)}$$
$$\underset{w, b}{\mathrm{min}} \left[\frac{1}{n}\sum_{i=1}^n max(0, 1 - y_i (w x_i + b)) + \lambda ||w||^2 \right] $$

# Kernel SVM

$\textbf{Mercer's Theorem}$: If K is positive semi-definite and symmetric, then exists a kernal $\phi$ such that $K(a, b) = \phi (a)^T \phi (b)$

$\textbf{Kernel trick}$: 
- map the original input to a new feature space
- replace inner product $ x_i \cdot x_j $ by kernel $ \phi (x_i) \cdot \phi (x_j) $
- train linear SVM in a new feature space
$$L(w, b, a_i) = \frac{1}{2} \sum_{i, j} a_i a_j y_i y_j \langle x_i, x_j \rangle - \sum_i a_i$$
- Decision function:
$$f(x) = sign \left(\sum_{i=1}^N a_i \tilde{y}_i K(x_i, x) + b \right)$$

$\textbf{Different Types of Kernel}$:
- linear: $K(a, b) = a^T b$
- polynomial: $K(a, b) = (\gamma a^T b + r)^d$
- Gaussian rbf: $K(a, b) = exp \left( \frac{||a - b||^2}{2 \sigma^2} \right)$
- sigmoid: $K(a, b) = tanh(\gamma a^T b + r)$

$\textbf{Hyperparameters}$:
- gamma $\gamma$: increasing gamma leads to overfitting as the classifier tries to perfectly fit the training data
- C (equivalent to $1 / \lambda$): 
  - C is larger, the margin will be smaller
  - C is smaller, the margin will be larger

# Support Vector Regression (SVR)

$\textbf{Objective}$:
$$\underset{w, b}{\mathrm{min}} \left[\frac{1}{2} ||w||^2 + C \sum_{i=1}^n L_{\epsilon} (w x_i + b - y_i) \right]$$
$$L_{\epsilon} (z) = max(|z| - \epsilon, 0)$$

# Support Vector Domain Description (SVDD): One-class SVM

$\textbf{Objective}$: find a smallest sphere with center $o$, radius $R$ to include all positive samples
$$\underset{R, o, \xi}{\mathrm{min}} L(R, o, \xi) = \underset{R, o, \xi}{\mathrm{min}} \left[ R^2 + C \sum_{i=1}^n \xi_i \right]$$
$$s.t. ||x_i 0 o||^2 \leq R^2 + \xi_i, \xi_i \geq 0$$

$\textbf{Advantages}$:
- Fast prediction
- Fit high-dimensional data
- kernel function adapts to many types of data

$\textbf{Disadvantages}$:
- Computational cost (with large training samples)
- can only fit small sample size (smaller than 1 million)
- Strongly dependent on softening parameter