# Support Vector Machine

---

## References

[Scikit Learn - Support Vector Machines](https://scikit-learn.org/stable/modules/svm.html)

[IBM - What are support vector machines (SVMs)?](https://www.ibm.com/think/topics/support-vector-machine)

[Medium - Math behind SVM (Support Vector Machine)](https://ankitnitjsr13.medium.com/math-behind-support-vector-machine-svm-5e7376d0ee4d)

[Spot Intelligence - Support Vector Machines (SVM) In Machine Learning Made Simple & How To Tutorial](https://spotintelligence.com/2024/05/06/support-vector-machines-svm/)

---

## Notes

#### Characteristics
- Supervised learning
- usually classification task
- finds best (maximum margin) hyperplane that separates classes
- margin is the hyperplane and nearest data points (support vectors)
- larger margin means better generalization usually

#### Input & Output
- **Input**: feature matrix $X$ with shape (n_samples, n_features)
- **Output**: lable vector $y$ with shape (n_samples,)

#### Parameters
- $\vec{w}$: weight vector of hyperplane
- $b$: bias term of hyperplane

#### Hyperparameters
- $C$: regularization rate
- $\alpha$: learning rate
- number of epochs
- batch size
- kernel type & parameters

#### Runtime Complexity
For linear SVM,
- **Training**: $O(n\cdot d)$ per epoch
    - update parameters for each data during optimization
* **Inference**: $O(d)$
    - get sign of signed distance from hyperplane to the data point

where
- $n$: number of training data
- $d$: number of features

#### Pros & Cons
Linear SVM:
- **Pros**:
    - fast training
    - scales well to large datasets
    - simple mathematics
    - efficient inference
* **Cons**: 
    - works well only if data is linearly separable


Polynomial Kernel SVM:
- **Pros**:
    - captures non linear relationship
    - fast on small datasets
- **Cons**:
    - more parameters
    - computationally heavier than linear SVMs
    - might overfit

---

## Mathematics

### Terms
- $\vec{w}$: weight vector
- $\vec{x}$: data point
- $b$: bias term
- $y$: label ($1$ or $-1$)
- $\mu$: margin ($\frac{1}{\|\vec{w}\|}$) 
- $C$: regularization rate
- $N$: number of data points

### Linear SVM

#### Decision Boundary
$$P:\vec{w}\cdot\vec{x}+b=0$$

#### Signed Distance
Signed distance represents which side and how far $\vec{x}$ is from the hyperplane $P$
$$d(\vec{x})=\frac{\vec{w}\cdot\vec{x}+b}{\|\vec{w}\|}$$

- $d(\vec{x})=0$ means $\vec{x}$ is on the hyperplane
- $0<|d(\vec{x})|\leq 1$ means $\vec{x}$ is within the margin ($\mu=\frac{1}{\|\vec{w}\|}$) from the hyperplane
- $1< |d(\vec{x})|$ means $\vec{x}$ is more than the margin away from the hyperplane

#### Prediction
Predicted label is the sign ($+1$ or $-1$) of the signed distance from the decision boundary.
$$\hat{y}=\text{sign}(\vec{w}\cdot\vec{x}+b)$$

The prediction is correct if $\hat{y}$ and $y$ are both $+1$'s or both $-1$'s. \
Similarly, the precition will be correct if $d(\vec{x})$ and $y$ have the same sign:
$$y\cdot d(\vec{x})>0 \quad \text{or} \quad y(\vec{w}\cdot\vec{x}+b)>0$$

- $y(\vec{w}\cdot\vec{x}+b)<0$ means $\vec{x}$ is incorrectly predicted
- $0\leq y(\vec{w}\cdot\vec{x}+b)\leq 1$ means $\vec{x}$ is correctly predicted but within the margin
- $1< y(\vec{w}\cdot\vec{x}+b)$ means $\vec{x}$ is correctly predicted and far enough from the boundary

#### Support Vectors
Support vectors are data points that are incorrectly classified or correctly classified but within margin distance from the decision boundary.
$$\text{SV}=\{\vec{x}_i \mid y_i(\vec{w}\cdot\vec{x}_i+b)\leq 1\}$$

#### Hinge Loss
$$\ell(\vec{x},y)=\max\big(0, 1-y(\vec{w}\cdot\vec{x}+b)\big)$$

- if $\vec{x}$ is incorrectly predicted, $\ell(\vec{x},y) > 1$
- if $\vec{x}$ is correctly predicted but within the margin, $1 \geq \ell(\vec{x},y) > 0$
- if $\vec{x}$ is correctly predicted and far enough from the boundary, $\ell(\vec{x},y) = 0$

#### Risk
$$L(\vec{w},b)=\frac{1}{2}\|\vec{w}\|^2+\frac{C}{N}\sum_{i=1}^{N}\ell(\vec{x}_i,y_i)$$
$$L_{\text{SGD}}(\vec{w},b)=\frac{1}{2}\|\vec{w}\|^2+C\cdot\ell(\vec{x}_i,y_i)$$

- $\frac{1}{2}\|\vec{w}\|^2$: regularization term; controls (increases) margin size
- $C$: regularization term; manages trade off between margin size and classification error
- $\frac{1}{N}\sum_{i=1}^{N}\ell(\vec{x}_i,y_i)$: classification error; sum of losses for all data points

#### Gradients
$$\frac{\partial \ell}{\partial \vec{w}} = \begin{cases}0&\vec{x}\not\in\text{SV}\\-y\cdot\vec{x}&\vec{x}\in\text{SV}\end{cases}$$
$$\frac{\partial L}{\partial \vec{w}}=\vec{w}+\frac{C}{N}\sum_{i=1}^{N}\begin{cases}0&\vec{x}_i\not\in\text{SV}\\-y_i\cdot\vec{x}_i&\vec{x}_i\in\text{SV}\end{cases}$$
$$\frac{\partial L_{\text{SGD}}}{\partial \vec{w}}=\vec{w}+\begin{cases}0&\vec{x}_i\not\in\text{SV}\\-C\cdot y_i\cdot\vec{x}_i&\vec{x}_i\in\text{SV}\end{cases}$$


<br>

$$\frac{\partial \ell}{\partial b} = \begin{cases}0&\vec{x}\not\in\text{SV}\\-y&\vec{x}\in\text{SV}\end{cases}$$
$$\frac{\partial L}{\partial b}=\frac{C}{N}\sum_{i=1}^{N}\begin{cases}0&\vec{x}_i\not\in\text{SV}\\-y_i&\vec{x}_i\in\text{SV}\end{cases}$$
$$\frac{\partial L_{\text{SGD}}}{\partial b}=\begin{cases}0&\vec{x}_i\not\in\text{SV}\\-C\cdot y_i&\vec{x}_i\in\text{SV}\end{cases}$$

#### Update
We will use Stochastic Gradient Descent (SGD)

If $\vec{x}$ is a support vector:
$$\vec{w} \gets \vec{w} - \alpha\cdot\vec{w}+\bigg(\alpha\cdot C\cdot y_i\cdot \vec{x}_i\bigg)$$
$$b \gets b + \bigg(\alpha\cdot C\cdot y_i\bigg)$$

If $\vec{x}$ is not a support vector:
$$\vec{w} \gets \vec{w} - \alpha\cdot\vec{w}$$
$$b \gets b$$

### Polynomial Kernel SVM

---

## Comments

Only Classification SVMs (linear and polynomial kernel) will be implemented.