# Support Vector Machine

This notebook contains a from-scratch implementation of a Support Vector Machine (SVM) binary
classifier. This implementation only uses basic Python libraries like numpy and pandas, aside from
some basic scikit-learn functionality for loading and managing test datasets.

Joseph Melby,
2023

### Basics

This supervised learning algorithm constructs a hyperplane in $n$-dimensional space that corresponds
to a decision boundary of a binary classification problem. It does this by initializing and then
optimizing a set of feature weights and finding an intercept (parameters for a hyperplane). In
particular, we begin with a feature matrix $X$ of dimension $n$ and vector $y$ such that

- each row $X_i$ of $X$ contains features of the data set, and 
- each element $y_i$ of $y$ is a positive or negative number giving the feature vector class. 

#### Hyperplane
The model is optimized to find a set of weights $W$ and an intercept $b$ defining a
hyperplane

$$W\cdot X + b$$

such that the function 

$$f(X_i) = sign(W\cdot X_i + b)$$

gives the model's classification of feature vector $X_i$ for each row of $X$. The hyperplane
parameters are optimized by both minimizing a cost function and maximizing the margin of separation
of the plane.

As a convenience for this implementation, we can push the intercept $b$ into the weights $W$ by just
adding them in as an extra column to the end of $W$:

$$ f(X_i) = \tilde{W} \cdot \tilde{X_i} + W_0 = \mathbf{W} \cdot X_i $$

where 

$$ \mathbf{W} = \left(\tilde{W}, W_0 \right), \qquad X_i = \left( \tilde{X_i}, 1 \right). $$

#### Cost Function

We have two options for the cost function, each with parameters that affect the model in slightly
different contexts. The first is 

$$ J(W) = \frac{1}{2}||W||^2 + C \left[ \frac{1}{n} \sum_{i=1}^{n} max \left(0, 1 - y_i \left(W
\cdot X_i
+ b \right)\right)\right] $$

Here, $C$ is affected while training the model; a larger $C$ corresponds to a narrow margin between
the hyperplane and support vector. The second option is 

$$ J(W) = \frac{\lambda}{2}||W||^2 + \frac{1}{n} \sum_{i=1}^{n} max(0, 1 - y_i(W \cdot X_i + b)) $$

Here a larger $\lambda$ gives a wider margin and a smaller $\lambda$ gives a narrow margin.
Basically, we can think of $\lambda$ as essentially equal to $1/C$. These parameters encode the
regularization strength, and they will need to be tuned with the other model parameters. 

As a final note, once we write the cost function in the implementation below, the combined
weight-intercept matrix $\mathbf{W}$ will be used instead of the explicit formula above:

$$ J(\mathbf{W}) = \frac{1}{2}||\mathbf{W}||^2 + C \left[ \frac{1}{n} \sum_{i=1}^{n} max \left(0, 1
- y_i \left(\mathbf{W} \cdot X_i\right)\right)\right] $$

#### Gradient Calculation

The gradient of the cost function is given by the following:

$$ \nabla_{\mathbf{W}} J(\mathbf{W}) = \frac{1}{n} \sum_{i=1}^n 
   \begin{cases}
    \mathbf{W} \qquad \text{if } max \left(0, 1 - y_i \left(\mathbf{W} \cdot X_i\right)\right) = 0\\
    \mathbf{W} - C y_i x_i \qquad \text{otherwise}.
   \end{cases} $$

The process for updating weights using gradient descent is the following:

1. Compute gradient: $\nabla_{\mathbf{W}} J(\mathbf{W})$
2. Update weights by moving opposite the gradient with learning rate $\alpha$: $\mathbf{W} =
   \mathbf{W} - \alpha \nabla_{\mathbf{W}} J(\mathbf{W})$ 
3. Repeat until, ideally, $J(W)$ is minimized.


### SVM Class



In [None]:
# Importing necessary packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# For splitting test data sets:
from sklearn.model_selection import train_test_split