### 1. SVM Hinge Loss Definition

Hinge loss is defined as follows:
<br><br>
\begin{equation}
    L_{i} = \sum_{j \neq y_{i}} = \max(0, s_{j} - s_{y_{i}} + \Delta)
\end{equation}
<br><br>
$\Delta$ is a margin value (it is an optimization hyperparameter). $s_{j}$ and $s_{y_{i}}$ are $j$th and $y_{i}$th class scores respectively for training example $x_{i}$. The class scores are computed with dot products $w_{j}^{T}x_{i}$ and $w_{y_{i}}^{T}x_{i}$.
<br><br>
The total loss to be minimized can be written as:
<br><br>
\begin{equation}
    L = \frac{1}{N}\sum_{i}^{N}L_{i} + \frac{1}{2}\lambda\|W\|_{2}^{2}
\end{equation}
<br><br>
where $\lambda$ is a regularization hyperparameter and $\frac{1}{2}$ is a constant for clean gradient computation. Intuitively, SVM wants score, $s_{y_{i}}=w_{y_{i}}^{T}x_{i}$ of the correct class $y_{i}$ to be greater than any other classes by at least the margin $\Delta$.

### 2. Computing the gradient of Hinge Loss (Unvectorized ver.)

In order to compute the gradient of the loss function w.r.t $W$, we start off with the loss for each individual training sample $x_{i}$:<br><br>
\begin{equation}
    L_{i} = \sum_{j \neq y_{i}} \max(0, w_{j}^{T}x_{i} - w_{y_{i}}^{T}x_{i} + \Delta)
\end{equation}
<br><br>
First, make sure to visualize that the wegiht matrix $W$ is of size ($D$, $C$) and our input matrix $X$ is of size ($N$, $D$) with $C$, $D$, $N$ represeting the number of classes, feature dimension, and number of train samples in $X$ respectively.
<br><br>
From the equation for $L_{i}$, we can see that we can compute the gradient of the loss by parts as follows:
<br><br>
1. When taking the gradient of loss w.r.t $w_{y_{i}}$:
<br><br>
\begin{equation}
    \nabla_{w_{y_{i}}} L_{i} = \sum_{j \neq y_{i}} \nabla_{w_{y_{i}}}\max(0, w_{j}^{T}x_{i} - w_{y_{i}}^{T}x_{i} + \Delta)
\end{equation}
<br><br>
When the hinge loss value of $w_{j}^{T}x_{i} - w_{y_{i}}^{T}x_{i} + \Delta \leq 0$, we can immediately see that $\nabla_{w_{y_{i}}} L_{i} = \nabla_{w_{y_{i}}} 0 = 0$. When it is greater than 0:
<br><br>
\begin{equation}
    \nabla_{w_{y_{i}}} L_{i} = \nabla_{w_{y_{i}}}(w_{j}^{T}x_{i} - w_{y_{i}}^{T}x_{i} + \Delta) = -x_{i}
\end{equation}
<br><br>
Now, combining the two cases, we have the following result:
<br><br>
\begin{equation}
    \nabla_{w_{y_{i}}} L_{i} = -(\sum_{j \neq y_{i}} \mathbb{1}(w_{j}^{T}x_{i} - w_{y_{i}}^{T}x_{i} + \Delta > 0))x_{i}
\end{equation}
<br><br>
where, $\mathbb{1}()$ is an indicator function for given condition.
<br><br>
2. When taking the gradient of loss w.r.t $w_{j}$:
<br><br>
\begin{equation}
    \nabla_{w_{j}} L_{i} = \sum_{j \neq y_{i}} \nabla_{w_{j}}\max(0, w_{j}^{T}x_{i} - w_{y_{i}}^{T}x_{i} + \Delta)
\end{equation}
<br><br>
With similar computations, we can see that
<br><br>
\begin{equation}
    \nabla_{w_{y_{i}}} L_{i} = \mathbb{1}(w_{j}^{T}x_{i} - w_{y_{i}}^{T}x_{i} + \Delta > 0)x_{i}
\end{equation}
<br><br>
As a final note, don't forget that hinge loss is not continuously differentiable. It is not differentiable at the point where hinge loss equals 0. 

### 3. Computing the vectorized