# Margin  and Hinge Loss
- Definition  
The margin (or functional margin) for predicted score $\hat{y}$ and true class $y \in \{-1,1\}$ is $\hat{y} y$.  
  - The margin is a measure of how correct we are  
  - We want to **maximize the margin**  
- Hinge Loss  
$$\ell_{\text {Hinge }}=\max \{1-m, 0\}=(1-m)_{+}$$
<div align="center"><img src = "./hinge.jpg" width = '500' height = '100' align = center /></div>  
Hinge is a convex, upper bound on $0-1$ loss. Not differentiable at $m = 1$.  We have a “margin error” when $m < 1$. 

# SVM  
- Simplest Case:  
  - Hypothesis space $\mathcal{F}=\left\{f(x)=w^{T} x+b \mid w \in \mathbf{R}^{d}, b \in \mathbf{R}\right\}$  
  - $l_2$ regularization(Tikhonov Style)  
  - Loss:  Hinge Loss, $l(\hat{y}, y) = max(0, 1 - y\hat{y}) = (1 - y\hat{y})_+$  
  - The SVM prediction function is the solution to   
$$\min _{w \in \mathbf{R}^{d}, b \in \mathbf{R}} \frac{1}{2}\|w\|^{2}+\frac{c}{n} \sum_{i=1}^{n} \max \left(0,1-y_{i}\left[w^{T} x_{i}+b\right]\right)$$  
## SVM Optimization Problem (Tikhonov Version)
- unconstrained optimization  
- not differentiable  
- What can we do?  

## SVM Equivalent Form  
- Idea: Because **max** is not differentiable, we have to think of a method to eliminate it.  
$$\begin{array}{ll}
\operatorname{minimize} & \frac{1}{2}\|w\|^{2}+\frac{c}{n} \sum_{i=1}^{n} \xi_{i} \\
\text { subject to } & \xi_{i} \geqslant \max \left(0,1-y_{i}\left[w^{T} x_{i}+b\right]\right)
\end{array}$$  
which is also equivalent to:  
$$\begin{aligned}
&\operatorname{minimize} \quad \frac{1}{2}\|w\|^{2}+\frac{c}{n} \sum_{i=1}^{n} \xi_{i}\\
&\text { subject to } \quad \xi_{i} \geqslant\left(1-y_{i}\left[w^{T} x_{i}+b\right]\right) \text { for } i=1, \ldots, n\\
&\xi_{i} \geqslant 0 \text { for } i=1, \ldots, n
\end{aligned}$$  
- Why equivalent?  
Apparently, if we want to minimize the target, $\xi_i$ tends to be as small as possible, but it cant be less than $\max \left(0,1-y_{i}\left[w^{T} x_{i}+b\right]\right)$  
## SVM as a quadratic program  
- Differentiable objective function  
- n + d + 1 unknowns and 2n affine constraints  
## SVM dual problem  
Let's write the objective program in Lagrangian form  
$$L(w, b, \xi, \alpha, \lambda)=\frac{1}{2}\|w\|^{2}+\frac{c}{n} \sum_{i=1}^{n} \xi_{i}+\sum_{i=1}^{n} \alpha_{i}\left(1-y_{i}\left[w^{T} x_{i}+b\right]-\xi_{i}\right)+\sum_{i=1}^{n} \lambda_{i}\left(-\xi_{i}\right)$$  
- Reformulate it  
$$\begin{array}{|c|c|}
\hline \text { Lagrange Multiplier } & \text { Constraint } \\
\hline \hline \lambda_{i} & -\xi_{i} \leqslant 0 \\
\hline \alpha_{i} & \left(1-y_{i} f\left(x_{i}\right)\right)-\xi_{i} \leqslant 0 \\
\hline
\end{array}$$
$$\begin{aligned}
& L(w, b, \xi, \alpha, \lambda) \\
=& \frac{1}{2}\|w\|^{2}+\frac{c}{n} \sum_{i=1}^{n} \varepsilon_{i}+\sum_{i=1}^{n} \alpha_{i}\left(1-y_{i}\left[w^{T} x_{i}+b\right]-\xi_{i}\right)-\sum_{i} \lambda_{i} \xi_{i} \\
=& \frac{1}{2} w^{T} w+\sum_{i=1}^{n} \xi_{i}\left(\frac{c}{n}-\alpha_{i}-\lambda_{i}\right)+\sum_{i=1}^{n} \alpha_{i}\left(1-y_{i}\left[w^{T} x_{i}+b\right]\right)
\end{aligned}$$  
- Then we can write the primal and dual form  
$$\begin{aligned}
p^{*} &=\inf _{w, \xi, b} \sup _{\alpha, \lambda \succeq 0} L(w, b, \xi, \alpha, \lambda) \\
& \geqslant \sup _{\alpha, \lambda \succeq 0} \inf _{w, b, \xi} L(w, b, \xi, \alpha, \lambda)=d^{*}
\end{aligned}$$  

## Strong Duality by Slater’s constraint qualiﬁcation
- Convex problem + affine constraints $\to$ strong duality if and only if the problem is feasible  
- Is this problem feasible?  
  - Yes, as long as we take $w = b = 0$ and $\xi_i = 1$ for all $i = 1,2...n$
  - Then strong duality exists  

## Lagrangian dual function
$$\begin{aligned}
& g(\alpha, \lambda)=\inf _{w, b, \xi} L(w, b, \xi, \alpha, \lambda) \\
=& \inf _{w, b, \xi}\left[\frac{1}{2} w^{T} w+\sum_{i=1}^{n} \xi_{i}\left(\frac{c}{n}-\alpha_{i}-\lambda_{i}\right)+\sum_{i=1}^{n} \alpha_{i}\left(1-y_{i}\left[w^{T} x_{i}+b\right]\right)\right]
\end{aligned}$$  
- We can differentiate it and get $\partial_{w} L=0, \partial_{b} L=0, \partial_{\xi} L=0$  

## SVM Dual Function: First Order Conditions
$$\begin{aligned}
& g(\alpha, \lambda)=\inf _{w, b, \xi} L(w, b, \xi, \alpha, \lambda) \\
=& \inf _{w, b, \xi}\left[\frac{1}{2} w^{T} w+\sum_{i=1}^{n} \xi_{i}\left(\frac{c}{n}-\alpha_{i}-\lambda_{i}\right)+\sum_{i=1}^{n} \alpha_{i}\left(1-y_{i}\left[w^{T} x_{i}+b\right]\right)\right]
\end{aligned}$$  

$$\begin{array}{l}
\partial_{w} L=0 \quad \Longleftrightarrow \quad w-\sum_{i=1}^{n} \alpha_{i} y_{i} x_{i}=0 \\
\partial_{b} L=0 \quad \Longleftrightarrow \quad-\sum_{i=1}^{n} \alpha_{i} y_{i}=0 \quad \\
\partial_{\xi_{i}} L=0 \quad \Longleftrightarrow \quad \frac{c}{n}-\alpha_{i}-\lambda_{i}=0
\end{array}$$  
plug these quations to $g(\alpha, \lambda)$  
$$\begin{aligned}
g(\alpha, \lambda)&=\inf _{w, b, \xi} L(w, b, \xi, \alpha, \lambda) \\  
&= \frac{1}{2}w^Tw + 0 + \sum_{i=1}^n\alpha_i - \sum_{i=1}^n(\alpha_iy_i)(w^Tx_i+b) \\  
&= \frac{1}{2}w^Tw + \sum_{i = 1}^n\alpha_iy_iw^Tx_i - \sum_{i = 1}^n\alpha_iy_ib\\  
&= \frac{1}{2}w^Tw + \sum_{i = 1}^n\alpha_i - w^T\sum_{i=1}^n\alpha_iy_ix_i\\  
&= \sum_{i=1}^n\alpha_i - \frac{1}{2}w^Tw\\  
&= \sum_{i = 1}^n\alpha_i - \frac{1}{2}\sum_{i,j}^n\alpha_i\alpha_jy_iy_jx_j^Tx_i
\end{aligned}$$  
then we have  
$$\begin{array}{ll}
\sup _{\alpha} \sum_{i=1}^{n} \alpha_{i}-\frac{1}{2} \sum_{i, j=1}^{n} \alpha_{i} \alpha_{j} y_{i} y_{j} x_{j}^{T} x_{i} \\
\text { s.t. }  \sum_{i=1}^{n} \alpha_{i} y_{i}=0 \\
 \alpha_{i} \in\left[0, \frac{c}{n}\right] i=1, \ldots, n
\end{array}$$  

- Given the solution $\alpha^*$ to dual, the primal solution $w^*$ is $w^* = \sum_{i=1}^n\alpha_i^*y_ix_i$  
- **$w^*$ is the "span of data"**  
- Note $\alpha_i \in [0, \frac{c}{n}]$, C controls max weights in each example

# Insights from complementary slackness  
## Support Vectors and Margin   
Define $f^*(x) = X^Tw^* + b^*$
- Incorrect classification: $yf^*(x) \leq 0$  
- Margin error: $yf^*(x) \le 1$  
- On the margin: $yf^*(x) = 1$  
- Good side of the margin: $yf^*(x) \ge 1$  

Recall **slack variable** $\xi_{i}^{*}=\max \left(0,1-y_{i} f^{*}\left(x_{i}\right)\right)$ is the hinge loss on each $(x_i, y_i)$  
- Suppose $\xi^*_i = 0$  
- then $y_{i} f^{*}\left(x_{i}\right) \geqslant 1$  

## Complementary Slackness  
- Recall  
$$\begin{array}{|c|c|}
\hline \text { Lagrange Multiplier } & \text { Constraint } \\
\hline \hline \lambda_{i} & -\xi_{i} \leqslant 0 \\
\hline \alpha_{i} & \left(1-y_{i} f\left(x_{i}\right)\right)-\xi_{i} \leqslant 0 \\
\hline
\end{array}$$
- By strong duality, we must have  
$$\begin{array}{l}
\alpha_{i}^{*}\left(1-y_{i} f^{*}\left(x_{i}\right)-\xi_{i}^{*}\right)=0 \\
\lambda_{i}^{*} \xi_{i}^{*}=\left(\frac{c}{n}-\alpha_{i}^{*}\right) \xi_{i}^{*}=0
\end{array}$$  
- Discussion  
  - If $y_if^*(x_I) > 1$, then $\xi^*_i = 0$, and $\alpha^*_i = 0$  
  - If $y_if^*(x_I) < 1$, then $\xi^*_i > 0$, and $\alpha^*_i = \frac{c}{n}$  
  - If $\alpha^*_i = 0$, then $\xi^*_i = 0$, and then $y_if^*(x_I) > 1$   
  - If $\alpha_{i}^{*} \in\left(0, \frac{c}{n}\right)$, then $\xi^*_i = 0$, which implies $1-y_{i} f^{*}\left(x_{i}\right)=0$  
- Summary  
$$\begin{aligned}
\alpha_{i}^{*}=0 & \Longrightarrow y_{i} f^{*}\left(x_{i}\right) \geqslant 1 \\
\alpha_{i}^{*} \in\left(0, \frac{c}{n}\right) & \Longrightarrow y_{i} f^{*}\left(x_{i}\right)=1 \\
\alpha_{i}^{*}=\frac{c}{n} & \Longrightarrow y_{i} f^{*}\left(x_{i}\right) \leqslant 1 \\
y_{i} f^{*}\left(x_{i}\right)<1 & \Longrightarrow \alpha_{i}^{*}=\frac{c}{n} \\
y_{i} f^{*}\left(x_{i}\right)=1 & \Longrightarrow \alpha_{i}^{*} \in\left[0, \frac{c}{n}\right] \\
y_{i} f^{*}\left(x_{i}\right)>1 & \Longrightarrow \alpha_{i}^{*}=0
\end{aligned}$$

# Support Vectors  
- If $\alpha^*_i$ is the solution to the dual problem, then the primal solution is  
$$w^{*}=\sum_{i=1}^{n} \alpha_{i}^{*} y_{i} x_{i}$$  
with $\alpha_{i}^{*} \in\left[0, \frac{c}{n}\right]$  
- The $x_i$'s corresponding to $\alpha^*_i$ are called support vectors  
- Few Margin errors or on the margin $\to$ sparsity in examples  
- As discussed previously, if $y_if^*(x_i) > 1$, which means (x_i, y_i) is correctly classified, $\alpha^*_i$ becaomes 0, and its corresponding contribution to weight $w$ is 0 

# Complementary Slackness to get $b^*$  
## The bias term: b 
- We can write the complementary slackness in this form  
$$\begin{aligned}
\alpha_{i}^{*}\left(1-y_{i}\left[x_{i}^{T} w^{*}+b\right]-\xi_{i}^{*}\right) &=0 \\
\lambda_{i}^{*} \xi_{i}^{*}=\left(\frac{c}{n}-\alpha_{i}^{*}\right) \xi_{i}^{*} &=0
\end{aligned}$$  
- Suppose there is an $\alpha^*_i \in (0,\frac{c}{n})$
- we have $y_if^*(x_i) = 1$  
$$\begin{aligned}
& y_{i}\left[x_{i}^{T} w^{*}+b^{*}\right]=1 \\
\Longleftrightarrow & x_{i}^{T} w^{*}+b^{*}=y_{i}\left(\text { use } y_{i} \in\{-1,1\}\right) \\
\Longleftrightarrow & b^{*}=y_{i}-x_{i}^{T} w^{*}
\end{aligned}$$  
- The optimal b is  
$$b^{*}=y_{i}-x_{i}^{T} w^{*}$$  
- We get the same $b^*$ for any choice of i  with $\alpha^*_i \in (0,\frac{c}{n})$  
  - So $b_i$ could be calculated by using only 1 eligible ($x_i,y_i$) after obtaing the optimal $w^*$
- With calculation error, more robust to calculate the average over all eligible i's:  
$$b^{*}=\operatorname{mean}\left\{y_{i}-x_{i}^{T} w^{*} \mid \alpha_{i}^{*} \in\left(0, \frac{c}{n}\right)\right\}$$  
- If no such eligible i?  
  - Then we say a degenerate SVM training problem ($w^* = 0$)

# Teaser for Kernelization  
- Till now we have not talked about methods to get the exact $\alpha_i$, we assume we have the optimal solution, and we use complementary slackness to discuss under different situations.  
- SVM dual problem  
$$\begin{array}{ll}
\sup _{\alpha} \sum_{i=1}^{n} \alpha_{i}-\frac{1}{2} \sum_{i, j=1}^{n} \alpha_{i} \alpha_{j} y_{i} y_{j} x_{j}^{T} x_{i} \\
\text { s.t. }  \sum_{i=1}^{n} \alpha_{i} y_{i}=0 \\
 \alpha_{i} \in\left[0, \frac{c}{n}\right] i=1, \ldots, n
\end{array}$$  
Note that all dependence on inputs $x_i$ and $x_j$ is through their inner product  
We can replace $x^T_j x_i$ by any other inner product