# SVM Optimization Problem(no intercept)  
- SVM objective function  
$$J(w)=\frac{1}{n} \sum_{i=1}^{n} \max \left(0,1-y_{i}\left[w^{T} x_{i}\right]\right)+\lambda\|w\|^{2}$$  
- Not differentiable,  but what about we consider subgradient?  
- Derivative of hinge loss, $l(m) = max(0, 1 - m)$  
$$\ell^{\prime}(m)=\left\{\begin{array}{ll}
0 & m>1 \\
-1 & m<1 \\
\text { undefined } & m=1
\end{array}\right.$$  
- We need gradient with respect to parameter vector $w \in R^d$  
$$\begin{aligned}
\nabla_{w} \ell\left(y_{i} w^{T} x_{i}\right) &=\ell^{\prime}\left(y_{i} w^{T} x_{i}\right) y_{i} x_{i}(\text { chain rule }) \\
&=\left(\begin{array}{ll}
0 & y_{i} w^{T} x_{i}>1 \\
-1 & y_{i} w^{T} x_{i}<1 \\
\text { undefined } & y_{i} w^{T} x_{i}=1
\end{array}\right) y_{i} x_{i}\left(\text { expanded } m \text { in } \ell^{\prime}(m)\right) \\
&=\left\{\begin{array}{ll}
0 & y_{i} w^{T} x_{i}>1 \\
-y_{i} x_{i} & y_{i} w^{T} x_{i}<1 \\
\text { undefined } & y_{i} w^{T} x_{i}=1
\end{array}\right.
\end{aligned}$$  
thus  
$$\nabla_{w} \ell\left(y_{i} w^{T} x_{i}\right)=\left\{\begin{array}{ll}
0 & y_{i} w^{T} x_{i}>1 \\
-y_{i} x_{i} & y_{i} w^{T} x_{i}<1 \\
\text { undefined } & y_{i} w^{T} x_{i}=1
\end{array}\right.$$  
so  
$$\begin{aligned}
\nabla_{w} J(w) &=\nabla_{w}\left(\frac{1}{n} \sum_{i=1}^{n} \ell\left(y_{i} w^{T} x_{i}\right)+\lambda\|w\|^{2}\right) \\
&=\frac{1}{n} \sum_{i=1}^{n} \nabla_{w} \ell\left(y_{i} w^{T} x_{i}\right)+2 \lambda w \\
&=\left\{\begin{array}{ll}
\frac{1}{n} \sum_{i: y_{i} w^{T} x_{i}<1}\left(-y_{i} x_{i}\right)+2 \lambda w & \text { all } y_{i} w^{T} x_{i} \neq 1 \\
\text { undefined } & \text { otherwise }
\end{array}\right.
\end{aligned}$$  
- The subgradient of the SVM objective is  
$$\nabla_{w} J(w)=\frac{1}{n} \sum_{i: y_{i} w^{T} x_{i}<1}\left(-y_{i} x_{i}\right)+2 \lambda w$$  
when all $y_iw^Tx_i \ne 1$ for all $i$, and otherwise is undefined  
- Suppose we want to use gradient descent on $J(w)$:  
   - If we start at a random $w$, will we ever hit $y_iw^Tx_i = 1$?  
   - If we did, could we perturb the step size by $\epsilon$ to miss such a point
   - If we blindly apply gradient descent from a random starting point 
     - seems unlikely that we’ll hit a point where the gradient is undeﬁned
   - Still, doesn’t mean that gradient descent will work if objective not diﬀerentiable  
   - Theory of subgradients and subgradient descent will clear up any uncertainty  



# Convexity and sublevel sets

## Level Sets and Sublevel Sets
Let $f: R^d \to R$ be a function  
- Definition  
A **level set** or **contour line** for the value c is the set of points $f(x) = c$  
- Definition  
A **sublevel set** for the value c is the set of points $x \in R^d$ for which $f(x) \leqslant c$  
- Theorem  
If $f: R^d \to R$ is convex, then the sublevel sets are convex  
## Convex Optimization Problem: Implicit Form
- Convex Optimization Problem: Implicit Form  
$$\begin{array}{ll}
\text { minimize } & f(x) \\
\text { subject to } & x \in C
\end{array}$$  
where f is a convex function and C is a convex set  


# Convex and Differentiable Function  
First-Order Approximation  
- Suppose $f:R^d \to R$ is differentiable  
- Predict $f(y)$ given $f(x)$ and $\nabla f(x)$  
- Linear Approximation  
$$f(y) \approx f(x)+\nabla f(x)^{T}(y-x)$$  
<div align="center"><img src = "./linearApprox.jpg" width = '500' height = '100' align = center /></div>  
- Suppose $f:R^d \to R$ is convex and differentiable, then for any $x,y$  
- The linear approximation to $f$ at $x$ is a **global underestimator** of $f$  
- Corollary  
$f$ is convex and differentiable, if $\nabla f(x) = 0$, then $x$ is a global minimizer of f  


# Subgradients 
- Definition  
A vector $g \in R^d$ is a subgradient of $f:R^d \to R$ at $x$ if for all $z$,  
$$f(z) \geq f(x) + g^T(z - x)$$  
<div align="center"><img src = "./subgradients.jpg" width = '500' height = '100' align = center /></div>   
- Blue is the graph of $f(x)$, each red line is a global lower bound of $f(x)$  

## Subdiﬀerential  
- Definition  
   - $f$ is subdifferentiable at $x$ if there exists at least one subgradient at $x$.  
   - The set of all subgradients at $x$ is called the subdifferential: $\partial f(x)$  
- Basic Facts  
   -$f$ is convex and differentiable $ \Rightarrow$ $\partial f(x) = \{\nabla f(x) \}$  
   - At any point, there can be 0, 1, or infinitely many subgradients  
   - if $\partial f(x) = \emptyset$ then $f$ is not convex  

## Globla Optimality Condition
- Corollary
if $0 \in \partial f(x)$, then $x$ is a gloabal minimizer of $f$  

## Subdiﬀerential of Absolute Value
- Consider $f(x) = |x|$  
<div align="center"><img src = "./abs.jpg" width = '500' height = '100' align = center /></div>  

## $f(x_1, x_2) = |x_1| + 2|x_2|$  
<div align="center"><img src = "./f(x1,x2).jpg" width = '500' height = '100' align = center /></div>   

- Let’s find the subdifferential of $f(x_1, x_2) = |x_1| + 2|x_2| at (3,0)$   
- First coordinate of subgradient must be 1,  from $|x_1|$ part  
- Second coordinate of subgradient can be anything in $[-2,2]$    
- So graph of $h(x_1, x_2) = f(3, 0) + g^T (x_1 - 3 , x_2)$ is a global underestimation of $f(x_1, x_2)$, for any $g = (g_1,g_2)$, where $g_1 = 1, g_2 \in [-2,2]$  

### Underestimation Hyperplane  
<div align="center"><img src = "./hyperplane.jpg" width = '500' height = '100' align = center /></div>  

## Subdiﬀerential on Contour Plot  
<div align="center"><img src = "./contour.jpg" width = '500' height = '100' align = center /></div>   

## Contour Lines and Gradients  
- For function $f: R \to R^d$,  
  - Graphs of the function lives in $R^{d + 1}$  
  - Gradient and subgradient oof $f$ lives in $R^d$
  - Contours and level sets lives in $R^d$  
- $f: R^d \to R$ continuously differentiable, $\nabla f(x_0) \ne 0$, then $\nabla f(x_0)$ normal to level set  

## Contour Lines and Subgradients
- Let $f : R \to R^d$ has subgradient $g$ at $x_0$  
   - Hyperplane $H$ orthogonal to $g$ at $x_0$ must support the level set $S=\left\{x \in \mathbf{R}^{d} \mid f(x)=f\left(x_{0}\right)\right\}$  
   - i.e. $H$ contains $x_0$ and all of $S$ lies one side of $H$  
<div align="center"><img src = "./subgradient_contour.jpg" width = '500' height = '100' align = center /></div>   

- Points on $g$ side of $H$ have larger f-values than $f(x_0)$  
- But points on $-g$ side may not have smaller f-values  
- So $-g$ may not be a descent direction

# Subgradient Descent  
- Suppose $f$ is convex, and we start optimizing at $x_0$  
- Repeat  
  - Step in a negative subgradient direction  
  - $x = x_0 - tg$, where $t > 0$ is the step size and $g \in \partial f(x_0)$  
- $−g$ not a descent direction – can this work?

## Subgradient Gets Us Closer To Minimizer
- Theorem  
Suppose $f$ is convex:  
  - Let $x = x_0 - tg$
  - Let $f(z)$ be any point for which $f(z) < f(x_0)$  
  - Then for small enough $t > 0$  
$$\|x-z\|_{2}<\left\|x_{0}-z\right\|_{2}$$
  - Apply this with $z=x^{*} \in \arg \min _{x} f(x)$  
  
Negative subgradient step gets us closer to minimizer  
**proof**  
$$\begin{aligned}
\|x-z\|_{2}^{2} &=\left\|x_{0}-\operatorname{tg}-z\right\|_{2}^{2} \\
&=\left\|x_{0}-z\right\|_{2}^{2}-2 \operatorname{tg}^{T}\left(x_{0}-z\right)+t^{2}\|g\|_{2}^{2} \\
& \leqslant\left\|x_{0}-z\right\|_{2}^{2}-2 t\left[f\left(x_{0}\right)-f(z)\right]+t^{2}\|g\|_{2}^{2}
\end{aligned}$$  
consider $-2 t\left[f\left(x_{0}\right)-f(z)\right]+t^{2}\|g\|_{2}^{2}$, it is convex and quadratic  
has zero at $t = 0$ and $t=2\left(f\left(x_{0}\right)-f(z)\right) /\|g\|_{2}^{2}>0$  
Therefor it is negative for any  
$$t \in\left(0, \frac{2\left(f\left(x_{0}\right)-f(z)\right)}{\|g\|_{2}^{2}}\right)$$  

## Convergence Theorem for Fixed Step Size
- Assume $f:R \to R^d$ is convex and 
  - f is Lipschitz continuous with constant $G > 0$:  
  $$|f(x)-f(y)| \leqslant G\|x-y\| \text { for all } x, y$$  
- Theorem  
For fixed step size t, subgradient method satisfies  
$$\lim _{k \rightarrow \infty} f\left(x_{b e s t}^{(k)}\right) \leqslant f\left(x^{*}\right)+G^{2} t / 2$$  

## Convergence Theorems for Decreasing Step Sizes
- Assume $f:R \to R^d$ is convex and 
  - f is Lipschitz continuous with constant $G > 0$:  
  $$|f(x)-f(y)| \leqslant G\|x-y\| \text { for all } x, y$$  
- Theorem  
For step size respecting Robbins-Monro conditions  
$$\lim _{k \rightarrow \infty} f\left(x_{\text {best}}^{(k)}\right)=f\left(x^{*}\right)$$
