#### Subgradient method

Subgradient method to minimize nondifferentiable convex function $f$

$$x^{k+1}=x^k-\alpha_k g^k, \,\, \alpha_k>0$$

* $g^k$ is `any` subgradient of $f$ at $x^k$
* it is `not` a descent method, we keep track of best point
$$f_{\text{best}}^k=\min_{i=1,\cdots, k}f(x^i)$$

Step size are fixed ahead of time, rather than using some sort of line search as in gradient methods

* $\alpha_k$ can be constant
* step length can be constant $\alpha_k=\gamma/\|g^k\|_2$, so $\|x^{k+1}-x^k\|_2=\gamma$
* `square summable but not summable`
$$\sum_{k=1}^{\infty}\alpha_k^2<\infty,\,\sum_{k=1}^{\infty}\alpha_k=\infty$$
* `nonsummable diminishing` (e.g., $\alpha_k=c/k$, etc.)
$$\lim_{k\rightarrow \infty}\alpha_k=0,\,\sum_{k=1}^{\infty}\alpha_k=\infty$$

#### Convergence analysis

To obtain bounds of

$$f_{\text{best}}^k-f(x^*)$$

for various choices of step size, where $x^*$ solves unconstrained optimization problem $\min f(x), f(x^*)=f^*$

Assumption on $f$

* convex, not necessarily smooth, not necessarily strongly convex
* subgradient `uniformly bounded`
$$\forall x, \forall g\in \partial f(x), \|g\|_2 \leq G, G>0$$
* $\|x^1-x^*\|_2\leq R$

For subgradient steps

$$x^{k+1}=x^k-\alpha_k g^k, \, g^k\in \partial f(x^k)$$

we start by writing the gap between $x^k$ and $x^*$

$$\begin{align*}
\|x^{k+1}-x^*\|_2^2 &= \|x^k-\alpha_k g^k-x^*\|_2^2 \\
& = \|x^k-x^*\|_2^2 -2\alpha_k \langle g^k, x^k-x^*\rangle +\alpha_k^2\|g^k\|_2^2 \\
& \left(f^*\geq f(x^k) + \langle g^k , x^*-x^k \rangle \Longrightarrow -\langle g^k , x^k-x^* \rangle \leq f^*-f(x^k)\right) \\
& \leq \|x^k-x^*\|_2^2 -2\alpha_k \left(f(x^k)-f^*\right) +\alpha_k^2\|g^k\|_2^2 \\
& \text{this shows when step size is small enough, we are making progress...} \\
& \text{apply the inequality recursively} \\
&\leq \|x^1-x^*\|_2^2-2\sum_{i=1}^k\alpha_i\left(f(x^i)-f^*\right)+\sum_{i=1}^k\alpha_i^2\|g^i\|_2^2 \\
& \text{rearrange using }\|x^{k+1}-x^*\|_2^2\geq 0,\, \|x^1-x^*\|_2\leq R \\
2\sum_{i=1}^k\alpha_i\left(f(x^i)-f^*\right) &\leq R^2+\sum_{i=1}^k\alpha_i^2\|g^i\|_2^2 \\
& \text{since }\sum_{i=1}^k\alpha_i\left(f(x^i)-f^*\right)\geq\left(\sum_{i=1}^k \alpha_i\right)(f_{\text{best}}^k-f^*)\\
f_{\text{best}}^k-f^* &\leq \left(R^2+\sum_{i=1}^k\alpha_i^2\|g^i\|_2^2\right)/\left(2\sum_{i=1}^k \alpha_i\right)\\
& \text{use }\|g^k\|_2 \leq G \\
&\leq \boxed{\left(R^2+G^2\sum_{i=1}^k\alpha_i^2\right)/\left(2\sum_{i=1}^k \alpha_i\right)}
\end{align*}$$

With this, we have the following

* constant step size $\alpha_k=\alpha$: converges to $G^2\alpha/2$
* constant step length $\alpha_k=\gamma/\|g^k\|_2$: converges to $G\gamma /2$
* square summable but not summable: converges to zero
* nonsummable diminishing: converges to zero

##### Bounded subgradient and Lipschitz continuous function

$\|g\|_2\leq G, G>0, \forall x,\forall g\in \partial f(x)$ implies $f$ is Lipschitz continuous with constant $G$

$$|f(x)-f(y)|\leq G\|x-y\|_2, \forall x, y$$

To see this, let $g_x\in \partial f(x), g_y\in \partial f(y)$, we have

$$g_x^T(x-y)\geq f(x)-f(y)\geq g_y^T(x-y)$$

with Cauchy-Schwarz ($\pm a^Tb\leq \|a\|_2\|b\|_2$), we have

$$G\|x-y\|_2\geq f(x)-f(y)\geq -G\|x-y\|_2$$

##### Optimal sequence

We can choose sequence of positive $\alpha_1, \cdots, \alpha_k$ such that

$$\left(R^2+G^2\sum_{i=1}^k\alpha_i^2\right)/\left(2\sum_{i=1}^k \alpha_i\right)$$

is minimized

First, note that it is a convex function itself (quadratic over linear), and we can permute $\alpha_i$ and the function value does not change

Then, we apply all possible permutation to one optimal sequence $\alpha^*$, and use Jensen's inequality

$$f(\mathbb{E}[\alpha])\leq \mathbb{E}[f(\alpha)]$$

the right hand size would be just $f^*$ (since permutation does not change function value), while $\mathbb{E}[\alpha]$ (over all possible permutation) would be a sequence with identical $\alpha_i$ and $f(\mathbb{E}[\alpha])$ must also be $f^*$

Therefore, we can see that all $\alpha_i$ must be equal in the optimal sequence

We can then write the bound as

$$\frac{R^2+G^2k\alpha^2}{2k\alpha}$$

Take derivative and set it to zero, we get

$$\alpha=(R/G)/\sqrt{k}$$

Plug it back, we have

$$f_{\text{best}}^k-f^*\leq RG/\sqrt{k}$$