#### Subgradient and nondifferentiable function

A nondifferentiable function $f(x)$ is convex if and only if

for any $x_0\in \text{dom}f$, there exists a vector $g$ such that

$$f(x)\geq f(x_0) + g(x_0)^T(x-x_0)$$

i.e., `global underestimator` of $f$

$g$ is refered to as `subgradient`

Consider function $f(x)=|x|$, at $x=0$, there are infinite number of subgradients $g$, and the `set` of subgradients of $f$ at a point $x$ is called `subdifferential` of $f$ at $x$, denoted as $\partial f(x)$

Nondifferentiable $f$ is convex if and only if its subgradient is `monotone`

$$\langle g(x_1) - g(x_2) ,x_1-x_2 \rangle \geq 0, \forall g(x_1)\in \partial f(x_1), g(x_2)\in \partial f(x_2)$$

#### Some basic rules

$f$ is convex

* $\partial f(x)=\{\nabla f(x)\}$ if $f$ is differentiable at $x$
* **scaling**: $\partial(\alpha f)=\alpha \partial f, \alpha>0$
* **addition**: $\partial(f_1+f_2)=\partial f_1 +\partial f_2$
* **affine transformation**: if $g(x)=f(Ax+b)$, then $\partial g(x)=A^T \partial f(Ax+b)$
* **finite pointwise maximum**: if $f=\max_i f_i$, then subdifferential is the `convex hull` of union of subdifferentials of `active` functions at $x$
$$\partial f(x)= \text{conv} \bigcup\{\partial f_i(x)|f_i(x)=f(x)\}$$
* **pointwise supremum**: if $f=\sup_{\alpha \in A} f_{\alpha}$, then, roughly speaking, $\partial f(x)$ is closure of convex hull of union of subdifferentials of active functions
$$\text{cl conv}\bigcup \{\partial f_{\beta}(x)|f_{\beta}(x)=f(x)\}\subseteq \partial f(x)$$
* **weak rule** for pointwise supremum: find one $g\in \partial f(x)$
    * find any $\beta$ for which $f_{\beta}(x)=f(x)$ (assume supremum is achieved)
    * choose any $g\in \partial f_{\beta}(x)$

#### Example

##### Maximum eigenvalue

$$f(x)=\lambda_{\max}(A(x))=\sup_{\|y\|_2=1} y^TA(x)y$$

where $A(x)=A_0+x_1A_1+\cdots + x_nA_n$

* $f$ is pointwise supremum of $g_y(x)=y^TA(x)y\,$ over $\|y\|_2=1$
* for any $x$, the supremum is achieved when $y$ satisfies $A(x)y=\lambda_{max}(A(x))y, \|y\|_2=1$
* $\nabla g_y(x)=\begin{bmatrix}y^TA_1y & \cdots & y^TA_n(x)y\end{bmatrix}^T$

Therefore, to find `a` subgradient $\in \partial f$ at $x$, we can choose any unit eigenvector $y$ associated with $\lambda_{\max} (A(x))$ and

$$\begin{bmatrix}y^TA_1y & \cdots & y^TA_n(x)y\end{bmatrix}^T\in \partial f(x)$$

##### Expectation

$$f(x)=\mathbb{E}_uf(x, u)$$

We can use Monte Carlo method

* Sample $u_{1, \cdots, k}$
* $f(x) \approx (1/k)\sum_{i=1}^kf(x, u_i)$
* for each $i$, choose a $g(x, u_i)\in \partial_xf(x,  u_i)$
* We obtain an approximate subgradient $g=(1/k)\sum_{i=1}^kg(x, u_i)$

##### Minimization

Define $g(y)$ as the optimal value of

$$\min f_0(x), \,\,\ \text{s.t. } f_i(x)\leq y_i$$

where $f_i$ are convex

If `strong duality` holds with the dual

$$\max \inf_x \left(f_0(x)+\sum_{i=1}^m\lambda_i\left(f_i(x)-y_i\right)\right), \,\, \text{s.t. }\lambda\geq 0$$

for which $\lambda^*$ is its dual optimal

then, for a $z$ such that $g(z)$ is finite, we have

$$\begin{align*}
g(z)&\geq\inf_x \left(f_0(x)+\sum_{i=1}^m\lambda^*_i\left(f_i(x)-z_i\right)\right) \\
&=\inf_x \left(f_0(x)+\sum_{i=1}^m\lambda^*_i\left(f_i(x)-y_i\right)\right) - \sum_{i=1}^m\lambda^*_i(z_i-y_i)\\
& \text{Strong duality} \\
&=g(y) - \sum_{i=1}^m\lambda^*_i(z_i-y_i)
\end{align*}$$

That is, $-\lambda^*$ is a subgradient of $g$ at $y$

#### Optimality condition

##### Unconstrained optimzation

If $f$ is convex and nondifferentiable, then $x^*$ minimizes $f(x)$ if and only if

$$\boxed{0 \in \partial f(x^*)}$$

This follows directly from definition of subgradient

$$f(y)\geq f(x^*) =  f(x^*)+\mathbf{0}^T(y-x^*), \forall y$$

##### Example: piecewise linear minimization

$f(x)=\max_i (a_i^Tx+b_i)$

$x^*$ minimizes $f \Longleftrightarrow 0\in \partial f(x^*) =\text{conv}\{a_i | i\in I(x^*)\}$

where $I(x)=\{i| a_i^Tx+b_i=f(x)\}$

By definition of convex hull, above is equivalent to

$$\exists \lambda \text{ s.t. } \lambda\geq 0, \mathbf{1}^T\lambda =1, \sum_{i=1}^m\lambda_i a_i=0$$

Since the convex hull only applies to `active` $a_i$ (i.e., $a_i^Tx^*+b_i=f(x^*)$), therefore, for the `inactive` ones (i.e., $a_i^Tx^*+b_i<f(x^*)$), we have $\lambda_i=0$

We can see that these are the KKT conditions for epigraph form

$$\min_{x, t} t, \,\, \text{s.t. } a_i^Tx+b_i\leq t$$

##### Constrained optimization

For

$$\min f_0(x), \,\, \text{s.t. } f_i(x)\leq 0$$

If $f_i$ is convex and subdifferentiable, the problem is strictly feasible, then

$x^*$ is primal optimal and $\lambda^*$ is dual optimal if and only if

$$\begin{align*}
& f_i(x^*)\leq 0, \,\, \lambda^*\geq 0 \\
& 0 \in \partial f_0(x^*)+\sum_{i=1}^m\lambda_i^*\partial f_i(x^*)\\
& \lambda_i^* f_i(x^*)=0
\end{align*}$$

This generalizes KKT conditions

#### Directional derivative

The `directional derivative` of $f$ at $x$ in the direction of $\delta x$ is defined as

$$\begin{align*}f'(x;\delta x)&=\lim_{a\rightarrow 0} \frac{f(x+a\delta x)-f(x)}{a}\\
&=\lim_{t\rightarrow \infty}\left(tf(x+\frac{1}{t}\delta x)-tf(x)\right)
\end{align*}$$

If $f$ is `differentiable`, then

$$f(x+a\delta x)-f(x)=f(x)+a\nabla f(x)^T\delta x + (a^2/2)\delta x^THf(x)\delta x + O(a^3)-f(x)$$

Plug in and simplify, we have

$$f'(x; \delta x)=\nabla f(x)^T \delta x$$

That is, the directional derivative is a linear function of $\delta x$

An equivalent definition for directional derivative of a `convex` $f$ is

$$\begin{align*}f'(x;\delta x)&=\inf_{a> 0} \frac{f(x+a\delta x)-f(x)}{a}\\
&=\inf_{t> 0} \left(tf(x+\frac{1}{t}\delta x)-tf(x)\right)
\end{align*}$$

We can see this by defining $h(\delta x)=f(x+\delta x)-f(x)$, which is convex in $\delta x$

With $h(0)=0$, the perspective of $h(\delta x)$

$$th(\delta x/t)=tf(x+\frac{1}{t}\delta x)-tf(x)$$

is `nonincreasing` in $t>0$

Therefore

$$\lim_{t\rightarrow \infty} th(\delta x/t)=\inf_{t>0} th(\delta x/t)$$

As a result, $f'(x;\delta x)$ defines a lower bound on $f$ in the direction of $\delta x$

$$f(x+a\delta x)\geq f(x) + a f'(x;\delta x), \forall a\geq 0$$