#### Decomposable functions

Recall the concept of `gradient descent` is to, at iteration $k$, minimize the quadratic approximation of $f$ at $x^k$

$$x^{k+1}=\arg \min_xf(x^{k})+\nabla f(x^{k})^T(x-x^{k})+\frac{1}{2t^k}\|x-x^{k}\|_2^2$$

We take derivative and set it to zero

$$0+\nabla f(x^{k})+\frac{1}{t^k}(x-x^{k})=0$$

and we get the gradient descent equation

$$x^{k+1}=x^{k}-t^k \nabla f(x^{k})$$

The main motivation of proximal gradient method is that, when $f$ is not differentiable, rather than replacing the whole thing using subgradient, which we have seen is slow, we would like to keep some portion of this quadratic approximation

In particular, we look at function

$$f(x)=g(x)+h(x)$$

where $g$ is differentiable while $h$ is not

and we apply the quadratic approximation to $g$

$$\begin{align*}
x^{k+1} &  = \arg \min_z g(x^k)+\nabla g(x^k)^T(z-x^k)+\frac{1}{2t}\|z-x^k\|_2^2+h(z)\\
& g(x^k), \nabla g(x^k) \text{ as individual term has no effect on minimization over z}\\
 & = \arg \min_z \frac{1}{2} t\|\nabla g(x^k)\|_2^2+\nabla g(x^k)^T(z-x^k)+\frac{1}{2t}\|z-x^k\|_2^2+h(z)\\
&=\arg\min_z\frac{1}{2t}\|z-\left(x^k-t\nabla g(x^k)\right)\|_2^2+h(z)
\end{align*}$$

That is, we try to find a $z$ such that

* it stays close to gradient update of $g$ by minimizing $\|z-\left(x^k-t\nabla g(x^k)\right)\|_2^2$
* it makes $h$ small by minimizing $h(z)$

#### Proximal operator

More formally, we define proximal operator as

$$\text{prox}_{h, t}(x) = \arg \min_z \frac{1}{2t}\|z-x\|_2^2+h(z) $$

As an example, if $h(z)$ is the indicator function, then

$$\begin{align*}
\text{prox}_{I, t}(x) &= \arg \min_z \frac{1}{2t}\|z-x\|_2^2+I_S(z) \\
&=\arg \min_z \frac{1}{2t}\|z-x\|_2^2, \text{s.t. }z\in S
\end{align*}$$


Use proximal operator, we can write proximal gradient step for $f(x)=g(x)+h(x)$

$$x^{k+1} = \text{prox}_{h, t}(x^k-t\nabla g(x^k))$$

or we can write it more like standard gradient step by defining

$$G_t(x)=\left(x-\text{prox}_{h, t}\left(x-t\nabla g(x)\right)\right)/t$$

which is often called generalized gradient of $f$

With this, we can write the proximal gradient step as

$$x^{k+1}=x^k-tG_{t}(x^k)$$

The key point for proximal method is that
* $\text{prox}_{h, t}(\cdot)$ often has `closed-form` expression for many commonly used $h$ or can be computed very efficiently, and it only depends on $h$
* The $g$ part can be complicated, but we only need its gradient

#### Backtracking line search

Backtracking line search for proximal method works similarly as with gradient descent

We choose $\beta\in (0,1)$, at each iteration, we start with some $t$ (e.g., from previous iteration) and let

$$z=\text{prox}_{h, t}(x^k-t\nabla g(x^k))$$

while

$$g(z)>g(x^k)+\nabla g(x^k)^T (z-x^k)+\frac{1}{2t}\|z-x^k\|_2^2$$

we do $t\leftarrow \beta t$

Else, we perform proximal gradient update with $x^{k+1}=z$

#### Example: LASSO

Now, we use proximal gradient method for LASSO example

$$\min_x \frac{1}{2}\|y-Ax\|_2^2+\lambda \|x\|_1, \lambda \geq 0$$

where $g(x)=\frac{1}{2}\|y-Ax\|_2^2$ and $h(x)=\lambda\|x\|_1$

To get proximal gradient step

$$x^{k+1} = \text{prox}_{h, t}(x^k-t\nabla g(x^k))$$

we first compute gradient of $g$

$$\nabla g(x)=-A^T(y-Ax)$$

Then, we compute proximal mapping for $h$

$$\text{prox}_{\lambda\|\cdot\|_1, t}(x)=\arg \min_z \frac{1}{2t}\|x-z\|_2^2+\lambda \|z\|_1$$

This is essentially soft thresholding and we know from example on subgradient optimality condition

We define $S_{\lambda, t}(x)$ as soft thresholding operator and we have

$$z_i = [S_{\lambda, t}(x)]_i = \left\{\begin{array}{ll} x_i-t\lambda & x_i>t\lambda \\ 0 & x_i \in [-t\lambda, t\lambda] \\ x_i+t\lambda & x_i<-t\lambda\end{array}\right.$$

Combine these two, we have the proximal gradient step for LASSO

$$x^{k+1}=S_{\lambda, t}\left(x^k+tA^T(y-Ax^k)\right)$$

This algorithm is commonly known as Iterative Shrinkage Thresholding Algorithm (ISTA)