---
title: 12.2 Gradient Descent
subject:  Optimization
subtitle: 
short_title: 12.2 Gradient Descent
authors:
  - name: Nikolai Matni
    affiliations:
      - Dept. of Electrical and Systems Engineering
      - University of Pennsylvania
    email: nmatni@seas.upenn.edu
license: CC-BY-4.0
keywords: 
math:
  '\vv': '\mathbf{#1}'
  '\bm': '\begin{bmatrix}'
  '\em': '\end{bmatrix}'
  '\R': '\mathbb{R}'
---

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/nikolaimatni/ese-2030/HEAD?labpath=/10_Ch_11_PCA_Apps/121-Apps.ipynb)

{doc}`Lecture notes <../lecture_notes/Lecture 21 - An introduction to unconstrained optimization, gradient descent, and Newton’s method.pdf>`

## Reading

Material related to this page, as well as additional exercises, can be

## Learning Objectives

By the end of this page, you should know:
- 

\textbf{Gradient Descent}

Our intuition so far is that we should try to "walk downhill" and that the negative gradient $-\nabla f(x)$ points in the direction of steepest descent at point $x$. Can we turn this into an algorithm for minimizing (at least locally) a cost function $f(x)$?

This intuition is precisely the motivation behind the gradient descent algorithm, which starting with an initial guess $x^{(0)}$ iteratively updates the current best guess $x^{(k)}$ of $x^*$ according to:

\[x^{(k+1)} = x^{(k)} - s\nabla f(x^{(k)}), \quad \text{for } k=0,1,2,\ldots \quad (GD)\]

where $s>0$ is called the step size. The update rule (GD) moves you in the direction of a local minimum if $s>0$, but be careful, because if $s$ is too large, you can overshoot (we'll see more about this later).

Because we know that $\nabla f(x^*)=0$ if $x^*$ is a local minima, we can use the norm of the gradient as a stopping criterion, e.g., if $||\nabla f(x^{(k)})||_2 \leq \varepsilon$ for some small $\varepsilon>0$, we stop updating our iterate because $x^{(k)}$ is "close enough" to $x^*$ (typical choice of $\varepsilon$ are $10^{-7}$ or $10^{-6}$, depending on how precise of a solution is required).

Before looking at some examples of (GD) in action, let's try to get some intuition as to why this works. Suppose we are currently at $x^{(k)}$. Let's form a linear approximation of $f(x)$ near $x^{(k)}$ using its first-order Taylor series approximation:

\[f(x) \approx f(x^{(k)}) + \nabla f(x^{(k)})^T(x-x^{(k)}) \quad (TS)\]

As you can see from the figure, (TS) is a very good approximation of $f(x)$ when $x$ is not too far from $x^{(k)}$, but gets worse as we move further away.

Let's use (TS) to define our next point $x^{(k+1)}$ so that $f(x^{(k+1)}) \leq f(x^{(k)})$. If we define $\Delta x^{(k)} = x^{(k+1)} - x^{(k)}$, then evaluating (TS) at point $x^{(k+1)}$ becomes

\[f(x^{(k+1)}) - f(x^{(k)}) \approx \nabla f(x^{(k)})^T \Delta x^{(k)} (= \langle \nabla f(x^{(k)}), \Delta x^{(k)} \rangle)\]

So that if we want $f(x^{(k+1)}) \leq f(x^{(k)})$, then we should find a nearby $x^{(k+1)}$ such that

\[\nabla f(x^{(k)})^T \Delta x^{(k)} \leq 0. \quad (*)\]

Now, assuming that $\nabla f(x^{(k)}) \neq 0$ (so we're not at a local extremum), a clear choice for $\Delta x^{(k)}$ is $-s\nabla f(x^{(k)})$ for $s>0$ a step size chosen small enough so that (TS) is a good approximation. In that case, we have

\[\nabla f(x^{(k)})^T \Delta x^{(k)} = -s||\nabla f(x^{(k)})||^2 < 0.\]

In general then, any choice $\Delta x^{(k)}$ such that (*) holds is a valid descent direction. Geometrically, this is illustrated in the picture below:

[Image placeholder for descent direction illustration]

Example: Let's use gradient descent to minimize $f(x) = \frac{1}{2}||x||^2$. This is a silly example, but one we can easily compute iterates by hand for. Here $\nabla f(x) = x$, so our descent direction is $-\nabla f(x) = -x$. Let's use $x\in\mathbb{R}^2$ and an initial guess of $x^{(0)} = (1,1)$. We'll use a step size of $s = \frac{1}{2}$. Then

\[x^{(1)} = x^{(0)} - \frac{1}{2}\nabla f(x^{(0)}) = \begin{bmatrix} 1 \\ 1 \end{bmatrix} - \frac{1}{2}\begin{bmatrix} 1 \\ 1 \end{bmatrix} = \frac{1}{2}\begin{bmatrix} 1 \\ 1 \end{bmatrix} = \frac{1}{2}x^{(0)}\]

\[x^{(2)} = x^{(1)} - \frac{1}{2}\nabla f(x^{(1)}) = \frac{1}{2}\begin{bmatrix} 1 \\ 1 \end{bmatrix} - \frac{1}{2}\left(\frac{1}{2}\begin{bmatrix} 1 \\ 1 \end{bmatrix}\right) = \frac{1}{4}\begin{bmatrix} 1 \\ 1 \end{bmatrix} = \left(\frac{1}{2}\right)^2 x^{(0)}\]

\[x^{(k)} = \left(\frac{1}{2}\right)^k \begin{bmatrix} 1 \\ 1 \end{bmatrix}. \quad \text{So we see that } x^{(k)} \to x^* = 0 \text{ exponentially quickly}\]
at rate $\left(\frac{1}{2}\right)$.

Example OUTLINE NOTES PLEASE ADD A $||Ax-b||^2$ EXAMPLE THAT'S MORE INTERESTING OR TO DO MOSTLY COMPUTATIONAL SHOW A GOOD STEPSIZE CHOICE AND ONE WHERE IT DIVERGES.

Zig-zags and what to do about them

Let's consider a very simple optimization over $x\in\mathbb{R}^2$ with cost function

\[f(x_1, x_2) = \frac{1}{2}(x^2 + by^2)\]

where we'll let $a<b=1$ w.l.o.g. The optimal solution is obviously $(x^*, y^*) = (0,0)$ but we'll use this to illustrate how gradient descent can get you into tricky situations.

Suppose we run gradient descent on $f$, and we further allow ourselves to pick the best possible step size $s_k$ at each iteration, i.e., we choose step size

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/nikolaimatni/ese-2030/HEAD?labpath=/10_Ch_11_PCA_Apps/121-Apps.ipynb)
