---
title: 13.2 Gradient Descent
subject:  Optimization
subtitle: 
short_title: 13.2 Gradient Descent
authors:
  - name: Nikolai Matni
    affiliations:
      - Dept. of Electrical and Systems Engineering
      - University of Pennsylvania
    email: nmatni@seas.upenn.edu
license: CC-BY-4.0
keywords: 
math:
  '\vv': '\mathbf{#1}'
  '\bm': '\begin{bmatrix}'
  '\em': '\end{bmatrix}'
  '\R': '\mathbb{R}'
---

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/nikolaimatni/ese-2030/HEAD?labpath=/12_Ch_13_Optimization/142-Gradient_Descent.ipynb)

{doc}`Lecture notes <../lecture_notes/Lecture 21 - An introduction to unconstrained optimization, gradient descent, and Newton’s method.pdf>`

## Reading

Material related to this page, as well as additional exercises, can be found in Chapter 12 in ROB101 textbook by Jesse Grizzle 

## Learning Objectives

By the end of this page, you should know:
- 

## Gradient Descent

Our intuition so far is that we should try to "walk downhill" and that the negative gradient $-\nabla f(\vv x)$ points in the direction of steepest descent at point $x$. Can we turn this into an algorithm for minimizing (at least locally) a cost function $f(x)$?

This intuition is precisely the motivation behind the gradient descent algorithm, which starting with an initial guess $x^{(0)}$ iteratively updates the current best guess $x^{(k)}$ of $x^*$ according to:

\[x^{(k+1)} = x^{(k)} - s\nabla f(x^{(k)}), \quad \text{for } k=0,1,2,\ldots \quad (GD)\]

where $s>0$ is called the step size. The update rule (GD) moves you in the direction of a local minimum if $s>0$, but be careful, because if $s$ is too large, you can overshoot (we'll see more about this later).

Because we know that $\nabla f(x^*)=0$ if $x^*$ is a local minima, we can use the norm of the gradient as a stopping criterion, e.g., if $||\nabla f(x^{(k)})||_2 \leq \varepsilon$ for some small $\varepsilon>0$, we stop updating our iterate because $x^{(k)}$ is "close enough" to $x^*$ (typical choice of $\varepsilon$ are $10^{-7}$ or $10^{-6}$, depending on how precise of a solution is required).

Before looking at some examples of (GD) in action, let's try to get some intuition as to why this works. Suppose we are currently at $x^{(k)}$. Let's form a linear approximation of $f(x)$ near $x^{(k)}$ using its first-order Taylor series approximation:

\[f(x) \approx f(x^{(k)}) + \nabla f(x^{(k)})^T(x-x^{(k)}) \quad (TS)\]

As you can see from the figure, (TS) is a very good approximation of $f(x)$ when $x$ is not too far from $x^{(k)}$, but gets worse as we move further away.

Let's use (TS) to define our next point $x^{(k+1)}$ so that $f(x^{(k+1)}) \leq f(x^{(k)})$. If we define $\Delta x^{(k)} = x^{(k+1)} - x^{(k)}$, then evaluating (TS) at point $x^{(k+1)}$ becomes

\[f(x^{(k+1)}) - f(x^{(k)}) \approx \nabla f(x^{(k)})^T \Delta x^{(k)} (= \langle \nabla f(x^{(k)}), \Delta x^{(k)} \rangle)\]

So that if we want $f(x^{(k+1)}) \leq f(x^{(k)})$, then we should find a nearby $x^{(k+1)}$ such that

\[\nabla f(x^{(k)})^T \Delta x^{(k)} \leq 0. \quad (*)\]

Now, assuming that $\nabla f(x^{(k)}) \neq 0$ (so we're not at a local extremum), a clear choice for $\Delta x^{(k)}$ is $-s\nabla f(x^{(k)})$ for $s>0$ a step size chosen small enough so that (TS) is a good approximation. In that case, we have

\[\nabla f(x^{(k)})^T \Delta x^{(k)} = -s||\nabla f(x^{(k)})||^2 < 0.\]

In general then, any choice $\Delta x^{(k)}$ such that (*) holds is a valid descent direction. Geometrically, this is illustrated in the picture below:

[Image placeholder for descent direction illustration]

Example: Let's use gradient descent to minimize $f(x) = \frac{1}{2}||x||^2$. This is a silly example, but one we can easily compute iterates by hand for. Here $\nabla f(x) = x$, so our descent direction is $-\nabla f(x) = -x$. Let's use $x\in\mathbb{R}^2$ and an initial guess of $x^{(0)} = (1,1)$. We'll use a step size of $s = \frac{1}{2}$. Then

\[x^{(1)} = x^{(0)} - \frac{1}{2}\nabla f(x^{(0)}) = \begin{bmatrix} 1 \\ 1 \end{bmatrix} - \frac{1}{2}\begin{bmatrix} 1 \\ 1 \end{bmatrix} = \frac{1}{2}\begin{bmatrix} 1 \\ 1 \end{bmatrix} = \frac{1}{2}x^{(0)}\]

\[x^{(2)} = x^{(1)} - \frac{1}{2}\nabla f(x^{(1)}) = \frac{1}{2}\begin{bmatrix} 1 \\ 1 \end{bmatrix} - \frac{1}{2}\left(\frac{1}{2}\begin{bmatrix} 1 \\ 1 \end{bmatrix}\right) = \frac{1}{4}\begin{bmatrix} 1 \\ 1 \end{bmatrix} = \left(\frac{1}{2}\right)^2 x^{(0)}\]

\[x^{(k)} = \left(\frac{1}{2}\right)^k \begin{bmatrix} 1 \\ 1 \end{bmatrix}. \quad \text{So we see that } x^{(k)} \to x^* = 0 \text{ exponentially quickly}\]
at rate $\left(\frac{1}{2}\right)$.

Example OUTLINE NOTES PLEASE ADD A $||Ax-b||^2$ EXAMPLE THAT'S MORE INTERESTING OR TO DO MOSTLY COMPUTATIONAL SHOW A GOOD STEPSIZE CHOICE AND ONE WHERE IT DIVERGES.

Zig-zags and what to do about them

Let's consider a very simple optimization over $x\in\mathbb{R}^2$ with cost function

\[f(x_1, x_2) = \frac{1}{2}(x^2 + by^2)\]

where we'll let $a<b=1$ w.l.o.g. The optimal solution is obviously $(x^*, y^*) = (0,0)$ but we'll use this to illustrate how gradient descent can get you into tricky situations.

Suppose we run gradient descent on $f$, and we further allow ourselves to pick the best possible step size $s_k$ at each iteration, i.e., we choose step size

$s_k = \argmin_{s \in [0,1]} F(x^{(k)} - sf(x^{(k)}))$

and the update $x^{(k+1)} = x^{(k)} - s_k f(x^{(k)})$.

This is called exact line search for selecting the step size, and is widely used in practice.
If we use this choice of step size $s_k$, then it is possible to write an explicit formula
for our iterates $(x^k, y^k)$ as we progress down the bowl. Starting at $(x^{(0)},y^{(0)})=(b,1)$, we
get
\begin{equation}
x^{(k)} = b \left(\frac{b-1}{b+1}\right)^k, \quad y^{(k)} = \left(\frac{1-b}{1+b}\right)^k, \quad F(x^{(k)},y^{(k)}) = \left(\frac{1-b}{1+b}\right)^{2k} F(x^{(0)},y^{(0)}). \quad (8)
\end{equation}

If $b=1$, which corresponds to a function with level sets that are perfect circles,
we succeed immediately in one step $(x^{(1)} = y^{(1)} = 0)$. This is because the gradient
always points directly to the optimal point $(0,0)$.

\begin{tikzpicture}
\draw (0,0) circle (1cm);
\node at (0,0) {$\bullet$};
\foreach \angle in {0,45,...,315}
    \draw[-latex] (0,0) -- (\angle:1cm);
\node at (1.5,0) {$b=1$ case};
\node[right] at (1,0) {$\rightarrow$ $\nabla F(x,y)$ always points to the optimal};
\node[right] at (1,-0.5) {point $x^* = (0,0)$};
\end{tikzpicture}

The real purpose of this example is seen when $b$ is small. The crucial ratio in
equation (8) is
\[
r = \frac{b-1}{b+1},
\]

If $r$ is small, $(x^{(k)},y^{(k)})$ converges to $(0,0)$ very quickly. However, if $r$ is close to
1, then this convergence is very slow. For example, if $b = \frac{1}{10}$, then $r = \frac{9}{11}$, for $b = \frac{1}{100}$, $r = \frac{99}{101}$.

This means we need to take many more gradient steps to get close to $x^* = (0,0)$. The
picture below highlights what's going wrong: for small $b$, the level sets become elongated
ellipses, so that following gradients leads to us zig-zagging our way to the origin instead of
taking a straight path. It is this zig-zagging that causes slow convergence.

\begin{tikzpicture}
\draw[->] (-10,0) -- (10,0) node[right] {$x_1$};
\draw[->] (0,-4) -- (0,4) node[above] {$x_2$};
\draw[dashed] (0,0) ellipse (8cm and 3cm);
\draw[dashed] (0,0) ellipse (6cm and 2.25cm);
\draw[dashed] (0,0) ellipse (4cm and 1.5cm);
\draw[dashed] (0,0) ellipse (2cm and 0.75cm);
\draw[red,-latex,thick] (8,0) -- (7.5,1) node[above] {$x^{(1)}$};
\draw[red,-latex,thick] (7.5,1) -- (6,0.5);
\draw[red,-latex,thick] (6,0.5) -- (5.5,1);
\draw[red,-latex,thick] (5.5,1) -- (4,0.5);
\draw[red,-latex,thick] (4,0.5) -- (3.5,0.75);
\node[below right] at (8,0) {$x^{(0)}$};
\end{tikzpicture}


So what's going wrong here? If we write $F(x)$ as a quadratic form, it is:

\[
F(x) = \begin{bmatrix} x^T & 1 \end{bmatrix} \begin{bmatrix} 0 & 0 \\ 0 & b \end{bmatrix} \begin{bmatrix} x \\ 1 \end{bmatrix} = x^T b x
\]

Notice that the condition number of the matrix $K$ is precisely $\frac{1}{b}$, which is large for
small $b$. This means that one direction (in this case the x-axis) is penalized much more
than the other (the y-axis). This leads to stretched out level sets, which leads to
zig-zags and slow convergence. How can we fix this?

Newton's Method (continued)

To derive the gradient descent method (GD), we used a first order approximation of
$F(x)$ near $x^{(k)}$ to figure out what direction we should move in. What if in this example
we saw steeper and shallower directions where the function changes quickly (look at the contours of the
function in the figure; they are all very thin ellipses)? Isn't one step enough?

We should also account for how quickly the gradient changes: we need to compute the
"gradient of the gradient", a.k.a. the Hessian of $F$ at $x$. The Hessian of a function $f: \mathbb{R}^n \to \mathbb{R}$
is an $n \times n$ symmetric matrix with entries given by 2nd order partial derivatives of $f$:

\[
[\nabla^2 f(x)]_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}(x)
\]

The Hessian tells us how quickly the gradient changes in the same way $f'(x)$ tells us
how quickly $f(x)$ changes for a scalar function $f(x)$. The Hessian lets us make
a second order Taylor series approximation to our function $F(x)$ near our current
guess $x^{(k)}$:

\[
F(x) \approx F(x^{(k)}) + \nabla f(x^{(k)})^T(x-x^{(k)}) + (x-x^{(k)})^T \nabla^2 f(x^{(k)})(x-x^{(k)}), \quad (6)
\]

which provides a local quadratic approximation to $F(x)$ near $F(x^{(k)})$:

\begin{tikzpicture}
\draw[->] (-2,0) -- (2,0) node[right] {$x$};
\draw[->] (0,-0.5) -- (0,2) node[above] {$F(x)$};
\draw[thick, blue] plot[domain=-2:2,samples=100] (\x,{0.5*\x*\x});
\draw[thick, red] plot[domain=-1:1,samples=100] (\x,{0.5+0.5*\x*\x});
\node[blue] at (1.5,1.5) {$F(x)$};
\node[red] at (-1.5,1.5) {quad approx};
\fill[blue] (0,0.5) circle (2pt);
\node[right] at (0,0.5) {$F(x^{(k)})$};
\end{tikzpicture}

As before, if we let $\Delta x^{(k)} = x^{(k+1)} - x^{(k)}$, we can rewrite (6) as

\[
F(x^{(k+1)}) - F(x^{(k)}) \approx (\Delta x^{(k)})^T \nabla F(x^{(k)}) + \frac{1}{2} \nabla F(x^{(k)})^T \Delta x^{(k)}. \quad (7a)
\]

Since we want to make $F(x^{(k+1)}) - F(x^{(k)})$ as small as possible, it makes sense
to pick $\Delta x^{(k)}$ to minimize the RHS of (7a), which is another minimization problem!

We'll focus on the setting where $\nabla^2 F(x^{(k)})$ is positive definite - this corresponds to
settings where our function is convex. In this case, the RHS of (7a) is a positive definite
quadratic function, which is minimized at:

\[
\Delta x^{(k)} = - \nabla^2 F(x^{(k)})^{-1} \nabla F(x^{(k)}).
\]

Using this descent direction instead of $-\nabla F(x^{(k)})$ yields Newton's Method:

\[
x^{(k+1)} = x^{(k)} - \nabla^2 F(x^{(k)})^{-1} \nabla F(x^{(k)}).
\]

The idea behind Newton's Method is to "unstretch" the stretched out directions,
so that to our algorithm, the level sets of a function are locally circles. If
we look at our test example above, note that:

\[
\nabla^2 F(x) = \begin{bmatrix} 1 & \\ & b \end{bmatrix} \Rightarrow \nabla^2 F(x)^{-1} = \begin{bmatrix} 1 & \\ & \frac{1}{b} \end{bmatrix}
\]

so that for $x^{(0)} = (b,1)$, we have:

\begin{align*}
x^{(1)} &= x^{(0)} - \nabla^2 F(x)^{-1} \nabla F(x) \\
&= \begin{bmatrix} b \\ 1 \end{bmatrix} - \begin{bmatrix} 1 & \\ & \frac{1}{b} \end{bmatrix} \begin{bmatrix} b \\ b \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \end{bmatrix}
\end{align*}

i.e., we converge in one step no matter what the choice of $b$ is in $F(x) = \frac{1}{2}(x^2+by^2)$!

The cost of this fast convergence though is that at each update, we need to solve
a linear system of equations of the form

\[
\nabla^2 F(x^{(k)}) \Delta x^{(k)} = \nabla F(x^{(k)})
\]

which may be expensive if $x$ is a high-dimensional vector. It is for this reason that
gradient descent based methods are the predominant methods used in machine learning,
where oftentimes the dimensionality of $x$ can be on the order of millions or
billions.

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/nikolaimatni/ese-2030/HEAD?labpath=/12_Ch_13_Optimization/142-Gradient_Descent.ipynb)
