## Gradient Descent

The problem with the methods we've look at thus far is that they all require inverting the Hessian matrix. This can be very slow.

Here is a simpler approach that will only use the gradient, and not the Hessian. Consider $\nabla E(\hat\theta_{i, 0}, \hat\theta_{i, 1})$. This is the vector of partial derivatives. Let's consider a change $(\Delta \hat\theta_0, \Delta \hat\theta_1)$. Then, using the tangent line approximation to the quadratic surface, we should have:

\\[
\Delta E
\approx
    \nabla E(\hat\theta_{i, 0}, \hat\theta_{i, 1}) \cdot (\Delta \hat\theta_0, \Delta \hat\theta_1)
= \frac{\partial E}{\partial \hat\theta_0} \Delta\hat\theta_0
  + \frac{\partial E}{\partial \hat\theta_1} \Delta\hat\theta_1
\\]

Now, this is only an approximation to $\Delta E$, because the error surface is quadratic, not linear. Therefore, as we change $\hat\theta_0, \hat\theta_1$, the partial derivatives will change. Still, just like $f'(x)$ is the slope of the line tangent to $f$ at $x$, $\nabla E(\hat\theta)$ is the gradient of the linear surface tangent to $E$ at $\hat\theta$. (I wrote $\hat\theta$ which is the vector version of $(\hat\theta_0, \hat\theta_1)$.)

In other words, $\nabla E(\theta)$ gives us the best linear approximation to the quadratic surface. For small $\Delta\theta$ the approximation should stay pretty good.

Now, remember what we want to do: we want to find a minimum of the error surface $E$. But instead of trying to make a big jump to try to zero the partial derivatives, why don't we just try to make a small step in the downhill direction of the quadratic surface?

For instance, any update $\Delta\theta$ where we have $\nabla E(\theta) \cdot \Delta\theta < 0$ should reduce the error, provided $\Delta\theta$ is small enough that the approximation to $\Delta E$ is still good.

Since we want to focus on small $\Delta\theta$, let's consider taking a step of length $\epsilon$. The length of a vector $v$ is given by $\sqrt{\sum_i v_i^2}$. This is sometimes called the *Euclidean distance*, or even more commonly the *norm* of $v$. It is a generalization of the Pythagorean theorem. We write the norm of a vector as $||v||$.

When we write $\alpha v$, where $\alpha \in \mathbb{R}$ (a real number) and $v \in \mathbb{R^n}$ (an $n$-dimensional vector of real numbers), we mean multiplying every component of $v_i$ by $\alpha$ to produce $\alpha v = (\alpha v_1, \alpha v_2, \ldots, \alpha v_n)$. A one-dimensional number like $\alpha$ is called a *scalar*; stretching/shrinking a vector by multiplying each coordinate by a scalar is called *scalar multiplication*.

You may verify that $||\alpha v|| = \alpha ||v||$. So considering all vectors of length $\epsilon$ means consider every vector which can be written as $\epsilon u$, where $||u|| = 1$. A vector where $||u|| = 1$ is called a *unit* vector and a unit vector is often used when we want to focus on *direction* more than length.

Therefore, when picking an update $\Delta\theta$, I want to seperate the questions of the *magnitude* of the update ($||\Delta\theta||$) and the *direction* of the update. So let's focus on the direction first.

So what direction $u$ should we move in? Well, first note that any direction where $\nabla E(\theta) \cdot u > 0$ is heading *uphill* on the error surface. That's because

\\[
\nabla E(\theta) \cdot u = \sum_i \frac{\partial E}{\partial \theta_i}(\theta) u_i
\\]

That's because an update in the direction of $u$ is effectively a *linear combination* (a "weighted sum") of updates along each coordinate dimension $\theta_i$, each of which impacts $E(\theta)$ via the slope $\frac{\partial E}{\partial \theta_i}(\theta)$.

On the other hand, $\nabla E(\theta) \cdot u = 0$ means you are traveling sideways on the error surface. A $u$ like this is *parallel to the contour* at $\theta$. That's because when you're on a contour line, if you move along that line, you don't change your height. Since moving in the direction $u$ doesn't change your height, it must be the direction of the contour line.

So we know we want $\nabla E(\theta) \cdot u < 0$. In fact, the best direction $u$ would be the one where $\nabla E(\theta) \cdot u$ is minimized and is as negative as possible. That would be the direction *most* downhill, and would give us the most "bang for our buck" with a small move $\epsilon$.

Let's continue to think about this. The direction most uphill at $\theta$ is perpindicular to the contour line at $\theta$. Likewise, the direction most downhill is *also* perpindicular to the contour line, but in the opposite direction of the most uphill direction.

**TODO**: Image of contour line and vectors up and downhill.

I say that the most uphill direction is in fact straight in the direction of $\nabla E(\theta)$. The way I will show this is to show that for any $u_i, u_j$ (for $i\ne j$), it is always best that:

\\[
\frac{
    \left(\partial E / \partial \theta_i\right)(\theta)
}{
    \left(\partial E / \partial \theta_j\right)(\theta)
}
=
\frac{u_i}{u_j}
\\]

If this were not true, I say that you could modify $u$ to $u'$ such that (a) $\nabla E(\theta) \cdot u' > \nabla E(\theta) \cdot u$, and (b) $||u'|| = ||u||$. In other words, for the same size of step, you could improve $u$ to $u'$ so that you will travel more uphill.

So let's show it. First, let's ask how much a change in $u_i$ forces us to change $u_j$ in order to keep $||u||$ constant. First, let's consider $\frac{\partial ||u||}{\partial u_i}$. Since:

\\[
||u|| = \sqrt{\sum_k u_k^2}
\\]

To take the derivative, we use the chain rule, along with a rule that $\frac{\partial \sqrt{x}}{\partial x} = \frac{1}{x}$:

\\[
\begin{align}
\frac{\partial ||u||}{\partial {u_i}} &= \frac{\partial}{\partial u_i} \sqrt{\sum_k u_k^2} \\
&= \frac{1}{\sum_k u_k^2} \left( \frac{\partial}{\partial u_i} \sum_k u_k^2 \right) \\
&= \frac{1}{||u||} \left( 2 u_i \right)
\end{align}
\\]

What this tells us is that a *marginal* (also called an *infinitesimal*) change to $u_i$ will cause a change in $||u||$ equal to $\frac{2 u_i}{||u||}$. Now, if we want to keep $||u||$ constant, then to compensate for a change to $u_i$, we'll have to make a corresponding change to $u_j$. What is the *proportion* of the magnitude of the change?

\\[
\frac{2 u_i}{||u||} \Delta u_i = -\frac{2 u_j}{||u||} \Delta u_j \\
\Rightarrow \Delta u_j = -\frac{u_i}{u_j} \Delta u_i
\\]

This shows you that a change to $u_i$ must be compensated for by a change to $u_j$ in the opposite direction, and with a magnitude $\frac{u_i}{u_j}$ times as large as the $u_i$ change.

Now, let's ask how $\nabla E(\theta) \cdot u$ would change if we made a pair of changes to $u_i$ and $u_j$:

\\[
\Delta \nabla E(\theta) = \nabla E(\theta) \cdot (\Delta u) =
    \frac{\partial E}{\partial \theta_i}(\theta) (\Delta u_i)
    + 
    \frac{\partial E}{\partial \theta_j}(\theta) (\Delta u_j)
\\]

Let us then substitute in what we know about the relationship between $\Delta u_i$ and $\Delta u_j$:

\\[
\begin{align}
\Delta \nabla E(\theta) &=
    \frac{\partial E}{\partial \theta_i}(\theta) (\Delta u_i)
    + 
    \frac{\partial E}{\partial \theta_j}(\theta) (-\frac{u_i}{u_j})(\Delta u_i)
\\
&=
    \left(
        \frac{\partial E}{\partial \theta_i}(\theta)
        + 
        \left( -\frac{u_i}{u_j} \right)
        \frac{\partial E}{\partial \theta_j}(\theta)
    \right)
    (\Delta u_i)
\end{align}
\\]

Now, if $u$ is truly the direction most uphill, then that means that a change in $u_i$ (along with the corresponding change to $u_j$) should cause no change to $E(\theta) \cdot u$. From the above equation, this implies that:

\\[
\begin{align}
\frac{\partial E}{\partial \theta_i}(\theta)
+ 
(-\frac{u_i}{u_j})
\frac{\partial E}{\partial \theta_j}(\theta)
&=
0
\\
\Rightarrow
\frac{\partial E}{\partial \theta_i}(\theta)
\Big/
\frac{\partial E}{\partial \theta_j}(\theta)
&=
\frac{u_i}{u_j}
\end{align}
\\]

This is the key to the direction $u$ which is most uphill: the ratio of $u_i/u_j$ is equal to the ratio of the corresponding partial derivatives of $E$. What $u$ satisfies this? The simplest such $u$ is:

\\[
\begin{align}
u &= \left(
    \frac{\partial E}{\theta_1}(\theta),
    \frac{\partial E}{\theta_2}(\theta),
    \ldots,
    \frac{\partial E}{\theta_k}(\theta) \right)
\\
&= \nabla E(\theta)
\end{align}
\\]

One last pedantic point: there is a second vector that satisfies the criteria that 

\\[
\frac{\partial E}{\partial \theta_i}(\theta)
\Big/
\frac{\partial E}{\partial \theta_j}(\theta)
=
\frac{u_i}{u_j}
\\]

This vector is $-\nabla E(\theta)$. So the above condition is a *necessary* but not *sufficient* condition to determine the direction $u$ which is most uphill.

Any vector satisfying the above property is equal to $\alpha \nabla E(\theta)$, for some scalar $\alpha$. We really only need to consider when $\alpha = 1$ or $\alpha = -1$. One of these is the direction most *uphill* while the other is the direction most *downhill*.

To see which is which, remember that:

\\[
\Delta (E(\theta)) = (\nabla E(\theta)) \cdot u = \sum_i \frac{\partial E}{\partial \theta_i}(\theta) u_i
\\]

So if $u = \nabla E(\theta)$, then:

\\[
\nabla E(\theta) \cdot u
= \nabla E(\theta) \cdot \nabla E(\theta)
= \sum_i \left( \frac{\partial E}{\partial \theta_i}(\theta) \right)^2
\\]

Since this is the sum of squares, it must be positive. Likewise, if $u = -\nabla E(\theta)$:

\\[
\nabla E(\theta) \cdot u
= \nabla E(\theta) \cdot -\nabla E(\theta)
= - \left( \nabla E(\theta) \cdot \nabla E(\theta) \right)
\\]

Since this is the negative of a positive number, it must be negative. Thus we have determined that $\nabla E(\theta)$ is the direction most uphill, and $-\nabla E(\theta)$ is the direction most downhill.