## Gradient Descent

Now that we've seen the best direction of $\Delta \theta$ to use when updating $\theta$, let's use this for an algorithm called *gradient descent*:

1. Pick a constant $\eta$ called the *learning rate*.
2. Pick an $\epsilon$ which will determine when to stop the algorithm.
3. Pick a random starting point $\theta_0$. Set $i := 0$.
4. If $||\nabla E(\theta_i)|| < \epsilon$, stop and return $\theta_i$.
5. Else, set $\theta_{i + 1} := \theta_i - \eta (\nabla E(\theta_i))$.
6. Set $i := i + 1$.
7. Goto Step 4.

This algorithm works by first deciding whether the gradient is small enough that changes to $\theta$ won't make a big enough impact to $E(\theta)$ to be worth it. If not, then we can stop and just return the current $\theta_i$.

Otherwise, we need to update $\theta_i$. We know that the best direction to move in is $-\nabla E(\theta_i)$. How much should we move, though? That is determined by the learning rate: the smaller $\eta$ is, the smaller a step we will take. Since $\nabla E(\theta)$ will change continuously as we vary $\theta$, $\nabla E(\theta) \cdot \Delta \theta$ is only an approximation of $\Delta E(\theta)$. This approximation is more accurate for smaller steps.

Let's see some examples of how gradient descent proceeds with different learning rates!

In [1]:
from examples.gradient_descent_example import GradientDescentExample

GradientDescentExample.run()

Learning_rate = 0.01


Learning_rate = 0.3


Learning_rate = 0.39


In the first example, what we see here is that if $\eta$ is chosen too small, then the steps will be overly conservative. The step taken does not get to the point where $\nabla E(\theta_{i + 1}) \cdot \nabla E(\theta_i) = 0$. That means we could have updated with a larger step in the direction of $E(\theta_i)$ and gone even further down.

Since we *undershoot* the optimum step size by such a great factor, it will take a long time for the model parameters to approach the optimum. You can see I run this for 30 steps and still don't get very close to the optimum parameters.

In the second example we see an example of a pretty good learning rate. In just ten steps it has gotten very close to the optimum parameters. You can see that it *overshoots* and goes past that point where $\nabla E(\theta_{i + 1} \cdot \nabla E(\theta_i) = 0$. That means we have gone all the way downhill in the direction of $\nabla E(\theta_i$$, and now even gone a little ways *uphill* past the minimum along this axis of update.

Still, even though we overshoot the best step size, overall the error is improved each time, and the size of the overshoot is less for each successive step. We converge very fast despite the overshooting.

In the last example, we see that a learning rate that is too large. Here the overshoot is so large that each step *increases* the error, so each step results in a worse $\theta_{i + 1}$ than the last. When the learning rate is too large like this, gradient descent will never converge to the optimum $\theta$. It won't converge to any point at all! We say that the algorithm *diverges*.

So what is the right learning rate to choose? Choice of learning rate is called a *hyperparameter*. The learning rate isn't a parameter of the model you end up choosing (those are $\theta_0, \theta_1$), but it is like a *knob* or dial which you can use to parameterize the *learning algorithm*.

In machine learning, we often just have to try out different hyperparameter settings and see which ones work the best. As a rule of thumb, learning rates of $1.0, 0.1, 0.01, 0.001$ seem to be good initial guesses to explore. When searching for the right learning rate, we often try out rates which differ by a factor of ten (or sometimes a factor of two) so that we can quickly try out some different values and narrow down the range of $\eta$ under consideration.

This is one of the annoying things about machine learning: we often need to try out a bunch of hyperparameters without much way to forsee which ones will work.

Note that the magnitude of the update is not just proportional to the learning rate $\eta$, but also the magnitude of the gradient $\nabla E(\theta_i)$. That is: when the gradient is shallower, the step size will be smaller. This makes sense because the nature of a quadratic surface is that the minimum is closer whem the gradient is shallower.

Let's see what would happen if we took a *constant* step size. That is, if we always took a step of length $\eta$ in the direction of $\nabla E(\theta)$, but didn't adjust the step size for the magnitude of the gradient. One way to do this would be:

\\[
\theta_{i + 1} := \theta_i - \eta \frac{\nabla E(\theta)}{||\nabla E(\theta)||}
\\]

This divides $\nabla E(\theta)$ by its norm. This means that:

\\[
\left|\left|
  \frac{\nabla E(\theta)}{||\nabla E(\theta)||}
\right|\right|
=
1
\\]

When you divide a vector $v$ by its norm like this, we call this *normalizing* $v$.

So what we have done is change our update rule to always take a step of exactly length $\eta$ in the direction of the gradient of $E$. Let's see how this plays out

In [1]:
from examples.gradient_descent_constant_step_example import GradientDescentConstantStepExample

GradientDescentConstantStepExample.run()

Learning_rate = 0.5


What is now happening is that gradient descent succeeds in moving toward the global minimum, but once it is in the vicinity, it keeps bouncing around. What we really want is for the updates to "cool down" as we approach the optimum $\theta$. Without cooling down, our algorithm will never truly converge.

It turns out that for quadratic error surfaces, provided that $\eta$ is small enough, the gradient descent updates will be guaranteed to cool down. That's what we saw above when $\eta = 0.3$.

In the future we will deal with error surfaces that are not quadratic. This happens when we have models that are not linear. In that case, it's not guaranteed that there is any choice of $\eta$ that is guaranteed to cause the updates to cool down. In that case, we sometimes *decay* $\eta$ as we perform additional updates. One common way to do this is called *exponential decay*. After each update to $\theta_{i + 1}$, we also update:

\\[
\eta_{i + 1} = \eta_i * (1 - \text{decay_factor})
\\]

Here the decay factor is often something like $0.01$. Since we multiply $\eta$ by $0.99 < 1.0$ at each step, it will keep getting smaller.

Decaying the learning rate like this will cause a cool down in updates and ensure that we converge and don't keep bouncing around.

For now I won't talk more about learning rate decay, though. As I mentioned, it isn't needed for quadratic optimization if you choose a good learning rate. You can just keep it in mind for a later lecture.