In [1]:
from IPython.display import HTML
css_file = './custom.css'
HTML(open(css_file, "r").read())

# Gradient Descent

© 2018 Daniel Voigt Godoy

## 1. Definition

Gradient Descent is a generic optimization algorithm used to find optimal solutions (maximum or minimum). It can be used to minimize a given cost function of a Machine Learning algorithm, for instance.

It works by tweaking a set of parameters, performing incremental changes to them at every step, gradually converging to the solution (or not!).

The key is the ***incremental changes*** of the parameters. How does it know if it should ***increase*** or ***decrease*** a given parameter? How does it know ***how much*** to change?

This is what the ***partial derivative*** is used for. It determines how much the ***cost function changes*** if  ***one parameter changes a little bit***. 

If we want to know ***how much*** $J(w_1, w_2) \\ $ ***changes*** when we ***modify*** the value of $w_1 \ $ ***a bit***, we have the ***partial derivative of*** $J(w_1, w_2) \\ $ ***with respect to*** $w_1 \ $:

$$
\frac{\partial{J(w_1, w_2)}}{\partial{w_1}}
$$

The same holds for $w_2$:

$$
\frac{\partial{J(w_1, w_2)}}{\partial{w_2}}
$$

So, ***gradient descent*** will compute ***partial derivatives with respect to every weight*** (and the ***bias*** too!)

Then it will ***update each weight*** using its corresponding ***partial derivative*** and a ***multiplying factor*** $\eta$ which is known as the ***learning rate***.

$$
w_1 = w_1 - \eta \frac{\partial{J(w_1, w_2)}}{\partial{w_1}}
$$

***IMPORTANT***:
   - The ***learning rate*** is the ***single most important hyper-parameter*** to tune when you are using ***Deep Learning*** models! 
   - If it is ***too small***, convergence to the solution will be ***extremely slow***, but if it is ***too big***, you may end up ***not converging at all***. You will understand these mechanisms in the ***interactive example*** and the ***experiment***.
   
![](http://cs231n.github.io/assets/nn3/learningrates.jpeg)
<center>Source: CS231n CNN for Visual Recognition</center>

After ***updating all weights***, it restarts the process, ***re-evaluating the partial derivatives using the updated weights*** and ***updating all weights one more time***, and so on and so forth!

That is just it! No rocket science, it is quite simple, actually!

But ***partial derivatives*** can be a bit intimidating, so let's go through an interactive example!

In [2]:
from intuitiveml.optimizer.GradientDescent import *
vb = VBox(build_figure_deriv())
vb.layout.align_items = 'center'

In [3]:
vb

VBox(children=(FigureWidget({
    'data': [{'line': {'color': 'black'},
              'mode': 'lines',
       …

Click the ***Step*** button once. It will show you vectors: ***red*** and ***gray***.

The ***red*** vector is our ***update*** to the weight.

The ***gray*** vector shows ***how much the cost changes*** given our update.

If you divide their ***lengths***, gray over red, it will give you the ***approximate partial derivative***.

The ***update*** itself equals the ***partial derivative*** times the ***learning rate***.

Change the ***learning rate*** to 0.25. If you click the ***Step*** button once again, you should see a much bigger update.

#### Exercises:

1. Now, choose a different learning rate, reset the plot and follow some steps. Observe the path it traces and check if it hits the minimum. Try different learning rates, see what happens if you choose a really big value for it.


2. Then, change the function to a ***Non-convex*** and set the learning rate to the minimum before following some steps. Where does it converge to? Try resetting and observing its path. Does it reach the global minimum? Try different learning rates and see what happens then.

### 1.2 Types of Gradient Descent

There are 3 types of Gradient Descent, depending on the number of samples it uses to compute the partial derivatives.

#### 1.2.1 Batch

It uses ***all data points*** to compute the partial derivatives and, therefore, its path towards the solution is stable, yet it is going to be ***very slow*** on large datasets.
 
#### 1.2.2 Stochastic

It uses a ***single data point*** to compute the partial derivative and, because of it, it is ***very fast***, but its path towards the solution is going to be ***erratic*** and ***jumpy***.

#### 1.2.3 Mini-Batch

It uses ***some data points*** to compute the partial derivative and it is a compromise between ***stability*** and ***speed***. Its size is a ***hyper-parameter*** on its own, although a value of 32 (and other powers of 2) are commonly used.

## 2. Experiment

Time to try it yourself!

There are two parameters, x1 and x2, and we're using Gradient Descent to try to reach the ***minimum*** indicated by the ***star***.

The dataset has only 50 data points.

The controls below allow you to:
- adjust the learning rate
- scale the features x1 and x2
- set the number of epochs (steps)
- batch size (since the dataset has 50 points, a size of 64 means using ***all*** points)
- starting point for x1 and x2 (initialization)

Use the controls to play with different configurations and answer the questions below.

In [4]:
x1, x2, y = data()
mygd = plotGradientDescent(x1, x2, y)
vb = VBox(build_figure(mygd))
vb.layout.align_items = 'center'

In [5]:
vb

VBox(children=(FigureWidget({
    'data': [{'type': 'contour',
              'uid': 'ec22b362-19a2-4177-972b-0…

#### Questions

1. ***Without scaling features***, start with the ***learning rate at minimum***:
    - change the batch size - try ***stochastic***, ***batch*** and ***mini-batch*** sizes - what happens to the trajectory? Why?
    - keeping ***maximum batch size***, increase ***learning rate*** to 0.000562 (three notches) - what happens to the trajectory? Why?
    - now reduce gradually ***batch size*** - what happens to the trajectory? Why?
    - go back to ***maximum batch size*** and, this time, increase ***learning rate*** a bit further- what happens to the trajectory? Why?
    - experiment with different settings (yet ***no scaling***), including initial values ***x1*** and ***x2*** and try to get as close as possible to the ***minimum*** - how hard is it?
    - what was the ***largest learning rate*** you manage to use succesfully?


2. Check ***Scale Features*** - what happened to the surface (cost)? What about its level (look at the scale)?


3. ***Using scaled features***, answer the same items as in ***question 1***.


4. How do you compare the ***performance*** of gradient descent with and without ***scaling***? Why did this happen? (think about the partial derivatives with respect to each feature, especially without scaling)

## 3. More Resources

[Gradient descent, how neural networks learn](https://www.youtube.com/watch?v=IHZwWFHWa-w)

[Intro to optimization in deep learning: Gradient Descent](https://blog.paperspace.com/intro-to-optimization-in-deep-learning-gradient-descent/)

[Stochastic Gradient Descent with momentum](https://towardsdatascience.com/stochastic-gradient-descent-with-momentum-a84097641a5d)

[An overview of gradient descent optimization algorithms](http://ruder.io/optimizing-gradient-descent/)

[Why Momentum Really Works](https://distill.pub/2017/momentum/)

#### This material is copyright Daniel Voigt Godoy and made available under the Creative Commons Attribution (CC-BY) license ([link](https://creativecommons.org/licenses/by/4.0/)). 

#### Code is also made available under the MIT License ([link](https://opensource.org/licenses/MIT)).

In [6]:
from IPython.display import HTML
HTML('''<script>
  function code_toggle() {
    if (code_shown){
      $('div.input').hide('500');
      $('#toggleButton').val('Show Code')
    } else {
      $('div.input').show('500');
      $('#toggleButton').val('Hide Code')
    }
    code_shown = !code_shown
  }

  $( document ).ready(function(){
    code_shown=false;
    $('div.input').hide()
  });
</script>
<form action="javascript:code_toggle()"><input type="submit" id="toggleButton" value="Show Code"></form>''')