# How optimize your networks

## 1. normalized inputs will speed up training

- subtracting the mean
- normalize by dividing the variances
- learning can be slow when inputs are unnormalized because of different scales.
- Example below

In [49]:
import plotly.plotly as py
import plotly.graph_objs as go

import numpy as np

w1 = np.arange(-5, 5, 0.1)
w2 = np.arange(-5, 5, 0.1)
w1,w2 = np.meshgrid(w1, w2)
J = w1**2+ w2**2

surface = go.Surface(x=w1, y=w2, z=J, colorscale='Viridis')
data = [surface]
layout = go.Layout(
title='Normalized inputs',
    scene=dict(
        xaxis=dict(
            gridcolor='rgb(255, 255, 255)',
            zerolinecolor='rgb(255, 255, 255)',
            showbackground=True,
            backgroundcolor='rgb(230, 230,230)'
        ),
        yaxis=dict(
            gridcolor='rgb(255, 255, 255)',
            zerolinecolor='rgb(255, 255, 255)',
            showbackground=True,
            backgroundcolor='rgb(230, 230,230)'
        ),
        zaxis=dict(
            gridcolor='rgb(255, 255, 255)',
            zerolinecolor='rgb(255, 255, 255)',
            showbackground=True,
            backgroundcolor='rgb(230, 230,230)'
        )
    )
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='inputs_normalized')

In [50]:
import plotly.plotly as py
import plotly.graph_objs as go

import numpy as np

w1 = np.arange(-50, 50, 0.1)
w2 = np.arange(-5, 5, 0.1)
w1,w2 = np.meshgrid(w1, w2)
J = w1**2+ w2**2

surface = go.Surface(x=w1, y=w2, z=J, colorscale='Viridis')
data = [surface]
layout = go.Layout(
title='Unnormalized inputs',
    scene=dict(
        xaxis=dict(
            gridcolor='rgb(255, 255, 255)',
            zerolinecolor='rgb(255, 255, 255)',
            showbackground=True,
            backgroundcolor='rgb(230, 230,230)'
        ),
        yaxis=dict(
            gridcolor='rgb(255, 255, 255)',
            zerolinecolor='rgb(255, 255, 255)',
            showbackground=True,
            backgroundcolor='rgb(230, 230,230)'
        ),
        zaxis=dict(
            gridcolor='rgb(255, 255, 255)',
            zerolinecolor='rgb(255, 255, 255)',
            showbackground=True,
            backgroundcolor='rgb(230, 230,230)'
        )
    )
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='inputs_unnormalized')

## 2. Vanishing or exploding gradients

Example:
Very deep neural network. Let's assume $g(z)=z$ (so no transformation, just a linear activation function), and biases equal to 0.

$\hat y = w^{[L]}w^{[L-1]}w^{[L-2]}... w^{[3]}w^{[2]}w^{[1]}x$

recall that $z^{[1]} =w^{[1]}x $, and that $a^{[1]}=g(z^{[1]})=z^{[1]}$

similarly, $a^{[2]}=g(z^{[2]})=g(w^{[2]}a^{[1]})$

Imagine 2 nodes in each layer, and w =  $\begin{bmatrix} 1.3 & 0 \\ 0 & 1.3 \end{bmatrix}$

$\hat y = w^{[L]} \begin{bmatrix} 1.3 & 0 \\ 0 & 1.3 \end{bmatrix}^{L-1}   x$

even if w's slightly smaller than 1 or slightly larger, the activations will explode when there are many layers in the network!

https://www.coursera.org/learn/deep-neural-network/lecture/lXv6U/normalizing-inputs

### Solution

Choose your initialization wisely!

The more input features feeding into layer l, the smaller we want each $w_i$ to be. Common rule of thumb: $Var(w_i)$ = 1/n or 2/n

Initialize:
    ```w^{[l]}= np.random.randn(shape)*np.sqrt(2/n_(l-1)) ```
    
--> common for relu

Different initializations for different activation functions!

## 3. Optimization

![title](optimizer.png)

What happens often is that gradient descent oscillates to a fairly big extent, because the derivative is bigger in the vertical direction.

Some optimization algorithms that work faster than gradient descent:


### 3.1 Gradient descent with momentum
compute an exponentially weighthed average of the gradients and use that gradient instead. because asymmetric axes, you want to have slower learning on one axis, and fasted on another one.

Momentum:
compute dW and db on the current minibatch.

Combute $V_{dw} = \beta V_{dw} + (1-\beta)dW$ and

Combute $V_{db} = \beta V_{db} + (1-\beta)db$

--> moving average for the derivatives of W and b

$W:= W- \alpha Vdw$

$b:= b- \alpha Vdb$

This averages out gradient descent, and will "dampen" oscillations
Generally, $\beta=0.9$ is a good hyperparameter value.


### 3.2 RMSprop

RMSprop: "root mean square" prop.

Slow down learning on one direction and speed up in another one.

On each iteration, use exponentially weithed average again:
exponentially weighted average of the squares of the derivatives

$S_{dw} = \beta S_{dw} + (1-\beta)dW^2$

$S_{db} = \beta S_{dw} + (1-\beta)db^2$

$W:= W- \alpha \dfrac{dw}{\sqrt{S_{dw}}}$ and

$b:= b- \alpha \dfrac{db}{\sqrt{S_{db}}}$

In the direction where we want to learn fast, the corresponding S will be small, so dividing by a small number. On the other hand, in the direction where we will want to learn slow, the corresponding S will be relatively large, and updates will be smaller. 

Often, add small $\epsilon$ in the denominator to make sure that you don't end up dividing by 0.

### 3.3 Adam optimization algorithm

"Adaptive Moment Estimation", basically using the first and second moment estimations.

Works very well in many situations!

Taking momentum and RMSprop and putting it together!

Initialize:

$V_{dw}=0, S_{dw}=0, V_{db}=0, S_{db}=0$.

each iteration:
Compute $dW, db$ using the current mini-batch

$V_{dw} = \beta_1 V_{dw} + (1-\beta_1)dW$, $V_{db} = \beta_1 V_{db} + (1-\beta_1)db$ 

$S_{dw} = \beta_2 S_{dw} + (1-\beta_2)dW^2$, $S_{db} = \beta_2 S_{db} + (1-\beta_2)db^2$ 

Is like momentum and then RMSprop. We need to perform a correction! This is sometimes also done in RSMprop, but definitely here too.


$V^{corr}_{dw}= \dfrac{V_{dw}}{1-\beta_1^t}$, $V^{corr}_{db}= \dfrac{V_{db}}{1-\beta_1^t}$

$S^{corr}_{dw}= \dfrac{S_{dw}}{1-\beta_2^t}$, $S^{corr}_{db}= \dfrac{S_{db}}{1-\beta_2^t}$

$W:= W- \alpha \dfrac{V^{corr}_{dw}}{\sqrt{S^{corr}_{dw}+\epsilon}}$ and

$b:= b- \alpha \dfrac{V^{corr}_{db}}{\sqrt{S^{corr}_{db}+\epsilon}}$ 

Hyperparameters:
- $\alpha$ we need to tune
- $\beta_1 = 0.9$
- $\beta_2 = 0.999$
- $\epsilon = 10^{-8}$

Generally, only $\alpha$ gets tuned.

### 3.4 learning rate decay

Learning rate decreases across epochs.

$\alpha = \dfrac{1}{1+\text{decay_rate * epoch_nb}}* \alpha_0$

other methods:

$\alpha = 0.97 ^{\text{epoch_nb}}* \alpha_0$ (or exponential decay)

or

$\alpha = \dfrac{k}{\sqrt{\text{epoch_nb}}}* \alpha_0$

or

Manual decay!


## 4. Hyperparameter tuning

Now that we've ween some optimization algorithms, let's have another look at all the hyperparameters that need tuning.

Most important:
- $\alpha$

Important next:
- $\beta$ (momentum)
- Number of hidden units
- mini-batch-size

Finally:
- Number of layers
- Learning rate decay

Almost never tuned:
- $\beta_1$, $\beta_2$, $\epsilon$ (Adam)

Things to do:

- don't use a grid, because hard to say in advance which hyperparameters will be important



https://www.coursera.org/learn/deep-neural-network/lecture/y0m1f/gradient-descent-with-momentum