# How optimize your networks

## 1. normalized inputs will speed up training

- subtracting the mean
- normalize by dividing the variances
- learning can be slow when inputs are unnormalized because of different scales.
- Example below

In [49]:
import plotly.plotly as py
import plotly.graph_objs as go

import numpy as np

w1 = np.arange(-5, 5, 0.1)
w2 = np.arange(-5, 5, 0.1)
w1,w2 = np.meshgrid(w1, w2)
J = w1**2+ w2**2

surface = go.Surface(x=w1, y=w2, z=J, colorscale='Viridis')
data = [surface]
layout = go.Layout(
title='Normalized inputs',
    scene=dict(
        xaxis=dict(
            gridcolor='rgb(255, 255, 255)',
            zerolinecolor='rgb(255, 255, 255)',
            showbackground=True,
            backgroundcolor='rgb(230, 230,230)'
        ),
        yaxis=dict(
            gridcolor='rgb(255, 255, 255)',
            zerolinecolor='rgb(255, 255, 255)',
            showbackground=True,
            backgroundcolor='rgb(230, 230,230)'
        ),
        zaxis=dict(
            gridcolor='rgb(255, 255, 255)',
            zerolinecolor='rgb(255, 255, 255)',
            showbackground=True,
            backgroundcolor='rgb(230, 230,230)'
        )
    )
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='inputs_normalized')

In [50]:
import plotly.plotly as py
import plotly.graph_objs as go

import numpy as np

w1 = np.arange(-50, 50, 0.1)
w2 = np.arange(-5, 5, 0.1)
w1,w2 = np.meshgrid(w1, w2)
J = w1**2+ w2**2

surface = go.Surface(x=w1, y=w2, z=J, colorscale='Viridis')
data = [surface]
layout = go.Layout(
title='Unnormalized inputs',
    scene=dict(
        xaxis=dict(
            gridcolor='rgb(255, 255, 255)',
            zerolinecolor='rgb(255, 255, 255)',
            showbackground=True,
            backgroundcolor='rgb(230, 230,230)'
        ),
        yaxis=dict(
            gridcolor='rgb(255, 255, 255)',
            zerolinecolor='rgb(255, 255, 255)',
            showbackground=True,
            backgroundcolor='rgb(230, 230,230)'
        ),
        zaxis=dict(
            gridcolor='rgb(255, 255, 255)',
            zerolinecolor='rgb(255, 255, 255)',
            showbackground=True,
            backgroundcolor='rgb(230, 230,230)'
        )
    )
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='inputs_unnormalized')

## 2. Vanishing or exploding gradients

Example:
Very deep neural network. Let's assume $g(z)=z$ (so no transformation, just a linear activation function), and biases equal to 0.

$\hat y = w^{[L]}w^{[L-1]}w^{[L-2]}... w^{[3]}w^{[2]}w^{[1]}x$

recall that $z^{[1]} =w^{[1]}x $, and that $a^{[1]}=g(z^{[1]})=z^{[1]}$

similarly, $a^{[2]}=g(z^{[2]})=g(w^{[2]}a^{[1]})$

Imagine 2 nodes in each layer, and w =  $\begin{bmatrix} 1.3 & 0 \\ 0 & 1.3 \end{bmatrix}$

$\hat y = w^{[L]} \begin{bmatrix} 1.3 & 0 \\ 0 & 1.3 \end{bmatrix}^{L-1}   x$

even if w's slightly smaller than 1 or slightly larger, the activations will explode when there are many layers in the network!

https://www.coursera.org/learn/deep-neural-network/lecture/lXv6U/normalizing-inputs

### Solution

the more input features feeding into layer l, the smaller we want each $w_i$ to be. Common rule of thumb: $Var(w_i)$ = 1/n or 2/n

Initialize:
    ```w^{[l]}= np.random.randn(shape)*np.sqrt(2/n_(l-1)) ```
    
--> common for relu

Different initializations for different activation functions!

## 3. Optimization

Concept of Exponentially (moving) weighted averages: beta-parameter that takes into account yesterday's measurement.

Some optimization algorithms that work faster than gradient descent:


- Gradient descent with momentum
- RMSprop
- Adam optimization algorithm
- learning rate decay

https://www.coursera.org/learn/deep-neural-network/lecture/y0m1f/gradient-descent-with-momentum