# Let's continue learning about Neural Nets

In part 1, we learned about othogonalization and some regularization algorithms.

- Othogonalization refers to the idea that we separate the process of eliminating bias from the process of eliminating variance. That is, we find a network that performs very well on our training data first, then we implement regularization (or other processes) to improve dev set performance.

- L2 regularization is the most common form. It refers to penalizing the cost function by the squared norm of the linear weights. 

- Drop-out regularization is another form. It refers to randomly selecting nodes in the network which will not be used during an iteration of training. The result is that no node relies heavily on any node, since it might be zero sometimes. 

See that notebook for more information.

## Mini-batch Gradient Descent

This separates our training samples into groups and implements forward and backward propagation on the groups one at a time. 

```
for t = 1,...,5000: ##suppose you have 5,000,000 examples
    forward_prop X{t}: ##vectorize on a sub-set of those 5,000,000
        Z[1] = W[1]X{t} + b[1]
        A[1] = g[1](Z[1])
        ...
        A[l] = g[l](Z[l])
    
    
    compute_cost:
        J{t} = 1/1000 sum(Cost_func(label,prediction) + L2 regularization penalty
    ##the 1000's are m, our training sizes, since we broke up the 
    ##training data into 5,000 samples of 1,000 examples each.
    
    complete back_prop, using X{t} and Y{t}
```

A single pass through the above pseudo-code is one **epoch**. Normally, training will require more than one epoch to minimize our cost function. 

### Understanding Mini-batch Gradient Descent

This process speeds up learning by using enough data to approximate the gradient in the whole batch while still taking adavantage of vectorization. If X{t} is small, we have a lot of looping and the gradient won't always be very close to the batch gradient. If X{t} is very large, we spend a lot of time computing a gradient for trivial increases in gradient precision compared with a medium sized X{t}.

The optimal size (i.e., columns) for X{t} varies, but it is recommended that you chose a power of 2 between 5 (=32) and 8 (=512). 

## Exponentially weighted averages

This is a process for smoothing noisy or eratic data where instantaneopus measurements tend to deviate from the mean value in the neighborhood. At least, that's sort of how I see it. Temperature is the example given in the video. How should we express temperature over the course of the year in a graph? Well, you don't want to take too many days of past data, since the temperature two months ago is probably colder or warmer than tomorrow will be. Let's suppose you only want to use recent data (not data from previous years.) A function like this would be a decent first bet:
$$
v(t) = .8v(t-1) + .2\theta _t
$$
Which is just a weighted average. The general form for this is
$$
v(t) = \beta v(t-1) + (1 - \beta)\theta _t
$$
This is function expresses the rolling temperature over a number of days
$$
\textrm{no. days of the rolling average} = \frac{1}{1 - \beta}
$$

So if $\beta = .8$ we're approximating the five day average.

### Understanding a little more clearly

v(t) is an exponentially decaying funciton. Notice that 
$$
v(t) = .2\theta _t +.8(.2\theta_{t-1} + .8v(t-2)) 
$$
So we're essentially multiplying each observed temperature day by the size of the function v(t) on that day, and v(t) rapidly approaches 0.

### Code Implementation

Very memory efficient:
```
v = 0
update: 
    theta = get_next_theta
    v = beta*v + (1-beta)*theta
```
This is much more memory efficient than keeping the last ten values of v and finding their average.

### Bias correction

Exponentially weighted averages don't produce good estimates in the inital section of the series unless $\beta$ is small. To correct this, we take $v_t$ and divide by $(1-\beta^t)$, which converges to 1 rather quickly. So, for large $t$, we've very close to $v_t$, but for small t, $\frac{v_t}{1-\beta^t} \approx (\theta_1 + \ldots + \theta_t)/t$.


## Gradient Descent with Momentum

In one sentence:

> Compute an exponentially weighted average of your gradient and use that to update your weights.

This almost always converges faster.
During each iteration, compute dW, db on current mini-batch and then compute ```VdW = beta*VdW + (1 - beta)dW``` and same for ```Vdb```. 

$$
v_{dW} = \beta v_{dW} + (1-\beta)dW \\
v_{db} = \beta v_{db} + (1-\beta)db \\
W = W - \alpha v_{dW}, b = b - \alpha b_{db}
$$

In pratice, one doesn't usually need to tune $\beta$, 0.9 is a strong value, but you can tune it if you want. Sometimes one sees the $1 - \beta)$ term omitted, which is mathematically equivalent as long as you scale $\alpha$ corespondingly, by $\frac{1]{1 - \beta}$. 

## RMS Prop

1. Compute dW and db on a mini batch
1. Find SdW and Sdb, SdW = beta * SdW + (1-beta)dW^2, likewise for Sdb.
1. Update W and b as W = W - alpha * dW/(sqrt(SdW)), likewise for b
1. And just in case sqrt(SdW) is very close to zero, we add a tiny epsilon to prevent W from "blowing up."

## Adam: RMS Prop + Momentum

This is one of the few optimization algorithms that's been shown to be consistently successful. 

1. Set VdW, SdW, Vdb, Sdb to 0.
1. on each iteration, compute derivatives for mini-batch
    1. VdW = beta_1 * VdW + (1-beta_1)dW, Vdb = (same)
    1. SdW = beta_2 * VdW + (1-beat_2)dW^2
    1. We do implement bias correction.
    1. VdV = VdW / (1 - beta_1^t), and likewise for hte other three
    1. Update W = W - alpha * VdW/(sqrt(SdW) + epsilon), etc. 
    
### Hyperparameters with Adam

- $\alpha$ needs to be tuned.
- $\beta_1$ usually default as 0.9
- $\beta_2$ ysyally default as 0.999
- $\epsilon$ usually $10^8$. 

The name Adam is for Adaptive moment estimation

## Learning Rate Decay

The idea is to reduce the learning rate with each epoch. 

$$
\alpha = \frac{1}{1 + \textrm{decay rate} \times \textrm{epoch number}}\alpha_0
$$

There are other function that people use (exponential decay, etc.)

Ng sees learning rate decay as lower down his list of ways to tune hyperparameters.

In [7]:
decay_rate = .5
[.2/(1+decay_rate*epoch_number) for epoch_number in range(0,20)]

[0.2,
 0.13333333333333333,
 0.1,
 0.08,
 0.06666666666666667,
 0.05714285714285715,
 0.05,
 0.044444444444444446,
 0.04,
 0.03636363636363637,
 0.03333333333333333,
 0.03076923076923077,
 0.028571428571428574,
 0.02666666666666667,
 0.025,
 0.023529411764705882,
 0.022222222222222223,
 0.021052631578947368,
 0.02,
 0.01904761904761905]

## Local minima, local optima, etc

Ng basically says "we used to worry about that, but in high dimensional spaces, it doesn't really happen. Almost all regions where there's a zero gradient are saddle points."

_Plateaus_ are a problem. A plateaus is just a large region where the derivative is close to zero. They slow things down a fair amount; optimizations like Adam help a bunch.

In [24]:
import numpy as np
X = np.random.randn(1,10).reshape((1,10))
X[0,1:3]
.1^.1

TypeError: unsupported operand type(s) for ^: 'float' and 'float'