# Let's continue learning about Neural Nets

In part 1, we learned about othogonalization and some regularization algorithms.

- Othogonalization refers to the idea that we separate the process of eliminating bias from the process of eliminating variance. That is, we find a network that performs very well on our training data first, then we implement regularization (or other processes) to improve dev set performance.

- L2 regularization is the most common form. It refers to penalizing the cost function by the squared norm of the linear weights. 

- Drop-out regularization is another form. It refers to randomly selecting nodes in the network which will not be used during an iteration of training. The result is that no node relies heavily on any node, since it might be zero sometimes. 

See that notebook for more information.

## Mini-batch Gradient Descent

This separates our training samples into groups and implements forward and backward propagation on the groups one at a time. 

```
for t = 1,...,5000: ##suppose you have 5,000,000 examples
    forward_prop X{t}: ##vectorize on a sub-set of those 5,000,000
        Z[1] = W[1]X{t} + b[1]
        A[1] = g[1](Z[1])
        ...
        A[l] = g[l](Z[l])
    
    
    compute_cost:
        J{t} = 1/1000 sum(Cost_func(label,prediction) + L2 regularization penalty
    ##the 1000's are m, our training sizes, since we broke up the 
    ##training data into 5,000 samples of 1,000 examples each.
    
    complete back_prop, using X{t} and Y{t}
```

A single pass through the above pseudo-code is one **epoch**. Normally, training will require more than one epoch to minimize our cost function. 

### Understanding Mini-batch Gradient Descent

This process speeds up learning by using enough data to approximate the gradient in the whole batch while still taking adavantage of vectorization. If X{t} is small, we have a lot of looping and the gradient won't always be very close to the batch gradient. If X{t} is very large, we spend a lot of time computing a gradient for trivial increases in gradient precision compared with a medium sized X{t}.

The optimal size (i.e., columns) for X{t} varies, but it is recommended that you chose a power of 2 between 5 (=32) and 8 (=512). 

## Exponentially weighted averages

This is a process for smoothing noisy or eratic data where instantaneopus measurements tend to deviate from the mean value in the neighborhood. At least, that's sort of how I see it. Temperature is the example given in the video. How should we express temperature over the course of the year in a graph? Well, you don't want to take too many days of past data, since the temperature two months ago is probably colder or warmer than tomorrow will be. Let's suppose you only want to use recent data (not data from previous years.) A function like this would be a decent first bet:
$$
v(t) = .8v(t-1) + .2\theta _t
$$
Which is just a weighted average. The general form for this is
$$
v(t) = \beta v(t-1) + (1 - \beta)\theta _t
$$
This is function expresses the rolling temperature over a number of days
$$
\textrm{no. days of the rolling average} = \frac{1}{1 - \beta}
$$

So if $\beta = .8$ we're approximating the five day average.

### Understanding a little more clearly

v(t) is an exponentially decaying funciton. Notice that 
$$
v(t) = .2\theta _t.8(.2\theta + .8v(t-2)) 
$$
So we're essentially multiplying each observed temperature day by the size of the function v(t) on that day, and v(t) rapidly approaches 0.

### Code Implementation

Very memory efficient:
```
v = 0
update: 
    theta = get_next_theta
    v = beta*v + (1-beta)*theta
```
This is much more memory efficient than keeping the last ten values of v and finding their average.