## Week 1

Applied Machine Learning is an iterative process, test something and see how it works and improve it. In deep learning the hyperparameters are the center of this testing.

### Splitting the data into sets

1) A training set is used to train the model  
2) A development set is used to see if the trained model performs well  
3) A test set is used after the final model has been created through multiple iterations with 1 and 2. It is used for an unbiased evaluation of how good the final model is.

Back in the day, with small datasets (100, 1000 or 10000 examples) a split of 60/20/20 % might be seen. However, this is not necessary with big data.

With big data, like a million examples, the split into 1, 2, 3 might be something like 98/1/1 %, since 10000 examples will be enough for dev and test sets.

It's important that the dev and test sets come from the same distribution/source. Would be nice if training set would also be from the same source, but that's not always realistic.

Sometimes, if you don't need an unbiased estimate of the model, the test set is not needed. In that case, you would only have 1 and 2. In cases like this, the dev set is sometimes confusingly called "test set", even though it really is just a dev set (the model is is basically fit to the "test set", so it's not a real test set).

### Bias and variance

High bias = underfitting  
High variance = overfitting

Examining the training and dev set errors can tell us about bias and variance. If the training error is very large, there is likely high bias. If the difference between the training error and dev error is large (2>>1), there is likely high variance. There can also be both at the same time.

This simple analysis works, if dev and test are from the same distribution and if the Bayes error is close to zero ("how well would a human classify cats", images are not blurry).

<img src="notes_images/bias_var.png" width="700">

For curiosity's sake, this is what a simultaneous high bias and high variance might look like (purple line). Some parts are underfit and others overfit:

<img src="notes_images/hi_bias_var.png" width="700">

### Basic recipe for training neural networks

Solving bias and variance problems are always specific with the task at hand, but there are some general guidelines to make approaching the problems more systematic.

Start by reducing bias to an acceptable level. Then reduce variance. Do some iterations between these until you are happy. Here are some ideas for reducing bias and variance:

<img src="notes_images/recipe.png" width="700">

The "bias variance tradeoff" is mostly a historical problem, where decreasing one would increase the other. With modern deep learning and related methods, this is not such a big deal anymore, since different methods are used for decreasing bias and variance. Avoiding this tradeoff is one reason, why DL has been so useful in supervised learning.

As long as you have a well-regularized network, training a bigger network almost never hurts, it just takes longer.

### Regularization

Regularization should be one the first attempted solutions to a high-variance problem.

Regularization is done by adding a regularization term to the cost function. Lambda is a regularization (hyper)parameter.

In neural networks, the Frobenius norm (aka L2-regularization) is typically used. It is a sum over squared elements of a matrix. [NOTE: In the following image, the equation for Frob norm is incorrect]

We also have to take the regularization term into account while doing backprop and updating the parameters.

<img src="notes_images/regu.png" width="700">

### Why regularization reduces overfitting?

The regularization term in the cost function penalizes weight matrices for being too large. 

Setting a large value for lambda means that many W-terms have to be close to zero. Consequently, many hidden units become insignificant, which creates a "simpler network" less prone to overfitting.

<img src="notes_images/y_reg.png" width="700">

(Another way to think about this for a tanh activation: setting a high lambda will push the W-parameters close to zero, which pushes the activation function outputs to the linear zone of tanh. A more linear network -> reduced overfitting.)

### Dropout regularization

While L2-regularization might be the most popular regularization method, dropout is also a powerful method.

Dropout regularization drops out (eliminates) random nodes in the neural network. This basically creates a smaller, diminished network.

The random dropout is repeated for each training example. So each example has its "own" network.

Inverted dropout is the most common way to implement this regularization. After the dropout, *a* is scaled back up by the dropout rate (see green box). If we did not do the inversion step, *a*'s expected value would be lowered by the dropout rate (e.g. *a* would be overall 20 % smaller than it should be). This would cause scaling problems in testing.

<img src="notes_images/do_reg.png" width="700">

The dropout should not be used during test time (test data), since it would make the testing outputs noisy (random). Luckily, if we used inverted dropout to scale *a* back up while training, we actually do not need to take dropout into account while testing.

### Understanding dropout

Dropout makes the network "smaller". Using a smaller network has a regularizing effect reducing overfitting.

Dropout makes singular units less able to rely on singular features, because any feature might get dropped out. This means, that they spread out the weights (instead of one very high and others low). This shrinks the squared norm of the weights. So just like L2-regularization, dropout basically shrinks the weights, albeit in a different way.

<img src="notes_images/y_do.png" width="700">

The value of keep.prob can also be varied for each layer. Typically you would have a lower keep.prob (so drop out more units) in bigger layers, and a higher keep.prob in smaller layers.

Most of the time you don't want to drop out data from the input layer.

In computer vision, dropout is commonly used since there is almost never enough data, which causes overfitting.

A downside of dropout is that the cost function J is no longer well-defined. By dropping out random nodes, you make J hard to calculate. Consequently, you lose a debugging tool: you can no longer graph the cost function over iterations and expect J to decrease. To combat this, you can momentarily turn off dropout, check that J monotonicly decreases, and then turn dropout back on (and pray there aren't other bugs).

### Other regularization methods

**Data augmentation**. Increase the amount of training examples by augmenting the training set: e.g. flipping/cropping/rotating/zooming images and adding them as "new" examples. This is not as good as obtaining actually new examples, but it is free and it helps. However, use your brains: you might want to tell your network that a horizontally flipped cat is still a cat, but maybe not a vertically flipped one.

**Early stopping**. Basically you stop the optimization early while the W-values are in the midrange (they are initialized close to zero). The right time can be figured out with the dev set by stopping when the dev set error starts increasing (usually it first goes down and then increases). However, a big downside is that early stopping messes up orthogonalization by mixing up optimizing J and solving overfitting (making everything more complicated). Usually it's better to just use L2-reg. However, ES might sometimes be more handy than L2, since you don't need to try out values for the hyperparameter lambda.

Orthogonalization = think of one task at a time (it's this idea/concept). E.g.
1) Optimize J.  
2) Solve overfitting.  

This concept will be talked about more later.

The standard (Ng) way to go is to just use L2-reg and try out different values of lambda.

### Normalizing inputs

Normalizing the inputs will speed up the training of your network.

Basically you normalize a feature by subtracting the mean $\mu$ and dividing by standard deviation $\sigma$. This gives the different features a similar scale (e.g. -1 ... 1).

Use the same mean and sd calculated for your training set to normalize your dev set and test set.

Normalization works, because it makes the cost function look more symmetric relative to the parameters. This allows the optimization algorithm (gradient descent) to find the J minimum faster.

<img src="notes_images/norm.png" width="700">

### Vanishing/exploding gradients

In very deep neural networks, the derivatives/slopes can get very small (vanish) or very big (explode), which makes training difficult and slow.

Carefully chosen random weight initialization can significantly reduce this problem.

If your weights are lower than 1, in a very deep nn the activations (-> derivatives) can vanish ($W^L$).

If your weights are higher than 1, in a very deep nn the activations can explode.

In the image example, a linear activation is used to make this example simpler.

<img src="notes_images/vaex.png" width="700">

### Weight initialization for deep networks

Careful initialization of the weights can partially solve the vanishing/exploding gradient problem. However, this is not a total solution and the problem will still persist to some degree in very deep networks.

Basically we want to initialize the weights so that each of the W-matrices is not too much bigger or smaller than 1. This way z doesn't grow or decrease too quickly.

To achieve this, we can set the variance of the initialized weights to $\frac{2}{n}$ (ReLU) or $\frac{1}{n}$ (tanh), where n is the number of incoming features to a neuron.

The variance can be also taken as a hyperparameter, but usually this is not necessary/worth the trouble.

<img src="notes_images/we_in.png" width="700">

### Numerical approximation of gradients

We can use numerical derivatives to check if the analytical derivatives in our back propagation are correct -> if we are implementing back prop correctly.

The two-sided numerical derivative formula is more accurate than the one-sided one. We will use the two-sided formula for gradient checking.

<img src="notes_images/che_de.png" width="700">

### Gradient checking

Gradient checking can be used to verify that back propagation is implemented correctly, and to find bugs if it is not.

Basically you put all the parameters (W, b) into a single vector $\theta$. Similarly, you put all the derivatives of the parameters (dW, db) into a single vector $d\theta$.

Then, you compute the numerical derivative $\theta_{approx}$ for each element in the parameter vector $\theta$ and compare it to the "real" derivative in $d\theta$. If the values are close to one another, then you have implemented back prop correctly.

If the values are not close to one another, see which terms have larger differences and search for bugs in the code.

<img src="notes_images/gra_che.png" width="700">

### Gradient Checking Implementation Notes

Some practical tips for how to implement gradient checking:  
1) Do grad check only to debug - turn it off for standard training. If you always keep it on, it makes training very slow.  
2) If grad check fails, then check which terms have a large difference. Check calculation of those terms for bugs.  
3) If you are using regularization, then remember to include the J reg term in the calculations while doing grad check.  
4) Gradient check does not work with dropout. Dropout makes J definition blurry, so grad check becomes hard to use. If you have to do this, do the grad check with dropout off.  
5) Run the grad check at initialization and then later after training the network for a while. It is possible that the grad check passes at initialization since w, b are close to zero but fails with higher values. This is usually not the case, but good to keep in mind.

## Week 1 - Programming exercises

### Exercise 1 - Initialization

#### Initializing the weights to zeros:

In general, initializing all the weights to zero results in the network failing to break symmetry. This means that every neuron in each layer will learn the same thing, and you might as well be training a neural network with $n^{[l]}=1$ for every layer, and the network is no more powerful than a linear classifier such as logistic regression. 

**What you should remember**:
- The weights $W^{[l]}$ should be initialized randomly to break symmetry. 
- It is however okay to initialize the biases $b^{[l]}$ to zeros. Symmetry is still broken so long as $W^{[l]}$ is initialized randomly. 


#### Initializing the weights to very large random numbers:

**Observations**:
- The cost starts very high. This is because with large random-valued weights, the last activation (sigmoid) outputs results that are very close to 0 or 1 for some examples, and when it gets that example wrong it incurs a very high loss for that example. Indeed, when $\log(a^{[3]}) = \log(0)$, the loss goes to infinity.
- Poor initialization can lead to vanishing/exploding gradients, which also slows down the optimization algorithm. 
- If you train this network longer you will see better results, but initializing with overly large random numbers slows down the optimization.

**In summary**:
- Initializing weights to very large random values does not work well. 
- Hopefully intializing with small random values does better. The important question is: how small should be these random values be? Lets find out in the next part! 

#### Initializing the weights with He initialization:

This is named for the first author of He et al., 2015. (If you have heard of "Xavier initialization", this is similar except Xavier initialization uses a scaling factor for the weights $W^{[l]}$ of `sqrt(1./layers_dims[l-1])` where He initialization would use `sqrt(2./layers_dims[l-1])`.)

Instead of multiplying `np.random.randn(..,..)` by 10, you will multiply it by $\sqrt{\frac{2}{\text{dimension of the previous layer}}}$, which is what He initialization recommends for layers with a ReLU activation. 

**Observations**:
- The model with He initialization separates the blue and the red dots very well in a small number of iterations.


#### Exercise conclusions:

You have seen three different types of initializations. For the same number of iterations and same hyperparameters the comparison is:

<table> 
    <tr>
        <td>
        **Model**
        </td>
        <td>
        **Train accuracy**
        </td>
        <td>
        **Problem/Comment**
        </td>
    </tr>
        <td>
        3-layer NN with zeros initialization
        </td>
        <td>
        50%
        </td>
        <td>
        fails to break symmetry
        </td>
    <tr>
        <td>
        3-layer NN with large random initialization
        </td>
        <td>
        83%
        </td>
        <td>
        too large weights 
        </td>
    </tr>
    <tr>
        <td>
        3-layer NN with He initialization
        </td>
        <td>
        99%
        </td>
        <td>
        recommended method
        </td>
    </tr>
</table> 

**What you should remember from this notebook**:
- Different initializations lead to different results
- Random initialization is used to break symmetry and make sure different hidden units can learn different things
- Don't initialize to values that are too large
- He initialization works well for networks with ReLU activations. 

### Exercise 2 - Regularization

#### L2-regularization:

The standard way to avoid overfitting is called **L2 regularization**. It consists of appropriately modifying your cost function, from:
$$J = -\frac{1}{m} \sum\limits_{i = 1}^{m} \large{(}\small  y^{(i)}\log\left(a^{[L](i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right) \large{)} \tag{1}$$
To:
$$J_{regularized} = \small \underbrace{-\frac{1}{m} \sum\limits_{i = 1}^{m} \large{(}\small y^{(i)}\log\left(a^{[L](i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right) \large{)} }_\text{cross-entropy cost} + \underbrace{\frac{1}{m} \frac{\lambda}{2} \sum\limits_l\sum\limits_k\sum\limits_j W_{k,j}^{[l]2} }_\text{L2 regularization cost} \tag{2}$$

**Observations**:
- The value of $\lambda$ is a hyperparameter that you can tune using a dev set.
- L2 regularization makes your decision boundary smoother. If $\lambda$ is too large, it is also possible to "oversmooth", resulting in a model with high bias.

**What is L2-regularization actually doing?**:

L2-regularization relies on the assumption that a model with small weights is simpler than a model with large weights. Thus, by penalizing the square values of the weights in the cost function you drive all the weights to smaller values. It becomes too costly for the cost to have large weights! This leads to a smoother model in which the output changes more slowly as the input changes. 

**What you should remember** -- the implications of L2-regularization on:
- The cost computation:
    - A regularization term is added to the cost
- The backpropagation function:
    - There are extra terms in the gradients with respect to weight matrices
- Weights end up smaller ("weight decay"): 
    - Weights are pushed to smaller values.

#### Dropout regularization:

**Dropout** is a widely used regularization technique that is specific to deep learning. 
**It randomly shuts down some neurons in each iteration.**

At each iteration, you shut down (= set to zero) each neuron of a layer with probability $1 - keep\_prob$ or keep it with probability $keep\_prob$. The dropped neurons don't contribute to the training in both the forward and backward propagations of the iteration.

When you shut some neurons down, you actually modify your model. The idea behind drop-out is that at each iteration, you train a different model that uses only a subset of your neurons. With dropout, your neurons thus become less sensitive to the activation of one other specific neuron, because that other neuron might be shut down at any time. 

**Note**:
- A **common mistake** when using dropout is to use it both in training and testing. You should use dropout (randomly eliminate nodes) only in training. 
- Deep learning frameworks like [tensorflow](https://www.tensorflow.org/api_docs/python/tf/nn/dropout), [PaddlePaddle](http://doc.paddlepaddle.org/release_doc/0.9.0/doc/ui/api/trainer_config_helpers/attrs.html), [keras](https://keras.io/layers/core/#dropout) or [caffe](http://caffe.berkeleyvision.org/tutorial/layers/dropout.html) come with a dropout layer implementation. Don't stress - you will soon learn some of these frameworks.

**What you should remember about dropout:**
- Dropout is a regularization technique.
- You only use dropout during training. Don't use dropout (randomly eliminate nodes) during test time.
- Apply dropout both during forward and backward propagation.
- During training time, divide each dropout layer by keep_prob to keep the same expected value for the activations. For example, if keep_prob is 0.5, then we will on average shut down half the nodes, so the output will be scaled by 0.5 since only the remaining half are contributing to the solution. Dividing by 0.5 is equivalent to multiplying by 2. Hence, the output now has the same expected value. You can check that this works even when keep_prob is other values than 0.5.  

#### Exercise conclusions:

**Here are the results of our three models**: 

<table> 
    <tr>
        <td>
        **model**
        </td>
        <td>
        **train accuracy**
        </td>
        <td>
        **test accuracy**
        </td>
    </tr>
        <td>
        3-layer NN without regularization
        </td>
        <td>
        95%
        </td>
        <td>
        91.5%
        </td>
    <tr>
        <td>
        3-layer NN with L2-regularization
        </td>
        <td>
        94%
        </td>
        <td>
        93%
        </td>
    </tr>
    <tr>
        <td>
        3-layer NN with dropout
        </td>
        <td>
        93%
        </td>
        <td>
        95%
        </td>
    </tr>
</table> 

Note that regularization hurts training set performance! This is because it limits the ability of the network to overfit to the training set. But since it ultimately gives better test accuracy, it is helping your system. 

**What we want you to remember from this notebook**:
- Regularization will help you reduce overfitting.
- Regularization will drive your weights to lower values.
- L2 regularization and Dropout are two very effective regularization techniques.

### Exercise 3 - Gradient Checking

#### How does gradient checking work?

Backpropagation computes the gradients $\frac{\partial J}{\partial \theta}$, where $\theta$ denotes the parameters of the model. $J$ is computed using forward propagation and your loss function.

Let's look back at the definition of a derivative (or gradient):
$$ \frac{\partial J}{\partial \theta} = \lim_{\varepsilon \to 0} \frac{J(\theta + \varepsilon) - J(\theta - \varepsilon)}{2 \varepsilon} \tag{1}$$

We know the following:

- $\frac{\partial J}{\partial \theta}$ is what you want to make sure you're computing correctly. 
- You can compute $J(\theta + \varepsilon)$ and $J(\theta - \varepsilon)$ (in the case that $\theta$ is a real number), since you're confident your implementation for $J$ is correct. 

#### Exercise conclusions:

**What you should remember from this notebook**:
- Gradient checking verifies closeness between the gradients from backpropagation and the numerical approximation of the gradient (computed using forward propagation).
- Gradient checking is slow, because approximating the gradient with $\frac{\partial J}{\partial \theta} \approx  \frac{J(\theta + \varepsilon) - J(\theta - \varepsilon)}{2 \varepsilon}$ is computationally costly, so we don't run it in every iteration of training. You would usually run it only to make sure your code is correct, then turn it off and use backprop for the actual learning process. 
- Gradient Checking, at least as we've presented it, doesn't work with dropout. You would usually run the gradient check algorithm without dropout to make sure your backprop is correct, then add dropout.

## Week 2

Deep learning works best on big data, but training on large datasets is slow. Learning to use efficient optimization algorithms is important for speeding up the training.

### Mini-batch gradient descent

With the "standard" (batch) gradient descent we have to go through all the training examples to take a single gradient descent step. If we have millions of examples, this can be very slow.

We can make faster progress, if we allow GD to start taking steps even before processing all the examples.

Split your training set (5,000,000 examples) to baby training sets (1,000 examples each). These baby training sets are called **mini-batches**.

<img src="notes_images/mb_gd.png" width="700">

In mini-batch gradient descent, we take a step of gradient descent for each mini-batch. So essentially we do everything (forward, cost, backward, update) just as before, just one minibatch at a time.

A single pass through the entire training set is called an **epoch**. While in batch gradient descent during a single pass we take a single step, in mini-batch GD during an epoch we take many (here 5000) steps.

For a large dataset, mini-batch is way faster than batch. Mini-batch is the standard GD approach for large datasets.

<img src="notes_images/mb_gd_alg.png" width="700">

### Understanding mini-batch gradient descent

In batch GD, the cost decreases with each step of GD. However, in mini-batch GD, this does not necessarily happen, since each step is taken based on a different training set. The general trend is still downwards for mini-batch, but the progress is "noisier".

The mini-batch size affects how GD converges:  
 - If the size is *m*, we basically have the familiar batch GD. Each step is slow.  
 - If the size is 1, each example is a new "training set" and we have **stochastic gradient descent**. Each step is quick, but we lose all the speedup from vectorization, which is inefficient (main problem). It is also very noisy and never converges.  
 - In practice the size is chosen between m (too large) and 1 (too small). This will give us fastest learning, since we get quick progression (steps) AND the speedup from vectorization.

<img src="notes_images/mb_gd_und.png" width="700">

How to choose the mini-batch size?  
 - For a small training set (m < 2000), use standard batch GD  
 - For a bigger training set, typical sizes would be between 64 and 512.
     - These are usually powers of two (64, 128, 256, 512), because on certain systems this can run faster.
     - Find an efficient value through trial and error.
 - Make sure that the mini-batch fits in the CPU/GPU memory.

There are even more efficient optimization algorithms than batch/mini-batch gradient descent.

## Exponentially weighted averages

In order to understand the algorithms that are faster than GD, we need to understand exponentially weighted averages (sometimes called running averages). EWAs are a key component in many advanced optimization algorithms.

See the equation in the image below.

If $\beta$ is high (0.98), the exponentially weighted average changes more slowly (green curve), since it puts more emphasis on the previous days' temperatures.

If $\beta$ is low (0.50), the changes are much quicker and noisier (yellow curve), since we are averaging over a shorter time window.

<img src="notes_images/exp_wa.png" width="700">

## Understanding exponentially weighted averages

Mathematically this is quite simple. Basically, $\beta$ determines how many previous terms actually matter, i.e. how far back does the average go. If beta is higher, more previous terms are added before the function becomes practically zero. So at each time step there is an exponentially decaying function that depends on the previous time steps.

<img src="notes_images/ewa_und.png" width="700">

Implementation is light on memory, since it requires just keeping a single real number in memory. This is also the main reason why we use this - there are more accurate ways to count a running average, but most of them require more memory. 

E.g. you could always sum over 50 days and divide by 50, but this requires keeping more values in memory and is computationally more expensive.

EWA is a simple, computationally and memory efficient way to calculate averages over a lot of variables.

<img src="notes_images/ewa_imp.png" width="700">

### Bias correction in exponentially weighted averages

Bias correction helps us get a better estimate of <insert quantity\> (e.g. temperature) in the early stages of EWA. Without bias correction, the values are much lower (purple) than they should be (green, with bias correction).

The bias correction is basically a denominator that goes to one as *t* grows. So it helps early on, then fades away.

<img src="notes_images/ewa_bias.png" width="700">

### Gradient descent with momentum

Almost always faster than the standard (batch/mini-batch) gradient descent algorithm.

Basic idea: compute an exponentially weighted average over the gradients and use it to update the weights.

Momentum basically smooths out the steps of gradient descent, which allows GD to find the optimum faster when the cost function has a highly non-symmetric shape. The default GD would just dance around (blue line in image) in cases likes this, but momentum goes straight through (red line).

Basically makes the current step of gradient descent depend on the previous steps. *NOTE: Reminds me of MCMC*

An analogy is a ball rolling into a bowl (see image). There is an acceleration term, the current velocity and the friction.

<img src="notes_images/gd_mom.png" width="700">

Implementation:
 - A robust default value is $\beta = 0.90$. 
 - Bias correction is usually not used, since after 10 steps the bias is already gone (with default $\beta$).
 - Some people leave out the $(1-\beta)$ term but we will use it. The different versions just require different $\alpha$.
 - $v_{dW}$ has the same dimensions as dW and W

<img src="notes_images/gd_momi.png" width="700">

### RMSprop

Root mean square prop has a basic idea similar to momentum, but the equations are slightly different. RMSprop damps down the oscillations in GD allowing us to use a larger learning rate $\alpha$. All of this speeds up the training.

In the image, the updates in the vertical direction are divided by large numbers, which helps damp down the up-down oscillations. In horizontal the division is done by small numbers, so the steps are bigger.

A small epsilon term is added to the denominator in the update to avoid division by zero.

<img src="notes_images/rmsprop.png" width="700">

### Adam optimization algorithm

"Adaptive moment estimation"

Adam = Momentum + RMSprop

Adam is commonly used and very effective. It generalizes well for different types of problems. 

Historically in DL research there have been a bunch of optimization algorithms that work for very specific tasks making them not very useful generally. But Adam is one of the good guys.

In Adam, you typically do implement bias correction. Both for the momentum and RMSprop terms.

<img src="notes_images/adam.png" width="700">

While using Adam, the hyperparameters are usually chosen as follows:
 - Tune learning rate $\alpha$.
 - Use the default value $\beta_1 = 0.90$
 - Use the default value $\beta_2 = 0.999$
 - Use the default value (can differ, not super crucial) $\epsilon = 10^{-8}$

The betas can naturally also be tuned, but this is rarely done.

### Learning rate decay

Slowly reducing the learning rate over time could help speed up the training. This is called learning rate decay.

During the initial stages of learning we take bigger steps. As learning approaches convergence, having a smaller learning rate allows us to take smaller steps and get closer to the optimum (especially with something noisy like mini-batch GD with small mini-batches).

While implementing decay, the learning rate is usually tied to the epoch number. Alpha_zero, the initial learning rate, and the decay rate are hyperparameters, that usually need to be tuned. 

<img src="notes_images/lr_dec.png" width="700">

There are also other ways to implement learning rate decay, e.g. exponential decay, discrete staircase, manual decay.

Learning rate decay is not among the first options to speed up training. Usually, a well chosen static alpha is already good enough. However, in some cases decay can help.

### The problem of local optima

Historically, getting stuck to local optima have been thought of as a big problem. This is largely based on thinking about optimization as a low-dimensional problem (like 2-D or 3-D). However, this thinking does not really transfer to high-dimensional spaces.

Nowadays, neural networks might have thousands of parameters. In a parameter space like this, its highly unlikely that there will actually exist a local optima, a point which is flat to all thousands of directions. Usually, every point has some downhills and uphills around it.

However, while the optimization is not likely to get _stuck_, it can be _slow_. *Plateaus* are regions where the gradient is close to zero for a long time causing the optimization to ṕrogress very slowly.

<img src="notes_images/plateaus.png" width="700">

Modern optimization algorithms like Momentum, RMSprop and Adam deal reasonably well with plateaus.

## Week 2 - Programming exercises

### Exercise 1 - Gradient descent

**What you should remember**:
- The difference between gradient descent, mini-batch gradient descent and stochastic gradient descent is the number of examples you use to perform one update step.
- You have to tune a learning rate hyperparameter $\alpha$.
- With a well-turned mini-batch size, usually it outperforms either gradient descent or stochastic gradient descent (particularly when the training set is large).

### Exercise 2 - Mini-Batch Gradient descent

There are two steps to build mini-batches from the training set (X, Y).:
- **Shuffle**: Create a shuffled version of the training set (X, Y). Each column of X and Y represents a training example. Note that the random shuffling is done synchronously between X and Y. Such that after the shuffling the $i^{th}$ column of X is the example corresponding to the $i^{th}$ label in Y. The shuffling step ensures that examples will be split randomly into different mini-batches. 

- **Partition**: Partition the shuffled (X, Y) into mini-batches of size `mini_batch_size` (here 64). Note that the number of training examples is not always divisible by `mini_batch_size`. The last mini batch might be smaller, but you don't need to worry about this.

**What you should remember**:
- Shuffling and Partitioning are the two steps required to build mini-batches
- Powers of two are often chosen to be the mini-batch size, e.g., 16, 32, 64, 128.

### Exercise 3 - Momentum

Because mini-batch gradient descent makes a parameter update after seeing just a subset of examples, the direction of the update has some variance, and so the path taken by mini-batch gradient descent will "oscillate" toward convergence. Using momentum can reduce these oscillations. 

Momentum takes into account the past gradients to smooth out the update. We will store the 'direction' of the previous gradients in the variable $v$. Formally, this will be the exponentially weighted average of the gradient on previous steps. You can also think of $v$ as the "velocity" of a ball rolling downhill, building up speed (and momentum) according to the direction of the gradient/slope of the hill.

The momentum update rule is, for $l = 1, ..., L$: 

$$ \begin{cases}
v_{dW^{[l]}} = \beta v_{dW^{[l]}} + (1 - \beta) dW^{[l]} \\
W^{[l]} = W^{[l]} - \alpha v_{dW^{[l]}}
\end{cases}\tag{3}$$

$$\begin{cases}
v_{db^{[l]}} = \beta v_{db^{[l]}} + (1 - \beta) db^{[l]} \\
b^{[l]} = b^{[l]} - \alpha v_{db^{[l]}} 
\end{cases}\tag{4}$$

where L is the number of layers, $\beta$ is the momentum and $\alpha$ is the learning rate.

**Note** that:
- The velocity is initialized with zeros. So the algorithm will take a few iterations to "build up" velocity and start to take bigger steps.
- If $\beta = 0$, then this just becomes standard gradient descent without momentum. 

**How do you choose $\beta$?**

- The larger the momentum $\beta$ is, the smoother the update because the more we take the past gradients into account. But if $\beta$ is too big, it could also smooth out the updates too much. 
- Common values for $\beta$ range from 0.8 to 0.999. If you don't feel inclined to tune this, $\beta = 0.9$ is often a reasonable default. 
- Tuning the optimal $\beta$ for your model might need trying several values to see what works best in term of reducing the value of the cost function $J$. 

**What you should remember**:
- Momentum takes past gradients into account to smooth out the steps of gradient descent. It can be applied with batch gradient descent, mini-batch gradient descent or stochastic gradient descent.
- You have to tune a momentum hyperparameter $\beta$ and a learning rate $\alpha$.

### Exercise 4 - Adam

Adam is one of the most effective optimization algorithms for training neural networks. It combines ideas from RMSProp and Momentum.

**How does Adam work?**
1. It calculates an exponentially weighted average of past gradients, and stores it in variables $v$ (before bias correction) and $v^{corrected}$ (with bias correction). 
2. It calculates an exponentially weighted average of the squares of the past gradients, and  stores it in variables $s$ (before bias correction) and $s^{corrected}$ (with bias correction). 
3. It updates parameters in a direction based on combining information from "1" and "2".

The update rule is, for $l = 1, ..., L$: 

$$\begin{cases}
v_{dW^{[l]}} = \beta_1 v_{dW^{[l]}} + (1 - \beta_1) \frac{\partial \mathcal{J} }{ \partial W^{[l]} } \\
v^{corrected}_{dW^{[l]}} = \frac{v_{dW^{[l]}}}{1 - (\beta_1)^t} \\
s_{dW^{[l]}} = \beta_2 s_{dW^{[l]}} + (1 - \beta_2) (\frac{\partial \mathcal{J} }{\partial W^{[l]} })^2 \\
s^{corrected}_{dW^{[l]}} = \frac{s_{dW^{[l]}}}{1 - (\beta_2)^t} \\
W^{[l]} = W^{[l]} - \alpha \frac{v^{corrected}_{dW^{[l]}}}{\sqrt{s^{corrected}_{dW^{[l]}}} + \varepsilon}
\end{cases}$$
where:
- t counts the number of steps taken of Adam 
- L is the number of layers
- $\beta_1$ and $\beta_2$ are hyperparameters that control the two exponentially weighted averages. 
- $\alpha$ is the learning rate
- $\varepsilon$ is a very small number to avoid dividing by zero

### Exercise 5 - Model with different optimization algorithms

#### Summary

<table> 
    <tr>
        <td>
        **optimization method**
        </td>
        <td>
        **accuracy**
        </td>
        <td>
        **cost shape**
        </td>
    </tr>
        <td>
        Gradient descent
        </td>
        <td>
        79.7%
        </td>
        <td>
        oscillations
        </td>
    <tr>
        <td>
        Momentum
        </td>
        <td>
        79.7%
        </td>
        <td>
        oscillations
        </td>
    </tr>
    <tr>
        <td>
        Adam
        </td>
        <td>
        94%
        </td>
        <td>
        smoother
        </td>
    </tr>
</table> 

Momentum usually helps, but given the small learning rate and the simplistic dataset, its impact is almost negligible. Also, the huge oscillations you see in the cost come from the fact that some minibatches are more difficult thans others for the optimization algorithm.

Adam on the other hand, clearly outperforms mini-batch gradient descent and Momentum. If you run the model for more epochs on this simple dataset, all three methods will lead to very good results. However, you've seen that Adam converges a lot faster.

Some advantages of Adam include:
- Relatively low memory requirements (though higher than gradient descent and gradient descent with momentum) 
- Usually works well even with little tuning of hyperparameters (except $\alpha$)

**References**:

- Adam paper: https://arxiv.org/pdf/1412.6980.pdf

## Week 3

### Hyperparameter tuning: tuning process

The sheer number of hyperparameters can make training networks tricky. The following might be present:

| Hyperparameter | Tuning importance |
|:---|:---|
|$\alpha$|1st|
|$\beta$|2nd|
|\# of hidden units|2nd|
|mini-batch size|2nd|
|\# of layers|3rd|
|learning rate decay|3rd|
|$\beta_1, \beta_2, \epsilon$|don't tune, use defaults (Adam)|

However, the order of importance is somewhat subjective.

When sampling hyperparameter values, do not sample from a symmetric grid. Always sample randomly. This gives you more unique values for each parameter.

<img src="notes_images/hp_tune.png" width="700">

You can also follow the "coarse to fine" approach. First, randomly sample the entire space. Then, sample again focusing on the region of the space that yielded the best values.

### Hyperparameter tuning: Using an appropriate scale to pick hyperparameters

Hyperparameters should be sampled randomly. But the randomness should not always be uniform over the entire range of valid values. Instead, we should pick an appropriate scale for exploring the hyperparameters.

For hyperparameters like *# of hidden units* and *# of layers*, sampling uniformly on a linear scale over some range can be reasonable. However, this is not the case for all of the hpars.

While searching for $\alpha$, a logarithmic scale should be used. On a linear scale, the randomization does not treat all regions of the space smartly. E.g. when sampling uniformly on a linear scale in [0.0001 ... 1.0], the vast majority of the sampled values will be between 0.1 and 1.0. And we do want to search also in the lower values.

<img src="notes_images/hp_scale.png" width="700">

While searching for $\beta$ (hpar for exponentially weighted average), a logarithmic scale should also be used. We want to sample more densely in the region, where $\beta$ is close to one, since that's where the function $\frac{1}{1-\beta}$ sees the most rapid changes.

### Hyperparameter tuning in practice: Pandas vs. Caviar

There are at least two common approaches for organizing the hyperparameter search process:

- Babysit a single model, checking progress and nudging the hpar values every day. Use, if lacking computational resources. Ng: "Panda". 
- Train multiple models in parallel with different hyperparameter settings. Use, if computational resources available. Ng: "Caviar".

Hyperparameter values from one application domain (e.g. vision) in DL do not automatically work for another (e.g. speech).

### Batch Normalization: Normalizing activations in a network

Batch normalization makes the hyperparameter search problem much easier and makes our neural network more robust. It will also enable us to much more easily and efficiently train even deeper networks.

Normalizing the input features can speed up learning, as we learned before.

Turns out we can extend the normalization to the NN activations to make training even more efficient. This is called batch normalization.

We will normalize the values before activations, i.e. normalize $z$. This is the standard. The normalization can also be done after activation (to $a$), but this is not as common.

To implement batch normalization, we calculate the $z$ mean and sd for a layer and use those to normalize the $z$ values within the layer. This is done separately for each layer.

However, this will force each unit's $z$ to have mean zero and variance one, which we might not want. In other words, we might not want to only give the non-linear activation functions values distributed around zero. Instead, we want to give them a variety of inputs. Consequently, instead of using the $z_{norm}$ directly we will use $\widetilde{z}$, which can have any mean and sd dictated by the learnable parameters $\gamma$ and $\beta$.

This $\beta$ is NOT the same one as used as a hyperparameter for momentum. They just happen to have the same symbol (in their original papers).

<img src="notes_images/batch_norm.png" width="700">

### Batch Normalization: Fitting Batch Norm into a neural network

In practice, batch normalization is just added as an additional step between calculating $z$ and $a$ for each unit. Basically, instead of using the unnormalized $z$, we are using the normalized $\widetilde{z}$ for calculating the activation. Everything else is business as usual.

BN adds two new trainable parameters for each layer: $\beta$ and $\gamma$. These are optimized as usual along the others with GD, Adam etc.

However, the parameter $b$ can be emitted in BN since the normalization step (mean to zero) makes it redundant. The parameter $\beta$ sort of takes its place in BN.

In practice, batch norm is usually applied with mini-batches, not the entire training set. So basically whenever you calculate the BN, you just use the data from the current mini-batch for the mean and sd. The $\beta$ and $\gamma$ are still "global" and optimized as usual.

Usually implementing batch normalization is just a single line of code in a DL framework.

<img src="notes_images/bn_imp.png" width="700">

### Batch Normalization: Why does Batch Norm work?

There are three major reasons for why batch norm is useful.

*First*, it gives a similar range of values to all the hidden units. So some don't have $z$ from 0 to 1 and others from 0 to 10,000. This speeds up training.

*Second*, it makes weights deep in the network (e.g. layer 10) more robust to changes to weights in the earlier layers (e.g. layer 1), since the earlier layers cannot shift around as much. This makes learning easier for the later layers.

From the perspective of a later layer, the values coming (as inputs) from the previous layer are changing all the time during training (bc the parameters are changing). Usually, this makes the layer suffer from the problem of *covariate shift*. However, using BN stabilises the distribution of the values coming in from the previous layer - they get a somewhat stable mean and sd. So, BN reduces the problem of the input values of the later layer changing, in essence allowing the later layer to learn more independently of the previous layers.

*Third*, batch norm has a slight regularization effect when used with mini-batches. This is basically a side-effect.

Each $z$ value is scaled by the mean and sd of the mini-batch instead of the whole set. The mean and sd are a bit noisy, since the whole set is not used. This ends up adding slight noise to the value $\widetilde{z}$. Which in turn adds noise to each layer's activations. This causes a regularization effect, since the later hidden units can not heavily rely on the value of a singular earlier unit. 

A higher mini-batch size reduces the regulatization effect, logically.

However, BN should not be used as "true" regularization, since the amount of noise is small and it causes only a slight reg effect. For a stronger regularization, use dropout or another reg method alongside BN.

### Batch Normalization: Batch Norm at test time

Batch norm processes the data one mini-batch at a time, but at test time we might need to process one example at a time. This requires some adaptation at test time, since the mean and sd are required for getting results but it does not make sense to calculate them based on a single example.

To solve this, we estimate the mean and sd from the training set. This is usually done with an exponentially weighted average over mini-batches, which keeps track of the mean and sd values during training. This yields a rough estimate of the mean and sd, which we can use at test time.

<img src="notes_images/bn_test.png" width="700">

Naturally, frameworks like Tensorflow usually take care of implementing this, so we rarely have to do it ourselves.

### Multi-class classification: Softmax Regression

So far we have been doing binary classifications: "cat" or "no cat". Softmax regression is a generalization of logistic regression that allows us to recognize multiple classes: "other", "cat", "dog" or "chicken". Here we would have four classes or C = 4.

In softmax regression, the output layer is a *softmax layer* that tells us the probability of each of the classes. P(other), P(cat), ... These sum to one.

The softmax activation basically exponentiates $z$ and "normalizes" with the sum of all the exponentiated values ($e^z$). This yields the probabilities.

<img src="notes_images/smax.png" width="700">

In the simplest possible case (no hidden layers), softmax just creates linear decision boundaries between the classes.

### Multi-class classification: Training a softmax classifier

In multi-class classification problems, the label vector will consist of 0s and 1s. During training, we basically try to optimize the probability of the true prediction (the one that has a 1 in the label vector).

<img src="notes_images/smax_t.png" width="700">

At this level, you usually just focus on forward prop and rely on some DL framework to implement backprop.

### Deep learning frameworks

So far, we implemented everything manually to understand what's actually going on behind the scenes. But realistically, this is not always the way to go, especially with bigger and more complex neural nets. Instead, it is usually much more practical to build the work on existing deep learning frameworks.

<img src="notes_images/framew.png" width="700">

### TensorFlow

Get coding.

## Week 3 - Programming exercises

**What you should remember**:
- Tensorflow is a programming framework used in deep learning
- The two main object classes in tensorflow are Tensors and Operators. 
- When you code in tensorflow you have to take the following steps:
    - Create a graph containing Tensors (Variables, Placeholders ...) and Operations (tf.matmul, tf.add, ...)
    - Create a session
    - Initialize the session
    - Run the session to execute the graph
- You can execute the graph multiple times as you've seen in model()
- The backpropagation and optimization is automatically done when running the session on the "optimizer" object.