# C2_Notes_W1 - Practical Aspects of Deep Learning

This is the second course of the deep learning specialization at [Coursera](https://www.coursera.org/specializations/deep-learning) which is moderated by [DeepLearning.ai](http://deeplearning.ai/). The course is taught by Andrew Ng.

## Table of contents

* [1. Course Summary](#train--dev--test-sets)
* [2. Train / Dev / Test sets](#train--dev--test-sets)
* [3. Bias / Variance](#bias--variance)
* [4. Basic Recipe for Machine Learning](#basic-recipe-for-machine-learning)
* [5. Regularization](#regularization)
* [6. Why regularization reduces overfitting?](#why-regularization-reduces-overfitting)
* [7. Dropout Regularization](#dropout-regularization)
* [8. Understanding Dropout](#understanding-dropout)
* [9. Other regularization methods](#other-regularization-methods)
* [10. Normalizing inputs](#normalizing-inputs)
* [11. Vanishing / Exploding gradients](#vanishing--exploding-gradients)
* [12. Weight Initialization for Deep Networks](#weight-initialization-for-deep-networks)
* [13. Numerical approximation of gradients](#numerical-approximation-of-gradients)
* [14. Gradient checking implementation notes](#gradient-checking-implementation-notes)
* [15. Initialization summary](#initialization-summary)
* [16. Regularization summary](#regularization-summary)

# 1. Course summary

Here are the course summary as its given on the course [link](https://www.coursera.org/learn/deep-neural-network):

> This course will teach you the "magic" of getting deep learning to work well. Rather than the deep learning process being a black box, you will understand what drives performance, and be able to more systematically get good results. You will also learn TensorFlow. 
>
> After 3 weeks, you will: 
> - Understand industry best-practices for building deep learning applications. 
> - Be able to effectively use the common neural network "tricks", including initialization, L2 and dropout regularization, Batch normalization, gradient checking, 
> - Be able to implement and apply a variety of optimization algorithms, such as mini-batch gradient descent, Momentum, RMSprop and Adam, and check for their convergence. 
> - Understand new best-practices for the deep learning era of how to set up train/dev/test sets and analyze bias/variance
> - Be able to implement a neural network in TensorFlow. 
>
> This is the second course of the Deep Learning Specialization.

# 2. Train / Dev / Test sets

- Its impossible to get all your hyperparameters right on a new application from the first time.
- So the idea is you go through the loop: `Idea ==> Code ==> Experiment`.
- You have to go through the loop many times to figure out your hyperparameters.
- Your data will be split into three parts:
  - Training set.       (Has to be the largest set)
  - Hold-out cross validation set / Development or "dev" set.
  - Testing set.
- You will try to build a model upon training set then try to optimize hyperparameters on dev set as much as possible. Then after your model is ready you try and evaluate the testing set.
- so the trend on the ratio of splitting the models:
  - If size of the  dataset is 100 to 1000000  ==> 60/20/20
  - If size of the  dataset is 1000000  to INF  ==> 98/1/1 or  99.5/0.25/0.25
      - Main idea here is that, if you have 10 million observations, you may only need like 0.1% (10 thousand) to do the evaluation (test or hold-out set)
- The trend now gives the training data the biggest sets.
- Make sure the dev and test set are coming from the same distribution.
  - For example if cat training pictures is from the web and the dev/test pictures are from users cell phone they will mismatch. It is better to make sure that dev and test set are from the same distribution.
- The dev set rule is to try them on some of the good models you've created.
- Its OK to only have a dev set without a testing set. But a lot of people in this case call the dev set as the test set. A better terminology is to call it a dev set as its used in the development.

# 3. Bias / Variance

- Bias / Variance techniques are Easy to learn, but difficult to master.
- So here the explanation of Bias / Variance:
  - If your model is underfitting (logistic regression of non linear data) it has a "high bias"
  - If your model is overfitting then it has a "high variance"
  - Your model will be alright if you balance the Bias / Variance
  - For more:
    - ![](Images/01-_Bias_-_Variance.png)
- Another idea to get the bias /  variance if you don't have a 2D plotting mechanism:
  - High variance (overfitting) for example:
    - Training error: 1%
    - Dev error: 11%
  - high Bias (underfitting) for example:
    - Training error: 15%
    - Dev error: 14%
  - high Bias (underfitting) && High variance (overfitting) for example:
    - Training error: 15%
    - Test error: 30%
  - Best (low bias and low variance):
    - Training error: 0.5%
    - Test error: 1%
  - These Assumptions came from that human has 0% error. If the problem isn't like that you'll need to use human error as baseline.

# 4. Basic Recipe for Machine Learning

#### Does you model have high bias? (underfitting):
  - Try to make your NN bigger (size of hidden units, number of layers)
  - Try a different model that is suitable for your data.
  - Try to run it longer.
  - Different (advanced) optimization algorithms.
  - Generally you try these until you can fit at least the training data pretty well... "until you reduce bias to an acceptable level"

#### Does your model have high variance (overfitting or poor generalization)?
  - More data.
  - Try regularization.
  - Try a different model that is suitable for your data.
  - **You should try the previous two points until you have a low bias and low variance.**
  
In the older days before deep learning, there was a "Bias/variance tradeoff". But because now you have more options/tools for solving the bias and variance problem its really helpful to use deep learning... deep learning (particularily using bigger networks and adding more data) can allow us to reduce our bias and variance at the same time
- NOTE: Training a bigger neural network never hurts. The main cost is just longer training time.

# 5. Regularization

- Adding regularization to NN will help it reduce variance (overfitting)
- L1 matrix norm:
  - `||W|| = Sum(|w[i,j]|)  # sum of absolute values of all w`
- L2 matrix norm because of arcane technical math reasons is called Frobenius norm:
  - `||W||^2 = Sum(|w[i,j]|^2)	# sum of all w squared`
  - Also can be calculated as `||W||^2 = W.T * W`
- Regularization for logistic regression:
  - The normal cost function that we want to minimize is: `J(w,b) = (1/m) * Sum(L(y(i),y'(i)))`
  - The L2 regularization version: `J(w,b) = (1/m) * Sum(L(y(i),y'(i))) + (lambda/2m) * Sum(|w[i]|^2)`
  - The L1 regularization version: `J(w,b) = (1/m) * Sum(L(y(i),y'(i))) + (lambda/2m) * Sum(|w[i]|)`
  - The L1 regularization version makes a lot of w values become zeros, which makes the model size smaller.
  - L2 regularization is being used much more often.
  - `lambda` here is the regularization parameter (hyperparameter)
- Regularization for NN:
  - The normal cost function that we want to minimize is:   
    `J(W1,b1...,WL,bL) = (1/m) * Sum(L(y(i),y'(i)))`

  - The L2 regularization version:   
    `J(w,b) = (1/m) * Sum(L(y(i),y'(i))) + (lambda/2m) * Sum((||W[l]||^2)`

  - We stack the matrix as one vector `(mn,1)` and then we apply `sqrt(w1^2 + w2^2.....)`

  - To do back propagation (old way):   
    `dw[l] = (from back propagation)`

  - The new way:   
    `dw[l] = (from back propagation) + lambda/m * w[l]`

  - So plugging it in weight update step:

    - ```
      w[l] = w[l] - learning_rate * dw[l]
           = w[l] - learning_rate * ((from back propagation) + lambda/m * w[l])
           = w[l] - (learning_rate*lambda/m) * w[l] - learning_rate * (from back propagation) 
           = (1 - (learning_rate*lambda)/m) * w[l] - learning_rate * (from back propagation)
      ```

  - **In practice this penalizes large weights and effectively limits the freedom in your model.**

  - The new term `(1 - (learning_rate*lambda)/m) * w[l]`  causes the **weight to decay** in proportion to its size.




In [1]:
import numpy as np
# Frobenius norm is just the sqrt of each squared term in the matrix... 
# ... Its basically the L2 norm but for matrices.
#  Numpys default norm for 
W=np.random.rand(5,5)
print('Norm using numpy:', np.linalg.norm(W, 'fro'))

squared = W.reshape(1,-1) * W.reshape(1,-1)
print('Norm calcd manually:', np.sqrt(squared.sum()))

Norm using numpy: 2.86153762764
Norm calcd manually: 2.86153762764


# 6. Why regularization reduces overfitting?

So why is it that if we reduce the L2 norm (or L1 norm) of the weight vectors, we reduce overfitting? The bais intuition is that if we restrict the weight sizes, then we are inherently reducing the freedom of the model. We control how much we want to regularize wi the lambda hyperparameter:
  - Intuition 1:
     - If `lambda` is too large - a lot of w's will be close to zeros which will make the NN simpler (you can think of it as it would behave closer to logistic regression). 
     - You actually dont get any weights that are literally zeroed out, but many become so close to zero, so they effectively dont really make an impact on the model, resulting in a much simpler model.
     - If `lambda` is perfect it will just reduce some weights that makes the neural network overfit.
  - Intuition 2 (with _tanh_ activation function):
     - If `lambda` is too large, w's will be small (close to zero) - will use the linear part of the _tanh_ activation function, so we will go from non linear activation to _roughly_ linear which would make the NN a _roughly_ linear classifier... because if every layer is linear, then you have a linear model!
     - If `lambda` is perfect it will just make SOME of _tanh_ activations _roughly_ linear which will prevent overfitting.
     
![](images/c2w1_regs.png)
     
_**Implementation tip**_: if you implement gradient descent, one of the steps to debug gradient descent is to plot the cost function J as a function of the number of iterations of gradient descent and you want to see that the cost function J decreases **monotonically** after every elevation of gradient descent with regularization. If you plot the old definition of J (no regularization) then you might not see it decrease monotonically.

# 7. Dropout Regularization

- In most cases Andrew Ng tells that he uses the L2 regularization.
- The dropout regularization eliminates some neurons/weights on each iteration based on a probability.
- A most common technique to implement dropout is called "Inverted dropout".
- Code for Inverted dropout:

  ```python
  keep_prob = 0.8   # 0 <= keep_prob <= 1
  l = 3  # this code is only for layer 3
  # the generated number that are less than 0.8 will be dropped. 80% stay, 20% dropped
  d3 = np.random.randn(a[l].shape[0], a[l].shape[1]) < keep_prob

  a3 = np.multiply(a3,d3)   # keep only the values in d3
  # NOTE THIS IS ELEMENT WISE MULTIPLICAITON THAT ZEROES OUT DROPPED NEURONS

  # increase a3 to not reduce the EXPECTED VALUE of output
  # (ensures that the expected value of a3 remains the same) - to solve the scaling problem... 
  # THIS IS THE INVERTED DROPOUT TECHNIQUE!!! 
  a3 = a3 / keep_prob       
  ```
- Vector d[l] is used for forward and back propagation and is the same for them, but it is different for each iteration/epoch (each run through ALL of the training examples)
- At test time we DONT USE DROPOUT. If you implement dropout at test time - it would just add noise to predictions and make your model suck.

# 8. Understanding Dropout

- In the previous video, the intuition was that dropout randomly knocks out units in your network. So it's as if on every iteration you're working with a smaller NN, and so using a smaller NN seems like it should have a regularizing effect.
- Another intuition: **individual units cannot rely on any one feature, so have to spread out weights.**
    - Because weights are spread out, this has the effect of shrinking the **shared norm** of the weights... This is simlar to the effect of L2 regulatization!
    - It is possible to show that dropout has a similar effect as L2 regulatization
- It is possible to have different `keep_prob` values for each per layer of your network.
    - So for larger layers (with more neurons and a higher probability of overfitting) you can have a lower keep_prob (eg. 0.5), and for layers that you are not worried about overfitting you can set keep_prob very close to 1.
- The input layer dropout has to be near 1 (or 1 - no dropout) because you don't want to eliminate a lot of features.
- If you're more worried about some layers overfitting than others, you can set a lower `keep_prob` for some layers than others. The downside is, this gives you even more hyperparameters to search for using cross-validation. One other alternative might be to have some layers where you apply dropout and some layers where you don't apply dropout and then just have one hyperparameter, which is a `keep_prob` for the layers for which you do apply dropouts.
- A lot of researchers are using dropout with Computer Vision (CV) because they have a very big input size and almost never have enough data, so overfitting is the usual problem. And dropout is a regularization technique to prevent overfitting.
- A downside of dropout is that the cost function J is not well defined and it will be hard to debug (plot J by iteration).
  - To solve that you'll need to turn off dropout, set all the `keep_prob`s to 1, and then run the code and check that it monotonically decreases J and then turn on the dropouts again.
  
**KEY: Dropout is a regulaization technique... So if your model is not really overfitting, then you probably dont't need to use dropout!**

# 9. Other regularization methods

- **Data augmentation**:
  - For example in a computer vision data:
    - You can flip all your pictures horizontally this will give you m more data instances.
    - You could also apply a random position and rotation to an image to get more data.
  - For example in OCR, you can impose random rotations and distortions to digits/letters.
  - New data obtained using this technique isn't as good as the real independent data, but still can be used as a regularization technique.
- **Early stopping**:
  - In this technique we plot the training set and the dev set cost together for each iteration. At some iteration the dev set cost will stop decreasing and will start increasing.
  - We will pick the point at which the training set error and dev set error are best (lowest training cost with lowest dev cost).
  - We will take these parameters as the best parameters.
      - When we start training (iterations 1), W params will be very close to zero. As we increase iterations W will increase, but at a certain point W will grow too larger and start to overfit the training data (this is why regularization is important as it limits the size of weights).
      - Early stopping essentially means we are picking a W with a smaller norm ||W||2F.

![](Images/02-_Early_stopping.png)
- Andrew prefers to use L2 regularization instead of early stopping because this technique simultaneously tries to minimize the cost function and not to overfit which contradicts the orthogonalization approach (will be discussed further).
  - But its advantage is that you don't need to search a hyperparameter like in other regularization approaches (like `lambda` in L2 regularization).

- **Model Ensembles**:
  - Algorithm:
    - Train multiple independent models.
    - At test time average their results.
  - It can get you extra 2% performance.
  - It reduces the generalization error.
  - You can use some snapshots of your NN at the training ensembles them and take the results.


# 10. Normalizing inputs

#### If you normalize your inputs this will speed up the training process a lot.
- Normalization has two main steps:
  1. **Zero out the mean**:
      - Get the mean of the training set: `mean = (1/m) * sum(x(i))`
      - Subtract the mean from each input: `X = X - mean`
      - This makes your inputs centered around 0.
  2. **Normalize the variance:**
      - Get the variance of the training set: `variance = (1/m) * sum(x(i)^2)`
      - Normalize the variance. `X /= variance`
      - This will make the variances equal... so notice in graph below (at step 2, x2 has a much larger variance, and in step 3 the variance appears relatively equal!)

![](images/c2w1_normalize.png)
- These steps should be applied to training, dev, and testing sets (**but always use the mean and variance of the TRAIN SET ONLY).**
    - **REPEAT: USE THE SAME VARIANCE AND MEAN WHEN SCALING (NORMALIZING) THE DEV AND TEST SETS!!!**

#### Why normalize?
  - If we don't normalize the inputs (and our features are on very different scales) our cost function will be deep and its shape will be inconsistent (elongated) then optimizing it will take a long time.
  - But if we normalize it the opposite will occur. The shape of the cost function will be consistent (look more symmetric like circle in 2D example) and **we can use a larger learning rate alpha - the optimization will be faster.**
  - This is particularily important when features comes from crazy different scales... However it pretty much never hurts to normalize, so it makes sense to just do it either way.
![](images/c2w1_normalize2.png)


# 11. Vanishing / Exploding gradients

- The Vanishing / Exploding gradients occurs when your derivatives become very small or very big.
- To understand the problem, suppose that we have a deep neural network with number of layers L, and all the activation functions are **linear** and each `b = 0`
  - Then:   
    ```
    Y' = W[L]W[L-1].....W[2]W[1]X
    ```
  - Then, if we have 2 neurons per layer and x1 = x2 = 1, we get the following:

    ```
    if W[l] = [1.5   0] 
              [0   1.5] (l != L because of different dimensions in the output layer)
    Y' = W[L] [1.5  0]^(L-1) X = 1.5^L 	# which will be very large
              [0  1.5]
    ```
    ```
    if W[l] = [0.5  0]
              [0  0.5]
    Y' = W[L] [0.5  0]^(L-1) X = 0.5^L 	# which will be very small
              [0  0.5]
    ```

#### The last example explains that activation outputs at each layer (and similarly derivatives) will progressively decreas/increase exponentially AS A FUNCTION OF THE NUMBER OF LAYERS (L)...

- So If **W IS GREARTHER THAN I (Identity matrix)** the activation and gradients will explode.
- And If **W IS LESS THAN I (Identity matrix)** the activation and gradients will vanish.

#### The following examples is replicated (somewhat) in code below!
![](images/c2w1_explodgrads.png)

In [67]:
import numpy as np
identity = np.array(([1,0,0],[0,1,0], [0,0,1]))
X = np.array(([1],[1],[1]))
Wl_explode = np.array(([1.5,0,0],[0,1.5,0],[0,0,1.5]))
Wl_vanish = np.array(([0.5,0,0],[0,0.5,0],[0,0,0.5]))

print("X shape: ",X.shape)
print("Wl shape: ",Wl_explode.shape)

X shape:  (3, 1)
Wl shape:  (3, 3)


In [69]:
# EXPLODING GRADIENT
layer = np.dot(Wl_explode,X)
for l in range(1,10):
    layer = np.dot(Wl_explode,layer)
    print(layer)

[[ 2.25]
 [ 2.25]
 [ 2.25]]
[[ 3.375]
 [ 3.375]
 [ 3.375]]
[[ 5.0625]
 [ 5.0625]
 [ 5.0625]]
[[ 7.59375]
 [ 7.59375]
 [ 7.59375]]
[[ 11.390625]
 [ 11.390625]
 [ 11.390625]]
[[ 17.0859375]
 [ 17.0859375]
 [ 17.0859375]]
[[ 25.62890625]
 [ 25.62890625]
 [ 25.62890625]]
[[ 38.44335938]
 [ 38.44335938]
 [ 38.44335938]]
[[ 57.66503906]
 [ 57.66503906]
 [ 57.66503906]]


In [70]:
# VANISHING GRADIENT
layer = np.dot(Wl_vanish,X)
for l in range(1,10):
    layer = np.dot(Wl_vanish,layer)
    print(layer)

[[ 0.25]
 [ 0.25]
 [ 0.25]]
[[ 0.125]
 [ 0.125]
 [ 0.125]]
[[ 0.0625]
 [ 0.0625]
 [ 0.0625]]
[[ 0.03125]
 [ 0.03125]
 [ 0.03125]]
[[ 0.015625]
 [ 0.015625]
 [ 0.015625]]
[[ 0.0078125]
 [ 0.0078125]
 [ 0.0078125]]
[[ 0.00390625]
 [ 0.00390625]
 [ 0.00390625]]
[[ 0.00195312]
 [ 0.00195312]
 [ 0.00195312]]
[[ 0.00097656]
 [ 0.00097656]
 [ 0.00097656]]



- Recently Microsoft trained 152 layers (ResNet)! which is a really big number. With such a deep neural network, if your activations or gradients increase or decrease exponentially as a function of L, then these values could get really big or really small. And this makes training difficult, especially if your gradients are exponentially smaller than L, **then gradient descent will take tiny little steps. It will take a long time for gradient descent to learn anything.**
- There is a partial solution that doesn't completely solve this problem but it helps a lot - careful choice of how you initialize the weights (next video).

# 12. Weight Initialization for Deep Networks

- A partial solution to the Vanishing / Exploding gradients in NN is better or more careful choice of the random initialization of weights
- In a single neuron (Perceptron model): `Z = w1x1 + w2x2 + ... + wnxn`
  - So if `n_x` is large we want `W`'s to be smaller to not explode the cost.

#### So it turns out that we need the variance of `W` to be `1/n_x` 
- A few notes: 
    - Multiplying a random variable 'x' by the square root of 'v' will set the variance of 'x' to v. This is what we are doing below. 
    - Also we use n[l-1] for the inputs, because the number of inputs (columns of the weight vector) is equal to the number of neuonrs in the previous layer... because the shape each weight vector vector is (n[nl], n[l-1])


- So lets say when we initialize `W`'s like this (better to use with `tanh` activation):   
  ```
  np.random.rand(shape) * np.sqrt(1/n[l-1])
  ```
  or variation of this (Bengio et al.):   
  ```
  np.random.rand(shape) * np.sqrt(2/(n[l-1] + n[l]))
  ```
- Setting initialization part inside sqrt to `2/n[l-1]` for `ReLU` is better:   
  ```
  np.random.rand(shape) * np.sqrt(2/n[l-1])
  ```


- Number 1 or 2 in the nominator can also be a hyperparameter to tune (but not the first to start with)
- This is one of the best way of partially solution to Vanishing / Exploding gradients (ReLU + Weight Initialization with variance) which will help gradients not to vanish/explode too quickly
- The initialization in this video is called "He Initialization / Xavier Initialization" and has been published in 2015 paper.

#### So in summation, weight initialization helps prevent your weights from exploding or from vanishing too quickly so you can train a reasonably deep network. 

In [134]:
nl_prev = 5
W1_neurons = 5
DESIRED_VARIANCE = 2/nl_prev

# SAMPLEE GAUSIAN RANDOM VARIABLE
W_pre = np.random.randn(W1_neurons, nl_prev)

# SET VARIANCE TO DESIRED VARIANCE BY USING SQUARE ROOT RULE
W_post = W_pre * np.sqrt(DESIRED_VARIANCE)

print(DESIRED_VARIANCE)
print(np.var(W_post))

0.4
0.410106237724


# 13. Numerical approximation of gradients

- There is an technique called gradient checking which tells you if your implementation of backpropagation is correct.
- There's a numerical way to calculate the derivative:   
  ![](Images/03-_Numerical_approximation_of_gradients.png)
- Gradient checking approximates the gradients and is very helpful for finding the errors in your backpropagation implementation but it's slower than gradient descent (so use only for debugging).
- Implementation of this is very simple.

# 14. Gradient checking implementation notes

#### Gradient checking Implementation
- Our goal is simply to figure out if `d_theta` is indeed the gradient of `J(theta)`
- NOTE: W1 and dW1 should have same dimensions. So theta and d_theta will have same dimensions.

  - First take `W[1],b[1],...,W[L],b[L]` and reshape into one big vector (`theta`)
  - The cost function will be `J(theta)`... instead of `J(W1,b1,...Wn,bn)`
  - Then take `dW[1],db[1],...,dW[L],db[L]` into one big vector (`d_theta`)
      
 
- **Algorithm**:   
```
eps = 10^-7   # small number
for i in len(theta):
      d_theta_approx[i] = (J(theta1,...,theta[i] + eps) -  J(theta1,...,theta[i] - eps)) / 2*eps
```
- Finally we evaluate this formula `(||d_theta_approx - d_theta||) / (||d_theta_approx||+||d_theta||)` (`||` - Euclidean vector norm) and check (with eps = 10^-7):


- if it is < 10^-7  - great, very likely the backpropagation implementation is correct
- if around 10^-5   - can be OK, but need to inspect if there are no particularly big values in `d_theta_approx - d_theta` vector
- if it is >= 10^-3 - bad, probably there is a bug in backpropagation implementation

- Don't use the gradient checking algorithm at training time because it's very slow.
- Use gradient checking only for debugging.
- If algorithm fails grad check, look at components (a specific W or b value) to try to identify the bug.
- Don't forget to add `lamda/(2m) * sum(W[l])` to `J` if you are using L1 or L2 regularization.
- Gradient checking doesn't work with dropout because J is not consistent. 
  - You can first turn off dropout (set `keep_prob = 1.0`), run gradient checking and then turn on dropout again.
- Run gradient checking at random initialization and train the network for a while maybe there's a bug which can be seen when w's and b's become larger (further from 0) and can't be seen on the first iteration (when w's and b's are very small).

# 15. Initialization summary

- The weights $W^{[l]}$ should be initialized randomly to break symmetry

- It is however okay to initialize the biases $b^{[l]}$ to zeros. Symmetry is still broken so long as $W^{[l]}$ is initialized randomly

- Different initializations lead to different results

- Random initialization is used to break symmetry and make sure different hidden units can learn different things

- Don't intialize to values that are too large

- He initialization works well for networks with ReLU activations. 

# 16. Regularization summary

#### 1. L2 Regularization

**Observations**:

- The value of Î» is a hyperparameter that you can tune using a dev set.
- L2 regularization makes your decision boundary smoother. If Î» is too large, it is also possible to "oversmooth", resulting in a model with high bias.

**What is L2-regularization actually doing?**:

**L2-regularization relies on the assumption that a model with small weights is simpler than a model with large weights.** Thus, by penalizing the square values of the weights in the cost function you drive all the weights to smaller values. It becomes too costly for the cost to have large weights! This leads to a smoother model in which the output changes more slowly as the input changes.

**What you should remember:** the implications of L2-regularization on:

- The cost computation:
  - A regularization term is added to the cost
- The backpropagation function:
  - There are extra terms in the gradients with respect to weight matrices
- Weights end up smaller ("weight decay"):
  - Weights are pushed to smaller values.

####  2. Dropout

**What you should remember about dropout:**

- Dropout is a regularization technique.
- You only use dropout during training. Don't use dropout (randomly eliminate nodes) during test time.
- Apply dropout both during forward and backward propagation.
- During training time, divide each dropout layer by keep_prob to keep the same expected value for the activations. For example, if keep_prob is 0.5, then we will on average shut down half the nodes, so the output will be scaled by 0.5 since only the remaining half are contributing to the solution. Dividing by 0.5 is equivalent to multiplying by 2. Hence, the output now has the same expected value. You can check that this works even when keep_prob is other values than 0.5.