# IMPROVING MACHINE LEARNING MODELS - 3: Reducing Variance in Neural Networks
<hr style="height:5px;border-width:2;color:gray">

In [None]:
import numpy as np
np.random.seed(42)

## 1. Dropout regularization

You have already seen L1 and L2 Regularization, which  penalises large values in the Weight Matrix, and effectively compresses the model.

Dropout has a completely different approach to reaching the optimal. As the name suggests, at each iteration you randomly "dropout", or remove nodes from each layer at each iteraation. So effectively, training a much smaller model each time. 

<img src="https://ml-cheatsheet.readthedocs.io/en/latest/_images/regularization-dropout.PNG">

> image from https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf

During testing, the entire fully trained model is used, and no nodes dropped, so we have a much better trained model. 

We will be using a method called Inverted Dropout to achieve dropout Regularization.

### 1.1 Implementation of Dropout: Inverted Dropout

Consider the $l^{th}$ layer $L^{[l]}$ in the training sequence. The task is to effectively Zero out random features or activities in this Layer. For this, we will use a **Dropout** vector, denoted by $d^{[l]}$ 

The vector $d^{[l]}$ consists of 0s and 1s, with the probability of any element being 1, given by $p_{keep}^{[l]}$ and having the same shape as $a^{[l]}$

> Exercise: Create the required dropout vector, with the shape same as $a^{[l]}$, and with probability of any element being 0.7
Hint: Use np.random.rand and the astype('float') method

In [None]:
a_l = np.random.randn(7, 6) * 0.01
keep_prob = 0.7

d_l = np.ones((7,6))
d_1 = d_1*keep_prob

This, $d^{[l]}$ vector is multiplied with $a^{[l]}$, elementwise. Now, to maintain the same 'size' of the feature vector, we scale up $a^{[l]}$, by dividing by the $p_{keep}^{[l]}$

> Exercise: Perform the dropout on the feature vector a_l, and scale it up.

In [None]:
a_l *= None
a_l /= None

**Note: Only use dropout when training your model. Using dropout during testing/validation/production phase adds noise into the Results.**

### 1.2 Using Dropout to Effectively reduce Overfitting

A rule of thumb, is that to reduce overfitting, the layers with the more nodes have a lower $p_{keep}^{[l]}$. Consider you have a model with the following architecture (3, 8, 9, 7, 2, 1). 
Then you can have:
1. $p_{keep}^{[1]} = 1$
1. $p_{keep}^{[2]} = 0.7$
1. $p_{keep}^{[3]} = 0.5$
1. $p_{keep}^{[4]} = 0.7$
1. $p_{keep}^{[5]} = 1$
1. $p_{keep}^{[6]} = 1$

The intuition behind this can be that decreasing the probability of keeping a neuron reduces the chances of overfitting the huge matrix. 

### 1.3 Disadvantages
1. Dropout fucks with cost functions. (Why?) So calculate cost keeping $p_{keep}^{[l]} = 1 \forall l \in [1, L]$

## 2. Batch Normalization

Batch Normalization is the process of Normalizing each feature of the batch, to have Mean 0 and Variance 1 and optionally changing the Mean and Variance to some other value. 
Consider the Layer, $l$:

1. Calculating the Mean of the m training samples of each feature:

$$ \mu^{[l]} = \frac{1}{m} \sum_{i = 1}^m Z^{[l](i)} $$

1. Calculating the Variance of the m training samples of each feature:

$$ (\sigma^{[l]})^2 = \frac{1}{m} \sum_{i = 1}^m (Z^{[l](i)} -  \mu^{[l]}) $$

3. Normalize each feature of the training sample

$$ Z^{[l](i)}_{norm} = \frac{Z^{[l](i)} - \mu^{[l](i)}}{ \sigma^{[l](i)} + \epsilon } $$

4. Modify the mean and Variance of each feature (Optional, can sometimes lead to better results)

$$ \tilde Z^{[l](i)} = \gamma^{[l]} * Z^{[l](i)}_{norm} + \beta^{[l]} $$

5. Proceed with the next equations just as you would, but replace $ Z^{[l](i)} $ with $ \tilde Z^{[l](i)} $, eg:

$$ A^{[l]} = g^{[l]}(\tilde Z^{[l]}) $$

6. $\gamma^{[l]}$ and $\beta^{[l]} $ are trainable parameters, similar to $W^{[l]}$ and $b^{[l]}$. So optimize them as well. For example if you use Gradient Descent:

$$ \beta ^{[l]} = \beta^{[l]} - \alpha \frac{\partial J}{\partial \beta^{[l]}} $$



### Observations about Batch Normalization

1. Consider the primary forward pass equaion,

$$ Z^{[l]} = W^{[l]} \cdot A^{[l-1]} + b^{[l]} $$

Calculating the mean, always results in cancelling out the addition of the bias vector. Which means that it's presence is not required and can be effectively removed. 

2. $ Z^{[l]} $, $ b^{[l]} $, $ \beta^{[l]} $ and $\gamma^{[l]}$ have the same shape: $(n^{[l]}, 1)$ due to the element wise products and additions being done. 

2. Batch Normalization has slight Regularization effect, but it must not be used intentionally for that. The source of the regularizing effect is the noise in $\mu$ and $\sigma$

3. Batch Normalization makes a model more robust to be able to deal with covariate shift. Please check online for what covariate shift means. 
> A quick explanation can be found in Andrew NG's Improving Deep Learning Models Course on Coursera.

4. When performing training, keep an exponentially weighted moving average of the  $\mu^{[l]}$ and $\sigma^{[l]}$ of training samples that you encounter,  $\mu^{[l]}_{avg}$ and $\sigma^{[l]}_{avg}$. These $ \mu^{[l]}_{avg}$ and $\sigma^{[l]}_{avg} $ will be used in testing/validation/production. 

> For Batch Gradient Descent, $\mu^{[l]}_{avg} = \mu^{[l]}$ and $\sigma^{[l]}_{avg} = \sigma^{[l]}$ due to the same dataaset being passed each epoch.

> Exponentially weighted moving average can be calculated using the following formula:
$$ V_t = \beta V_{t-1} + (1 - \beta)\theta_t $$ where, 
1. $V_t$ is the moving average at the $ t^{th} $ iteration
2. $V_0 = 0$
2. $ \theta_t $ is the next value to be added
3. $ \beta \in [0, 1) $ dictates the effect of the current value, $\theta_t$ on the moving average,  $V_t$
