# Deep Learning: Batchnormalization


In [2]:
import math
import numpy as np

%matplotlib inline
np.random.seed(1)

import IPython
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"


## Introduction


Batch normalization is a technique used in deep learning to improve the training of neural networks. The motivation behind batch normalization is to address the problem of **internal covariate shift**, which refers to the phenomenon where the distribution of the inputs to a layer of a neural network changes as the parameters of the previous layers are updated during training. By normalizing the inputs of the layer, batch normalization reduces the amount of internal covariate shift and improves the stability of the network. This can result in faster training, better performance, and more robust models.

Batch normalization works by normalizing the activations input of each layer to have **zero mean and unit variance**, and then applying a scale and shift parameter to the normalized values.
`
Batch normalization has become a standard technique in deep learning and is used in many state-of-the-art models.


Motivations:


* Reduces the amount of internal covariate shift and improves the stability of the network. This leads to better performance, and more robust models.

* Allows to use large mearning rates. This results in result in faster training

<img src="images/bn-v0.png" style="transform: scale(0.85);" alt="bn-comparison">

    * Legends:
        * BN-Baseline: learning rate reference.
        * BN-x5: Initial learning rate of 0.0075 (5 times Inception’s learning rate).
        * BN-x30: Initial learning rate 0.045 (30 times that of Inception).
        * BN-x5-Sigmoid: Uses Sigmoid activation function (non-linearity) instead of ReLU.
        
        
     In the graph above you can see that using Batch normalization it is possible to train the DNN inspection with large learning rate and reach the same performance in less epochs. 

* Provides some level of regularization

The forward equations belows is the batchnormalztion layer:

$
\mu = \frac{1}{m} \sum_{i=1}^m x_i \\
\sigma^2 = \frac{1}{m} \sum_{i=1}^m (x_i - \mu)^2  \\
\hat{z}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}} \\
y_i = \gamma \hat{z}_i + \beta
$

where $\gamma$ and $\beta$ are the scaling and shift parameters determine during training.

Backward equations: 

$
\frac{\partial L}{\partial \hat{z}i} = \frac{\partial L}{\partial y_i} \cdot \gamma \\
\frac{\partial L}{\partial \sigma^2} = \sum_{i=1}^m \frac{\partial L}{\partial \hat{z}i} \cdot (x_i - \mu) \cdot \frac{-1}{2} \cdot (\sigma^2 + \epsilon)^{-3/2} \\
\frac{\partial L}{\partial \mu} = \sum_{i=1}^m \frac{\partial L}{\partial \hat{z}i} \cdot \frac{-1}{\sqrt{\sigma^2 + \epsilon}} + \frac{\partial L}{\partial \sigma^2} \cdot \frac{-2}{m} \cdot \sum_{i=1}^m (x_i - \mu) \\
\frac{\partial L}{\partial x_i} = \frac{\partial L}{\partial \hat{x}i} \cdot \frac{1}{\sqrt{\sigma^2 + \epsilon}} + \frac{\partial L}{\partial \sigma^2} \cdot \frac{2}{m} \cdot (x_i - \mu) + \frac{\partial L}{\partial \mu} \cdot \frac{1}{m} \\
\frac{\partial L}{\partial \gamma} = \sum_{i=1}^m \frac{\partial L}{\partial y_i} \cdot \hat{x}i \\
\frac{\partial L}{\partial \beta} = \sum_{i=1}^m \frac{\partial L}{\partial y_i}
$

where $L = L(\hat{z}_i, \mu, \sigma^2,x_i,\gamma, \beta )$ is the loss function.


The batch normalization layer usage:

1. Use between the layer and the activion of the layer

```python 

# NOTE: ok 
x = Conv2D(64, kernel_size=3, strides=1, padding='same')(x)
x = BatchNormalization()(x)
x = LeakyReLU()(x)

# NOTE: not reccommended 
x = Conv2D(64, kernel_size=3, strides=1, padding='same')(x)
x = LeakyReLU()(x)
x = BatchNormalization()(x)
```

1. Not ueful when using Stchastic Gradient descent. (The mean of 1 sample is the sample and the variance is zero) 
1. Canot be used together with dropout in the same layer block

```python 

# NOTE: ok
x = Conv2D(64, kernel_size=3, strides=2, padding='same')(x)
x = BatchNormalization()(x)
x = LeakyReLU()(x)

x = Dropout(0.30)(x)
x = Flatten()(x)


# NOTE: not recommended
x = Conv2D(64, kernel_size=3, strides=2, padding='same')(x)
x = Dropout(0.30)(x)
x = BatchNormalization()(x)
x = LeakyReLU()(x)
```


## Implementation forward equation

In [7]:
def batchnormalization(x, gamma , beta):
    
    epsilon = 1e-8
    
    mu = np.mean(x,axis=0)
    
    sigma2 = np.var(x,axis=0)
    
    z = (x - mu) / np.sqrt(sigma2 + epsilon)
    
    bn =  gamma * z + beta
    
    return bn

In [13]:
x = np.random.normal(5.0,2.0,size=(100,2))

x.shape

gamma = 1.0
beta = 0.00
bn = batchnormalization(x, gamma , beta)

bn.shape

bn.mean(axis=0)
bn.var(axis=0)

(100, 2)

(100, 2)

array([4.00790512e-16, 1.01862963e-15])

array([1., 1.])

## Reference

* https://towardsdatascience.com/understanding-batch-normalization-with-examples-in-numpy-and-tensorflow-with-interactive-code-7f59bb126642 <= very good in simple.
* https://kevinzakka.github.io/2016/09/14/batch_normalization/
* Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift (paper)
    * Authors: Sergey Ioffe (same author of PLDA and works at Google) n Christian Szegedy (google)
    * https://arxiv.org/pdf/1502.03167.pdf Paper **TODO** Read the paper. It is simple and easy to understand/ It is a good gain experience in reading paper  
* https://www.tensorflow.org/api_docs/python/tf/keras/layers/BatchNormalization tensprflow doc