# Batch Norm

Accelerate convergence of deep neural network

In [1]:
import torch
import torch.nn as nn

Standardizing the input data point helps models convergence by setting the *a priori* scale of the parameters to a similar scale  
But what about the intermediates layer latent features? They can take a wide range of magnitude from one to another

To fix this phenomenon, we normalize the input by the **mean** and **standard deviation** of the batch. We then scale by a learned factor and offset coefficient

$$\mathrm{BN}(\mathbf{x}) = \boldsymbol{\gamma} \odot \frac{\mathbf{x} - \hat{\boldsymbol{\mu}}_\mathcal{B}}{\hat{\boldsymbol{\sigma}}_\mathcal{B}} + \boldsymbol{\beta}$$

These coefficients allow the model to avoid having 0 mean and unit variance if it needs it

The computation of mean and std on batch adds some noise during the training and regularizes. Thus we generally don't put **Dropout** after **BatchNorm**, or we set a dropout probability very low $\approx 0.1$   
During training, the **BatchNorm** layers keep a running estimate of the mean and std that will be used during evaluation time

In the original paper, **BatchNorm** layers are placed before the activation function  
However, it seems that putting **BatchNorm** after the activation function provides better performance (and makes more sense) even if some SOTA CNN architecture have **BatchNorm** layers before activation

<center>
    <img src='images/bn_activation.png' width=55% style="margin-left:auto; margin-right:auto"/>
    <p style="font-size:14px;">Source: <a href='https://github.com/ducha-aiki/caffenet-benchmark/blob/master/batchnorm.md'>CaffeNet Benchmark</a></p>
</center>

For convolutional layers, the normalization is applied *per channel*

In [3]:
batch_norm_example = nn.Sequential(nn.Conv2d(1, 6, kernel_size=3), 
              nn.ReLU(),
              nn.BatchNorm2d(6),
              nn.MaxPool2d(kernel_size=2, stride=2)) # for batchNorm2d (i.e., images), we need to give the number of channels