# Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
By Sergey Ioffe and Christian Szegedy

> Ioffe, S., & Szegedy, C. (2015, June).
> Batch normalization: Accelerating deep network training by reducing internal covariate shift.
> In International conference on machine learning (pp. 448-456). PMLR.

These are my summarization notes from the paper.

---

## 1. Introduction

Stochastic gradient descent (SGD) is an effective way to train deep neural nets.
SGD optimizes the parameters $\Theta$ of the network in order to minimize the loss:
\begin{equation}
    \Theta = \textrm{argmin}_{\Theta} \frac{1}{N} \sum_{i=1}^N \ell \left( x_i, \Theta \right),
\end{equation}
where $x_1, \dots, x_N$ is the training set.
In SGD, the training takes place in steps, with each step considering a *mini-batch* $x_1,\dots,x_m$ of size $m$.

Using mini-batches instead of one training sample at a time has several benefits.
First, the gradient of the loss for a mini-batch is an estimate of the gradient over the whole training set.
The quality of the estimation improves as the batch size increases.
Second, computation over a mini-batch can be more efficient than $m$ computations for individual samples.

SGD works well, but requires careful tuning of the model hyper-parameters.
The inputs to each layer are affected by the parameters of all preceding layers.
Small changes to the network parameters amplify as the network becomes deeper.

Layers must continuously adapt to the change in distributions of the inputs.
This is an issue.
When the input distribution to a learning system changes, it is said to experience *covariate shift*.
Parts of a network (e.g., a sub-network or a layer) can also experience covariate shift.

Since normalization helps the network generalize, applying it to a sub-network will also help.
This means that the distribution of $x$ will remain stable over time and then the parameters of the sub-network
do not have to compensate for changes in the distribution of $x$.

This also helps layers outside of the sub-network.
If a layer has a sigmoid activation function $g(x) = \frac{1}{1 + \exp(-x)}$, then as $|x|$ increases,
$g'(x)$ tends towards zero.
Thus for all dimensions except those with small absolute values,
the gradient will vanish and the model will train slowly.
Over time, changes to the weights and biases will cause many dimensions of $x$ to saturate.
This effect is amplified as the network depth increases.
However, ensuring that nonlinearity inputs remain more stable during training,
then the optimizer is less likely to get stuck in the saturated regime and this would accelerate training.

### Internal Covariate Shift

Internal covariate shift is the change in the distribution of nodes of a deep network in the course of training.
Training speed increases by reducing/eliminating internal covariate shift.
**Batch normalization** is a mechanism that reduces internal covariate shift.

This works by applying normalization to fix the means and variances of layer inputs.
It also reduces the dependence of gradients on the scale or initial value of parameters.
This allows for higher learning rates without risk of divergence.
This also regularizes the model and reduces the need for Dropout.
This also allows for the use of saturating nonlinearities since it prevents getting stuck in the saturated modes.

## 2. Towards Reducing Internal Covariate Shift