# Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
By Sergey Ioffe and Christian Szegedy

> Ioffe, S., & Szegedy, C. (2015, June).
> Batch normalization: Accelerating deep network training by reducing internal covariate shift.
> In International conference on machine learning (pp. 448-456). PMLR.

These are my summarization notes from the paper.

---

## 1. Introduction

Stochastic gradient descent (SGD) is an effective way to train deep neural nets.
SGD optimizes the parameters $\Theta$ of the network in order to minimize the loss:
\begin{equation}
    \Theta = \textrm{argmin}_{\Theta} \frac{1}{N} \sum_{i=1}^N \ell \left( x_i, \Theta \right),
\end{equation}
where $x_1, \dots, x_N$ is the training set.
In SGD, the training takes place in steps, with each step considering a *mini-batch* $x_1,\dots,x_m$ of size $m$.

Using mini-batches instead of one training sample at a time has several benefits.
First, the gradient of the loss for a mini-batch is an estimate of the gradient over the whole training set.
The quality of the estimation improves as the batch size increases.
Second, computation over a mini-batch can be more efficient than $m$ computations for individual samples.

SGD works well, but requires careful tuning of the model hyper-parameters.
The inputs to each layer are affected by the parameters of all preceding layers.
Small changes to the network parameters amplify as the network becomes deeper.

Layers must continuously adapt to the change in distributions of the inputs.
This is an issue.
When the input distribution to a learning system changes, it is said to experience *covariate shift*.
Parts of a network (e.g., a sub-network or a layer) can also experience covariate shift.

Since normalization helps the network generalize, applying it to a sub-network will also help.
This means that the distribution of $x$ will remain stable over time and then the parameters of the sub-network
do not have to compensate for changes in the distribution of $x$.

This also helps layers outside of the sub-network.
If a layer has a sigmoid activation function $g(x) = \frac{1}{1 + \exp(-x)}$, then as $|x|$ increases,
$g'(x)$ tends towards zero.
Thus for all dimensions except those with small absolute values,
the gradient will vanish and the model will train slowly.
Over time, changes to the weights and biases will cause many dimensions of $x$ to saturate.
This effect is amplified as the network depth increases.
However, ensuring that nonlinearity inputs remain more stable during training,
then the optimizer is less likely to get stuck in the saturated regime and this would accelerate training.

### Internal Covariate Shift

Internal covariate shift is the change in the distribution of nodes of a deep network in the course of training.
Training speed increases by reducing/eliminating internal covariate shift.
**Batch normalization** is a mechanism that reduces internal covariate shift.

This works by applying normalization to fix the means and variances of layer inputs.
It also reduces the dependence of gradients on the scale or initial value of parameters.
This allows for higher learning rates without risk of divergence.
This also regularizes the model and reduces the need for Dropout.
This also allows for the use of saturating nonlinearities since it prevents getting stuck in the saturated modes.

## 2. Towards Reducing Internal Covariate Shift

*Internal Covariate Shift* is the change in the distribution of network activations due to the change in network parameters during training. The intuition for this comes from the knowledge that networks converge faster if the inputs are standardized - that is adjusted to have zero mean and unit variance.

Batch normalization wants to ensure that, for any parameter values, the network *always* produces activations with the desired distribution since this would all the gradient of the loss with respect to the model parameters toa ccount for the normalization, and for its dependence on the model parameters $\Theta$. If $x$ is a layer input and $X$ is the set of inputs over the training data set, then the normalization can be written as a transformation
\begin{equation}
    \hat{x} = \textrm{Norm}(x,X).
\end{equation}
This depends not only on the given training example $x$, but on all examples $X$ - each of which depends on $\Theta$ (the set of parameters) if $x$ is generated by another layer. For backpropagation, you would compute
$\frac{\partial \textrm{Norm}(x,X)}{\partial x}$ and
$\frac{\partial \textrm{Norm}(x,X)}{\partial X}$. However, this is computationally expensive so the authors looked for another method.

## 3. Normalization via Mini-Batch Statistics

It is computationally costly to fully standardize each layer's inputs, so batch normalization makes two simplifications. First, it standardizes each feature independently (each feature is normalized to have zero mean and unit variance.

\begin{equation}
    \hat{x}^{(k)} = \frac{x^{(k)} - E(x^{(k)})} {\sqrt{ \textrm{Var}(x^{(k)}) }}
\end{equation}

is computed over the training data set.

The issue with this, however, is that normalizing the inputs of a layer may change what the layer can represent. For example, normalizing the inputs of a sigmoid would constrain them to the linear regime of the nonlinearity. To address, the authors introduce, for each activation $x^{(k)}$, a pair of parameters $\gamma^{(k)}$ and $\beta^{(k)}$ that scale and shift the normalized value:

\begin{equation}
    y^{(k)} = \gamma^{(k)} \hat{x}^{(k)} + \beta^{(k)}.
\end{equation}

The $\gamma$ and $\beta$ parameters are learned along with the original model. In fact, if $\gamma = \sqrt{\textrm{Var}(x)}$ and $\beta = E(x)$ then you would recover the original activations.

The second simplification is to use the training batch (used for stochastic gradient descent) for normalization. Since we are already using each mini-batch to approximate the training set, we can use each mini-batch to estimate the mean and variance of each activation. This also allows normalization to fully participate in backpropagation.

The use of mini-batches is enabled by computation of per-dimension variances rather than joint covariances. If joint covariances were being used, regularization would be required since the mini-batch size is likely to be smaller than the number of activations, resulting in singular covariance matrices.

**Algorithm:**

Input: Values of $x$ over a mini-batch: $B = \left\{ x_1, \dots, x_m \right\}$; parameters to be learned: $\gamma,\beta$.

Output: $\{ y_i = \textrm{BN}_{\gamma,\beta} (x_i) \}$

Mini-batch mean: $\mu_B \leftarrow \frac{1}{m}\sum_{i=1}^m x_i$

Mini-batch variance: $ \sigma_B^2 \leftarrow \frac{1}{m} \sum_{i=1}^m (x_i - \mu_B)^2 $

Normalize: $ \hat{x}_i \leftarrow \frac{ x_i - \mu_B }{ \sqrt{\sigma_B^2 + \epsilon} }$

Scale and shift: $ y_i \leftarrow \gamma \hat{x}_i + \beta \equiv \textrm{BN}_{\gamma,\beta}(x_i) $

During training, it will be necessary to backpropagate the gradient of the loss, $\ell$, through the batch normalization transform and compute the gradients with respect to the parameters of the BN transform. Using the chain rule, we get:

\begin{equation}
    \frac{\partial \ell}{\partial \hat{x}_i} = \frac{\partial \ell}{\partial y_i} \cdot \gamma\\
    \frac{\partial\ell}{\partial \sigma_B^2} = \sum_{i=1}^m \frac{\partial\ell}{\partial \hat{x}_i} \cdot
    \left( x_i - \mu_B \right) \cdot \frac{-1}{2} \left(\sigma_B^2 + \epsilon\right)^{-3/2}\\
    \frac{\partial\ell}{\partial\mu_B} = \sum_{i=1}^m \frac{\partial\ell}{\partial\hat{x}_i} \cdot
    \frac{-1}{\sqrt{ \sigma_B^2 + \epsilon }}\\
    \frac{\partial\ell}{\partial x_i} = \frac{\partial\ell}{\partial \hat{x}_i} \cdot
    \frac{1}{\sqrt{ \sigma_B^2 + \epsilon }} + \frac{\partial\ell}{\partial\sigma_B^2} \cdot
    \frac{2 (x_i - \mu_B)}{m} + \frac{\partial\ell}{\partial\mu_B} \cdot \frac{1}{m}
\end{equation}
