## Batch Normalization

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift (Ioffe S., Szegedy C., 2015)

As standard scaling improves learning by normalizing each feature in the training dataset, batch normalization normalizes input to every unit of a hidden layer within a batch. Estimated distribution is over the samples in the batch.
Typically, batch normalization is done before the activation function, but this is not a hard rule.

$$ \textbf{u}^{[l]} = \frac{1}{S} \sum_{s=1}^{S}{\textbf{z}^{[l]}_{s, :}} $$

$$ \textbf{v}^{[l]} = \frac{1}{S} \sum_{s=1}^{S}{(\textbf{z}^{[l]}_{s, :} - \textbf{u}^{[l]})^2} $$

$$ 
\textbf{z}^{[l]} \leftarrow 
\frac 
    {\textbf{z}^{[l]} - \textbf{u}} 
    {\sqrt{\textbf{v}^{[l]} + \epsilon}}
$$

$$
\textbf{z}^{[l]} \leftarrow 
\textbf{z}^{[l]} \odot \tilde{\textbf{v}}^{[l]} + \tilde{\textbf{u}}
$$

Parameters $\tilde{\textbf{v}}^{[l]}$ and $\tilde{\textbf{u}}^{[l]}$ are learnable with back propagation.
Batch normalization of layer $l$ pre-activation makes the input to the layer $l+1$ vary less, since the between-iterations, same-unit distribution's mean and variance are fixed. This improves the speed of learning.

Since estimates of mean and variance are different for each batch, $\textbf{z}^{[l]}$ gets noisy, which also acts as small regularizer. The quantity of noise reduces with the size of the batch.

For validation time, estimate $\textbf{v}^{[l]}$ and $\textbf{u}^{[l]}$ as exponentially weighted average across the training mini-batches. Use these estimates instead of the validation dataset's properties.

### Rerefences
* https://arxiv.org/pdf/1502.03167.pdf
* https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-scaler
* https://github.com/ducha-aiki/caffenet-benchmark/blob/master/batchnorm.md

In [5]:
import numpy as np

m = 10
s = 100
z = np.random.rand(s, m)

u = np.sum(z, axis=0) / s
v = (z - u) ** 2 / s

z = z - u / (np.sqrt(v + 1e-10))

tilde_u = np.random.rand(*u.shape)
tilde_v = np.random.rand(*v.shape)

z = z * tilde_v + tilde_u

print(z.shape)

(100, 10)


In [1]:
import tensorflow.keras.layers
?tensorflow.keras.layers.BatchNormalization

[0;31mInit signature:[0m
[0mtensorflow[0m[0;34m.[0m[0mkeras[0m[0;34m.[0m[0mlayers[0m[0;34m.[0m[0mBatchNormalization[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0maxis[0m[0;34m=[0m[0;34m-[0m[0;36m1[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmomentum[0m[0;34m=[0m[0;36m0.99[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mepsilon[0m[0;34m=[0m[0;36m0.001[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcenter[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mscale[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mbeta_initializer[0m[0;34m=[0m[0;34m'zeros'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mgamma_initializer[0m[0;34m=[0m[0;34m'ones'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmoving_mean_initializer[0m[0;34m=[0m[0;34m'zeros'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmoving_variance_initializer[0m[0;34m=[0m[0;34m'ones'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mbeta_regularizer[0m[0;