## Batch Normalization

As Standard Scaling improves learning by adjusting each feature in the input layer, Batch Normalization adjusts every unit of a layer.

Estimated distribution is over the observations in the mini-batch.
Typically, pre-activation value $Z$ is used, but this is not a hard rule.

$$ \mu = \frac{1}{m} \sum{z^{(i)}} $$

$$ \sigma^2 = \frac{1}{m} \sum{(z_i - \mu)^2} $$

$$ z^{(i)}_{norm} = \frac{z^{(i)} - \mu}{\sqrt{\sigma^2 + \epsilon}} $$

$$ z^{(i)}_{final} = \gamma z^{(i)}_{norm} + \beta $$

Parameters $\gamma$ and $\beta$ are learnable with back propagation.
Batch normalization of layer L's pre-activation makes the input to the layer L+1 vary less, since the between-iterations, same-unit distribution's mean and variance are fixed. This improves the speed of learning.

Since estimates of mean and variance are different for each batch, $z^{[l]}$ gets noisy, which also acts as small regularizer. The quantity of noise reduces with the size of the batch.

For validation time, estimate $\mu$ and $\sigma^2$ as exponentially weighted average across the training mini-batches. Use these estimates instead of the validation dataset's properties.

In [1]:
import tensorflow.keras.layers
?tensorflow.keras.layers.BatchNormalization

[0;31mInit signature:[0m
[0mtensorflow[0m[0;34m.[0m[0mkeras[0m[0;34m.[0m[0mlayers[0m[0;34m.[0m[0mBatchNormalization[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0maxis[0m[0;34m=[0m[0;34m-[0m[0;36m1[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmomentum[0m[0;34m=[0m[0;36m0.99[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mepsilon[0m[0;34m=[0m[0;36m0.001[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcenter[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mscale[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mbeta_initializer[0m[0;34m=[0m[0;34m'zeros'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mgamma_initializer[0m[0;34m=[0m[0;34m'ones'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmoving_mean_initializer[0m[0;34m=[0m[0;34m'zeros'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmoving_variance_initializer[0m[0;34m=[0m[0;34m'ones'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mbeta_regularizer[0m[0;