The normalization is a tip not only for preprocessing data but also for improving training and inference. This tutorial shows several normalization methods used in training.

Reference: https://arxiv.org/abs/1803.08494

![](https://github.com/shaohua0116/Group-Normalization-Tensorflow/raw/master/figure/gn.png)

Tensorflow.org (2020)

In the following, we use the x as the batch data for normalization.

In [0]:
!pip install -q tf-nightly tensorflow-addons

In [6]:
import tensorflow as tf
import tensorflow_addons as tfa

print("Tensorflow Version: {}".format(tf.__version__))
print("GPU {} available.".format("is" if tf.config.experimental.list_physical_devices("GPU") else "not"))

Tensorflow Version: 2.2.0-dev20200303
GPU not available.


In [0]:
x = tf.random.normal(shape=(3, 2, 3, 2))  # (batch_size, H, W, Channel)

# Batch Normalization

## The Origin Batch Normalization using `tf.nn.batch_normalization`

$$\mu_{j} = \frac{1}{m}\sum^{m}_{i=1} x_{ij}$$
$$\sigma^{2}_{j}=\frac{1}{m}\sum^{m}_{i=1}(x_{ij}-\mu_{j})^2$$
$$\hat{x_{ij}}=\frac{x_{ij}-\mu_{j}}{\sqrt{\sigma^2_j + \epsilon}}$$

In [24]:
means, vars = tf.nn.moments(x, axes=[0, 1, 2], keepdims=True)
bn_scratch = (x - means) / tf.sqrt(vars + 1e-3)

# through tf.nn.batch_normalization
b_ori = tf.nn.batch_normalization(x, means, vars, offset=None, scale=None, variance_epsilon=1e-3)

print("mean shape: {}, vars shape: {}".format(means.shape, vars.shape))

mean shape: (1, 1, 1, 2), vars shape: (1, 1, 1, 2)


In [19]:
bn_scratch[0, ...], b_ori[0, ...]

(<tf.Tensor: shape=(2, 3, 2), dtype=float32, numpy=
 array([[[ 1.0182704 ,  0.5101875 ],
         [ 0.6017975 , -1.3783205 ],
         [-0.3881208 ,  1.2821097 ]],
 
        [[ 1.1666497 ,  1.3473433 ],
         [ 0.53064954,  0.19922358],
         [-1.4957787 ,  0.26144832]]], dtype=float32)>,
 <tf.Tensor: shape=(2, 3, 2), dtype=float32, numpy=
 array([[[ 1.0182704 ,  0.5101875 ],
         [ 0.6017976 , -1.3783205 ],
         [-0.38812083,  1.2821096 ]],
 
        [[ 1.1666497 ,  1.3473433 ],
         [ 0.53064954,  0.19922358],
         [-1.4957788 ,  0.26144832]]], dtype=float32)>)

## The Advanced BN using `tf.keras.layers.BatchNormalization`

The `tf.keras.layers.BatchNormalization` is a higher API for the batch normalization. However, this API is slightly different from the lower API `tf.nn.batch_normalization`. 

Batch normalization consists of two main calculations, one is the standard normalization `x_hat = (x - mean) / sqrt(x + epsilon)`, and the second is the **de-normalization** `y = gamma * x_hat + beta`. The lower-level API implements the first calculation and the higher-level API implements the whole calculation.

This higher-level API mainly includes the following parameters.
* `mean`, `variance`: the mean and variance of the batch data
* `gamma`, `beta`: are **trainable** based on the gradients
* `running_mean` and `running_var`: collected the mean and the variance from each batch and averaged them

In training,
* The first step is to calculate the mean and variance and apply to the data, `x_hat = (x - mean) / sqrt(variance + epsilon)`.
* The second step is to multiply the gamma and add the beta up to the x_hat, `out = x_hat * gamma + beta`. (We expect the gamma and the beta can be learned from the data.)
* The third step is to add the current mean and variance to the `running_mean` and `running_variance`, `running_mean = momentum * running_mean + (1 - momentum) * mean_batch` and `running_variance = momentum * running_variance + (1 - momentum) * variance_batch`.

From the above calculation, we can expect the running_mean and the running_variance are more batch sizes more persuasive.

In inference,
The data feed into the model is not in a unit of the batch, so you can't calculate the mean and the variance. Instead, you would use the `running_mean` and `running_variance`. 
The output would become, `x_hat = (x - running_mean) / sqrt(running_variance + epsilon)` and the output `out = x_hat * gamma + beta`.

However, if you are going to use a higher API without the training, the mean and variance can't be calculated, so you have to use the `running_mean` and `running_variance`. But, at this moment there is also no historical `running_mean` and `running_variance`. The value is at their defaults, `0` and `1`.

In [0]:
# scale: gamma, center: beta
# momentum: only available in the training mode
b = tf.keras.layers.BatchNormalization(
    axis=-2, trainable=False, scale=False, center=False, epsilon=0.001)(x)

running_mean, running_vars = 0., 1.
bnorm_scratch = (x - running_mean) / tf.sqrt(running_vars + 1e-3)

In [17]:
bnorm_scratch[0, ...], b[0, ...]

(<tf.Tensor: shape=(2, 3, 2), dtype=float32, numpy=
 array([[[ 1.3049238 ,  0.6310653 ],
         [ 0.92743635, -1.4323952 ],
         [ 0.0301825 ,  1.4744987 ]],
 
        [[ 1.4394137 ,  1.5457758 ],
         [ 0.86294836,  0.2912935 ],
         [-0.9737895 ,  0.3592828 ]]], dtype=float32)>,
 <tf.Tensor: shape=(2, 3, 2), dtype=float32, numpy=
 array([[[ 1.3049238 ,  0.6310653 ],
         [ 0.9274363 , -1.4323952 ],
         [ 0.0301825 ,  1.4744987 ]],
 
        [[ 1.4394135 ,  1.5457757 ],
         [ 0.86294836,  0.2912935 ],
         [-0.9737895 ,  0.3592828 ]]], dtype=float32)>)

# Layer Normalization

In Tensorflow, by default, the layer normalization is only to normalize the last dimension of tensors (the channel), that is, the axis is set to [-1]. However, the layer normalization is to normalize the whole channels within the whole input dimensions, that is to set the axis to [1,2,3].

$$\mu_{i} = \frac{1}{m}\sum^{m}_{j=1} x_{ij}$$
$$\sigma^{2}_{i}=\frac{1}{m}\sum^{m}_{j=1}(x_{ij}-\mu_{i})^2$$
$$\hat{x_{ij}}=\frac{x_{ij}-\mu_{i}}{\sqrt{\sigma^2_i + \epsilon}}$$



In [76]:
l = tf.keras.layers.LayerNormalization(axis=[1,2,3])(x)

means, vars = tf.nn.moments(x, axes=[1, 2, 3], keepdims=True)
lnorm_scratch = (x - means) / tf.sqrt(vars + 1e-3)

print("mean shape: {}, vars shape: {}".format(means.shape, vars.shape))

mean shape: (3, 1, 1, 1), vars shape: (3, 1, 1, 1)


In [77]:
lnorm_scratch[1, ...], l[1, ...]

(<tf.Tensor: shape=(2, 3, 2), dtype=float32, numpy=
 array([[[ 0.16816942, -0.91264147],
         [ 2.6653469 ,  0.03617269],
         [ 0.57133687, -1.2314578 ]],
 
        [[-0.6139934 , -0.82229173],
         [ 0.0084926 , -0.7341101 ],
         [ 0.11144427,  0.75353146]]], dtype=float32)>,
 <tf.Tensor: shape=(2, 3, 2), dtype=float32, numpy=
 array([[[ 0.16816941, -0.91264135],
         [ 2.6653466 ,  0.0361727 ],
         [ 0.57133687, -1.2314576 ]],
 
        [[-0.6139933 , -0.8222917 ],
         [ 0.00849261, -0.73411   ],
         [ 0.11144428,  0.75353146]]], dtype=float32)>)

# Instance Normalization

In [21]:
inorm = tfa.layers.InstanceNormalization(center=False, scale=False, epsilon=1e-6)(x)

means, vars = tf.nn.moments(x, axes=[1, 2], keepdims=True)
inorm_scratch = (x - means) / tf.sqrt(vars + 1e-6)

print("mean shape: {}, vars shape: {}".format(means.shape, vars.shape))

mean shape: (3, 1, 1, 2), vars shape: (3, 1, 1, 2)


In [22]:
inorm[1, ...], inorm_scratch[1, :]

(<tf.Tensor: shape=(2, 3, 2), dtype=float32, numpy=
 array([[[-0.30603546, -0.6347712 ],
         [ 2.1050472 ,  0.77404225],
         [ 0.08323207, -1.1081547 ]],
 
        [[-1.0612316 , -0.5006187 ],
         [-0.4602071 , -0.36968517],
         [-0.3608049 ,  1.8391874 ]]], dtype=float32)>,
 <tf.Tensor: shape=(2, 3, 2), dtype=float32, numpy=
 array([[[-0.30603546, -0.6347713 ],
         [ 2.1050472 ,  0.7740423 ],
         [ 0.08323209, -1.1081547 ]],
 
        [[-1.0612317 , -0.5006187 ],
         [-0.4602071 , -0.36968526],
         [-0.3608049 ,  1.8391874 ]]], dtype=float32)>)

# Group Normalization

Group Normalization divides the channels into groups and computes within each group the mean and the variance for normalization. 

If the parameter `groups` is set to 1, it is identical to the `layer normalization`. 

If the parameters `groups` is set to the input dimension (number of groups is equal to numbers of channels, e.g. image data shape [batch_size, H, W, C], the number of groups is 2(H, and W)), it is identical to `instance normalization`.

## Two Input Dimensions (`groups=2`)

In [69]:
gnorm = tfa.layers.GroupNormalization(groups=2, epsilon=1e-6, center=False, scale=False)(x)

means, vars = tf.nn.moments(x, axes=[1, 2], keepdims=True)
gnorm_scratch = (x - means) / tf.sqrt(vars + 1e-6)

print("mean shape: {}, vars shape: {}".format(means.shape, vars.shape))

mean shape: (3, 1, 1, 2), vars shape: (3, 1, 1, 2)


In [70]:
gnorm[1, ...], gnorm_scratch[1, ...]

(<tf.Tensor: shape=(2, 3, 2), dtype=float32, numpy=
 array([[[-0.30603546, -0.6347712 ],
         [ 2.1050472 ,  0.77404225],
         [ 0.08323207, -1.1081547 ]],
 
        [[-1.0612316 , -0.5006187 ],
         [-0.4602071 , -0.36968517],
         [-0.3608049 ,  1.8391874 ]]], dtype=float32)>,
 <tf.Tensor: shape=(2, 3, 2), dtype=float32, numpy=
 array([[[-0.30603546, -0.6347713 ],
         [ 2.1050472 ,  0.7740423 ],
         [ 0.08323209, -1.1081547 ]],
 
        [[-1.0612317 , -0.5006187 ],
         [-0.4602071 , -0.36968526],
         [-0.3608049 ,  1.8391874 ]]], dtype=float32)>)

## Three Input Dimensions (`groups=1`)

In [71]:
gnorm = tfa.layers.GroupNormalization(groups=1, epsilon=1e-6, center=False, scale=False)(x)

means, vars = tf.nn.moments(x, axes=[1, 2, 3], keepdims=True)
gnorm_scratch = (x - means) / tf.sqrt(vars + 1e-6)

print("mean shape: {}, vars shape: {}".format(means.shape, vars.shape))

mean shape: (3, 1, 1, 1), vars shape: (3, 1, 1, 1)


In [72]:
gnorm[1, ...], gnorm_scratch[1, ...]

(<tf.Tensor: shape=(2, 3, 2), dtype=float32, numpy=
 array([[[ 0.16829652, -0.9133313 ],
         [ 2.6673615 ,  0.03620002],
         [ 0.57176876, -1.2323886 ]],
 
        [[-0.6144575 , -0.8229133 ],
         [ 0.008499  , -0.73466504],
         [ 0.1115285 ,  0.75410104]]], dtype=float32)>,
 <tf.Tensor: shape=(2, 3, 2), dtype=float32, numpy=
 array([[[ 0.16829653, -0.9133313 ],
         [ 2.6673615 ,  0.03620003],
         [ 0.57176876, -1.2323886 ]],
 
        [[-0.6144575 , -0.8229133 ],
         [ 0.00849902, -0.734665  ],
         [ 0.11152851,  0.75410104]]], dtype=float32)>)