### 为什么需要Normalization

1. 对于一些激活函数，比如sigmoid，当值域小到或者达到一定范围，会有梯度离散的情况，那么这时候就有必要对数据进行标准化。
2. 不同的变量，值域相差过大，会导致做优化时，路径难以选择。
   
   ![batch_norm](./images/batch_norm.png)

一般来说，神经网络进行训练的输入值，我们都希望其范围在0周围对称分布。

In [None]:
def normalize(x, mean, std):
    x = x - mean
    x = x / std
    return x

### Batch Norm

Batch Norm最重要的是动态更新，在每次反向传播时，都会对$\gamma$和$\beta$进行更新

![batch norm](./images/batch_norm_all.png)

主要关注公式里的$\gamma$和$\beta$，下图中，$\mu$和$\sigma$是统计量，不会参与更新；$\gamma$和$\beta$会在反向传播时，作为参数，做梯度运算，训练出最适合的数据normalization的均值和方差

![pipline](./images/bm_pipeline.png)

In [1]:
import tensorflow as tf

from tensorflow import keras
from tensorflow.keras import layers, optimizers


# 2 images with 4x4 size, 3 channels
# we explicitly enforce the mean and stddev to N(1, 0.5)
x = tf.random.normal([2,4,4,3], mean=1.,stddev=0.5)
# center主要影响偏移量, 对应beta，scale主要影响缩放量, 对应gamma，
# beta， gamma在前向传播时，不会进行更新，只是在反向传播时更新
# trainable 表示是训练模式还是测试模式，如果是训练模式，会对moving_mean， moving_variance两个统计量进行更新
net = layers.BatchNormalization(axis=-1, center=True, scale=True,
                                trainable=True)

out = net(x)
# variables包含gamma, beta, moving_mean, moving_variance, 后两个是统计的全局均值和方差
# trainable_variables只包含gamma和beta 
print('forward in test mode:', net.variables)


out = net(x, training=True)
print('forward in train mode(1 step):', net.variables)

for i in range(100):
    out = net(x, training=True)
print('forward in train mode(100 steps):', net.variables)


optimizer = optimizers.SGD(lr=1e-2)
for i in range(10):
    with tf.GradientTape() as tape:
        out = net(x, training=True)
        loss = tf.reduce_mean(tf.pow(out,2)) - 1

    grads = tape.gradient(loss, net.trainable_variables)
    optimizer.apply_gradients(zip(grads, net.trainable_variables))
print('backward(10 steps):', net.variables)

forward in test mode: [<tf.Variable 'batch_normalization/gamma:0' shape=(3,) dtype=float32, numpy=array([1., 1., 1.], dtype=float32)>, <tf.Variable 'batch_normalization/beta:0' shape=(3,) dtype=float32, numpy=array([0., 0., 0.], dtype=float32)>, <tf.Variable 'batch_normalization/moving_mean:0' shape=(3,) dtype=float32, numpy=array([0., 0., 0.], dtype=float32)>, <tf.Variable 'batch_normalization/moving_variance:0' shape=(3,) dtype=float32, numpy=array([1., 1., 1.], dtype=float32)>]
forward in train mode(1 step): [<tf.Variable 'batch_normalization/gamma:0' shape=(3,) dtype=float32, numpy=array([1., 1., 1.], dtype=float32)>, <tf.Variable 'batch_normalization/beta:0' shape=(3,) dtype=float32, numpy=array([0., 0., 0.], dtype=float32)>, <tf.Variable 'batch_normalization/moving_mean:0' shape=(3,) dtype=float32, numpy=array([0.01052996, 0.00967027, 0.01049235], dtype=float32)>, <tf.Variable 'batch_normalization/moving_variance:0' shape=(3,) dtype=float32, numpy=array([0.9926309 , 0.99292   , 0

  super(SGD, self).__init__(name, **kwargs)
