In [21]:
import numpy as np

先来看看BN的定义
推荐视频:https://www.youtube.com/watch?v=BZh1ltr5Rkg
$
\begin{array}{l}
\textbf{Input:} \text{ Values of } x \text{ over a mini-batch: } \mathcal{B} = \{x_{1..m}\}; \\
\quad \quad \quad \text{ Parameters to be learned: } \gamma, \beta \\
\textbf{Output:} \{y_i = \text{BN}_{\gamma,\beta}(x_i)\} \\
\\
\mu_\mathcal{B} \leftarrow \frac{1}{m}\sum_{i=1}^{m}x_i \quad \quad \quad \quad \quad \quad \quad \quad  \quad \quad \text{// mini-batch mean} \\
\sigma_\mathcal{B}^2 \leftarrow \frac{1}{m}\sum_{i=1}^{m}(x_i - \mu_\mathcal{B})^2 \quad \quad \quad \quad \quad \quad \text{// mini-batch variance} \\
\hat{x}_i \leftarrow \frac{x_i - \mu_\mathcal{B}}{\sqrt{\sigma_\mathcal{B}^2 + \epsilon}} \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \text{// normalize} \\
y_i \leftarrow \gamma\hat{x}_i + \beta \equiv \text{BN}_{\gamma,\beta}(x_i) \quad \quad \quad \quad \quad \quad \text{// scale and shift}
\end{array}
$

BN解决的问题有：
1. improve gradient flow
通过归一化每一层的输入，减少了internal covariate shift(较低层参数变动会导致后续层输入分布大幅变化)
2. allow higher learning rates
3. reduce strong dependence on initialization
4. regularization
要注意的的点有：
1. 测试时均值与方差不再基于批次计算，而是基于训练期间的经验均值

In [22]:
from torch.nn import BatchNorm2d
np.random.seed(2025)
# mini batch of x
batch_size = 100
features = 20
x = np.random.randn(batch_size,features)
x.shape

# gamma and beta are learnable parameters
def batch_normalization(x,gamma,beta,eps = 1e-9):
    batch_mean = np.mean(x,axis=0) # shape (features,)
    batch_var = np.var(x,axis=0) # shape (features,)
    print(batch_mean.shape,batch_var.shape) 
    print(batch_mean[0],batch_var[0])
    x_hat = (x - batch_mean) /  np.sqrt(batch_var+eps)
    return gamma * x_hat + beta
y = batch_normalization(x,1,0)
print(y.shape,np.mean(y),np.var(y))

(20,) (20,)
-0.18335136892609022 0.9809683438883487
(100, 20) 3.552713678800501e-18 0.9999999989560583


那么layer norm呢：
$
\begin{array}{l}
\textbf{Input:} \text{ 一个样本的特征向量: } \mathcal{H} = \{x_{1..H}\}; \\
\quad \quad \quad \text{ 需要学习的参数: } \gamma, \beta \\
\textbf{Output:} \{y_i = \text{LN}_{\gamma,\beta}(x_i)\} \\
\\
\mu_\mathcal{H} \leftarrow \frac{1}{H}\sum_{i=1}^{H}x_i \quad \quad \quad \quad \quad \quad \quad \quad  \quad \quad \text{// 特征维度均值} \\
\sigma_\mathcal{H}^2 \leftarrow \frac{1}{H}\sum_{i=1}^{H}(x_i - \mu_\mathcal{H})^2 \quad \quad \quad \quad \quad \quad \text{// 特征维度方差} \\
\hat{x}_i \leftarrow \frac{x_i - \mu_\mathcal{H}}{\sqrt{\sigma_\mathcal{H}^2 + \epsilon}} \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \text{// 归一化} \\
y_i \leftarrow \gamma\hat{x}_i + \beta \equiv \text{LN}_{\gamma,\beta}(x_i) \quad \quad \quad \quad \quad \quad \text{// 缩放和偏移}
\end{array}

$   

In [27]:
x = np.random.randn(batch_size,features)
def layer_normalization(x,gamma,beta,eps = 1e-9):
    layer_mean = np.mean(x,axis=1,keepdims=True)
    layer_var = np.var(x,axis=1,keepdims=True)
    print(layer_mean.shape,layer_var.shape)
    print(layer_mean[0],layer_var[0])
    x_hat = (x - layer_mean) /  np.sqrt(layer_var+eps)
    return gamma * x_hat + beta
y = layer_normalization(x,np.ones((1,features)),np.zeros((1,features)))
print(y.shape,np.mean(y),np.var(y))

(100, 1) (100, 1)
[0.15872828] [1.11482916]
(100, 20) -7.327471962526034e-18 0.9999999988190323


代码上看起来区别只是axis 从0变成了1，但是仔细观察这两者的区别：
axis = 0 是根据这个batch里某个feature所有的值进行normalization的
axis = 1 是根据batch中的一个sample所有feature的均值进行normalization的

BN vs LN
BN适用于CNN,FNN，固定输入大小、较大批次训练的模型 ->较大批次的训练好产生可靠的均值和方差
LN适用于RNN,Transformer或小批次变长序列处理模型。->使用单个样本内的统计信息，不受批次大小影响
