## 1 背景    

训练机器学习模型之前，通常都会都数据进行预处理(eg.减均值，除方差)，这样做能够减少数据分布之间的差异，提升模型泛化能力。  

但是在训练深度网络的时候，即便对输入数据进行了预处理，模型还是很难进行训练，这是因为虽然已经对输入数据进行了预处理，网络参数会随着训练不断的变化，隐藏层的数据分布就会随着模型参数的变化不断变化，使得模型每次迭代都得去适应新的数据分布，收敛较慢。这种由于前面层参数变化引起的隐藏层输入数据分布的变化就称为 Internal  Covariate Shift，Batch  Normalization 可以解决这一问题，加速收敛。  

## 2 方法  

BN 提出之前，一般只对输入层的数据进行预处理(eg.图像分类任务中，输入数据减去 ImageNet 数据集上三个通道的均值)

BN 是在训练过程中对每层的输入数据都进行预处理。  


在一层网络的单个神经元上进行 BN 操作，在每个 mini-batch 上， BN 步骤如下：  

1 计算 mini-batch 中每个输入数据 $x_i$ 的 均值和方差：$\mu_B = \frac {1}{m} \sum_{i=1}^{m} x_i$ , $\sigma_B^2 = \frac {1}{m} \sum_{i=1}^{m} (x_i - \mu_B)^2$  

2 对数据进行标准化：$\hat{x_i} \leftarrow \frac {x_i - \mu_B}{\sqrt {\mu_B^2 + \epsilon}}$  

3 进行数据重构：$y_i \leftarrow \gamma \hat {x_i} + \beta \equiv {BN}_{\gamma,\beta (x_i)}$  

'''  
sample_mean = np.mean(x, axis = 0)  # 计算输入数据 x 的均值  
sample_var = np.var(x , axis = 0)   # 计算输入数据 x 的方差  
x_hat = (x - sample_mean) / (np.sqrt(sample_var  + eps)) # 归一化  
out = gamma * x_hat + beta # 重构变化  
'''

这里存在一个问题，数据标准化中，先将数据标准化到了 0 均值，1 方差的分布中，然后又对该数据进行了平移和缩放，将数据固定在了特定的均值和方差分布上去，即第一步已经得到标准分布，第二步怎么又给变走了。  

这么做的原因是为了保留前面网络学习到的特征分布，保证模型的表达能力不因为规范化而下降。  

我们可以看到，第一步的变换将输入数据进行了均值为 0、方差为 1 的标准化。这样就影响到前层网络学习到的特征，即无论前面学习到了什么，这里都被简单的调整到了一个规定的范围。  

所以，为了保留前层网络所学到的数据分布，算法将经过第一步处理之后的数据进行再平移 $\beta$ 和再缩放 $\gamma$，使得每个神经元对应的输入范围是针对该神经元量身定制的一个确定范围。$\beta$ 和 $\gamma$ 都是可学习的，这就使得 Normalization 层可以学习如何去尊重底层的学习结果。  

经过两次变换可以在充分利用前层学习能力的前提下，保证神经元非线性的表达能力。通过区分激活函数饱和区和非饱和区，神经元就可以将任意数据进行非线性变换。而第一步的规范化会将几乎所有数据映射到激活函数的非饱和区（线性区），仅利用到了线性变化能力，从而降低了神经网络的表达能力。而进行再变换，则可以将数据从线性区变换到非线性区，恢复模型的表达能力。  


In [1]:
def batchnorm_forward(x, gamma, beta, bn_param):
  """
  batch normalization 前向操作.
  
  训练过程中样本均值和方差通过 minibatch 进行计算，用来标准化输入数据 x
  训练过程中通过滑动平均，计算整个数据集的均值和方差，用来作为测试集的均值和方差，对测试集进行标准化
  
  running_mean = momentum * running_mean + (1 - momentum) * sample_mean
  running_var = momentum * running_var + (1 - momentum) * sample_var

  原文中不是通过滑动平均的方差来计算数据的均值和方差的.

  Input:
  - x: Data of shape (N, D)
  - gamma: Scale parameter of shape (D,)
  - beta: Shift paremeter of shape (D,)
  - bn_param: Dictionary with the following keys:
    - mode: 'train' or 'test'; required
    - eps: Constant for numeric stability
    - momentum: Constant for running mean / variance.
    - running_mean: Array of shape (D,) giving running mean of features
    - running_var Array of shape (D,) giving running variance of features

  Returns a tuple of:
  - out: of shape (N, D)
  - cache: 缓存反向计算时需要的变量
  """
  mode = bn_param['mode']
  eps = bn_param.get('eps', 1e-5)
  momentum = bn_param.get('momentum', 0.9)

  N, D = x.shape
  running_mean = bn_param.get('running_mean', np.zeros(D, dtype=x.dtype))
  running_var = bn_param.get('running_var', np.zeros(D, dtype=x.dtype))

  out, cache = None, None
  if mode == 'train':
    sample_mean = np.mean(x, axis = 0)
    sample_var = np.var(x , axis = 0)
    x_hat = (x - sample_mean) / (np.sqrt(sample_var  + eps))
    out = gamma * x_hat + beta
    cache = (gamma, x, sample_mean, sample_var, eps, x_hat)
    running_mean = momentum * running_mean + (1 - momentum) * sample_mean
    running_var = momentum * running_var + (1 - momentum) * sample_var
  elif mode == 'test':
    scale = gamma / (np.sqrt(running_var  + eps))
    out = x * scale + (beta - running_mean * scale)
  else:
    raise ValueError('Invalid forward batchnorm mode "%s"' % mode)

  # Store the updated running means back into bn_param
  bn_param['running_mean'] = running_mean
  bn_param['running_var'] = running_var

  return out, cache

def batchnorm_backward(dout, cache):
  """
  Backward pass for batch normalization.
  
  Inputs:
  - dout: Upstream derivatives, of shape (N, D)
  - cache: 反向计算时需要变量的缓存
  
  Returns a tuple of:
  - dx: Gradient with respect to inputs x, of shape (N, D)
  - dgamma: Gradient with respect to scale parameter gamma, of shape (D,)
  - dbeta: Gradient with respect to shift parameter beta, of shape (D,)
  """
  dx, dgamma, dbeta = None, None, None
  gamma, x, u_b, sigma_squared_b, eps, x_hat = cache
  N = x.shape[0]

  dx_1 = gamma * dout
  dx_2_b = np.sum((x - u_b) * dx_1, axis=0)
  dx_2_a = ((sigma_squared_b + eps) ** -0.5) * dx_1
  dx_3_b = (-0.5) * ((sigma_squared_b + eps) ** -1.5) * dx_2_b
  dx_4_b = dx_3_b * 1
  dx_5_b = np.ones_like(x) / N * dx_4_b
  dx_6_b = 2 * (x - u_b) * dx_5_b
  dx_7_a = dx_6_b * 1 + dx_2_a * 1
  dx_7_b = dx_6_b * 1 + dx_2_a * 1
  dx_8_b = -1 * np.sum(dx_7_b, axis=0)
  dx_9_b = np.ones_like(x) / N * dx_8_b
  dx_10 = dx_9_b + dx_7_a

  dgamma = np.sum(x_hat * dout, axis=0)
  dbeta = np.sum(dout, axis=0)
  dx = dx_10
  return dx, dgamma, dbeta


def batchnorm_backward_alt(dout, cache):
  """
  Alternative backward pass for batch normalization.
  
  Note: This implementation should expect to receive the same cache variable
  as batchnorm_backward, but might not use all of the values in the cache.
  
  Inputs / outputs: Same as batchnorm_backward
  """
  dx, dgamma, dbeta = None, None, None
  gamma, x, sample_mean, sample_var, eps, x_hat = cache
  N = x.shape[0]
  dx_hat = dout * gamma
  dvar = np.sum(dx_hat* (x - sample_mean) * -0.5 * np.power(sample_var + eps, -1.5), axis = 0)
  dmean = np.sum(dx_hat * -1 / np.sqrt(sample_var +eps), axis = 0) + dvar * np.mean(-2 * (x - sample_mean), axis =0)
  dx = 1 / np.sqrt(sample_var + eps) * dx_hat + dvar * 2.0 / N * (x-sample_mean) + 1.0 / N * dmean
  dgamma = np.sum(x_hat * dout, axis = 0)
  dbeta = np.sum(dout , axis = 0) 
  return dx, dgamma, dbeta


卷积神经网络经过卷积后得到的是一系列的特征图而不是一层单个的神经元，如果 min-batch sizes 为 m，那么网络某一层输入数据可以表示为四维矩阵 (m,f,p,q)，m 为 min-batch sizes，f 为特征图个数，p、q 分别为特征图的宽高。在 cnn 中我们可以把每个特征图看成是一个特征处理（一个神经元），因此在使用 Batch Normalization，mini-batch size 的大小就是：m*p*q，于是对于每个特征图都只有一对可学习参数：$\gamma , \beta$。相当于求 mini-batch 中所有样本所对应的一个特征图的所有神经元的平均值、方差，然后对这个特征图神经元做归一化。

In [2]:
def spatial_batchnorm_forward(x, gamma, beta, bn_param):
  """
  Computes the forward pass for spatial batch normalization.
  
  Inputs:
  - x: Input data of shape (N, C, H, W)
  - gamma: Scale parameter, of shape (C,)
  - beta: Shift parameter, of shape (C,)
  - bn_param: Dictionary with the following keys:
    - mode: 'train' or 'test'; required
    - eps: Constant for numeric stability
    - momentum: Constant for running mean / variance. momentum=0 means that
      old information is discarded completely at every time step, while
      momentum=1 means that new information is never incorporated. The
      default of momentum=0.9 should work well in most situations.
    - running_mean: Array of shape (D,) giving running mean of features
    - running_var Array of shape (D,) giving running variance of features
    
  Returns a tuple of:
  - out: Output data, of shape (N, C, H, W)
  - cache: Values needed for the backward pass
  """
  N, C, H, W = x.shape
  x_flat = x.transpose(0, 2, 3, 1).reshape(-1, C)
  out_flat, cache = batchnorm_forward(x_flat, gamma, beta, bn_param)
  out = out_flat.reshape(N, H, W, C).transpose(0, 3, 1, 2)
  return out, cache


def spatial_batchnorm_backward(dout, cache):
  """
  Computes the backward pass for spatial batch normalization.
  
  Inputs:
  - dout: Upstream derivatives, of shape (N, C, H, W)
  - cache: Values from the forward pass
  
  Returns a tuple of:
  - dx: Gradient with respect to inputs, of shape (N, C, H, W)
  - dgamma: Gradient with respect to scale parameter, of shape (C,)
  - dbeta: Gradient with respect to shift parameter, of shape (C,)
  """
  N, C, H, W = dout.shape
  dout_flat = dout.transpose(0, 2, 3, 1).reshape(-1, C)
  dx_flat, dgamma, dbeta = batchnorm_backward(dout_flat, cache)
  dx = dx_flat.reshape(N, H, W, C).transpose(0, 3, 1, 2)
  return dx, dgamma, dbeta

## 3 参考  

1 [详解深度学习中的Normalization，不只是BN](https://zhuanlan.zhihu.com/p/33173246)  
2 [Batch Normalization 学习笔记](http://blog.csdn.net/hjimce/article/details/50866313)  
3 [CS231n](http://cs231n.stanford.edu/)