欢迎来到上海交通大学 CS7353《[设计和理解深度神经网络](https://cs7353.netlify.app/)》！

这里是第一次课程作业，具体时间信息见[课程网站](https://cs7353.netlify.app/)。作业在 Canvas 上提交，注意时间节点。只需要上传一份 ipynb 文件，请务必保留每个单元格的运行结果。

如有任何问题，请联系[助教](https://cs7353.netlify.app/staff/)。

# 1 简介

在本次作业中，您将练习编写前向传播和反向传播代码。本次作业不需要用到GPU。

此任务的目标如下：

- 了解并能够实现（矢量化的）**反向传播**
- 为深度网络实现**最大池化**
- 为深度网络实现**批量归一化**
- 为深度网络实现**卷积**

注意，请严格遵守以下注意事项，如有违背，本次作业零分处理：
- 请仅在标明的 TODO 位置完成代码，请勿更改其他代码；
- 请勿 import 其他 python package（补充：请手搓，不要为了省事直接调用打包好的函数）；
- 请务必保留每个单元格的运行结果。

# 2 准备：辅助函数

这里是用于检测结果的辅助函数，**您不必进行任何代码层面的操作**，仅需运行对应单元格。

In [None]:
import numpy as np
import random
seed = 7
random.seed(seed)
np.random.seed(seed)

# for auto-reloading external modules
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2


def rel_error(x, y):
    """ returns relative error """
    return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))


def eval_numerical_gradient_array(f, x, df, h=1e-5):
    """
    Evaluate a numeric gradient for a function that accepts a numpy
    array and returns a numpy array.
    """
    grad = np.zeros_like(x)
    it = np.nditer(x, flags=['multi_index'], op_flags=['readwrite'])
    while not it.finished:
        ix = it.multi_index

        oldval = x[ix]
        x[ix] = oldval + h
        pos = f(x).copy()
        x[ix] = oldval - h
        neg = f(x).copy()
        x[ix] = oldval

        grad[ix] = np.sum((pos - neg) * df) / (2 * h)
        it.iternext()
    return grad

# 3 卷积（Convolution）

卷积是深度学习中一种重要的操作，尤其在卷积神经网络中广泛应用。它主要用于从输入数据中提取特征。

卷积操作的基本思想是通过滑动一个称为卷积核（或滤波器）的小窗口，对输入数据进行局部区域的加权求和。这个卷积核在整个输入数据上滑动，每一次都产生一个输出值。通过调整卷积核的参数，网络可以学习到不同的特征，例如边缘、纹理、或更高级的抽象特征。

卷积操作的优势在于它能够有效地捕捉输入数据中的局部特征，而且参数共享的机制使得网络对于不同位置的相似特征具有更好的学习能力。此外，卷积操作也具有降维的作用，可以减少模型的参数数量，有助于提高计算效率。

在卷积神经网络中，多个卷积层可以通过堆叠来提取更高级别的特征，构建复杂的特征层次结构，从而更好地适应不同的任务，如图像分类、物体检测等。

## 3.1 卷积：朴素前向传播
卷积神经网络的核心是卷积操作。在函数`conv_forward_naive`中实现卷积层的前向传播。

在此阶段，您不必过于担心效率；只需以您认为最清晰的方式编写代码即可。

In [None]:
def conv_forward_naive(x, w, b, conv_param):
    """
    A naive implementation of the forward pass for a convolutional layer.

    The input consists of N data points, each with C channels, height H and width
    W. We convolve each input with F different filters, where each filter spans
    all C channels and has height HH and width HH.

    Input:
    - x: Input data of shape (N, C, H, W)
    - w: Filter weights of shape (F, C, HH, WW)
    - b: Biases, of shape (F,)
    - conv_param: A dictionary with the following keys:
      - 'stride': The number of pixels between adjacent receptive fields in the
        horizontal and vertical directions.
      - 'pad': The number of pixels that will be used to zero-pad the input.

    Returns a tuple of:
    - out: Output data, of shape (N, F, H', W') where H' and W' are given by
      H' = 1 + (H + 2 * pad - HH) / stride
      W' = 1 + (W + 2 * pad - WW) / stride
    - cache: (x, w, b, conv_param)
    """
    out = None
    #############################################################################
    # TODO: Implement the convolutional forward pass.                           #
    # Hint: you can use the function np.pad for padding.                        #
    #############################################################################
    stride = conv_param['stride']
    pad = conv_param['pad']

    N, C, H, W = x.shape
    F, _, HH, WW = w.shape

    H_out = 1 + (H + 2 * pad - HH) // stride
    W_out = 1 + (W + 2 * pad - WW) // stride

    x_padded = np.pad(x, ((0, 0), (0, 0), (pad, pad), (pad, pad)), mode='constant', constant_values=0)

    out = np.zeros((N, F, H_out, W_out))

    for n in range(N):
      for f in range(F):
        for i in range(H_out):
          for j in range(W_out):
            h_start = i*stride
            h_end = h_start + HH
            w_start = j*stride
            w_end = w_start + WW

            region = x_padded[n, :, h_start:h_end, w_start:w_end]
            out[n,f,i,j] = np.sum (region*w[f]) + b[f]
    #############################################################################
    #                             END OF YOUR CODE                              #
    #############################################################################
    cache = (x, w, b, conv_param)
    return out, cache

您可以通过运行以下测试您的实现，您应该看到误差小于 ``1e-7``。

In [None]:
x_shape = (2, 3, 4, 4)
w_shape = (3, 3, 4, 4)
x = np.linspace(-0.1, 0.5, num=np.prod(x_shape)).reshape(x_shape)
w = np.linspace(-0.2, 0.3, num=np.prod(w_shape)).reshape(w_shape)
b = np.linspace(-0.1, 0.2, num=3)

conv_param = {'stride': 2, 'pad': 1}
out, _ = conv_forward_naive(x, w, b, conv_param)
correct_out = np.array([[[[[-0.08759809, -0.10987781],
                           [-0.18387192, -0.2109216 ]],
                          [[ 0.21027089,  0.21661097],
                           [ 0.22847626,  0.23004637]],
                          [[ 0.50813986,  0.54309974],
                           [ 0.64082444,  0.67101435]]],
                         [[[-0.98053589, -1.03143541],
                           [-1.19128892, -1.24695841]],
                          [[ 0.69108355,  0.66880383],
                           [ 0.59480972,  0.56776003]],
                          [[ 2.36270298,  2.36904306],
                           [ 2.38090835,  2.38247847]]]]])

# Compare your output to ours; difference should be around 1e-8
print ('Testing conv_forward_naive')
print ('conv forward error: ', rel_error(out, correct_out))

Testing conv_forward_naive
conv forward error:  2.2121476417505994e-08


## 3.2 卷积：朴素反向传播
在函数`conv_backward_naive`中实现卷积操作的反向传播。同样，您不必过于担心计算效率。

In [None]:
def conv_backward_naive(dout, cache):
    """
    A naive implementation of the backward pass for a convolutional layer.

    Inputs:
    - dout: Upstream derivatives.
    - cache: A tuple of (x, w, b, conv_param) as in conv_forward_naive

    Returns a tuple of:
    - dx: Gradient with respect to x
    - dw: Gradient with respect to w
    - db: Gradient with respect to b
    """
    dx, dw, db = None, None, None
    #############################################################################
    # TODO: Implement the convolutional backward pass.                          #
    #############################################################################

    x, w, b, conv_param = cache
    stride = conv_param['stride']
    pad = conv_param['pad']

    N, C, H, W = x.shape
    F, _, HH, WW = w.shape
    _, _, H_out, W_out = dout.shape

    # Initialize gradients
    dx = np.zeros_like(x)
    dw = np.zeros_like(w)
    db = np.zeros_like(b)

    # Pad x and dx
    x_padded = np.pad(x, ((0, 0), (0, 0), (pad, pad), (pad, pad)), mode='constant', constant_values=0)
    dx_padded = np.pad(dx, ((0, 0), (0, 0), (pad, pad), (pad, pad)), mode='constant', constant_values=0)

    # Compute db (sum over dout)
    db = np.sum(dout, axis=(0, 2, 3))

    for n in range(N):
      for f in range(F):
        for i in range(H_out):
          for j in range(W_out):
            h_start = i*stride
            h_end = h_start + HH
            w_start = j*stride
            w_end = w_start + WW

            region = x_padded[n, :, h_start:h_end, w_start:w_end]

            dw[f] += region * dout[n, f, i, j]

            dx_padded[n, :, h_start:h_end, w_start:w_end] += w[f] * dout[n, f, i, j]

    dx = dx_padded[:, :, pad:-pad, pad:-pad] if pad > 0 else dx_padded
    #############################################################################
    #                             END OF YOUR CODE                              #
    #############################################################################
    return dx, dw, db


完成后，运行以下内容以使用数值梯度检查验证您的反向传播，您应该看到误差小于 ``1e-8``。

In [None]:
x = np.random.randn(4, 3, 5, 5)
w = np.random.randn(2, 3, 3, 3)
b = np.random.randn(2,)
dout = np.random.randn(4, 2, 5, 5)
conv_param = {'stride': 1, 'pad': 1}

dx_num = eval_numerical_gradient_array(lambda x: conv_forward_naive(x, w, b, conv_param)[0], x, dout)
dw_num = eval_numerical_gradient_array(lambda w: conv_forward_naive(x, w, b, conv_param)[0], w, dout)
db_num = eval_numerical_gradient_array(lambda b: conv_forward_naive(x, w, b, conv_param)[0], b, dout)

out, cache = conv_forward_naive(x, w, b, conv_param)
dx, dw, db = conv_backward_naive(dout, cache)

# Your errors should be around 1e-9'
print ('Testing conv_backward_naive function')
print ('dx error: ', rel_error(dx, dx_num))
print ('dw error: ', rel_error(dw, dw_num))
print ('db error: ', rel_error(db, db_num))

Testing conv_backward_naive function
dx error:  1.6877671244138885e-09
dw error:  4.5472807201787456e-10
db error:  8.919734826253235e-12


# 4 最大池化（Max Pooling）

最大池化是深度学习中常用的一种池化操作，用于降低输入数据的空间维度。在卷积神经网络中，它通常用于减小特征图的尺寸，从而减少模型的计算复杂度，有助于提高模型的计算效率和泛化能力。

最大池化的主要优势在于它能够保留图像中最显著的特征，同时减少计算量。通过丢弃非最大值的信息，模型能够更加集中地关注对于任务而言最重要的特征。这使得网络对于位置变化更加鲁棒，因为最大值对于小的平移或变形具有不变性。

## 4.1 最大池化：朴素前向传播
在函数`max_pool_forward_naive`中实现最大池化操作的前向传播。同样，不必过于担心计算效率。

In [None]:
def max_pool_forward_naive(x, pool_param):
    """
    A naive implementation of the forward pass for a max pooling layer.

    Inputs:
    - x: Input data, of shape (N, C, H, W)
    - pool_param: dictionary with the following keys:
      - 'pool_height': The height of each pooling region
      - 'pool_width': The width of each pooling region
      - 'stride': The distance between adjacent pooling regions

    Returns a tuple of:
    - out: Output data
    - cache: (x, pool_param)
    """
    out = None
    #############################################################################
    # TODO: Implement the max pooling forward pass                              #
    #############################################################################
    pool_height = pool_param['pool_height']
    pool_width = pool_param['pool_width']
    stride = pool_param['stride']
    N, C, H, W = x.shape

    H_out = 1 + (H - pool_height) // stride
    W_out = 1 + (W - pool_width) // stride

    out = np.zeros((N, C, H_out, W_out))
    for n in range(N):
      for c in range(C):
        for i in range(H_out):
          for j in range(W_out):
            h_start = i*stride
            h_end = h_start + pool_height
            w_start = j*stride
            w_end = w_start + pool_width
            out[n,c,i,j] = np.max(x[n, c, h_start:h_end, w_start:w_end])
    #############################################################################
    #                             END OF YOUR CODE                              #
    #############################################################################
    cache = (x, pool_param)
    return out, cache

您可以通过运行以下测试您的实现，您应该看到误差小于 ``1e-7``。

In [None]:
x_shape = (2, 3, 4, 4)
x = np.linspace(-0.3, 0.4, num=np.prod(x_shape)).reshape(x_shape)
pool_param = {'pool_width': 2, 'pool_height': 2, 'stride': 2}

out, _ = max_pool_forward_naive(x, pool_param)

correct_out = np.array([[[[-0.26315789, -0.24842105],
                          [-0.20421053, -0.18947368]],
                         [[-0.14526316, -0.13052632],
                          [-0.08631579, -0.07157895]],
                         [[-0.02736842, -0.01263158],
                          [ 0.03157895,  0.04631579]]],
                        [[[ 0.09052632,  0.10526316],
                          [ 0.14947368,  0.16421053]],
                         [[ 0.20842105,  0.22315789],
                          [ 0.26736842,  0.28210526]],
                         [[ 0.32631579,  0.34105263],
                          [ 0.38526316,  0.4       ]]]])

# Compare your output with ours. Difference should be around 1e-8.
print ('Testing max_pool_forward_naive function:')
print ('max pool forward error: ', rel_error(out, correct_out))

Testing max_pool_forward_naive function:
max pool forward error:  4.1666665157267834e-08


## 4.2 最大池化：朴素反向传播
在函数`max_pool_backward_naive`中实现最大池化操作的反向传播。不必担心计算效率。



In [None]:

def max_pool_backward_naive(dout, cache):
    """
    A naive implementation of the backward pass for a max pooling layer.

    Inputs:
    - dout: Upstream derivatives
    - cache: A tuple of (x, pool_param) as in the forward pass.

    Returns:
    - dx: Gradient with respect to x
    """
    dx = None
    #############################################################################
    # TODO: Implement the max pooling backward pass                             #
    #############################################################################
    pool_height = pool_param['pool_height']
    pool_width = pool_param['pool_width']
    stride = pool_param['stride']
    N, C, H, W = x.shape

    H_out = 1 + (H - pool_height) // stride
    W_out = 1 + (W - pool_width) // stride

    dx = np.zeros_like(x)
    max_indices = np.zeros((N, C, H_out, W_out))

    out = np.zeros((N, C, H_out, W_out))
    for n in range(N):
      for c in range(C):
        for i in range(H_out):
          for j in range(W_out):
            h_start = i*stride
            h_end = h_start + pool_height
            w_start = j*stride
            w_end = w_start + pool_width

            window = x[n, c, h_start:h_end, w_start:w_end]
            out[n,c,i,j] = np.max(window)

            max_index = np.unravel_index(np.argmax(window), window.shape)
            h_max = h_start + max_index[0]
            w_max = w_start + max_index[1]

            dx[n, c, h_max, w_max] += dout[n, c, i, j]

    #############################################################################
    #                             END OF YOUR CODE                              #
    #############################################################################
    return dx


完成后，运行以下内容以使用数值梯度检查验证您的反向传播，您应该看到误差小于 ``1e-10``。

In [None]:
x = np.random.randn(3, 2, 8, 8)
dout = np.random.randn(3, 2, 4, 4)
pool_param = {'pool_height': 2, 'pool_width': 2, 'stride': 2}

dx_num = eval_numerical_gradient_array(lambda x: max_pool_forward_naive(x, pool_param)[0], x, dout)

out, cache = max_pool_forward_naive(x, pool_param)
dx = max_pool_backward_naive(dout, cache)

# Your error should be around 1e-12
print ('Testing max_pool_backward_naive function:')
print ('dx error: ', rel_error(dx, dx_num))

Testing max_pool_backward_naive function:
dx error:  3.2756244991678825e-12


# 5 批标准化（Batch Normalization）

让深度网络更容易训练的一种方法是使用更复杂的优化程序，如SGD+动量、RMSProp或Adam。另一种策略是改变网络的架构，使其更容易训练。沿着这些思路的一个想法是批归一化，由[1]提出。

这个想法相对直观。当机器学习方法的输入数据由均值为零且方差为单位的不相关特征组成时，它们往往表现更好。在训练神经网络时，我们可以在将数据馈送到网络之前对数据进行预处理，以显式地去相关其特征；这将确保网络的第一层看到符合良好分布的数据。然而，即使我们对输入数据进行预处理，网络更深层的激活可能不再是不相关的，也可能不再具有零均值或单位方差，因为它们是网络中较早层的输出。更糟糕的是，在训练过程中，网络每一层的特征分布会随着每一层的权重更新而发生变化。

[1]的作者假设深度神经网络内部特征的分布变化可能会使训练深度网络变得更加困难。为了解决这个问题，[1]提出在网络中插入批归一化层。在训练时，批归一化层使用一个小批量的数据来估计每个特征的均值和标准差。然后，使用这些估计的均值和标准差来居中和归一化小批量的特征。在训练过程中，保持这些均值和标准差的运行平均值，而在测试时，使用这些运行平均值来居中和归一化特征。

这种归一化策略可能会降低网络的表示能力，因为对于某些层来说，具有非零均值或非单位方差的特征有时可能更为优化。因此，批归一化层包含每个特征维度的可学习的位移和缩放参数。


批标准化是训练深度全连接网络的一种非常有用的技术。批标准化也可以用于卷积网络，但我们需要稍作调整；这个修改被称为“空间批标准化”。

通常，批标准化接受形状为`(N, D)`的输入，并产生形状为`(N, D)`的输出，其中我们对小批量维度`N`进行归一化。对于来自卷积层的数据，批标准化需要接受形状为`(N, C, H, W)`的输入，并产生形状为`(N, C, H, W)`的输出，其中`N`维度表示小批量大小，`(H, W)`维度表示特征图的空间大小。

如果特征图是通过卷积产生的，那么我们期望每个特征通道的统计信息在不同图像之间以及同一图像中的不同位置之间是相对一致的。因此，空间批标准化通过在小批量维度`N`和空间维度`H`和`W`上计算统计信息，为每个`C`特征通道计算均值和方差。

[1] Sergey Ioffe和Christian Szegedy，“Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”，ICML 2015。


## 5.1 批标准化：前向传播

在函数`batchnorm_forward`中实现空间批标准化的前向传播。

In [None]:
import math
def batchnorm_forward(x, gamma, beta, bn_param):
    """
    Forward pass for batch normalization.

    During training the sample mean and (uncorrected) sample variance are
    computed from minibatch statistics and used to normalize the incoming data.
    During training we also keep an exponentially decaying running mean of the mean
    and variance of each feature, and these averages are used to normalize data
    at test-time.

    At each timestep we update the running averages for mean and variance using
    an exponential decay based on the momentum parameter:

    running_mean = momentum * running_mean + (1 - momentum) * sample_mean
    running_var = momentum * running_var + (1 - momentum) * sample_var

    Note that the batch normalization paper suggests a different test-time
    behavior: they compute sample mean and variance for each feature using a
    large number of training images rather than using a running average. For
    this implementation we have chosen to use running averages instead since
    they do not require an additional estimation step; the torch7 implementation
    of batch normalization also uses running averages.

    Input:
    - x: Data of shape (N, D)
    - gamma: Scale parameter of shape (D,)
    - beta: Shift paremeter of shape (D,)
    - bn_param: Dictionary with the following keys:
      - mode: 'train' or 'test'; required
      - eps: Constant for numeric stability
      - momentum: Constant for running mean / variance.
      - running_mean: Array of shape (D,) giving running mean of features
      - running_var Array of shape (D,) giving running variance of features

    Returns a tuple of:
    - out: of shape (N, D)
    - cache: A tuple of values needed in the backward pass
    """
    mode = bn_param['mode']
    eps = bn_param.get('eps', 1e-5)
    momentum = bn_param.get('momentum', 0.9)

    N, D = x.shape
    running_mean = bn_param.get('running_mean', np.zeros(D, dtype=x.dtype))
    running_var = bn_param.get('running_var', np.zeros(D, dtype=x.dtype))

    out, cache = None, None
    if mode == 'train':
        #############################################################################
        # TODO: Implement the training-time forward pass for batch normalization.   #
        # Use minibatch statistics to compute the mean and variance, use these      #
        # statistics to normalize the incoming data, and scale and shift the        #
        # normalized data using gamma and beta.                                     #
        #                                                                           #
        # You should store the output in the variable out. Any intermediates that   #
        # you need for the backward pass should be stored in the cache variable.    #
        #                                                                           #
        # You should also use your computed sample mean and variance together with  #
        # the momentum variable to update the running mean and running variance,    #
        # storing your result in the running_mean and running_var variables.        #
        #############################################################################

        sample_mean = x.mean(axis=(0))
        sample_var = x.var(axis=(0))

        x_normalized = (x - sample_mean)/np.sqrt(sample_var)
        out = gamma*x_normalized + beta

        running_mean = momentum * running_mean + (1 - momentum) * sample_mean
        running_var = momentum * running_var + (1 - momentum) * sample_var

        cache = (x, x_normalized, sample_mean, sample_var, gamma, beta, eps)
        #############################################################################
        #                             END OF YOUR CODE                              #
        #############################################################################
    elif mode == 'test':
        #############################################################################
        # TODO: Implement the test-time forward pass for batch normalization. Use   #
        # the running mean and variance to normalize the incoming data, then scale  #
        # and shift the normalized data using gamma and beta. Store the result in   #
        # the out variable.                                                         #
        #############################################################################
        sample_mean = x.mean(axis=(0))
        sample_var = x.var(axis=(0))

        running_mean = momentum * running_mean + (1 - momentum) * sample_mean
        running_var = momentum * running_var + (1 - momentum) * sample_var

        x_normalized = (x - running_mean)/np.sqrt(running_var)
        out = gamma*x_normalized + beta
        #############################################################################
        #                             END OF YOUR CODE                              #
        #############################################################################
    else:
        raise ValueError('Invalid forward batchnorm mode "%s"' % mode)

    # Store the updated running means back into bn_param
    bn_param['running_mean'] = running_mean
    bn_param['running_var'] = running_var

    return out, cache


def spatial_batchnorm_forward(x, gamma, beta, bn_param):
    """
    Computes the forward pass for spatial batch normalization.

    Inputs:
    - x: Input data of shape (N, C, H, W)
    - gamma: Scale parameter, of shape (C,)
    - beta: Shift parameter, of shape (C,)
    - bn_param: Dictionary with the following keys:
      - mode: 'train' or 'test'; required
      - eps: Constant for numeric stability
      - momentum: Constant for running mean / variance. momentum=0 means that
        old information is discarded completely at every time step, while
        momentum=1 means that new information is never incorporated. The
        default of momentum=0.9 should work well in most situations.
      - running_mean: Array of shape (D,) giving running mean of features
      - running_var Array of shape (D,) giving running variance of features

    Returns a tuple of:
    - out: Output data, of shape (N, C, H, W)
    - cache: Values needed for the backward pass
    """
    N, C, H, W = x.shape
    x_flat = x.transpose(0, 2, 3, 1).reshape(-1, C)
    out_flat, cache = batchnorm_forward(x_flat, gamma, beta, bn_param)
    out = out_flat.reshape(N, H, W, C).transpose(0, 3, 1, 2)
    return out, cache

通过运行以下内容检查您的实现，您应该看到误差小于 ``1e-5``。

In [None]:
# Check the training-time forward pass by checking means and variances
# of features both before and after spatial batch normalization
import numpy as np

N, C, H, W = 2, 3, 4, 5
x = 4 * np.random.randn(N, C, H, W) + 10

print ('Before spatial batch normalization:')
print ('  Shape: ', x.shape)
print ('  Means: ', x.mean(axis=(0, 2, 3)))
print ('  Stds: ', x.std(axis=(0, 2, 3)))

# Means should be close to zero and stds close to one. Shape should be unchanged.
gamma, beta = np.ones(C), np.zeros(C)
bn_param = {'mode': 'train'}
out, _ = spatial_batchnorm_forward(x, gamma, beta, bn_param)
print ('After spatial batch normalization:')
print ('  Shape: ', out.shape)
print ('  Means: ', out.mean(axis=(0, 2, 3)))
print ('  Stds: ', out.std(axis=(0, 2, 3)))
print ('  Means error: ', out.mean(axis=(0, 2, 3)).mean())
print ('  Stds error: ', (1 - out.std(axis=(0, 2, 3))).mean())


# Means should be close to beta and stds close to gamma. Shape should be unchnaged.
gamma, beta = np.asarray([3, 4, 5]), np.asarray([6, 7, 8])
out, _ = spatial_batchnorm_forward(x, gamma, beta, bn_param)
print ('After spatial batch normalization (nontrivial gamma, beta):')
print ('  Shape: ', out.shape)
print ('  Means: ', out.mean(axis=(0, 2, 3)))
print ('  Stds: ', out.std(axis=(0, 2, 3)))
print ('  Means error: ', (beta - out.mean(axis=(0, 2, 3))).mean())
print ('  Stds error: ', (gamma - out.std(axis=(0, 2, 3))).mean())

Before spatial batch normalization:
  Shape:  (2, 3, 4, 5)
  Means:  [9.70660725 8.94972907 9.43287164]
  Stds:  [3.56684828 3.91649695 3.50278471]
After spatial batch normalization:
  Shape:  (2, 3, 4, 5)
  Means:  [-2.67841305e-16 -8.32667268e-17  3.16413562e-16]
  Stds:  [1. 1. 1.]
  Means error:  -1.1564823173178714e-17
  Stds error:  -7.401486830834377e-17
After spatial batch normalization (nontrivial gamma, beta):
  Shape:  (2, 3, 4, 5)
  Means:  [6. 7. 8.]
  Stds:  [3. 4. 5.]
  Means error:  2.960594732333751e-15
  Stds error:  -4.440892098500626e-16


## 5.2 批标准化：反向传播

在函数`spatial_batchnorm_backward`中实现空间批标准化的反向传播。

In [None]:
def batchnorm_backward(dout, cache):
    """
    Backward pass for batch normalization.

    For this implementation, you should write out a computation graph for
    batch normalization on paper and propagate gradients backward through
    intermediate nodes.

    Inputs:
    - dout: Upstream derivatives, of shape (N, D)
    - cache: Variable of intermediates from batchnorm_forward.

    Returns a tuple of:
    - dx: Gradient with respect to inputs x, of shape (N, D)
    - dgamma: Gradient with respect to scale parameter gamma, of shape (D,)
    - dbeta: Gradient with respect to shift parameter beta, of shape (D,)
    """
    dx, dgamma, dbeta = None, None, None

    #############################################################################
    # TODO: Implement the backward pass for batch normalization. Store the      #
    # results in the dx, dgamma, and dbeta variables.                           #
    #############################################################################
    x, x_normalized, sample_mean, sample_var, gamma, beta, eps = cache
    N, D = dout.shape
    dbeta = np.sum(dout, axis=0)
    dgamma = np.sum(dout * x_normalized, axis=0)
    dx_normalized = dout * gamma
    dvar = np.sum(dx_normalized * (x - sample_mean) * (-0.5) * (sample_var + eps) ** (-1.5), axis=0)
    dmean = np.sum(dx_normalized * (-1 / np.sqrt(sample_var + eps)), axis=0) + \
            dvar * np.sum(-2 * (x - sample_mean), axis=0) / N

    dx = dx_normalized / np.sqrt(sample_var + eps) + \
         dvar * 2 * (x - sample_mean) / N + \
         dmean / N
    #############################################################################
    #                             END OF YOUR CODE                              #
    #############################################################################

    return dx, dgamma, dbeta


def spatial_batchnorm_backward(dout, cache):
  """
  Computes the backward pass for spatial batch normalization.

  Inputs:
  - dout: Upstream derivatives, of shape (N, C, H, W)
  - cache: Values from the forward pass

  Returns a tuple of:
  - dx: Gradient with respect to inputs, of shape (N, C, H, W)
  - dgamma: Gradient with respect to scale parameter, of shape (C,)
  - dbeta: Gradient with respect to shift parameter, of shape (C,)
  """
  N, C, H, W = dout.shape
  dout_flat = dout.transpose(0, 2, 3, 1).reshape(-1, C)
  dx_flat, dgamma, dbeta = batchnorm_backward(dout_flat, cache)
  dx = dx_flat.reshape(N, H, W, C).transpose(0, 3, 1, 2)
  return dx, dgamma, dbeta

通过运行以下内容使用数值梯度检查来检查您的实现，您应该看到误差小于 1e-7。

In [None]:
N, C, H, W = 2, 3, 4, 5
x = 5 * np.random.randn(N, C, H, W) + 12
gamma = np.random.randn(C)
beta = np.random.randn(C)
dout = np.random.randn(N, C, H, W)

bn_param = {'mode': 'train'}
fx = lambda x: spatial_batchnorm_forward(x, gamma, beta, bn_param)[0]
fg = lambda a: spatial_batchnorm_forward(x, gamma, beta, bn_param)[0]
fb = lambda b: spatial_batchnorm_forward(x, gamma, beta, bn_param)[0]

dx_num = eval_numerical_gradient_array(fx, x, dout)
da_num = eval_numerical_gradient_array(fg, gamma, dout)
db_num = eval_numerical_gradient_array(fb, beta, dout)

_, cache = spatial_batchnorm_forward(x, gamma, beta, bn_param)
dx, dgamma, dbeta = spatial_batchnorm_backward(dout, cache)
print ('dx error: ', rel_error(dx_num, dx))
print ('dgamma error: ', rel_error(da_num, dgamma))
print ('dbeta error: ', rel_error(db_num, dbeta))

dx error:  5.397506808940468e-07
dgamma error:  1.719315643026913e-11
dbeta error:  4.333467383368165e-12


# 6 结语

恭喜你！你已经完成了第一次作业。尽管这一路历经艰辛，但是你对于卷积、池化、批标准化都有了更加深刻的理解！



>本次作业负责人：郜今（助教），gaojin@sjtu.edu.cn。
最后请允许我再次强调，作业在 Canvas 上提交，只需要上传一份 ipynb 文件，请保留每个单元格的运行结果，注意时间节点。 如有任何问题，请联系[助教](https://cs7353.netlify.app/staff/)。