In [1]:
#hide
! [ -e /content ] && pip install -Uqq fastbook
import fastbook
fastbook.setup_book()

'[' 不是内部或外部命令，也不是可运行的程序
或批处理文件。


# A Neural Net from the Foundations

# 一个来自基金会的神经网络

This chapter begins a journey where we will dig deep into the internals of the models we used in the previous chapters. We will be covering many of the same things we've seen before, but this time around we'll be looking much more closely at the implementation details, and much less closely at the practical issues of how and why things are as they are.

We will build everything from scratch, only using basic indexing into a tensor. We'll write a neural net from the ground up, then implement backpropagation manually, so we know exactly what's happening in PyTorch when we call `loss.backward`. We'll also see how to extend PyTorch with custom *autograd* functions that allow us to specify our own forward and backward computations.

本章开始了一段将深入挖掘我们在前几章中使用的模型的内部结构的旅程。我们将涉及许多我们以前看到过的相同的东西，但这一次我们将更仔细地研究实现细节，而不是更仔细地研究事物是如何以及为什么是这样的实际问题。

我们将只使用基本的张量索引从头开始构建所有东西。我们将从头开始编写一个神经网络，然后手动实现反向传播，这样我们就能确切地知道在调用`loss.backward`时PyTorch中发生了什么。我们还将了解如何使用自定义*autograd*函数扩展PyTorch，该函数允许我们指定自己的向前和向后计算。

## Building a Neural Net Layer from Scratch

## 从头开始构建神经网络层

Let's start by refreshing our understanding of how matrix multiplication is used in a basic neural network. Since we're building everything up from scratch, we'll use nothing but plain Python initially (except for indexing into PyTorch tensors), and then replace the plain Python with PyTorch functionality once we've seen how to create it.

让我们从刷新我们对于矩阵乘法是如何在一个基本的神经网络中使用的理解。因为我们都是从头开始构建的，所以一开始我们只会使用普通的Python(除了索引到PyTorch张量)，然后在了解了如何创建后，用PyTorch功能替换普通的Python。

### Modeling a Neuron

### 建模一个神经元

A neuron receives a given number of inputs and has an internal weight for each of them. It sums those weighted inputs to produce an output and adds an inner bias. In math, this can be written as:

$$ out = \sum_{i=1}^{n} x_{i} w_{i} + b$$

if we name our inputs $(x_{1},\dots,x_{n})$, our weights $(w_{1},\dots,w_{n})$, and our bias $b$. In code this translates into:

```python
output = sum([x*w for x,w in zip(inputs,weights)]) + bias
```

This output is then fed into a nonlinear function called an *activation function* before being sent to another neuron. In deep learning the most common of these is the *rectified Linear unit*, or *ReLU*, which, as we've seen, is a fancy way of saying:
```python
def relu(x): return x if x >= 0 else 0
```

一个神经元接收给定数量的输入，并对每个输入都有一个内部权重。它将这些加权输入相加以产生输出并添加内部偏差。在数学中，这可以写成：

$$ out = \sum_{i=1}^{n} x_{i} w_{i} + b$$

如果我们给输入命名 $(x_{1},\dots,x_{n})$, 我们的权重 $(w_{1},\dots,w_{n})$, 我们的偏差 $b$. 在代码中，这转化为：

```python
output = sum([x*w for x,w in zip(inputs,weights)]) + bias
```

然后将该输出输入一个称为*激活函数*的非线性函数，然后再发送到另一个神经元。在深度学习中，最常见的是*整流线性单元*，或*ReLU*，正如我们所看到的，这是一种奇特的说法：

```python
def relu(x): return x if x >= 0 else 0
```

A deep learning model is then built by stacking a lot of those neurons in successive layers. We create a first layer with a certain number of neurons (known as *hidden size*) and link all the inputs to each of those neurons. Such a layer is often called a *fully connected layer* or a *dense layer* (for densely connected), or a *linear layer*. 

It requires to compute, for each `input` in our batch and each neuron with a give `weight`, the dot product:

```python
sum([x*w for x,w in zip(input,weight)])
```

If you have done a little bit of linear algebra, you may remember that having a lot of those dot products happens when you do a *matrix multiplication*. More precisely, if our inputs are in a matrix `x` with a size of `batch_size` by `n_inputs`, and if we have grouped the weights of our neurons in a matrix `w` of size `n_neurons` by `n_inputs` (each neuron must have the same number of weights as it has inputs) and all the biases in a vector `b` of size `n_neurons`, then the output of this fully connected layer is:

```python
y = x @ w.t() + b
```

where `@` represents the matrix product and `w.t()` is the transpose matrix of `w`. The output `y` is then of size `batch_size` by `n_neurons`, and in position `(i,j)` we have (for the mathy folks out there):

$$y_{i,j} = \sum_{k=1}^{n} x_{i,k} w_{k,j} + b_{j}$$

Or in code:

```python
y[i,j] = sum([a * b for a,b in zip(x[i,:],w[j,:])]) + b[j]
```

The transpose is necessary because in the mathematical definition of the matrix product `m @ n`, the coefficient `(i,j)` is:

```python
sum([a * b for a,b in zip(m[i,:],n[:,j])])
```

So the very basic operation we need is a matrix multiplication, as it's what is hidden in the core of a neural net.

然后通过在连续层中堆叠大量这些神经元来构建深度学习模型。我们创建具有一定数量神经元（称为*隐藏大小*）的第一层，并将所有输入链接到每个神经元。这样的层通常称为*全连接层*或*密集层*（用于密集连接），或*线性层*。 

它需要为我们批次中的每个`input`和每个具有给定`weight`的神经元计算点积：

```python
sum([x*w for x,w in zip(input,weight)])
```

如果你做过一点线性代数，你可能还记得当你做*矩阵乘法*时，会有很多这样的点积。更准确地说，如果我们的输入在一个矩阵`x`中，大小为`batch_size`乘`n_inputs`，如果我们将神经元的权重分组在一个矩阵`w`中，大小为`n_neurons`乘`n_inputs`（每个神经元的权重必须与其输入的权重相同），并且所有偏差在一个向量`b`中，大小为`n_neurons`，那么这个完全连接层的输出是：

```python
y = x @ w.t() + b
```

其中`@`表示矩阵乘积，`w.t()`是`w`的转置矩阵。然后输出`y`的大小为`batch_size`由`n_neurons`，并且在位置`(i,j)`中我们有（对于数学爱好者）：

$$y_{i,j} = \sum_{k=1}^{n} x_{i,k} w_{k,j} + b_{j}$$

或在代码中：

```python
y[i,j] = sum([a * b for a,b in zip(x[i,:],w[j,:])]) + b[j]
```

转置是必要的，因为在矩阵积`m @ n`的数学定义中，系数 `(i,j)` 是：

```python
sum([a * b for a,b in zip(m[i,:],n[:,j])])
```

所以我们需要的最基本的运算是矩阵乘法，因为它隐藏在神经网络的核心。

### Matrix Multiplication from Scratch

### 从头开始的矩阵乘法

Let's write a function that computes the matrix product of two tensors, before we allow ourselves to use the PyTorch version of it. We will only use the indexing in PyTorch tensors:

在我们允许自己使用PyTorch版本之前，让我们编写一个计算两个张量的矩阵积的函数。我们将只使用PyTorch张量中的索引：

In [2]:
import torch
from torch import tensor

We'll need three nested `for` loops: one for the row indices, one for the column indices, and one for the inner sum. `ac` and `ar` stand for number of columns of `a` and number of rows of `a`, respectively (the same convention is followed for `b`), and we make sure calculating the matrix product is possible by checking that `a` has as many columns as `b` has rows:

我们需要三个嵌套的`for`循环：一个用于行索引，一个用于列索引，一个用于内总和。`ac`和`ar`分别代表`a`的列数和`a`的行数（`b`遵循相同的约定），我们通过检查`a`的列数和`b`的行数来确保计算矩阵积是可能的：

In [3]:
def matmul(a,b):
    ar,ac = a.shape # n_rows * n_cols
    br,bc = b.shape
    assert ac==br
    c = torch.zeros(ar, bc)
    for i in range(ar):
        for j in range(bc):
            for k in range(ac): c[i,j] += a[i,k] * b[k,j]
    return c

To test this out, we'll pretend (using random matrices) that we're working with a small batch of 5 MNIST images, flattened into 28×28 vectors, with linear model to turn them into 10 activations:

为了测试这个，我们将假设（使用随机矩阵）我们正在处理一小批5个MNIST图像，拉成28×28个向量，使用线性模型将它们转换为10个激活：

In [4]:
m1 = torch.randn(5,28*28)
m2 = torch.randn(784,10)

Let's time our function, using the Jupyter "magic" command `%time`:

让我们使用Jupyter“Magic”命令`%time`为我们的函数计时：

In [5]:
%time t1=matmul(m1, m2)

CPU times: total: 1.22 s
Wall time: 1.34 s


And see how that compares to PyTorch's built-in `@`:

看看这与PyTorch的内置`@`相比如何：

In [6]:
%timeit -n 20 t2=m1@m2

The slowest run took 1160.49 times longer than the fastest. This could mean that an intermediate result is being cached.
920 µs ± 2.24 ms per loop (mean ± std. dev. of 7 runs, 20 loops each)


As we can see, in Python three nested loops is a very bad idea! Python is a slow language, and this isn't going to be very efficient. We see here that PyTorch is around 100,000 times faster than Python—and that's before we even start using the GPU!

Where does this difference come from? PyTorch didn't write its matrix multiplication in Python, but rather in C++ to make it fast. In general, whenever we do computations on tensors we will need to *vectorize* them so that we can take advantage of the speed of PyTorch, usually by using two techniques: elementwise arithmetic and broadcasting.

正如我们所看到的，在Python三个嵌套循环是一个非常糟糕的主意！Python是一种缓慢的语言，而且效率不会很高。我们在这里看到PyTorch比Python快大约100,000倍——那是在我们开始使用GPU之前！

这种差异来自哪里？PyTorch不是用Python编写矩阵乘法的，而是C++使其快速。一般来说，每当我们对张量进行计算时，我们都需要对它们进行*矢量化*，以便我们可以利用PyTorch的速度，通常使用两种技术：元素算术和广播。

### Elementwise Arithmetic

### 元素算术

All the basic operators (`+`, `-`, `*`, `/`, `>`, `<`, `==`) can be applied elementwise. That means if we write `a+b` for two tensors `a` and `b` that have the same shape, we will get a tensor composed of the sums the elements of `a` and `b`:

(`+`, `-`, `*`, `/`, `>`, `<`, `==`) 的所有基本运算符都可以按元素应用。这意味着如果我们为具有相同形状的两个张量`a`和`b`编写`a+b`，我们将得到一个由`a`和`b`元素之和组成的张量：

In [7]:
a = tensor([10., 6, -4])
b = tensor([2., 8, 7])
a + b

tensor([12., 14.,  3.])

The Booleans operators will return an array of Booleans:

Booleans运算符将返回一个布尔数组：

In [8]:
a < b

tensor([False,  True,  True])

If we want to know if every element of `a` is less than the corresponding element in `b`, or if two tensors are equal, we need to combine those elementwise operations with `torch.all`:

如果我们想知道`a`的每个元素是否小于`b`中对应的元素，或者两个张量是否相等，我们需要将这些元素操作与`torch.all`结合起来：

In [9]:
(a < b).all(), (a==b).all()

(tensor(False), tensor(False))

Reduction operations like `all()`, `sum()` and `mean()` return tensors with only one element, called rank-0 tensors. If you want to convert this to a plain Python Boolean or number, you need to call `.item()`:

像`all（）`、`sum（）`和`mean（）`这样的约简操作只返回一个元素的张量，称为rank-0张量。如果要将其转换为纯Python布尔值或数字，则需要调用`.item（）`：

In [10]:
(a + b).mean().item()

9.666666984558105

The elementwise operations work on tensors of any rank, as long as they have the same shape:

元素运算适用于任何秩的张量，只要它们具有相同的形状：

In [11]:
m = tensor([[1., 2, 3], [4,5,6], [7,8,9]])
m*m

tensor([[ 1.,  4.,  9.],
        [16., 25., 36.],
        [49., 64., 81.]])

However you can't perform elementwise operations on tensors that don't have the same shape (unless they are broadcastable, as discussed in the next section):

但是，您不能对不具有相同形状的张量执行元素操作（除非它们是可广播的，如下一节所述）：

In [12]:
n = tensor([[1., 2, 3], [4,5,6]])
m*n

RuntimeError: The size of tensor a (3) must match the size of tensor b (2) at non-singleton dimension 0

With elementwise arithmetic, we can remove one of our three nested loops: we can multiply the tensors that correspond to the `i`-th row of `a` and the `j`-th column of `b` before summing all the elements, which will speed things up because the inner loop will now be executed by PyTorch at C speed. 

To access one column or row, we can simply write `a[i,:]` or `b[:,j]`. The `:` means take everything in that dimension. We could restrict this and take only a slice of that particular dimension by passing a range, like `1:5`, instead of just `:`. In that case, we would take the elements in columns or rows 1 to 4 (the second number is noninclusive). 

One simplification is that we can always omit a trailing colon, so `a[i,:]` can be abbreviated to `a[i]`. With all of that in mind, we can write a new version of our matrix multiplication:

使用元素运算，我们可以删除三个嵌套循环中的一个：我们可以在求和所有元素之前将对应于`a`的第`i`行和`b`的第`j`列的张量相乘，这将加快速度，因为内部循环现在将由PyTorch以C速度执行。

要访问一列或一行，我们可以简单地编写`a[i,：]`或`b[：,j]`。`：`表示获取该维度中的所有内容。我们可以通过传递一个范围（如`1:5`）来限制这一点并仅获取该特定维度的一部分，而不仅仅是`：`。在这种情况下，我们将获取列或行1到4中的元素（第二个数字是不包含在内的）。

一个简化是我们总是可以省略尾随冒号，所以`a[i，：]`可以缩写为`a[i]`。考虑到所有这些，我们可以编写一个新版本的矩阵乘法：



In [13]:
def matmul(a,b):
    ar,ac = a.shape
    br,bc = b.shape
    assert ac==br
    c = torch.zeros(ar, bc)
    for i in range(ar):
        for j in range(bc): c[i,j] = (a[i] * b[:,j]).sum()
    return c

In [14]:
%timeit -n 20 t3 = matmul(m1,m2)

2 ms ± 172 µs per loop (mean ± std. dev. of 7 runs, 20 loops each)


We're already ~700 times faster, just by removing that inner `for` loop! And that's just the beginning—with broadcasting we can remove another loop and get an even more important speed up.

我们已经快了大约700倍，仅仅通过去掉内部`for`循环！这只是开始——通过广播，我们可以去掉另一个循环，获得更重要的加速。

### Broadcasting

### 广播

As we discussed in <<chapter_mnist_basics>>, broadcasting is a term introduced by the [NumPy library](https://docs.scipy.org/doc/) that describes how tensors of different ranks are treated during arithmetic operations. For instance, it's obvious there is no way to add a 3×3 matrix with a 4×5 matrix, but what if we want to add one scalar (which can be represented as a 1×1 tensor) with a matrix? Or a vector of size 3 with a 3×4 matrix? In both cases, we can find a way to make sense of this operation.

Broadcasting gives specific rules to codify when shapes are compatible when trying to do an elementwise operation, and how the tensor of the smaller shape is expanded to match the tensor of the bigger shape. It's essential to master those rules if you want to be able to write code that executes quickly. In this section, we'll expand our previous treatment of broadcasting to understand these rules.

正如我们在<<chapter_mnist_basics>>中讨论的，广播是[NumPy库](https://docs.scipy.org/doc/)引入的一个术语，它描述了在算术运算中如何处理不同等级的张量。例如，很明显没有办法用4×5矩阵添加3×3矩阵，但是如果我们想用矩阵添加一个标量（可以表示为1×1张量）呢？或者用3×4矩阵添加一个大小为3的向量？在这两种情况下，我们都可以找到一种方法来讲得通这种操作。

广播给出了在尝试执行元素操作时形状何时兼容的具体规则，以及如何扩展较小形状的张量以匹配较大形状的张量。如果您想能够编写快速执行的代码，掌握这些规则是至关重要的。在本节中，我们将扩展之前对广播的处理以了解这些规则。

#### Broadcasting with a scalar

#### 使用标量进行广播

Broadcasting with a scalar is the easiest type of broadcasting. When we have a tensor `a` and a scalar, we just imagine a tensor of the same shape as `a` filled with that scalar and perform the operation:

使用标量广播是最简单的广播类型。当我们有一个张量`a`和一个标量时，我们只是想象一个与`a`相同形状的张量填充该标量并执行操作：

In [15]:
a = tensor([10., 6, -4])
a > 0

tensor([ True,  True, False])

How are we able to do this comparison? `0` is being *broadcast* to have the same dimensions as `a`. Note that this is done without creating a tensor full of zeros in memory (that would be very inefficient). 

This is very useful if you want to normalize your dataset by subtracting the mean (a scalar) from the entire data set (a matrix) and dividing by the standard deviation (another scalar):

我们如何能够进行这种比较？`0`被*广播*为与`a`具有相同的维度。请注意，这是在没有在内存中创建充满零的张量的情况下完成的（这将是非常低效的）。
如果您想通过从整个数据集（矩阵）中减去平均值（标量）并除以标准差（另一个标量）来标准化数据集，这非常有用：

In [16]:
m = tensor([[1., 2, 3], [4,5,6], [7,8,9]])
(m - 5) / 2.73

tensor([[-1.4652, -1.0989, -0.7326],
        [-0.3663,  0.0000,  0.3663],
        [ 0.7326,  1.0989,  1.4652]])

What if have different means for each row of the matrix? in that case you will 
need to broadcast a vector to a matrix.

如果矩阵的每一行都有不同的方法呢？在这种情况下，您将需要向矩阵广播一个向量。

#### Broadcasting a vector to a matrix

#### 将向量广播到矩阵

We can broadcast a vector to a matrix as follows:

我们可以将向量广播到矩阵，如下所示：

In [17]:
c = tensor([10.,20,30])
m = tensor([[1., 2, 3], [4,5,6], [7,8,9]])
m.shape,c.shape

(torch.Size([3, 3]), torch.Size([3]))

In [18]:
m + c

tensor([[11., 22., 33.],
        [14., 25., 36.],
        [17., 28., 39.]])

Here the elements of `c` are expanded to make three rows that match, making the operation possible. Again, PyTorch doesn't actually create three copies of `c` in memory. This is done by the `expand_as` method behind the scenes:

在这里，`c`的元素被扩展为三行匹配，从而使操作成为可能。同样，PyTorch实际上并没有在内存中创建`c`的三个副本。这是由幕后的`expand_as`方法完成的：

In [19]:
c.expand_as(m)

tensor([[10., 20., 30.],
        [10., 20., 30.],
        [10., 20., 30.]])

If we look at the corresponding tensor, we can ask for its `storage` property (which shows the actual contents of the memory used for the tensor) to check there is no useless data stored:

如果我们查看相应的张量，我们可以要求它的`storage`属性（显示张量使用的内存的实际内容）来检查是否存储了无用的数据：

In [20]:
t = c.expand_as(m)
t.storage()

 10.0
 20.0
 30.0
[torch.FloatStorage of size 3]

Even though the tensor officially has nine elements, only three scalars are stored in memory. This is possible thanks to the clever trick of giving that dimension a *stride* of 0 (which means that when PyTorch looks for the next row by adding the stride, it doesn't move):

尽管张量有九个元素，但内存中只存储了三个标量。这是可能的，这要归功于给该维度一个步幅为0的巧妙技巧（这意味着当PyTorch通过添加步幅来查找下一行时，它不会移动）：

In [21]:
t.stride(), t.shape

((0, 1), torch.Size([3, 3]))

Since `m` is of size 3×3, there are two ways to do broadcasting. The fact it was done on the last dimension is a convention that comes from the rules of broadcasting and has nothing to do with the way we ordered our tensors. If instead we do this, we get the same result:

因为`m`的大小是3×3，所以有两种方法来进行广播。事实上，它是在最后一个维度上完成的，这是一个来自广播规则的约定，与我们排序张量的方式无关。如果我们这样做，我们会得到相同的结果：

In [22]:
c + m

tensor([[11., 22., 33.],
        [14., 25., 36.],
        [17., 28., 39.]])

In fact, it's only possible to broadcast a vector of size `n` with a matrix of size `m` by `n`:

事实上，只能广播大小为`n`的向量和大小为`m` x `n`的矩阵：

In [23]:
c = tensor([10.,20,30])
m = tensor([[1., 2, 3], [4,5,6]])
c+m

tensor([[11., 22., 33.],
        [14., 25., 36.]])

This won't work:

这不会工作：

In [None]:
c = tensor([10.,20])
m = tensor([[1., 2, 3], [4,5,6]])
c+m

RuntimeError: The size of tensor a (2) must match the size of tensor b (3) at non-singleton dimension 1

If we want to broadcast in the other dimension, we have to change the shape of our vector to make it a 3×1 matrix. This is done with the `unsqueeze` method in PyTorch:

如果我们想在另一个维度上广播，我们必须改变向量的形状，使其成为3×1矩阵。这是通过PyTorch中的`unsqueeze`方法完成的：

In [24]:
c = tensor([10.,20,30])
m = tensor([[1., 2, 3], [4,5,6], [7,8,9]])
c = c.unsqueeze(1)
m.shape,c.shape

(torch.Size([3, 3]), torch.Size([3, 1]))

This time, `c` is expanded on the column side:

这次，`c`在列方向展开：

In [25]:
c+m

tensor([[11., 12., 13.],
        [24., 25., 26.],
        [37., 38., 39.]])

Like before, only three scalars are stored in memory:

像以前一样，内存中只存储了三个标量：

In [None]:
t = c.expand_as(m)
t.storage()

 10.0
 20.0
 30.0
[torch.FloatStorage of size 3]

And the expanded tensor has the right shape because the column dimension has a stride of 0:

扩展张量具有正确的形状，因为列维度的步幅为0：

In [None]:
t.stride(), t.shape

((1, 0), torch.Size([3, 3]))

With broadcasting, by default if we need to add dimensions, they are added at the beginning. When we were broadcasting before, Pytorch was doing `c.unsqueeze(0)` behind the scenes:

使用广播，默认如果我们需要添加维度，它们会在开始时添加。当我们之前广播时，Pytorch在幕后做`c.unsqueeze（0）`：

In [None]:
c = tensor([10.,20,30])
c.shape, c.unsqueeze(0).shape,c.unsqueeze(1).shape

(torch.Size([3]), torch.Size([1, 3]), torch.Size([3, 1]))

The `unsqueeze` command can be replaced by `None` indexing:

`取消压缩`命令可以替换为`None`索引：

In [None]:
c.shape, c[None,:].shape,c[:,None].shape

(torch.Size([3]), torch.Size([1, 3]), torch.Size([3, 1]))

You can always omit trailing colons, and `...` means all preceding dimensions:

您总是可以省略尾随冒号，并且`...`表示所有前面的维度：

In [None]:
c[None].shape,c[...,None].shape

(torch.Size([1, 3]), torch.Size([3, 1]))

With this, we can remove another `for` loop in our matrix multiplication function. Now, instead of multiplying `a[i]` with `b[:,j]`, we can multiply `a[i]` with the whole matrix `b` using broadcasting, then sum the results:

有了这个，我们可以在矩阵乘法函数中删除另一个`for`循环。现在，我们可以使用广播将`a[i]`与`b[:,j]`相乘，而不是将`a[i]`与整个矩阵`b`相乘，然后求和结果：

In [None]:
def matmul(a,b):
    ar,ac = a.shape
    br,bc = b.shape
    assert ac==br
    c = torch.zeros(ar, bc)
    for i in range(ar):
#       c[i,j] = (a[i,:]          * b[:,j]).sum() # previous
        c[i]   = (a[i  ].unsqueeze(-1) * b).sum(dim=0)
    return c

In [None]:
%timeit -n 20 t4 = matmul(m1,m2)

357 µs ± 7.2 µs per loop (mean ± std. dev. of 7 runs, 20 loops each)


We're now 3,700 times faster than our first implementation! Before we move on, let's discuss the rules of broadcasting in a little more detail.

我们现在比第一次实现快了3700倍！在我们继续之前，让我们更详细地讨论广播规则。

#### Broadcasting rules

#### 广播规则

When operating on two tensors, PyTorch compares their shapes elementwise. It starts with the *trailing dimensions* and works its way backward, adding 1 when it meets empty dimensions. Two dimensions are *compatible* when one of the following is true:

- They are equal.
- One of them is 1, in which case that dimension is broadcast to make it the same as the other.

Arrays do not need to have the same number of dimensions. For example, if you have a 256×256×3 array of RGB values, and you want to scale each color in the image by a different value, you can multiply the image by a one-dimensional array with three values. Lining up the sizes of the trailing axes of these arrays according to the broadcast rules, shows that they are compatible:

```
Image  (3d tensor): 256 x 256 x 3
Scale  (1d tensor):  (1)   (1)  3
Result (3d tensor): 256 x 256 x 3
```
    
However, a 2D tensor of size 256×256 isn't compatible with our image:

```
Image  (3d tensor): 256 x 256 x   3
Scale  (2d tensor):  (1)  256 x 256
Error
```

In our earlier examples we had with a 3×3 matrix and a vector of size 3, broadcasting was done on the rows:

```
Matrix (2d tensor):   3 x 3
Vector (1d tensor): (1)   3
Result (2d tensor):   3 x 3
```

As an exercise, try to determine what dimensions to add (and where) when you need to normalize a batch of images of size `64 x 3 x 256 x 256` with vectors of three elements (one for the mean and one for the standard deviation).

当在两个张量上操作时，PyTorch会按元素比较它们的形状。开头是*尾随维度*，向后工作，遇到空维度时加1。当以下情况之一为真时，两个维度是*兼容*的：

- 他们是相等的。
- 其中一个是1，在这种情况下，该维度被广播以使其与另一个维度相同。

数组不需要具有相同的维数。例如，如果您有一个256×256×3的RGB值数组，并且您希望将图像中的每种颜色缩放不同的值，则可以将图像乘以具有三个值的一维数组。根据广播规则排列这些数组的尾轴的大小，表明它们是兼容的：

```
Image  (3d tensor): 256 x 256 x 3
Scale  (1d tensor):  (1)   (1)  3
Result (3d tensor): 256 x 256 x 3
```
    
但是，大小为256×256的2D张量与我们的图像不兼容：

```
Image  (3d tensor): 256 x 256 x   3
Scale  (2d tensor):  (1)  256 x 256
Error
```

在我们之前的例子中，我们有一个3×3矩阵和一个大小为3的向量，广播是在行上完成的：

```
Matrix (2d tensor):   3 x 3
Vector (1d tensor): (1)   3
Result (2d tensor):   3 x 3
```

作为练习，当您需要对一组大小为`64 x 3 x 256 x 256`包含三个元素(一个代表平均值，一个代表标准差)的向量的图像进行规范化时，请尝试确定添加哪些维度(以及在哪里)。

Another useful way of simplifying tensor manipulations is the use of Einstein summations convention.

简化张量操作的另一种有用方法是使用爱因斯坦求和约定。

### Einstein Summation

### 爱因斯坦求和

Before using the PyTorch operation `@` or `torch.matmul`, there is one last way we can implement matrix multiplication: Einstein summation (`einsum`). This is a compact representation for combining products and sums in a general way. We write an equation like this:

```
ik,kj -> ij
```

The lefthand side represents the operands dimensions, separated by commas. Here we have two tensors that each have two dimensions (`i,k` and `k,j`).  The righthand side represents the result dimensions, so here we have a tensor with two dimensions `i,j`. 

The rules of Einstein summation notation are as follows:

1. Repeated indices on the left side are implicitly summed over if they are not on the right side.
2. Each index can appear at most twice on the left side.
3. The unrepeated indices on the left side must appear on the right side.

So in our example, since `k` is repeated, we sum over that index. In the end the formula represents the matrix obtained when we put in `(i,j)` the sum of all the coefficients `(i,k)` in the first tensor multiplied by the coefficients `(k,j)` in the second tensor... which is the matrix product! Here is how we can code this in PyTorch:

在使用PyTorch操作`@`或`torch.matmul`之前，我们还有最后一种实现矩阵乘法的方法：爱因斯坦求和（`einsum`）。这是一种以一般方式组合乘积和求和的紧凑表示。我们写一个这样的等式：

```
ik,kj -> ij
```

左边表示操作数维度，用逗号分隔。这里我们有两个张量，每个张量都有两个维度（`i， k`和`k， j`）。右边代表结果维度，所以这里我们有一个张量，有两个维度`i,j`。 

Einstein求和符号的规则如下：

1. 如果左侧的重复指数不在右侧，则会隐式求和。
2. 每个索引在左侧最多可以出现两次。
3. 左侧的未重复索引必须显示在右侧。

So in our example, since `k` is repeated, we sum over that index. In the end the formula represents the matrix obtained when we put in `(i,j)` the sum of all the coefficients `(i,k)` in the first tensor multiplied by the coefficients `(k,j)` in the second tensor... which is the matrix product! Here is how we can code this in PyTorch:
所以在我们的例子中，因为`k`是重复的，所以我们对该索引求和。最后，公式表示当我们在第一个张量中输入`(i,j)`所有系数`(i,k)`的总和乘以第二个张量中的系数`(k,j)`时获得的矩阵...这是矩阵乘积！这是我们如何在PyTorch中编码：

In [None]:
def matmul(a,b): return torch.einsum('ik,kj->ij', a, b)

Einstein summation is a very practical way of expressing operations involving indexing and sum of products. Note that you can have just one member on the lefthand side. For instance, this:

```python
torch.einsum('ij->ji', a)
```

returns the transpose of the matrix `a`. You can also have three or more members. This:

```python
torch.einsum('bi,ij,bj->b', a, b, c)
```

will return a vector of size `b` where the `k`-th coordinate is the sum of `a[k,i] b[i,j] c[k,j]`. This notation is particularly convenient when you have more dimensions because of batches. For example, if you have two batches of matrices and want to compute the matrix product per batch, you would could this: 

```python
torch.einsum('bik,bkj->bij', a, b)
```

Let's go back to our new `matmul` implementation using `einsum` and look at its speed:

Einstein求和是一种非常实用的表示运算的方法，涉及到索引和乘积的和。注意，左边只能有一个成员。例如,这个:

```python
torch.einsum('ij->ji', a)
```

返回矩阵`a`的转置。您也可以有三个或更多成员。这：

```python
torch.einsum('bi,ij,bj->b', a, b, c)
```

will return a vector of size `b` where the `k`-th coordinate is the sum of `a[k,i] b[i,j] c[k,j]`. This notation is particularly convenient when you have more dimensions because of batches. For example, if you have two batches of matrices and want to compute the matrix product per batch, you would could this: 
将返回一个大小为`b`的向量，其中第`k`个坐标是`a[k，i]b[i，j]c[k，j]`的总和。由于批次，当您有更多维度时，这种表示法特别方便。例如，如果您有两批矩阵并希望计算每批矩阵的乘积，您可以这样做： 

```python
torch.einsum('bik,bkj->bij', a, b)
```

让我们使用`einsum`返回到新的`matmul`实现，并看看它的速度:

In [None]:
%timeit -n 20 t5 = matmul(m1,m2)

68.7 µs ± 4.06 µs per loop (mean ± std. dev. of 7 runs, 20 loops each)


As you can see, not only is it practical, but it's *very* fast. `einsum` is often the fastest way to do custom operations in PyTorch, without diving into C++ and CUDA. (But it's generally not as fast as carefully optimized CUDA code, as you see from the results in "Matrix Multiplication from Scratch".)

正如您所看到的，它不仅实用，而且速度非常快。`einsum`通常是在PyTorch中执行自定义操作的最快方法，而无需深入研究C++和CUDA。（但它通常不如精心优化的CUDA代码快，正如您从“从头开始的矩阵乘法”中看到的结果。）

Now that we know how to implement a matrix multiplication from scratch, we are ready to build our neural net—specifically its forward and backward passes—using just matrix multiplications.

现在我们知道了如何从零开始实现矩阵乘法，我们准备好只使用矩阵乘法来构建我们的神经网络——特别是它的前向和后向传递。

## The Forward and Backward Passes

## 向前和向后传递

As we saw in <<chapter_mnist_basics>>, to train a model, we will need to compute all the gradients of a given loss with respect to its parameters, which is known as the *backward pass*. The *forward pass* is where we compute the output of the model on a given input, based on the matrix products. As we define our first neural net, we will also delve into the problem of properly initializing the weights, which is crucial for making training start properly.

正如我们在<<chapter_mnist_basics>>中看到的，要训练模型，我们需要计算给定损失关于其参数的所有梯度，这称为*向后传递*。*向前传递*是我们根据矩阵乘积计算给定输入上模型的输出的地方。当我们定义我们的第一个神经网络时，我们还将深入研究正确初始化权重的问题，这对于正确开始训练至关重要。

### Defining and Initializing a Layer

### 定义和初始化层

We will take the example of a two-layer neural net first. As we've seen, one layer can be expressed as `y = x @ w + b`, with `x` our inputs, `y` our outputs, `w` the weights of the layer (which is of size number of inputs by number of neurons if we don't transpose like before), and `b` is the bias vector:

我们将首先以两层神经网络为例。正如我们所看到的，一层可以表示为`y=x@w+b`，其中`x`是我们的输入，`y`是我们的输出，`w`是层的权重（如果我们不像以前那样转置，则是输入数量和神经元数量的大小），`b`是偏差向量：

In [None]:
def lin(x, w, b): return x @ w + b

We can stack the second layer on top of the first, but since mathematically the composition of two linear operations is another linear operation, this only makes sense if we put something nonlinear in the middle, called an activation function. As mentioned at the beginning of the chapter, in deep learning applications the activation function most commonly used is a ReLU, which returns the maximum of `x` and `0`. 

We won't actually train our model in this chapter, so we'll use random tensors for our inputs and targets. Let's say our inputs are 200 vectors of size 100, which we group into one batch, and our targets are 200 random floats:

我们可以将第二层堆栈在第一层之上，但由于从数学上讲，两个线性操作的组合是另一个线性操作，因此这只有在我们将非线性的东西放在中间时才有意义，称为激活函数。如本章开头所述，在深度学习应用中，最常用的激活函数是ReLU，它返回`x`和`0`的最大值。

在本章中，我们实际上不会训练我们的模型，所以我们将对输入和目标使用随机张量。假设我们的输入是200个大小为100的向量，我们将其分组为一批，我们的目标是200个随机浮点数：

In [None]:
x = torch.randn(200, 100)
y = torch.randn(200)

For our two-layer model we will need two weight matrices and two bias vectors. Let's say we have a hidden size of 50 and the output size is 1 (for one of our inputs, the corresponding output is one float in this toy example). We initialize the weights randomly and the bias at zero:

对于我们的两层模型，我们需要两个权重矩阵和两个偏置向量。假设我们的隐藏大小为50，输出大小为1（对于我们的一个输入，在这个玩具示例中相应的输出是一个浮点数）。我们随机初始化权重，偏置为零：

In [None]:
w1 = torch.randn(100,50)
b1 = torch.zeros(50)
w2 = torch.randn(50,1)
b2 = torch.zeros(1)

Then the result of our first layer is simply:

那么我们第一层的结果很简单：

In [None]:
l1 = lin(x, w1, b1)
l1.shape

torch.Size([200, 50])

Note that this formula works with our batch of inputs, and returns a batch of hidden state: `l1` is a matrix of size 200 (our batch size) by 50 (our hidden size).

There is a problem with the way our model was initialized, however. To understand it, we need to look at the mean and standard deviation (std) of `l1`:

请注意，此公式适用于我们的一批输入，并返回一批隐藏状态：`l1`是大小为200（我们的批次大小）乘50（我们的隐藏大小）的矩阵。

然而，我们的模型初始化方式有一个问题。要理解它，我们需要查看`l1`的均值和标准差（std）：

In [None]:
l1.mean(), l1.std()

(tensor(0.0019), tensor(10.1058))

The mean is close to zero, which is understandable since both our input and weight matrices have means close to zero. But the standard deviation, which represents how far away our activations go from the mean, went from 1 to 10. This is a really big problem because that's with just one layer. Modern neural nets can have hundred of layers, so if each of them multiplies the scale of our activations by 10, by the end of the last layer we won't have numbers representable by a computer.

Indeed, if we make just 50 multiplications between `x` and random matrices of size 100×100, we'll have:

均值接近于零，这是可以理解的，因为我们的输入矩阵和权重矩阵的均值都接近于零。但是标准差，代表我们的激活离均值有多远，从1到10。这是一个非常大的问题，因为只有一层。现代神经网络可以有一百层，所以如果它们中的每一层都将我们的激活规模乘以10，到最后一层结束时，我们将没有计算机可以表示的数字。

事实上，如果我们在`x`和大小为100×100的随机矩阵之间进行50次乘法，我们将得到：

In [None]:
x = torch.randn(200, 100)
for i in range(50): x = x @ torch.randn(100,100)
x[0:5,0:5]

tensor([[nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan]])

The result is `nan`s everywhere. So maybe the scale of our matrix was too big, and we need to have smaller weights? But if we use too small weights, we will have the opposite problem—the scale of our activations will go from 1 to 0.1, and after 50 layers we'll be left with zeros everywhere:

结果是到处都是`nan`。所以也许我们矩阵的规模太大了，我们需要更小的权重？但是如果我们使用太小的权重，我们将遇到相反的问题——我们的激活规模将从1到0.1,50层之后，我们将到处都是零：

In [None]:
x = torch.randn(200, 100)
for i in range(50): x = x @ (torch.randn(100,100) * 0.01)
x[0:5,0:5]

tensor([[0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.]])

So we have to scale our weight matrices exactly right so that the standard deviation of our activations stays at 1. We can compute the exact value to use mathematically, as illustrated by Xavier Glorot and Yoshua Bengio in ["Understanding the Difficulty of Training Deep Feedforward Neural Networks"](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf). The right scale for a given layer is $1/\sqrt{n_{in}}$, where $n_{in}$ represents the number of inputs.

In our case, if we have 100 inputs, we should scale our weight matrices by 0.1:

因此，我们必须完全正确地缩放我们的权重矩阵，以便我们激活的标准差保持在1。我们可以计算精确的值以进行数学计算，正如Xavier Glorot和Yoshua Bengio在["了解训练深度前馈神经网络的难度"](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf)中所说明的那样。给定层的正确刻度是$1/\sqrt{n_{in}}$, 其中$n_{in}$ , 表示输入的数量。

在我们的例子中，如果我们有100个输入，我们应该将权重矩阵缩放0.1：

In [None]:
x = torch.randn(200, 100)
for i in range(50): x = x @ (torch.randn(100,100) * 0.1)
x[0:5,0:5]

tensor([[ 0.7554,  0.6167, -0.1757, -1.5662,  0.5644],
        [-0.1987,  0.6292,  0.3283, -1.1538,  0.5416],
        [ 0.6106,  0.2556, -0.0618, -0.9463,  0.4445],
        [ 0.4484,  0.7144,  0.1164, -0.8626,  0.4413],
        [ 0.3463,  0.5930,  0.3375, -0.9486,  0.5643]])

Finally some numbers that are neither zeros nor `nan`s! Notice how stable the scale of our activations is, even after those 50 fake layers:

最后是一些既不是零也不是`nan`的数字！注意我们的激活规模是多么稳定，即使在这50个假层之后：

In [None]:
x.std()

tensor(0.7042)

If you play a little bit with the value for scale you'll notice that even a slight variation from 0.1 will get you either to very small or very large numbers, so initializing the weights properly is extremely important. 

Let's go back to our neural net. Since we messed a bit with our inputs, we need to redefine them:

如果你稍微使用一下缩放值，你会注意到即使是0.1的微小变化也会让你变成非常小的或非常大的数字，所以正确初始化权重非常重要。

让我们回到我们的神经网络。由于我们的输入有点混乱，我们需要重新定义它们：

In [None]:
x = torch.randn(200, 100)
y = torch.randn(200)

And for our weights, we'll use the right scale, which is known as *Xavier initialization* (or *Glorot initialization*):

对于我们的权重，我们将使用正确的刻度，称为*Xavier初始化*（或*Glorot初始化*）：

In [None]:
from math import sqrt
w1 = torch.randn(100,50) / sqrt(100)
b1 = torch.zeros(50)
w2 = torch.randn(50,1) / sqrt(50)
b2 = torch.zeros(1)

Now if we compute the result of the first layer, we can check that the mean and standard deviation are under control:

现在如果我们计算第一层的结果，我们可以检查平均值和标准差是否在控制之下：

In [None]:
l1 = lin(x, w1, b1)
l1.mean(),l1.std()

(tensor(-0.0050), tensor(1.0000))

Very good. Now we need to go through a ReLU, so let's define one. A ReLU removes the negatives and replaces them with zeros, which is another way of saying it clamps our tensor at zero:

很好。现在我们需要通过一个ReLU，所以让我们定义一个。ReLU删除负数并用零替换它们，还有另一种说法，它将张量钳制在零：

In [None]:
def relu(x): return x.clamp_min(0.)

We pass our activations through this:

我们通过以下方式传递我们的激活：

In [None]:
l2 = relu(l1)
l2.mean(),l2.std()

(tensor(0.3961), tensor(0.5783))

And we're back to square one: the mean of our activations has gone to 0.4 (which is understandable since we removed the negatives) and the std went down to 0.58. So like before, after a few layers we will probably wind up with zeros:

我们又回到了原点：我们的激活平均值已经降到了0.4（这是可以理解的，因为我们删除了负值），std降到了0.58。所以像以前一样，在几层之后，我们可能会以零结束：

In [None]:
x = torch.randn(200, 100)
for i in range(50): x = relu(x @ (torch.randn(100,100) * 0.1))
x[0:5,0:5]

tensor([[0.0000e+00, 1.9689e-08, 4.2820e-08, 0.0000e+00, 0.0000e+00],
        [0.0000e+00, 1.6701e-08, 4.3501e-08, 0.0000e+00, 0.0000e+00],
        [0.0000e+00, 1.0976e-08, 3.0411e-08, 0.0000e+00, 0.0000e+00],
        [0.0000e+00, 1.8457e-08, 4.9469e-08, 0.0000e+00, 0.0000e+00],
        [0.0000e+00, 1.9949e-08, 4.1643e-08, 0.0000e+00, 0.0000e+00]])

This means our initialization wasn't right. Why? At the time Glorot and Bengio wrote their article, the popular activation in a neural net was the hyperbolic tangent (tanh, which is the one they used), and that initialization doesn't account for our ReLU. Fortunately, someone else has done the math for us and computed the right scale for us to use. In ["Delving Deep into Rectifiers: Surpassing Human-Level Performance"](https://arxiv.org/abs/1502.01852) (which we've seen before—it's the article that introduced the ResNet), Kaiming He et al. show that we should use the following scale instead: $\sqrt{2 / n_{in}}$, where $n_{in}$ is the number of inputs of our model. Let's see what this gives us:

这意味着我们的初始化是不正确的。为什么？在Glorot和Bengio写文章的时候，神经网络中流行的激活是双曲正切（tanh，他们使用的正切），而这种初始化并没有考虑到我们的ReLU。幸运的是，其他人已经为我们做了数学计算，并为我们计算了正确的量表。在["深入研究整流器：超越人类水平的性能"](https://arxiv.org/abs/1502.01852)（我们以前见过——这是介绍ResNet的文章）中，何开明等人表明我们应该使用以下量表：$\sqrt{2 / n_{in}}$, 其中$n_{in}$是我们模型的输入数。让我们看看这给了我们什么：

In [None]:
x = torch.randn(200, 100)
for i in range(50): x = relu(x @ (torch.randn(100,100) * sqrt(2/100)))
x[0:5,0:5]

tensor([[0.2871, 0.0000, 0.0000, 0.0000, 0.0026],
        [0.4546, 0.0000, 0.0000, 0.0000, 0.0015],
        [0.6178, 0.0000, 0.0000, 0.0180, 0.0079],
        [0.3333, 0.0000, 0.0000, 0.0545, 0.0000],
        [0.1940, 0.0000, 0.0000, 0.0000, 0.0096]])

That's better: our numbers aren't all zeroed this time. So let's go back to the definition of our neural net and use this initialization (which is named *Kaiming initialization* or *He initialization*):

这更好：我们的数字这次没有全部归零。所以让我们回到我们神经网络的定义并使用这种初始化（称为*开明初始化*或*He初始化*）：

In [None]:
x = torch.randn(200, 100)
y = torch.randn(200)

In [None]:
w1 = torch.randn(100,50) * sqrt(2 / 100)
b1 = torch.zeros(50)
w2 = torch.randn(50,1) * sqrt(2 / 50)
b2 = torch.zeros(1)

Let's look at the scale of our activations after going through the first linear layer and ReLU:

让我们看看我们在通过第一个线性层和ReLU后的激活规模：

In [None]:
l1 = lin(x, w1, b1)
l2 = relu(l1)
l2.mean(), l2.std()

(tensor(0.5661), tensor(0.8339))

Much better! Now that our weights are properly initialized, we can define our whole model:

好多了！现在我们的权重已经正确初始化，我们可以定义我们的整个模型：

In [None]:
def model(x):
    l1 = lin(x, w1, b1)
    l2 = relu(l1)
    l3 = lin(l2, w2, b2)
    return l3

This is the forward pass. Now all that's left to do is to compare our output to the labels we have (random numbers, in this example) with a loss function. In this case, we will use the mean squared error. (It's a toy problem, and this is the easiest loss function to use for what is next, computing the gradients.)

The only subtlety is that our outputs and targets don't have exactly the same shape—after going though the model, we get an output like this:

这是向前传递。现在剩下要做的就是用损失函数将我们的输出与我们拥有的标签（在这个例子中是随机数）进行比较。在这种情况下，我们将使用均方误差。（这是一个玩具问题，这是最容易用于下一步的损失函数，计算梯度。）

唯一微妙的是我们的输出和目标没有完全相同的形状——在通过模型后，我们得到这样的输出：

In [None]:
out = model(x)
out.shape

torch.Size([200, 1])

To get rid of this trailing 1 dimension, we use the `squeeze` function:

为了摆脱这个尾随的1维，我们使用`squeeze`函数：

In [None]:
def mse(output, targ): return (output.squeeze(-1) - targ).pow(2).mean()

And now we are ready to compute our loss:

现在我们准备计算我们的损失：

In [None]:
loss = mse(out, y)

That's all for the forward pass—let's now look at the gradients.

这就是前向传递的全部内容——现在让我们看看梯度。

### Gradients and the Backward Pass

### 梯度和向后传递

We've seen that PyTorch computes all the gradients we need with a magic call to `loss.backward`, but let's explore what's happening behind the scenes.

Now comes the part where we need to compute the gradients of the loss with respect to all the weights of our model, so all the floats in `w1`, `b1`, `w2`, and `b2`. For this, we will need a bit of math—specifically the *chain rule*. This is the rule of calculus that guides how we can compute the derivative of a composed function:

$$(g \circ f)'(x) = g'(f(x)) f'(x)$$

我们已经看到PyTorch通过对`loss.backward`的神奇调用来计算我们需要的所有梯度，但是让我们探索幕后发生的事情。

现在是我们需要计算关于模型所有权重的损失梯度的部分，所以`w1`、`b1`、`w2`和`b2`中的所有浮点数。为此，我们需要一点数学——特别是*链规则*。这是微积分规则，指导我们如何计算组合函数的导数：
$$(g \circ f)'(x) = g'(f(x)) f'(x)$$

> j: I find this notation very hard to wrap my head around, so instead I like to think of it as: if `y = g(u)` and `u=f(x)`; then `dy/dx = dy/du * du/dx`. The two notations mean the same thing, so use whatever works for you.

>j: 我发现这个符号很难理解，所以我喜欢把它想象成：如果`y=g（u）`和`u=f（x）`；然后`dy/dx=dy/du * du/dx`。这两个符号的意思是一样的，所以使用任何适合你的符号。

Our loss is a big composition of different functions: mean squared error (which is in turn the composition of a mean and a power of two), the second linear layer, a ReLU and the first linear layer. For instance, if we want the gradients of the loss with respect to `b2` and our loss is defined by:

```
loss = mse(out,y) = mse(lin(l2, w2, b2), y)
```

The chain rule tells us that we have:
$$\frac{\text{d} loss}{\text{d} b_{2}} = \frac{\text{d} loss}{\text{d} out} \times \frac{\text{d} out}{\text{d} b_{2}} = \frac{\text{d}}{\text{d} out} mse(out, y) \times \frac{\text{d}}{\text{d} b_{2}} lin(l_{2}, w_{2}, b_{2})$$

To compute the gradients of the loss with respect to $b_{2}$, we first need the gradients of the loss with respect to our output $out$. It's the same if we want the gradients of the loss with respect to $w_{2}$. Then, to get the gradients of the loss with respect to $b_{1}$ or $w_{1}$, we will need the gradients of the loss with respect to $l_{1}$, which in turn requires the gradients of the loss with respect to $l_{2}$, which will need the gradients of the loss with respect to $out$.

So to compute all the gradients we need for the update, we need to begin from the output of the model and work our way *backward*, one layer after the other—which is why this step is known as *backpropagation*. We can automate it by having each function we implemented (`relu`, `mse`, `lin`) provide its backward step: that is, how to derive the gradients of the loss with respect to the input(s) from the gradients of the loss with respect to the output.

Here we populate those gradients in an attribute of each tensor, a bit like PyTorch does with `.grad`. 

The first are the gradients of the loss with respect to the output of our model (which is the input of the loss function). We undo the `squeeze` we did in `mse`, then we use the formula that gives us the derivative of $x^{2}$: $2x$. The derivative of the mean is just $1/n$ where $n$ is the number of elements in our input:

我们的损失是不同函数的大组合：均方误差（依次是平均值和2的幂的组合）、第二个线性层、ReLU和第一个线性层。例如，如果我们想要关于`b2`的损失的梯度，我们的损失定义为：

```
loss = mse(out,y) = mse(lin(l2, w2, b2), y)
```

链式规则告诉我们：
$$\frac{\text{d} loss}{\text{d} b_{2}} = \frac{\text{d} loss}{\text{d} out} \times \frac{\text{d} out}{\text{d} b_{2}} = \frac{\text{d}}{\text{d} out} mse(out, y) \times \frac{\text{d}}{\text{d} b_{2}} lin(l_{2}, w_{2}, b_{2})$$

为了计算关于$b_{2}$的损失的梯度，我们首先需要关于输出$out$损失的梯度。如果我们想要关于$w_{2}$的损失的梯度也是一样的。然后，为了得到关于$b_{1}$或$w_{1}$的损失的梯度，我们需要关于$l_{1}$的损失的梯度，这反过来需要关于$l_{2}$的损失的梯度，这需要关于$out$的损失的梯度。

因此，为了计算更新所需的所有梯度，我们需要从模型的输出开始，一层接一层地*向后*工作——这就是为什么这一步被称为*反向传播*。我们可以通过让我们实现的每个函数（`relu`, `mse`, `lin`）提供其向后步骤来自动化它：也就是说，如何从关于输出的损失梯度中导出关于输入的损失梯度。

在这里，我们在每个张量的属性中填充这些梯度，有点像PyTorch对`.grad`所做的那样。 

The first are the gradients of the loss with respect to the output of our model (which is the input of the loss function). We undo the `squeeze` we did in `mse`, then we use the formula that gives us the derivative of $x^{2}$: $2x$. The derivative of the mean is just $1/n$ where $n$ is the number of elements in our input:
首先是关于我们模型输出的损失梯度（这是损失函数的输入）。我们撤消我们在`mse`中做的`压缩`，然后我们使用给我们$x^{2}$：$2x$导数的公式。均值的导数只有$1/n$，其中$n$是我们输入中的元素数：

In [None]:
def mse_grad(inp, targ): 
    # grad of loss with respect to output of previous layer
    inp.g = 2. * (inp.squeeze() - targ).unsqueeze(-1) / inp.shape[0]

For the gradients of the ReLU and our linear layer, we use the gradients of the loss with respect to the output (in `out.g`) and apply the chain rule to compute the gradients of the loss with respect to the input (in `inp.g`). The chain rule tells us that `inp.g = relu'(inp) * out.g`. The derivative of `relu` is either 0 (when inputs are negative) or 1 (when inputs are positive), so this gives us:

对于ReLU和我们的线性层的梯度，我们使用关于输出（in `out.g`）的损失梯度，并应用链规则来计算关于输入（in `inp.g`）的损失梯度。链规则告诉我们`inp.g=relu'（inp）*out.g`。`relu`的导数是0（当输入为负时）或1（当输入为正时），所以这给了我们：

In [None]:
def relu_grad(inp, out):
    # grad of relu with respect to input activations
    inp.g = (inp>0).float() * out.g

The scheme is the same to compute the gradients of the loss with respect to the inputs, weights, and bias in the linear layer:

该方案与计算线性层中关于输入、权重和偏置的损失梯度相同：

In [None]:
def lin_grad(inp, out, w, b):
    # grad of matmul with respect to input
    inp.g = out.g @ w.t()
    w.g = inp.t() @ out.g
    b.g = out.g.sum(0)

We won't linger on the mathematical formulas that define them since they're not important for our purposes, but do check out Khan Academy's excellent calculus lessons if you're interested in this topic.

我们不会停留在定义它们的数学公式上，因为它们对我们的目的并不重要，但是如果你对这个话题感兴趣，请查看可汗学院的优秀微积分课程。

### Sidebar: SymPy

### 补充：SymPy

SymPy is a library for symbolic computation that is extremely useful library when working with calculus. Per the [documentation](https://docs.sympy.org/latest/tutorial/intro.html):

SymPy是一个用于符号计算的库，在处理微积分时非常有用。根据[文档](https://docs.sympy.org/latest/tutorial/intro.html)：

> : Symbolic computation deals with the computation of mathematical objects symbolically. This means that the mathematical objects are represented exactly, not approximately, and mathematical expressions with unevaluated variables are left in symbolic form.

>: 符号计算以符号方式处理数学对象的计算。这意味着数学对象是精确表示的，而不是近似表示的，具有未评估变量的数学表达式以符号形式保留。

To do symbolic computation, we first define a *symbol*, and then do a computation, like so:

要进行符号计算，我们首先定义一个*符号*，然后进行计算，如下所示：

In [None]:
from sympy import symbols,diff
sx,sy = symbols('sx sy')
diff(sx**2, sx)

2*sx

Here, SymPy has taken the derivative of `x**2` for us! It can take the derivative of complicated compound expressions, simplify and factor equations, and much more. There's really not much reason for anyone to do calculus manually nowadays—for calculating gradients, PyTorch does it for us, and for showing the equations, SymPy does it for us!

在这里，SymPy为我们求出了`x**2`的导数!它可以对复杂的复合表达式求导，简化和因式方程，以及更多。现在真的没有多少理由让任何人手工做微积分——计算梯度，PyTorch为我们做，显示方程，SymPy为我们做!

### End sidebar

Once we have have defined those functions, we can use them to write the backward pass. Since each gradient is automatically populated in the right tensor, we don't need to store the results of those `_grad` functions anywhere—we just need to execute them in the reverse order of the forward pass, to make sure that in each function `out.g` exists:

一旦我们定义了这些函数，我们就可以使用它们来编写向后传递。由于每个梯度都自动填充在正确的张量中，我们不需要将这些`_grad`函数的结果存储在任何地方——我们只需要以向前传递的相反顺序执行它们，以确保在每个函数`out.g`中存在：

In [None]:
def forward_and_backward(inp, targ):
    # forward pass:
    l1 = inp @ w1 + b1
    l2 = relu(l1)
    out = l2 @ w2 + b2
    # we don't actually need the loss in backward!
    loss = mse(out, targ)
    
    # backward pass:
    mse_grad(out, targ)
    lin_grad(l2, out, w2, b2)
    relu_grad(l1, l2)
    lin_grad(inp, l1, w1, b1)

And now we can access the gradients of our model parameters in `w1.g`, `b1.g`, `w2.g`, and `b2.g`.

现在我们可以访问`w1.g`, `b1.g`, `w2.g`, 和 `b2.g`中模型参数的梯度。

We have successfully defined our model—now let's make it a bit more like a PyTorch module.

我们已经成功地定义了我们的模型——现在让我们让它更像PyTorch模块。

### Refactoring the Model

### 重构模型

The three functions we used have two associated functions: a forward pass and a backward pass. Instead of writing them separately, we can create a class to wrap them together. That class can also store the inputs and outputs for the backward pass. This way, we will just have to call `backward`:

我们使用的三个函数有两个相关的函数：向前传递和向后传递。我们可以创建一个类来将它们包装在一起，而不是单独编写它们。该类还可以存储向后传递的输入和输出。这样，我们只需调用`backward`：

In [None]:
class Relu():
    def __call__(self, inp):
        self.inp = inp
        self.out = inp.clamp_min(0.)
        return self.out
    
    def backward(self): self.inp.g = (self.inp>0).float() * self.out.g

`__call__` is a magic name in Python that will make our class callable. This is what will be executed when we type `y = Relu()(x)`. We can do the same for our linear layer and the MSE loss:

`__call__`是Python中的一个神奇名称，它将使我们的类可调用。这是当我们键入`y=Relu()(x)`时将执行的操作。我们可以对线性层和MSE损失做同样的事情：

In [None]:
class Lin():
    def __init__(self, w, b): self.w,self.b = w,b
        
    def __call__(self, inp):
        self.inp = inp
        self.out = inp@self.w + self.b
        return self.out
    
    def backward(self):
        self.inp.g = self.out.g @ self.w.t()
        self.w.g = self.inp.t() @ self.out.g
        self.b.g = self.out.g.sum(0)

In [None]:
class Mse():
    def __call__(self, inp, targ):
        self.inp = inp
        self.targ = targ
        self.out = (inp.squeeze() - targ).pow(2).mean()
        return self.out
    
    def backward(self):
        x = (self.inp.squeeze()-self.targ).unsqueeze(-1)
        self.inp.g = 2.*x/self.targ.shape[0]

Then we can put everything in a model that we initiate with our tensors `w1`, `b1`, `w2`, `b2`:

然后我们可以把所有的东西都放在我们用张量`w1`, `b1`, `w2`, `b2`发起的模型中：

In [None]:
class Model():
    def __init__(self, w1, b1, w2, b2):
        self.layers = [Lin(w1,b1), Relu(), Lin(w2,b2)]
        self.loss = Mse()
        
    def __call__(self, x, targ):
        for l in self.layers: x = l(x)
        return self.loss(x, targ)
    
    def backward(self):
        self.loss.backward()
        for l in reversed(self.layers): l.backward()

What is really nice about this refactoring and registering things as layers of our model is that the forward and backward passes are now really easy to write. If we want to instantiate our model, we just need to write:

将事物重构和注册为我们模型的层的真正好处是，向前和向后传递现在非常容易编写。如果我们想实例化我们的模型，我们只需要编写：

In [None]:
model = Model(w1, b1, w2, b2)

The forward pass can then be executed with:

然后可以使用以下方式执行前向传递：

In [None]:
loss = model(x, y)

And the backward pass with:

向后传递是;

In [None]:
model.backward()

### Going to PyTorch

### 去PyTorch

The  `Lin`, `Mse` and `Relu` classes we wrote have a lot in common, so we could make them all inherit from the same base class:

我们编写的`Lin`、`Mse`和`Relu`类有很多共同点，因此我们可以使它们都继承自相同的基类：

In [None]:
class LayerFunction():
    def __call__(self, *args):
        self.args = args
        self.out = self.forward(*args)
        return self.out
    
    def forward(self):  raise Exception('not implemented')
    def bwd(self):      raise Exception('not implemented')
    def backward(self): self.bwd(self.out, *self.args)

Then we just need to implement `forward` and `bwd` in each of our subclasses:

然后我们只需要在我们的每个子类中实现`forward`和`bwd`:

In [None]:
class Relu(LayerFunction):
    def forward(self, inp): return inp.clamp_min(0.)
    def bwd(self, out, inp): inp.g = (inp>0).float() * out.g

In [None]:
class Lin(LayerFunction):
    def __init__(self, w, b): self.w,self.b = w,b
        
    def forward(self, inp): return inp@self.w + self.b
    
    def bwd(self, out, inp):
        inp.g = out.g @ self.w.t()
        self.w.g = inp.t() @ self.out.g
        self.b.g = out.g.sum(0)

In [None]:
class Mse(LayerFunction):
    def forward (self, inp, targ): return (inp.squeeze() - targ).pow(2).mean()
    def bwd(self, out, inp, targ): 
        inp.g = 2*(inp.squeeze()-targ).unsqueeze(-1) / targ.shape[0]

The rest of our model can be the same as before. This is getting closer and closer to what PyTorch does. Each basic function we need to differentiate is written as a `torch.autograd.Function` object that has a `forward` and a `backward` method. PyTorch will then keep trace of any computation we do to be able to properly run the backward pass, unless we set the `requires_grad` attribute of our tensors to `False`.

Writing one of these is (almost) as easy as writing our original classes. The difference is that we choose what to save and what to put in a context variable (so that we make sure we don't save anything we don't need), and we return the gradients in the `backward` pass. It's very rare to have to write your own `Function` but if you ever need something exotic or want to mess with the gradients of a regular function, here is how to write one:

模型的其余部分可以和以前一样。这与PyTorch的功能越来越接近了。我们需要微分的每个基本函数都写成一个`torch.autograd.Function`对象，该对象有一个`向前`和`向后`方法。PyTorch将继续跟踪我们所做的任何计算，以便能够正确地运行反向传递，除非我们将张量的`requires_grad`属性设置为`False`。

编写其中一个类(几乎)与编写原始类一样简单。不同之处在于，我们选择保存哪些内容，以及将哪些内容放入上下文变量(这样我们就可以确保不保存任何我们不需要的内容)，并在`backward`传递中返回梯度。编写自己的`Function`是非常罕见的，但如果你需要一些特殊的东西或想要打乱常规函数的梯度，下面是如何编写一个函数:

In [None]:
from torch.autograd import Function

class MyRelu(Function):
    @staticmethod
    def forward(ctx, i):
        result = i.clamp_min(0.)
        ctx.save_for_backward(i)
        return result
    
    @staticmethod
    def backward(ctx, grad_output):
        i, = ctx.saved_tensors
        return grad_output * (i>0).float()

The structure used to build a more complex model that takes advantage of those `Function`s is a `torch.nn.Module`. This is the base structure for all models, and all the neural nets you have seen up until now inherited from that class. It mostly helps to register all the trainable parameters, which as we've seen can be used in the training loop.

To implement an `nn.Module` you just need to:

- Make sure the superclass `__init__` is called first when you initialize it.
- Define any parameters of the model as attributes with `nn.Parameter`.
- Define a `forward` function that returns the output of your model.

As an example, here is the linear layer from scratch:

用于构建利用这些`Function`的更复杂模型的结构是`torch.nn.Module`。这是所有模型的基本结构，到目前为止您看到的所有神经网络都继承自该类。它主要帮助注册所有可训练的参数，正如我们看到的，这些参数可以在训练循环中使用。

实现一个`nn.Module`，你只需要:

- 确保父类`__init__`在初始化时首先被调用。
- 使用`nn.Parameter`将模型的任何参数定义为属性。
- 定义一个`forward`函数来返回模型的输出。

下面是一个从头开始的线性层:

In [None]:
import torch.nn as nn

class LinearLayer(nn.Module):
    def __init__(self, n_in, n_out):
        super().__init__()
        self.weight = nn.Parameter(torch.randn(n_out, n_in) * sqrt(2/n_in))
        self.bias = nn.Parameter(torch.zeros(n_out))
    
    def forward(self, x): return x @ self.weight.t() + self.bias

As you see, this class automatically keeps track of what parameters have been defined:

如你所见，这个类会自动跟踪已经定义的参数:

In [None]:
lin = LinearLayer(10,2)
p1,p2 = lin.parameters()
p1.shape,p2.shape

(torch.Size([2, 10]), torch.Size([2]))

It is thanks to this feature of `nn.Module` that we can just say `opt.step()` and have an optimizer loop through the parameters and update each one.

Note that in PyTorch, the weights are stored as an `n_out x n_in` matrix, which is why we have the transpose in the forward pass.

By using the linear layer from PyTorch (which uses the Kaiming initialization as well), the model we have been building up during this chapter can be written like this:

正是由于`nn.Module`的这个特性，我们只需输入`opt.step()`，并让优化器循环通过参数并更新每个参数。

注意，在PyTorch中，权重存储为一个`n_out x n_in`矩阵，这就是为什么我们在正向传递中进行转置。

通过使用PyTorch中的线性层(也使用了Kaiming初始化)，我们在本章中建立的模型可以写成这样:

In [None]:
class Model(nn.Module):
    def __init__(self, n_in, nh, n_out):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(n_in,nh), nn.ReLU(), nn.Linear(nh,n_out))
        self.loss = mse
        
    def forward(self, x, targ): return self.loss(self.layers(x).squeeze(), targ)

fastai provides its own variant of `Module` that is identical to `nn.Module`, but doesn't require you to call `super().__init__()` (it does that for you automatically):

fastai提供了自己的`Module`变体，与`nn.Module`相同，但不需要你调用`super().__init__()`(它会自动为你这样做):

In [None]:
class Model(Module):
    def __init__(self, n_in, nh, n_out):
        self.layers = nn.Sequential(
            nn.Linear(n_in,nh), nn.ReLU(), nn.Linear(nh,n_out))
        self.loss = mse
        
    def forward(self, x, targ): return self.loss(self.layers(x).squeeze(), targ)

In the last chapter, we will start from such a model and see how to build a training loop from scratch and refactor it to what we've been using in previous chapters.

在最后一章中，我们将从这样一个模型开始，并看看如何从零开始构建一个训练循环，并将其重构为我们在前几章中使用的内容。

## Conclusion

## 结论

In this chapter we explored the foundations of deep learning, beginning with matrix multiplication and moving on to implementing the forward and backward passes of a neural net from scratch. We then refactored our code to show how PyTorch works beneath the hood.

Here are a few things to remember:

- A neural net is basically a bunch of matrix multiplications with nonlinearities in between.
- Python is slow, so to write fast code we have to vectorize it and take advantage of techniques such as elementwise arithmetic and broadcasting.
- Two tensors are broadcastable if the dimensions starting from the end and going backward match (if they are the same, or one of them is 1). To make tensors broadcastable, we may need to add dimensions of size 1 with `unsqueeze` or a `None` index.
- Properly initializing a neural net is crucial to get training started. Kaiming initialization should be used when we have ReLU nonlinearities.
- The backward pass is the chain rule applied multiple times, computing the gradients from the output of our model and going back, one layer at a time.
- When subclassing `nn.Module` (if not using fastai's `Module`) we have to call the superclass `__init__` method in our `__init__` method and we have to define a `forward` function that takes an input and returns the desired result.

在这一章中，我们探索了深度学习的基础，从矩阵乘法开始，然后从头开始实现神经网络的向前和向后传递。然后，我们重构了代码，以展示PyTorch在幕后是如何工作的。

以下是一些需要记住的事情:

- 神经网络基本上是一堆矩阵乘法中间是非线性的。
- Python是慢的，所以要写快的代码，我们必须对它进行向量化，并利用诸如elementwise算法和广播等技术。
- 如果从末端开始并向后匹配的维度(如果它们是相同的，或者其中一个是1)，两个张量是可传播的。为了使张量可传播，我们可能需要添加尺寸为1的维度，并带`unsqueeze`或`None`索引。
- 正确地初始化神经网络对于开始训练至关重要。当我们有ReLU非线性时，应该使用开明初始化。
- 向后传递是多次应用的链式法则，从我们的模型输出计算梯度，然后一次一层返回。
- 当子类化`nn.Module`(如果没有使用fastai的`Module`)，我们必须在`__init__`方法中调用超类`__init__`方法，并且必须定义一个`forward`函数来接受输入并返回所需的结果。

## Questionnaire

## 问卷调查

1. Write the Python code to implement a single neuron.
1. Write the Python code to implement ReLU.
1. Write the Python code for a dense layer in terms of matrix multiplication.
1. Write the Python code for a dense layer in plain Python (that is, with list comprehensions and functionality built into Python).
1. What is the "hidden size" of a layer?
1. What does the `t` method do in PyTorch?
1. Why is matrix multiplication written in plain Python very slow?
1. In `matmul`, why is `ac==br`?
1. In Jupyter Notebook, how do you measure the time taken for a single cell to execute?
1. What is "elementwise arithmetic"?
1. Write the PyTorch code to test whether every element of `a` is greater than the corresponding element of `b`.
1. What is a rank-0 tensor? How do you convert it to a plain Python data type?
1. What does this return, and why? `tensor([1,2]) + tensor([1])`
1. What does this return, and why? `tensor([1,2]) + tensor([1,2,3])`
1. How does elementwise arithmetic help us speed up `matmul`?
1. What are the broadcasting rules?
1. What is `expand_as`? Show an example of how it can be used to match the results of broadcasting.
1. How does `unsqueeze` help us to solve certain broadcasting problems?
1. How can we use indexing to do the same operation as `unsqueeze`?
1. How do we show the actual contents of the memory used for a tensor?
1. When adding a vector of size 3 to a matrix of size 3×3, are the elements of the vector added to each row or each column of the matrix? (Be sure to check your answer by running this code in a notebook.)
1. Do broadcasting and `expand_as` result in increased memory use? Why or why not?
1. Implement `matmul` using Einstein summation.
1. What does a repeated index letter represent on the left-hand side of einsum?
1. What are the three rules of Einstein summation notation? Why?
1. What are the forward pass and backward pass of a neural network?
1. Why do we need to store some of the activations calculated for intermediate layers in the forward pass?
1. What is the downside of having activations with a standard deviation too far away from 1?
1. How can weight initialization help avoid this problem?
1. What is the formula to initialize weights such that we get a standard deviation of 1 for a plain linear layer, and for a linear layer followed by ReLU?
1. Why do we sometimes have to use the `squeeze` method in loss functions?
1. What does the argument to the `squeeze` method do? Why might it be important to include this argument, even though PyTorch does not require it?
1. What is the "chain rule"? Show the equation in either of the two forms presented in this chapter.
1. Show how to calculate the gradients of `mse(lin(l2, w2, b2), y)` using the chain rule.
1. What is the gradient of ReLU? Show it in math or code. (You shouldn't need to commit this to memory—try to figure it using your knowledge of the shape of the function.)
1. In what order do we need to call the `*_grad` functions in the backward pass? Why?
1. What is `__call__`?
1. What methods must we implement when writing a `torch.autograd.Function`?
1. Write `nn.Linear` from scratch, and test it works.
1. What is the difference between `nn.Module` and fastai's `Module`?

1. 编写Python代码来实现单个神经元。
1. 编写Python代码来实现ReLU。
1. 用矩阵乘法的方式编写密集层的Python代码。
1. 用普通Python编写密集层的Python代码(即使用Python内置的列表推导式和功能)。
1. 一层的“隐藏大小”是多少?
1. `t`方法在PyTorch中做什么?
1. 为什么用普通Python写矩阵乘法非常慢?
1. 在`matmul`中，为什么`ac==br`?
1. 在Jupyter Notebook中，如何测量单个单元格执行所需的时间?
1. 什么是“基本算术”?
1. 编写PyTorch代码来测试`a`中的每个元素是否都大于`b`中相应的元素。
1. 什么是0阶张量?如何将其转换为普通的Python数据类型?
1. 返回什么，为什么?`tensor([1,2])+tensor([1])`
1. 返回什么，为什么?`tensor([1,2])+tensor([1,2,3])`
1. 元素算术是如何帮助我们加快`matmul`的?
1. 广播规则是什么?
1. `expand_as`是什么?展示一个如何使用它来匹配广播结果的例子。
1. `unsqueeze`如何帮助我们解决某些广播问题?
1. 如何使用索引来执行与`unsqueeze`相同的操作?
1. 我们如何显示一个张量所使用的内存的实际内容?
1. 当向一个大小为3×3的矩阵中添加一个大小为3的向量时，该向量的元素是添加到矩阵的每一行还是每一列?(请务必在笔记本上运行此代码来检查您的答案。)
1. 广播和`expand_as`会导致内存使用增加吗?为什么或为什么不?
1. 使用爱因斯坦求和实现`matmul`。
1. einsum左边的重复索引字母代表什么?
1. 爱因斯坦求和符号的三个规则是什么?为什么?
1. 什么是神经网络的前向传递和后向传递?
1. 为什么我们需要存储前向传递中为中间层计算的一些激活?
1. 激活的标准差离1太远的缺点是什么?
1. 重初始化如何帮助避免这个问题?
1. 初始化权重的公式是什么，以便我们得到一个普通线性层的标准差为1，以及一个线性层后跟ReLU的标准差为1 ?
1. 为什么我们有时要在损失函数中使用`squeeze`法?
1. `squeeze`方法的参数有什么作用?为什么包含这个参数很重要，即使PyTorch不需要它?
1. 什么是链式法则?用本章给出的两种形式中的任何一种来表示方程。
1. 演示如何使用链式法则计算`mse(lin(l2, w2, b2)，y)`的梯度。
1. ReLU的梯度是多少?用数学或代码表示出来。(你不需要把它记在脑子里——试着用你对函数形状的知识来计算它。)
1. 在向后传递中，我们需要以什么顺序调用`*_grad`函数?为什么?
1. `__call__`是什么?
1. 在编写`torch.autograd.Function`时，我们必须实现哪些方法?
1. 编写`nn.Linear`从头开始，并测试它的工作。
1. `nn.Module`和fastai的`Module`有什么区别？

### Further Research

### 进一步的研究

1. Implement ReLU as a `torch.autograd.Function` and train a model with it.
1. If you are mathematically inclined, find out what the gradients of a linear layer are in mathematical notation. Map that to the implementation we saw in this chapter.
1. Learn about the `unfold` method in PyTorch, and use it along with matrix multiplication to implement your own 2D convolution function. Then train a CNN that uses it.
1. Implement everything in this chapter using NumPy instead of PyTorch. 

1. 将ReLU实现为`torch.autograd.Function`并使用它训练模型。
1. 如果你有数学倾向，找出线性层的梯度在数学符号中是什么。将它映射到我们在本章中看到的实现。
1. 了解PyTorch中的`unfold`方法，并将其与矩阵乘法一起使用来实现自己的2D卷积函数。然后训练一个使用它的CNN。
1. 使用NumPy而不是PyTorch实现本章中的所有内容。