In [None]:
#hide
!pip install -Uqq fastbook
import fastbook
fastbook.setup_book()

In [None]:
#hide
from fastai.gen_doc.nbdoc import *

# A Neural Net from the Foundations

# 来自基金会的神经网络[机器翻译]

This chapter begins a journey where we will dig deep into the internals of the models we used in the previous chapters. We will be covering many of the same things we've seen before, but this time around we'll be looking much more closely at the implementation details, and much less closely at the practical issues of how and why things are as they are.


本章开始了一段旅程，我们将深入挖掘我们在前几章中使用的模型的内部。我们将涵盖我们以前见过的许多相同的东西，但这次我们将更仔细地研究实现细节，更不用说关注事物如何以及为什么保持原样的实际问题了。[机器翻译]

We will build everything from scratch, only using basic indexing into a tensor. We]ll write a neural net from the ground up, then implement backpropagation manually, so we know exactly what's happening in PyTorch when we call `loss.backward`. We'll also see how to extend PyTorch with custom *autograd* functions that allow us to specify our own forward and backward computations.

我们将从头开始构建一切，只使用基本索引到张量中。我们] 将从头开始编写一个神经网络，然后手动实现反向传播，因此当我们调用 “loss.Backward” 时，我们确切地知道PyTorch中发生了什么。我们还将看到如何使用自定义 * autograd * 函数扩展PyTorch，这些函数允许我们指定自己的向前和向后计算。[机器翻译]

## Building a Neural Net Layer from Scratch

# # 从头开始构建神经网络层[机器翻译]

Let's start by refreshing our understanding of how matrix multiplication is used in a basic neural network. Since we're building everything up from scratch, we'll use nothing but plain Python initially (except for indexing into PyTorch tensors), and then replace the plain Python with PyTorch functionality once we've seen how to create it.

让我们从刷新我们对基本神经网络中如何使用矩阵乘法的理解开始。因为我们从头开始构建一切，所以我们最初只使用普通Python (除了索引到PyTorch张量)，然后，一旦我们看到如何创建它，就用PyTorch功能替换普通Python。[机器翻译]

### Modeling a Neuron

# 为神经元建模[机器翻译]

A neuron receives a given number of inputs and has an internal weight for each of them. It sums those weighted inputs to produce an output and adds an inner bias. In math, this can be written as:


神经元接收给定数量的输入，并且每个输入都有一个内部权重。它将这些加权输入相加以产生输出并添加内部偏差。在数学中，这可以写成:[机器翻译]

$$ out = \sum_{i=1}^{n} x_{i} w_{i} + b$$


$ $ Out = \ sum _{i = 1}^{n} x _{i} w _{i} b $ $[机器翻译]

if we name our inputs $(x_{1},\dots,x_{n})$, our weights $(w_{1},\dots,w_{n})$, and our bias $b$. In code this translates into:


如果我们命名我们的输入 $(x _{1}，\ dots，x _{n})$，我们的权重 $(w _{1}，\ dots，w _{n})$，和我们的偏见 $ b $。在代码中，这翻译成:[机器翻译]

```python
output = sum([x*w for x,w in zip(inputs,weights)]) + bias
```


'''Python
输出 = sum([x * w代表x，w in zip (输入，权重)]) 偏差
'''[机器翻译]

This output is then fed into a nonlinear function called an *activation function* before being sent to another neuron. In deep learning the most common of these is the *rectified Linear unit*, or *ReLU*, which, as we've seen, is a fancy way of saying:
```python
def relu(x): return x if x >= 0 else 0
```

然后将该输出送入称为 * 激活函数 * 的非线性函数，然后发送到另一个神经元。在深度学习中，其中最常见的是 * 校正线性单元 *，或者 * ReLU *，正如我们所看到的，这是一种奇特的说法:
'''Python
Def relu(x): 返回x如果x >= 0否则0
'''[机器翻译]

A deep learning model is then built by stacking a lot of those neurons in successive layers. We create a first layer with a certain number of neurons (known as *hidden size*) and link all the inputs to each of those neurons. Such a layer is often called a *fully connected layer* or a *dense layer* (for densely connected), or a *linear layer*. 


然后，通过在连续层中堆叠许多神经元来构建深度学习模型。我们创建具有一定数量神经元 (称为 * 隐藏大小 *) 的第一层，并将所有输入链接到每个神经元。这样的层通常被称为 * 完全连接的层 * 或 * 致密层 * (用于密集连接)，或 * 线性层 *。[机器翻译]

It requires to compute, for each `input` in our batch and each neuron with a give `weight`, the dot product:


它需要计算，对于我们批次中的每个 “输入” 和每个具有给定 “重量” 的神经元，点积:[机器翻译]

```python
sum([x*w for x,w in zip(input,weight)])
```


'''Python
Sum ([x * w for x，w in zip(input，weight)])
'''[机器翻译]

If you have done a little bit of linear algebra, you may remember that having a lot of those dot products happens when you do a *matrix multiplication*. More precisely, if our inputs are in a matrix `x` with a size of `batch_size` by `n_inputs`, and if we have grouped the weights of our neurons in a matrix `w` of size `n_neurons` by `n_inputs` (each neuron must have the same number of weights as it has inputs) and all the biases in a vector `b` of size `n_neurons`, then the output of this fully connected layer is:


如果你做了一点线性代数，你可能会记得当你做一个 * 矩阵乘法 * 的时候会有很多这样的点积。更准确地说，如果我们的输入在一个矩阵 'x' 中，其大小为 'batch_size'，则为 'n_inputs'，如果我们将神经元的权重分组在大小为 “n_neurons” 的矩阵 “w” 中，则通过 “n_inputs” (每个神经元必须具有与输入相同的权重数量)和大小为 “n_neurons” 的向量 “b” 中的所有偏差，那么这个完全连接的层的输出是:[机器翻译]

```python
y = x @ w.t() + b
```


'''Python
Y = x @ w.t() b
'''[机器翻译]

where `@` represents the matrix product and `w.t()` is the transpose matrix of `w`. The output `y` is then of size `batch_size` by `n_neurons`, and in position `(i,j)` we have (for the mathy folks out there):


其中 @ 表示矩阵乘积，w.t() 是w的转置矩阵。输出 'y' 是大小 'batch_size' 由 'n_neurons'，并在位置 '(i，j)' 我们有 (对于那里的马西人):[机器翻译]

$$y_{i,j} = \sum_{k=1}^{n} x_{i,k} w_{k,j} + b_{j}$$


$ $ Y _{i，j} = \ sum _{k = 1}^{n} x _{i，k} w _{k，j} b _{j}$[机器翻译]

Or in code:


或在代码中:[机器翻译]

```python
y[i,j] = sum([a * b for a,b in zip(x[i,:],w[j,:])]) + b[j]
```


'''Python
Y [i，j] = sum([a * b for a，b in zip(x[i，:]，w[j，:]) b[j]
'''[机器翻译]

The transpose is necessary because in the mathematical definition of the matrix product `m @ n`, the coefficient `(i,j)` is:


转置是必要的，因为在矩阵乘积 '@ n' 的数学定义中，系数 '(i，j)' 是:[机器翻译]

```python
sum([a * b for a,b in zip(m[i,:],n[:,j])])
```


'''Python
Sum ([a * b for a，b in zip(m[i，:]，n[:，j])
'''[机器翻译]

So the very basic operation we need is a matrix multiplication, as it's what is hidden in the core of a neural net.

所以我们需要的最基本的操作是矩阵乘法，因为它隐藏在神经网络的核心。[机器翻译]

### Matrix Multiplication from Scratch

# 从头开始矩阵乘法[机器翻译]

Let's write a function that computes the matrix product of two tensors, before we allow ourselves to use the PyTorch version of it. We will only use the indexing in PyTorch tensors:

在我们允许自己使用PyTorch版本之前，让我们编写一个计算两个张量的矩阵乘积的函数。我们将只使用PyTorch tensor中的索引:[机器翻译]

In [None]:
import torch
from torch import tensor

We'll need three nested `for` loops: one for the row indices, one for the column indices, and one for the inner sum. `ac` and `ar` stand for number of columns of `a` and number of rows of `a`, respectively (the same convention is followed for `b`), and we make sure calculating the matrix product is possible by checking that `a` has as many columns as `b` has rows:

我们需要三个嵌套的 “for” 循环: 一个用于行索引，一个用于列索引，一个用于内部总和。'Ac' 和 'ar' 分别代表 'a' 的列数和 'a' 的行数 ('b' 遵循相同的约定)，并且我们通过检查 'a' 具有与 'b' 具有行一样多的列来确保计算矩阵乘积是可能的:[机器翻译]

In [None]:
def matmul(a,b):
    ar,ac = a.shape # n_rows * n_cols
    br,bc = b.shape
    assert ac==br
    c = torch.zeros(ar, bc)
    for i in range(ar):
        for j in range(bc):
            for k in range(ac): c[i,j] += a[i,k] * b[k,j]
    return c

To test this out, we'll pretend (using random matrices) that we're working with a small batch of 5 MNIST images, flattened into 28×28 vectors, with linear model to turn them into 10 activations:

为了测试这一点，我们将假装 (使用随机矩阵) 我们正在处理一小批5 MNIST图像，展平为28 × 28向量，用线性模型将它们变成10个激活:[机器翻译]

In [None]:
m1 = torch.randn(5,28*28)
m2 = torch.randn(784,10)

Let's time our function, using the Jupyter "magic" command `%time`:

让我们使用Jupyter “magic” 命令 “% time” 对函数进行计时:[机器翻译]

In [None]:
%time t1=matmul(m1, m2)

CPU times: user 1.15 s, sys: 4.09 ms, total: 1.15 s
Wall time: 1.15 s


And see how that compares to PyTorch's built-in `@`:

看看这和PyTorch的内置 “@” 相比如何:[机器翻译]

In [None]:
%timeit -n 20 t2=m1@m2

14 µs ± 8.95 µs per loop (mean ± std. dev. of 7 runs, 20 loops each)


As we can see, in Python three nested loops is a very bad idea! Python is a slow language, and this isn't going to be very efficient. We see here that PyTorch is around 100,000 times faster than Python—and that's before we even start using the GPU!


正如我们所看到的，在Python中，三个嵌套循环是一个非常糟糕的主意!Python是一种缓慢的语言，这不会非常有效。我们在这里看到PyTorch比Python快100,000倍-这是在我们开始使用GPU之前![机器翻译]

Where does this difference come from? PyTorch didn't write its matrix multiplication in Python, but rather in C++ to make it fast. In general, whenever we do computations on tensors we will need to *vectorize* them so that we can take advantage of the speed of PyTorch, usually by using two techniques: elementwise arithmetic and broadcasting.

这种差异从何而来？PyTorch没有用Python编写它的矩阵乘法，而是用C来使它快速。一般来说，每当我们对张量进行计算时，我们都需要 * 矢量化 * 它们，这样我们就可以利用PyTorch的速度，通常使用两种技术: 元素算术和广播。[机器翻译]

### Elementwise Arithmetic

# 元素运算[机器翻译]

All the basic operators (`+`, `-`, `*`, `/`, `>`, `<`, `==`) can be applied elementwise. That means if we write `a+b` for two tensors `a` and `b` that have the same shape, we will get a tensor composed of the sums the elements of `a` and `b`:

所有基本运算符 (''，'-'，'*'，'/'，'>'，'<'，'= =') 都可以按元素方式应用。这意味着如果我们为具有相同形状的两个张量 “a” 和 “b” 写 “a”，我们将得到一个由 “a” 和 “b” 元素的总和组成的张量:[机器翻译]

In [None]:
a = tensor([10., 6, -4])
b = tensor([2., 8, 7])
a + b

tensor([12., 14.,  3.])

The Booleans operators will return an array of Booleans:

布尔运算符将返回布尔数组:[机器翻译]

In [None]:
a < b

tensor([False,  True,  True])

If we want to know if every element of `a` is less than the corresponding element in `b`, or if two tensors are equal, we need to combine those elementwise operations with `torch.all`:

如果我们想知道 “a” 的每个元素是否小于 “b” 中的相应元素，或者两个张量是否相等，我们需要将这些元素运算与 “torch.All” 结合起来:[机器翻译]

In [None]:
(a < b).all(), (a==b).all()

(tensor(False), tensor(False))

Reduction operations like `all()`, `sum()` and `mean()` return tensors with only one element, called rank-0 tensors. If you want to convert this to a plain Python Boolean or number, you need to call `.item()`:

像 'all()'，'sum()' 和 'mean()' 这样的减少操作返回只有一个元素的张量，称为秩-0张量。如果您想将它转换为普通的Python布尔值或数字，您需要调用 '.item()':[机器翻译]

In [None]:
(a + b).mean().item()

9.666666984558105

The elementwise operations work on tensors of any rank, as long as they have the same shape:

Elementwise操作在任何等级的张量上工作，只要它们具有相同的形状:[机器翻译]

In [None]:
m = tensor([[1., 2, 3], [4,5,6], [7,8,9]])
m*m

tensor([[ 1.,  4.,  9.],
        [16., 25., 36.],
        [49., 64., 81.]])

However you can't perform elementwise operations on tensors that don't have the same shape (unless they are broadcastable, as discussed in the next section):

但是，您不能对形状不相同的张量执行elementwise操作 (除非它们是可广播的，如下一节所述):[机器翻译]

In [None]:
n = tensor([[1., 2, 3], [4,5,6]])
m*n

RuntimeError: The size of tensor a (3) must match the size of tensor b (2) at non-singleton dimension 0

With elementwise arithmetic, we can remove one of our three nested loops: we can multiply the tensors that correspond to the `i`-th row of `a` and the `j`-th column of `b` before summing all the elements, which will speed things up because the inner loop will now be executed by PyTorch at C speed. 


使用elementwise算术，我们可以删除三个嵌套循环中的一个: 在对所有元素求和之前，我们可以将对应于 “a” 的第i行和 “b” 的第j列的张量相乘，这将加速事情，因为内部循环现在将由PyTorch以C速度执行。[机器翻译]

To access one column or row, we can simply write `a[i,:]` or `b[:,j]`. The `:` means take everything in that dimension. We could restrict this and take only a slice of that particular dimension by passing a range, like `1:5`, instead of just `:`. In that case, we would take the elements in columns or rows 1 to 4 (the second number is noninclusive). 


要访问一列或一行，我们可以简单地写 'a[i，:]' 或 'b[:，j]'。“:” 意味着在该维度中获取所有内容。我们可以限制这一点，通过传递一个范围，比如 “1:5”，而不仅仅是 “:”，只取特定维度的一部分。在这种情况下，我们将取列或行1到4中的元素 (第二个数字是不包含的)。[机器翻译]

One simplification is that we can always omit a trailing colon, so `a[i,:]` can be abbreviated to `a[i]`. With all of that in mind, we can write a new version of our matrix multiplication:

一个简化是我们总是可以省略一个尾随冒号，所以 'a[i，:]' 可以缩写为 'a[i]'。考虑到所有这些，我们可以编写一个新版本的矩阵乘法:[机器翻译]

In [None]:
def matmul(a,b):
    ar,ac = a.shape
    br,bc = b.shape
    assert ac==br
    c = torch.zeros(ar, bc)
    for i in range(ar):
        for j in range(bc): c[i,j] = (a[i] * b[:,j]).sum()
    return c

In [None]:
%timeit -n 20 t3 = matmul(m1,m2)

1.7 ms ± 88.1 µs per loop (mean ± std. dev. of 7 runs, 20 loops each)


We're already ~700 times faster, just by removing that inner `for` loop! And that's just the beginning—with broadcasting we can remove another loop and get an even more important speed up.

我们已经快了大约700倍，只要去掉内部的 “for” 循环!这仅仅是个开始 -- 通过广播，我们可以移除另一个循环，获得更重要的速度。[机器翻译]

### Broadcasting

# 广播[机器翻译]

As we discussed in <<chapter_mnist_basics>>, broadcasting is a term introduced by the [NumPy library](https://docs.scipy.org/doc/) that describes how tensors of different ranks are treated during arithmetic operations. For instance, it's obvious there is no way to add a 3×3 matrix with a 4×5 matrix, but what if we want to add one scalar (which can be represented as a 1×1 tensor) with a matrix? Or a vector of size 3 with a 3×4 matrix? In both cases, we can find a way to make sense of this operation.


正如我们在 <<chapter_mnist_basics>> 中讨论的那样，广播是由 [NumPy库] ( https://docs.scipy.org/doc/ )，它描述了在算术运算期间如何处理不同等级的张量。例如，很明显，没有办法将3 × 3矩阵与4 × 5矩阵相加，但是如果我们想用矩阵添加一个标量 (可以表示为1 × 1张量) 呢？或具有3 × 4矩阵的大小为3的向量？在这两种情况下，我们都可以找到一种方法来理解这种操作。[机器翻译]

Broadcasting gives specific rules to codify when shapes are compatible when trying to do an elementwise operation, and how the tensor of the smaller shape is expanded to match the tensor of the bigger shape. It's essential to master those rules if you want to be able to write code that executes quickly. In this section, we'll expand our previous treatment of broadcasting to understand these rules.

广播给出了特定的规则，以便在尝试执行elementwise操作时编码形状何时兼容，以及如何扩展较小形状的张量以匹配较大形状的张量。如果您希望能够编写快速执行的代码，掌握这些规则是至关重要的。在本节中，我们将扩展以前对广播的处理，以了解这些规则。[机器翻译]

#### Broadcasting with a scalar

# 用标量广播[机器翻译]

Broadcasting with a scalar is the easiest type of broadcating. When we have a tensor `a` and a scalar, we just imagine a tensor of the same shape as `a` filled with that scalar and perform the operation:

使用标量进行广播是最简单的广播类型。当我们有一个张量 'a' 和一个标量时，我们只是想象一个与 'a' 形状相同的张量用这个标量填充并执行操作:[机器翻译]

In [None]:
a = tensor([10., 6, -4])
a > 0

tensor([ True,  True, False])

How are we able to do this comparison? `0` is being *broadcast* to have the same dimensions as `a`. Note that this is done without creating a tensor full of zeros in memory (that would be very inefficient). 


我们如何进行这种比较？“0” 正在被 * 广播 * 以与 “a” 具有相同的维度。请注意，这是在不创建内存中充满零的张量的情况下完成的 (这将是非常低效的)。[机器翻译]

This is very useful if you want to normalize your dataset by subtracting the mean (a scalar) from the entire data set (a matrix) and dividing by the standard deviation (another scalar):

如果您想通过从整个数据集 (矩阵) 中减去平均值 (标量) 并除以标准偏差 (另一个标量) 来规范化数据集，这是非常有用的:[机器翻译]

In [None]:
m = tensor([[1., 2, 3], [4,5,6], [7,8,9]])
(m - 5) / 2.73

tensor([[-1.4652, -1.0989, -0.7326],
        [-0.3663,  0.0000,  0.3663],
        [ 0.7326,  1.0989,  1.4652]])

What if have different means for each row of the matrix? in that case you will need to broadcast a vector to a matrix.

如果矩阵的每一行都有不同的方法呢？在这种情况下，你需要向矩阵广播一个向量。[机器翻译]

#### Broadcasting a vector to a matrix

# 向矩阵广播向量[机器翻译]

We can broadcast a vector to a matrix as follows:

我们可以向矩阵广播一个向量，如下所示:[机器翻译]

In [None]:
c = tensor([10.,20,30])
m = tensor([[1., 2, 3], [4,5,6], [7,8,9]])
m.shape,c.shape

(torch.Size([3, 3]), torch.Size([3]))

In [None]:
m + c

tensor([[11., 22., 33.],
        [14., 25., 36.],
        [17., 28., 39.]])

Here the elements of `c` are expanded to make three rows that match, making the operation possible. Again, PyTorch doesn't actually create three copies of `c` in memory. This is done by the `expand_as` method behind the scenes:

在这里，“c” 的元素被扩展为使三行匹配，使得操作成为可能。同样，PyTorch实际上并没有在内存中创建 “c” 的三个副本。这是由幕后的 'expand_as' 方法完成的:[机器翻译]

In [None]:
c.expand_as(m)

tensor([[10., 20., 30.],
        [10., 20., 30.],
        [10., 20., 30.]])

If we look at the corresponding tensor, we can ask for its `storage` property (which shows the actual contents of the memory used for the tensor) to check there is no useless data stored:

如果我们查看相应的张量，我们可以请求它的 “存储” 属性 (它显示了用于张量的内存的实际内容) 来检查是否存储了无用的数据:[机器翻译]

In [None]:
t = c.expand_as(m)
t.storage()

 10.0
 20.0
 30.0
[torch.FloatStorage of size 3]

Even though the tensor officially has nine elements, only three scalars are stored in memory. This is possible thanks to the clever trick of giving that dimension a *stride* of 0 (which means that when PyTorch looks for the next row by adding the stride, it doesn't move):

尽管张量官方有九个元素，但内存中只存储了三个标量。这是可能的，这要归功于给该维度一个 * 步幅 * 0的聪明技巧 (这意味着当PyTorch通过添加步幅来查找下一行时，它不会移动):[机器翻译]

In [None]:
t.stride(), t.shape

((0, 1), torch.Size([3, 3]))

Since `m` is of size 3×3, there are two ways to do broadcasting. The fact it was done on the last dimension is a convention that comes from the rules of broadcasting and has nothing to do with the way we ordered our tensors. If instead we do this, we get the same result:

由于 “m” 的大小为3 × 3，因此有两种广播方式。事实上，它是在最后一个维度上完成的，这是一个来自广播规则的惯例，与我们订购张量的方式无关。相反，如果我们这样做，我们会得到相同的结果:[机器翻译]

In [None]:
c + m

tensor([[11., 22., 33.],
        [14., 25., 36.],
        [17., 28., 39.]])

In fact, it's only possible to broadcast a vector of size `n` with a matrix of size `m` by `n`:

事实上，只可能广播大小为 “n” 的向量与大小为 “m” 的矩阵由 “n”:[机器翻译]

In [None]:
c = tensor([10.,20,30])
m = tensor([[1., 2, 3], [4,5,6]])
c+m

tensor([[11., 22., 33.],
        [14., 25., 36.]])

This won't work:

这是行不通的:[机器翻译]

In [None]:
c = tensor([10.,20])
m = tensor([[1., 2, 3], [4,5,6]])
c+m

RuntimeError: The size of tensor a (2) must match the size of tensor b (3) at non-singleton dimension 1

If we want to broadcast in the other dimension, we have to change the shape of our vector to make it a 3×1 matrix. This is done with the `unsqueeze` method in PyTorch:

如果我们想在另一个维度上广播，我们必须改变向量的形状，使其成为3 × 1矩阵。这是通过PyTorch中的 “unsqueeze” 方法完成的:[机器翻译]

In [None]:
c = tensor([10.,20,30])
m = tensor([[1., 2, 3], [4,5,6], [7,8,9]])
c = c.unsqueeze(1)
m.shape,c.shape

(torch.Size([3, 3]), torch.Size([3, 1]))

This time, `c` is expanded on the column side:

这一次，'c' 在列侧展开:[机器翻译]

In [None]:
c+m

tensor([[11., 12., 13.],
        [24., 25., 26.],
        [37., 38., 39.]])

Like before, only three scalars are stored in memory:

像以前一样，只有三个标量存储在内存中:[机器翻译]

In [None]:
t = c.expand_as(m)
t.storage()

 10.0
 20.0
 30.0
[torch.FloatStorage of size 3]

And the expanded tensor has the right shape because the column dimension has a stride of 0:

并且展开的张量具有正确的形状，因为列维度的步幅为0:[机器翻译]

In [None]:
t.stride(), t.shape

((1, 0), torch.Size([3, 3]))

With broadcasting, by default if we need to add dimensions, they are added at the beginning. When we were broadcasting before, Pytorch was doing `c.unsqueeze(0)` behind the scenes:

使用广播，默认情况下，如果我们需要添加维度，它们会在开始时添加。以前我们广播的时候，Pytorch在幕后做 'c.unsqueeze(0)':[机器翻译]

In [None]:
c = tensor([10.,20,30])
c.shape, c.unsqueeze(0).shape,c.unsqueeze(1).shape

(torch.Size([3]), torch.Size([1, 3]), torch.Size([3, 1]))

The `unsqueeze` command can be replaced by `None` indexing:

“Unsqueeze” 命令可以由 “none” 索引替换:[机器翻译]

In [None]:
c.shape, c[None,:].shape,c[:,None].shape

(torch.Size([3]), torch.Size([1, 3]), torch.Size([3, 1]))

You can always omit trailing colons, and `...` means all preceding dimensions:

您总是可以省略尾随冒号，并且 “.” 表示所有前面的维度:[机器翻译]

In [None]:
c[None].shape,c[...,None].shape

(torch.Size([1, 3]), torch.Size([3, 1]))

With this, we can remove another `for` loop in our matrix multiplication function. Now, instead of multiplying `a[i]` with `b[:,j]`, we can multiply `a[i]` with the whole matrix `b` using broadcasting, then sum the results:

有了这个，我们可以删除矩阵乘法函数中的另一个 “for” 循环。现在，我们可以使用广播将 'a[i]' 与整个矩阵 'b' 相乘，而不是将 'b[:，j]' 相乘，然后将结果求和:[机器翻译]

In [None]:
def matmul(a,b):
    ar,ac = a.shape
    br,bc = b.shape
    assert ac==br
    c = torch.zeros(ar, bc)
    for i in range(ar):
#       c[i,j] = (a[i,:]          * b[:,j]).sum() # previous
        c[i]   = (a[i  ].unsqueeze(-1) * b).sum(dim=0)
    return c

In [None]:
%timeit -n 20 t4 = matmul(m1,m2)

357 µs ± 7.2 µs per loop (mean ± std. dev. of 7 runs, 20 loops each)


We're now 3,700 times faster than our first implementation! Before we move on, let's discuss the rules of broadcasting in a little more detail.

我们现在比我们的第一个实现快3,700倍!在我们继续之前，让我们更详细地讨论广播规则。[机器翻译]

#### Broadcasting rules

# 广播规则[机器翻译]

When operating on two tensors, PyTorch compares their shapes elementwise. It starts with the *trailing dimensions* and works its way backward, adding 1 when it meets empty dimensions. Two dimensions are *compatible* when one of the following is true:


当在两个tensor上操作时，PyTorch比较它们的形状元素。它从 * 尾随维度 * 开始，向后工作，当遇到空维度时添加1。当满足以下条件之一时，两个维度是 * 兼容 * 的:[机器翻译]

- They are equal.
- One of them is 1, in which case that dimension is broadcast to make it the same as the other.


-他们是平等的.
-其中一个是1，在这种情况下，该维度被广播以使其与另一个维度相同。[机器翻译]

Arrays do not need to have the same number of dimensions. For example, if you have a 256×256×3 array of RGB values, and you want to scale each color in the image by a different value, you can multiply the image by a one-dimensional array with three values. Lining up the sizes of the trailing axes of these arrays according to the broadcast rules, shows that they are compatible:


数组不需要具有相同的维度数。例如，如果您有一个256 × 3的RGB值数组，并且希望按不同的值缩放图像中的每种颜色，您可以将图像乘以具有三个值的一维数组。根据广播规则排列这些数组的尾随轴的大小，表明它们是兼容的:[机器翻译]

```
Image  (3d tensor): 256 x 256 x 3
Scale  (1d tensor):  (1)   (1)  3
Result (3d tensor): 256 x 256 x 3
```
    
However, a 2D tensor of size 256×256 isn't compatible with our image:


'''
图像 (3d张量): 256x3
尺度 (1d张量): (1) 3
结果 (3d张量): 256x3
'''

然而，尺寸为256 × 256的2D张量与我们的图像不兼容:[机器翻译]

```
Image  (3d tensor): 256 x 256 x   3
Scale  (1d tensor):  (1)  256 x 256
Error
```


'''
图像 (3d张量): 256x3
尺度 (1d张量): (1) 256x256
误差
'''[机器翻译]

In our earlier examples we had with a 3×3 matrix and a vector of size 3, broadcasting was done on the rows:


在我们之前的例子中，我们有一个3 × 3矩阵和一个大小为3的向量，在行上进行广播:[机器翻译]

```
Matrix (2d tensor):   3 x 3
Vector (1d tensor): (1)   3
Result (2d tensor):   3 x 3
```


'''
矩阵 (2d张量): 3x3
向量 (1d张量): (1) 3
结果 (2d张量): 3x3
'''[机器翻译]

As an exercise, try to determine what dimensions to add (and where) when you need to normalize a batch of images of size `64 x 3 x 256 x 256` with vectors of three elements (one for the mean and one for the standard deviation).

作为练习，尝试确定要添加的维度 (以及位置) 当您需要用三个元素的向量 (一个用于平均值，一个用于标准差) 标准化一批大小为 “64x3x256” 的图像时。[机器翻译]

Another useful wat of simplifying tensor manipulations is the use of Einstein summations convention.

简化张量操作的另一个有用的wat是使用爱因斯坦求和惯例。[机器翻译]

### Einstein Summation

# 爱因斯坦求和[机器翻译]

Before using the PyTorch operation `@` or `torch.matmul`, there is one last way we can implement matrix multiplication: Einstein summation (`einsum`). This is a compact representation for combining products and sums in a general way. We write an equation like this:


在使用PyTorch操作 “@” 或 “torch.Matmal” 之前，还有最后一种方法可以实现矩阵乘法: 爱因斯坦求和 ('einsum')。这是一个紧凑的表示，用于以一般方式组合产品和总和。我们写这样一个方程:[机器翻译]

```
ik,kj -> ij
```


'''
Ik，kj -> ij
'''[机器翻译]

The lefthand side represents the operands dimensions, separated by commas. Here we have two tensors that each have two dimensions (`i,k` and `k,j`).  The righthand side represents the result dimensions, so here we have a tensor with two dimensions `i,j`. 


左侧表示操作数维度，用逗号分隔。这里我们有两个张量，每个都有两个维度 ('i，k' 和 'k，j')。右边代表结果维度，所以这里我们有一个二维的张量 'i，j'。[机器翻译]

The rules of Einstein summation notation are as follows:


爱因斯坦求和记数法的规则如下:[机器翻译]

1. Repeated indices are implicitly summed over.
1. Each index can appear at most twice in any term.
1. Each term must contain identical nonrepeated indices.


1.重复的指数被隐式求和。
1.每个索引在任何术语中最多可以出现两次。
1.每个术语必须包含相同的非重复索引。[机器翻译]

So in our example, since `k` is repeated, we sum over that index. In the end the formula represents the matrix obtained when we put in `(i,j)` the sum of all the coefficients `(i,k)` in the first tensor multiplied by the coefficients `(k,j)` in the second tensor... which is the matrix product! Here is how we can code this in PyTorch:

所以在我们的例子中，由于 “k” 是重复的，我们对该指数求和。最后，公式表示当我们输入 '(i，j)' 所有系数的和 '(i，k) 时获得的矩阵第一个张量中的' 乘以第二个张量中的系数 '(k，j)'.哪个是矩阵产品!以下是我们如何在PyTorch中编写代码:[机器翻译]

In [None]:
def matmul(a,b): return torch.einsum('ik,kj->ij', a, b)

Einstein summation is a very practical way of expressing operations involving indexing and sum of products. Note that you can have just one member on the lefthand side. For instance, this:


爱因斯坦求和是一种非常实用的表示操作的方式，涉及索引和积的总和。请注意，您可以在左侧只有一个成员。例如，这个:[机器翻译]

```python
torch.einsum('ij->ji', a)
```


'''Python
Torch.einsum('ij->ji'，a)
'''[机器翻译]

returns the transpose of the matrix `a`. You can also have three or more members. This:


返回矩阵a的转置。您也可以有三个或更多成员。这个:[机器翻译]

```python
torch.einsum('bi,ij,bj->b', a, b, c)
```


'''Python
Torch.einsum('bi，ij，bj->b'，a，b，c)
'''[机器翻译]

will return a vector of size `b` where the `k`-th coordinate is the sum of `a[k,i] b[i,j] c[k,j]`. This notation is particularly convenient when you have more dimensions because of batches. For example, if you have two batches of matrices and want to compute the matrix product per batch, you would could this: 


将返回大小为 'b' 的向量，其中 'k'-th坐标是 'a[k，i] b[i，j] c[k，j]' 的总和。当您因为批次而具有更多维度时，此表示法特别方便。例如，如果你有两批矩阵，并且想要计算每批矩阵产品，你可以这样做:[机器翻译]

```python
torch.einsum('bik,bkj->bij', a, b)
```


'''Python
Torch.einsum('bik，bkj->bij'，a，b)
'''[机器翻译]

Let's go back to our new `matmul` implementation using `einsum` and look at its speed:

让我们回到使用 “einsum” 的新 “matmu” 实现，看看它的速度:[机器翻译]

In [None]:
%timeit -n 20 t5 = matmul(m1,m2)

68.7 µs ± 4.06 µs per loop (mean ± std. dev. of 7 runs, 20 loops each)


As you can see, not only is it practical, but it's *very* fast. `einsum` is often the fastest way to do custom operations in PyTorch, without diving into C++ and CUDA. (But it's generally not as fast as carefully optimized CUDA code, as you see from the results in "Matrix Multiplication from Scratch".)

正如你所看到的，它不仅实用，而且非常快。'Einsum' 通常是在PyTorch中进行自定义操作的最快方式，而无需深入C和CUDA。(但它通常不如仔细优化的CUDA代码快，正如您在 “从头开始矩阵乘法” 中的结果所看到的那样。)[机器翻译]

Now that we know how to implement a matrix multiplication from scratch, we are ready to build our neural net—specifically its forward and backward passes—using just matrix multiplications.

现在我们知道如何从头开始实现矩阵乘法，我们已经准备好使用矩阵乘法构建我们的神经网络 -- 特别是它的向前和向后传递。[机器翻译]

## The Forward and Backward Passes

# # 前进和后退[机器翻译]

As we saw in <<chapter_mnist_basics>>, to train a model, we will need to compute all the gradients of a given a loss with respect to its parameters, which is known as the *backward pass*. The *forward pass* is where we compute the output of the model on a given input, based on the matrix products. As we define our first neural net, we will also delve into the problem of properly initializing the weights, which is crucial for making training start properly.

正如我们在 <<chapter_mnist_basics>> 中看到的，为了训练一个模型，我们需要计算给定损失相对于其参数的所有梯度，这就是所谓的 * 向后传递 *。* Forward pass * 是我们根据矩阵乘积在给定输入上计算模型输出的地方。当我们定义我们的第一个神经网络时，我们还将深入研究正确初始化权重的问题，这对于使训练正确开始至关重要。[机器翻译]

### Defining and Initializing a Layer

# 定义并初始化图层[机器翻译]

We will take the example of a two-layer neural net first. As we've seen, one layer can be expressed as `y = x @ w + b`, with `x` our inputs, `y` our outputs, `w` the weights of the layer (which is of size number of inputs by number of neurons if we don't transpose like before), and `b` is the bias vector:

我们将首先以双层神经网络为例。正如我们所看到的，一层可以表示为 'y = x @ w b'，其中 'x' 是我们的输入，'y' 是我们的输出，'W' 层的权重 (如果我们不像以前那样转置，则是输入的大小数量，由神经元的数量)，而 'b' 是偏置向量:[机器翻译]

In [None]:
def lin(x, w, b): return x @ w + b

We can stack the second layer on top of the first, but since mathematically the composition of two linear operations is another linear operation, this only makes sense if we put something nonlinear in the middle, called an activation function. As mentioned at the beginning of the chapter, in deep learning applications the activation function most commonly used is a ReLU, which returns the maximum of `x` and `0`. 


我们可以将第二层堆叠在第一层之上，但是因为从数学上讲，两个线性运算的组成是另一个线性运算，所以只有当我们在中间放一些非线性的东西时，这才有意义，称为激活函数。正如本章开头提到的，在深度学习应用程序中，最常用的激活函数是ReLU，它返回 “x” 和 “0” 的最大值。[机器翻译]

We won't actually train our model in this chapter, so we'll use random tensors for our inputs and targets. Let's say our inputs are 200 vectors of size 100, which we group into one batch, and our targets are 200 random floats:

在本章中，我们实际上不会训练我们的模型，所以我们将为我们的输入和目标使用随机张量。假设我们的输入是大小为200的100个向量，我们将其分组为一个批次，我们的目标是200个随机浮动:[机器翻译]

In [None]:
x = torch.randn(200, 100)
y = torch.randn(200)

For our two-layer model we will need two weight matrices and two bias vectors. Let's say we have a hidden size of 50 and the output size is 1 (for one of our inputs, the corresponding output is one float in this toy example). We initialize the weights randomly and the bias at zero:

对于我们的两层模型，我们需要两个权重矩阵和两个偏置向量。假设我们的隐藏大小为50，输出大小为1 (对于我们的一个输入，在这个玩具示例中，相应的输出是一个浮点)。我们随机初始化权重和零偏差:[机器翻译]

In [None]:
w1 = torch.randn(100,50)
b1 = torch.zeros(50)
w2 = torch.randn(50,1)
b2 = torch.zeros(1)

Then the result of our first layer is simply:

那么我们的第一层的结果很简单:[机器翻译]

In [None]:
l1 = lin(x, w1, b1)
l1.shape

torch.Size([200, 50])

Note that this formula works with our batch of inputs, and returns a batch of hidden state: `l1` is a matrix of size 200 (our batch size) by 50 (our hidden size).


请注意，这个公式与我们的批输入一起工作，并返回一批隐藏状态: 'l1' 是一个大小为200 (我们的批处理大小) 的矩阵50 (我们隐藏的尺寸)。[机器翻译]

There is a problem with the way our model was initialized, however. To understand it, we need to look at the mean and standard deviation (std) of `l1`:

然而，我们的模型初始化方式有问题。为了理解它，我们需要看看 'l1' 的平均值和标准差 (std):[机器翻译]

In [None]:
l1.mean(), l1.std()

(tensor(0.0019), tensor(10.1058))

The mean is close to zero, which is understandable since both our input and weight matrices have means close to zero. But the standard deviation, which represents how far away our activations go from the mean, went from 1 to 10. This is a really big problem because that's with just one layer. Modern neural nets can have hundred of layers, so if each of them multiplies the scale of our activations by 10, by the end of the last layer we won't have numbers representable by a computer.


平均值接近于零，这是可以理解的，因为我们的输入和权重矩阵的平均值都接近于零。但是标准偏差，代表了我们的激活离平均值有多远，从1到10。这是一个非常大的问题，因为只有一层。现代神经网络可以有上百个层，所以如果每个层都将我们的激活规模乘以10，到最后一层的末尾，我们就不会有计算机可以表示的数字。[机器翻译]

Indeed, if we make just 50 multiplications between `x` and random matrices of size 100×100, we'll have:

事实上，如果我们在 “x” 和大小为100 × 100的随机矩阵之间进行50次乘法，我们将拥有:[机器翻译]

In [None]:
x = torch.randn(200, 100)
for i in range(50): x = x @ torch.randn(100,100)
x[0:5,0:5]

tensor([[nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan]])

The result is `nan`s everywhere. So maybe the scale of our matrix was too big, and we need to have smaller weights? But if we use too small weights, we will have the opposite problem—the scale of our activations will go from 1 to 0.1, and after 100 layers we'll be left with zeros everywhere:

结果是 “无处不在”。所以也许我们矩阵的规模太大了，我们需要更小的权重？但是如果我们使用太小的权重，我们将会有相反的问题 -- 我们激活的范围将从1到0.1，在100层之后，我们将到处都是零:[机器翻译]

In [None]:
x = torch.randn(200, 100)
for i in range(50): x = x @ (torch.randn(100,100) * 0.01)
x[0:5,0:5]

tensor([[0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.]])

So we have to scale our weight matrices exactly right so that the standard deviation of our activations stays at 1. We can compute the exact value to use mathematically, as illustrated by Xavier Glorot and Yoshua Bengio in ["Understanding the Difficulty of Training Deep Feedforward Neural Networks"](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf). The right scale for a given layer is $1/\sqrt{n_{in}}$, where $n_{in}$ represents the number of inputs.


因此，我们必须精确调整我们的权重矩阵，以便我们激活的标准偏差保持在1。我们可以计算精确的值以数学方式使用，如Xavier Glorot和Yoshua Bengio在 [“理解训练深度前馈神经网络的难度”] ( http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf )。给定层的右比例为 $1/\ sqrt{n _{in}}$，其中 $ n _{in}$ 表示输入的数量。[机器翻译]

In our case, if we have 100 inputs, we should scale our weight matrices by 0.1:

在我们的例子中，如果我们有100个输入，我们应该将我们的权重矩阵缩放0.1:[机器翻译]

In [None]:
x = torch.randn(200, 100)
for i in range(50): x = x @ (torch.randn(100,100) * 0.1)
x[0:5,0:5]

tensor([[ 0.7554,  0.6167, -0.1757, -1.5662,  0.5644],
        [-0.1987,  0.6292,  0.3283, -1.1538,  0.5416],
        [ 0.6106,  0.2556, -0.0618, -0.9463,  0.4445],
        [ 0.4484,  0.7144,  0.1164, -0.8626,  0.4413],
        [ 0.3463,  0.5930,  0.3375, -0.9486,  0.5643]])

Finally some numbers that are neither zeros nor `nan`s! Notice how stable the scale of our activations is, even after those 50 fake layers:

最后，一些既不是零也不是 'nans' 的数字!注意我们激活的规模有多稳定，即使在那50个假层之后:[机器翻译]

In [None]:
x.std()

tensor(0.7042)

If you play a little bit with the value for scale you'll notice that even a slight variation from 0.1 will get you either to very small or very large numbers, so initializing the weights properly is extremely important. 


如果你用比例值玩一会儿，你会注意到即使是从0.1的微小变化也会让你变得非常小或非常大，因此，正确初始化权重是极其重要的。[机器翻译]

Let's go back to our neural net. Since we messed a bit with our inputs, we need to redefine them:

让我们回到我们的神经网络。由于我们的输入有点混乱，我们需要重新定义它们:[机器翻译]

In [None]:
x = torch.randn(200, 100)
y = torch.randn(200)

And for our weights, we'll use the right scale, which is known as *Xavier initialization* (or *Glorot initialization*):

对于我们的权重，我们将使用正确的刻度，称为 * Xavier初始化 * (或 * Glorot初始化 *):[机器翻译]

In [None]:
from math import sqrt
w1 = torch.randn(100,50) / sqrt(100)
b1 = torch.zeros(50)
w2 = torch.randn(50,1) / sqrt(50)
b2 = torch.zeros(1)

Now if we compute the result of the first layer, we can check that the mean and standard deviation are under control:

现在，如果我们计算第一层的结果，我们可以检查平均值和标准偏差是否受到控制:[机器翻译]

In [None]:
l1 = lin(x, w1, b1)
l1.mean(),l1.std()

(tensor(-0.0050), tensor(1.0000))

Very good. Now we need to go through a ReLU, so let's define one. A ReLU removes the negatives and replaces them with zeros, which is another way of saying it clamps our tensor at zero:

很好。现在我们需要通过一个ReLU，让我们定义一个。ReLU删除底片并用零替换它们，这是另一种说法，它将我们的张量夹在零:[机器翻译]

In [None]:
def relu(x): return x.clamp_min(0.)

We pass our activations through this:

我们通过这个传递我们的激活:[机器翻译]

In [None]:
l2 = relu(l1)
l2.mean(),l2.std()

(tensor(0.3961), tensor(0.5783))

And we're back to square one: the mean of our activations has gone to 0.4 (which is understandable since we removed the negatives) and the std went down to 0.58. So like before, after a few layers we will probably wind up with zeros:

我们又回到了起点: 我们激活的平均值达到了0.4 (这是可以理解的，因为我们去掉了底片)，性病下降到0.58。就像以前一样，在几层之后，我们可能会以零结束:[机器翻译]

In [None]:
x = torch.randn(200, 100)
for i in range(50): x = relu(x @ (torch.randn(100,100) * 0.1))
x[0:5,0:5]

tensor([[0.0000e+00, 1.9689e-08, 4.2820e-08, 0.0000e+00, 0.0000e+00],
        [0.0000e+00, 1.6701e-08, 4.3501e-08, 0.0000e+00, 0.0000e+00],
        [0.0000e+00, 1.0976e-08, 3.0411e-08, 0.0000e+00, 0.0000e+00],
        [0.0000e+00, 1.8457e-08, 4.9469e-08, 0.0000e+00, 0.0000e+00],
        [0.0000e+00, 1.9949e-08, 4.1643e-08, 0.0000e+00, 0.0000e+00]])

This means our initialization wasn't right. Why? At the time Glorot and Bengio wrote their article, the popular activation in a neural net was the hyperbolic tangent (tanh, which is the one they used), and that initialization doesn't account for our ReLU. Fortunately, someone else has done the math for us and computed the right scale for us to use. In ["Delving Deep into Rectifiers: Surpassing Human-Level Performance"](https://arxiv.org/abs/1502.01852) (which we've seen before—it's the article that introduced the ResNet), Kaiming He et al. show that we should use the following scale instead: $\sqrt{2 / n_{in}}$, where $n_{in}$ is the number of inputs of our model. Let's see what this gives us:

这意味着我们的初始化不对。为什么？在Glorot和Bengio写他们的文章时，神经网络中流行的激活是双曲正切 (tanh，这是他们使用的)，初始化没有考虑到我们的关系。幸运的是，其他人已经为我们做了数学运算，并为我们计算了正确的量表。在 [《深耕整流器: 超越人类水平的表现》] ( https://arxiv.org/abs/1502.01852 ) (我们以前见过 -- 这是介绍ResNet的文章)，凯明He等人表明我们应该使用以下量表: $ \ sqrt{2 / n _{in}}$，其中 $ n _{in}$ 是我们模型的输入数。让我们看看这给了我们什么:[机器翻译]

In [None]:
x = torch.randn(200, 100)
for i in range(50): x = relu(x @ (torch.randn(100,100) * sqrt(2/100)))
x[0:5,0:5]

tensor([[0.2871, 0.0000, 0.0000, 0.0000, 0.0026],
        [0.4546, 0.0000, 0.0000, 0.0000, 0.0015],
        [0.6178, 0.0000, 0.0000, 0.0180, 0.0079],
        [0.3333, 0.0000, 0.0000, 0.0545, 0.0000],
        [0.1940, 0.0000, 0.0000, 0.0000, 0.0096]])

That's better: our numbers aren't all zeroed this time. So let's go back to the definition of our neural net and use this initialization (which is named *Kaiming initialization* or *He initialization*):

这样更好: 这次我们的数字并没有全部归零。让我们回到神经网络的定义，并使用这个初始化 (命名为 * Kaiming initialization * 或 * He initialization *):[机器翻译]

In [None]:
x = torch.randn(200, 100)
y = torch.randn(200)

In [None]:
w1 = torch.randn(100,50) * sqrt(2 / 100)
b1 = torch.zeros(50)
w2 = torch.randn(50,1) * sqrt(2 / 50)
b2 = torch.zeros(1)

Let's look at the scale of our activations after going through the first linear layer and ReLU:

让我们在经历了第一个线性层和ReLU之后，看看我们的激活规模:[机器翻译]

In [None]:
l1 = lin(x, w1, b1)
l2 = relu(l1)
l2.mean(), l2.std()

(tensor(0.5661), tensor(0.8339))

Much better! Now that our weights are properly initialized, we can define our whole model:

好多了!既然我们的权重已经正确初始化，我们可以定义我们的整个模型:[机器翻译]

In [None]:
def model(x):
    l1 = lin(x, w1, b1)
    l2 = relu(l1)
    l3 = lin(l2, w2, b2)
    return l3

This is the forward pass. Now all that's left to do is to compare our output to the labels we have (random numbers, in this example) with a loss function. In this case, we will use the mean squared error. (It's a toy problem, and this is the easiest loss function to use for what is next, computing the gradients.)


这是向前传球。现在剩下要做的就是将我们的输出与我们拥有的标签 (在本例中是随机数) 与损失函数进行比较。在这种情况下，我们将使用均方误差。(这是一个玩具问题，这是计算梯度的最简单的损失函数。)[机器翻译]

The only subtlety is that our outputs and targets don't have exactly the same shape—after going though the model, we get an output like this:

唯一微妙的是，我们的输出和目标没有完全相同的形状-经过模型后，我们得到这样的输出:[机器翻译]

In [None]:
out = model(x)
out.shape

torch.Size([200, 1])

To get rid of this trailing 1 dimension, we use the `squeeze` function:

为了去掉这个尾随的1维度，我们使用了 “squeeze” 函数:[机器翻译]

In [None]:
def mse(output, targ): return (output.squeeze(-1) - targ).pow(2).mean()

And now we are ready to compute our loss:

现在我们准备计算我们的损失:[机器翻译]

In [None]:
loss = mse(out, y)

That's all for the forward pass—let's now look at the gradients.

前传就这些了 -- 现在让我们看看梯度。[机器翻译]

### Gradients and the Backward Pass

# 梯度和向后传递[机器翻译]

We've seen that PyTorch computes all the gradients we need with a magic call to `loss.backward`, but let's explore what's happening behind the scenes.


我们已经看到PyTorch通过对 “loss.Backward” 的神奇调用来计算我们需要的所有梯度，但是让我们探索幕后发生了什么。[机器翻译]

Now comes the part where we need to compute the gradients of the loss with respect to all the weights of our model, so all the floats in `w1`, `b1`, `w2`, and `b2`. For this, we will need a bit of math—specifically the *chain rule*. This is the rule of calculus that guides how we can compute the derivative of a composed function:


现在我们需要计算相对于我们模型的所有权重的损失梯度，所以所有的浮动在 'w1'，'b1'，'w2' 和 'b2'。为此，我们需要一点数学 -- 特别是 * 链规则 *。这是指导我们如何计算组合函数的导数的微积分规则:[机器翻译]

$$(g \circ f)'(x) = g'(f(x)) f'(x)$$

$ $ (G \ circ f)'(x) = g'(f(x)) f '(x)$ $[机器翻译]

> j: I find this notation very hard to wrap my head around, so instead I like to think of it as: if `y = g(u)` and `u=f(x)`; then `dy/dx = dy/du * du/dx`. The two notations mean the same thing, so use whatever works for you.

> 我发现这个符号很难理解，所以我喜欢把它想象成: 如果 'y = g(u)' 和 'u = f(x)'; 则 'dy/dx = dy/du * du/dx'。这两个符号的意思是一样的，所以用任何对你有用的东西。[机器翻译]

Our loss is a big composition of different functions: mean squared error (which is in turn the composition of a mean and a power of two), the second linear layer, a ReLU and the first linear layer. For instance, if we want the gradients of the loss with respect to `b2` and our loss is defined by:


我们的损失是不同函数的大组合: 均方误差 (反过来是平均值和2的幂的组合)，第二个线性层，一个ReLU和第一个线性层。例如，如果我们想要相对于 “b2” 的损失梯度，并且我们的损失定义为:[机器翻译]

```
loss = mse(out,y) = mse(lin(l2, w2, b2), y)
```


'''
损失 = mse(out，y) = mse(lin(l2，w2，b2)，y)
'''[机器翻译]

The chain rule tells us that we have:
$$\frac{\text{d} loss}{\text{d} b_{2}} = \frac{\text{d} loss}{\text{d} out} \times \frac{\text{d} out}{\text{d} b_{2}} = \frac{\text{d}}{\text{d} out} mse(out, y) \times \frac{\text{d}}{\text{d} b_{2}} lin(l_{2}, w_{2}, b_{2})$$


链规则告诉我们，我们有:
$ $ \ Frac {\ text{d} loss}{\ text{d} b _{2}} = \ frac{\ text{d} loss}{\ text{d} out} \ times \ frac{\ text{d} out}{\ text{d} b _{2}} = \ frac{\ text{d}}{\ text{d} out} mse(out，y) \ times \ frac{\ text{d}}{\ text{d} b _{2}} lin(l _{2}，w _{2}，b _{2})$ $[机器翻译]

To compute the gradients of the loss with respect to $b_{2}$, we first need the gradients of the loss with respect to our output $out$. It's the same if we want the gradients of the loss with respect to $w_{2}$. Then, to get the gradients of the loss with respect to $b_{1}$ or $w_{1}$, we will need the gradients of the loss with respect to $l_{1}$, which in turn requires the gradients of the loss with respect to $l_{2}$, which will need the gradients of the loss with respect to $out$.


为了计算损失相对于 $ b _{2}$ 的梯度，我们首先需要损失相对于我们的输出 $ out $ 的梯度。如果我们想要损失相对于 $ w _{2}$ 的梯度是一样的。然后，为了得到损失相对于 $ b _{1}$ 或 $ w _{1}$ 的梯度，我们需要损失相对于 $ l _{1的梯度} $，这反过来要求损失相对于 $ l _{2}$ 的梯度，这将需要损失相对于 $ out $ 的梯度。[机器翻译]

So to compute all the gradients we need for the update, we need to begin from the output of the model and work our way *backward*, one layer after the other—which is why this step is known as *backpropagation*. We can automate it by having each function we implemented (`relu`, `mse`, `lin`) provide its backward step: that is, how to derive the gradients of the loss with respect to the input(s) from the gradients of the loss with respect to the output.


因此，为了计算更新所需的所有梯度，我们需要从模型的输出开始，然后向后 *，一层接一层-这就是为什么这一步被称为 * 反向传播 *。我们可以通过让我们实现的每个函数 ('relu'，'mse'，'lin') 提供它的后退步骤来自动化它: 也就是说，如何从损失相对于输出的梯度中导出损失相对于输入的梯度。[机器翻译]

Here we populate those gradients in an attribute of each tensor, a bit like PyTorch does with `.grad`. 


在这里，我们在每个张量的属性中填充这些渐变，有点像PyTorch使用 '。梯度'。[机器翻译]

The first are the gradients of the loss with respect to the output of our model (which is the input of the loss function). We undo the `squeeze` we did in `mse`, then we use the formula that gives us the derivative of $x^{2}$: $2x$. The derivative of the mean is just $1/n$ where $n$ is the number of elements in our input:

第一个是损失相对于我们模型输出的梯度 (这是损失函数的输入)。我们撤销了我们在 'mse' 中做的 '挤压'，然后我们使用给我们 $ x ^{2}$ 的导数的公式: $ 2x $。平均值的导数仅为 $1/n $，其中 $ n $ 是我们输入中的元素数:[机器翻译]

In [None]:
def mse_grad(inp, targ): 
    # grad of loss with respect to output of previous layer
    inp.g = 2. * (inp.squeeze() - targ).unsqueeze(-1) / inp.shape[0]

For the gradients of the ReLU and our linear layer, we use the gradients of the loss with respect to the output (in `out.g`) and apply the chain rule to compute the gradients of the loss with respect to the output (in `inp.g`). The chain rule tells us that `inp.g = relu'(inp) * out.g`. The derivative of `relu` is either 0 (when inputs are negative) or 1 (when inputs are positive), so this gives us:

对于ReLU和我们的线性层的梯度，我们使用损失相对于输出的梯度 (in 'out)。G') 并应用链规则来计算损失相对于输出的梯度 (在 'inp.G')。链规则告诉我们 'inp.g = relu'(inp) * out.g '。“Relu” 的导数是0 (当输入为负时) 或1 (当输入为正时)，所以这给了我们:[机器翻译]

In [None]:
def relu_grad(inp, out):
    # grad of relu with respect to input activations
    inp.g = (inp>0).float() * out.g

The scheme is the same to compute the gradients of the loss with respect to the inputs, weights, and bias in the linear layer:

该方案与计算线性层中损耗相对于输入、权重和偏差的梯度相同:[机器翻译]

In [None]:
def lin_grad(inp, out, w, b):
    # grad of matmul with respect to input
    inp.g = out.g @ w.t()
    w.g = inp.t() @ out.g
    b.g = out.g.sum(0)

We won't linger on the mathematical formulas that define them since they're not important for our purposes, but do check out Khan Academy's excellent calculus lessons if you're interested in this topic.

我们不会停留在定义它们的数学公式上，因为它们对我们的目的并不重要，但是如果您对这个主题感兴趣，请查看Khan Academy的优秀微积分课程。[机器翻译]

### Sidebar: SymPy

# 侧边栏: SymPy[机器翻译]

SymPy is a library for symbolic computation that is extremely useful library when working with calculus. Per the [documentation](https://docs.sympy.org/latest/tutorial/intro.html):

SymPy是一个用于符号计算的库，在处理微积分时非常有用。根据 [文件] ( https://docs.sympy.org/latest/tutorial/intro.html ):[机器翻译]

> : Symbolic computation deals with the computation of mathematical objects symbolically. This means that the mathematical objects are represented exactly, not approximately, and mathematical expressions with unevaluated variables are left in symbolic form.

>: 符号计算处理象征性地计算数学对象。这意味着数学对象被精确地表示，而不是近似地表示，并且带有未评估变量的数学表达式保留为符号形式。[机器翻译]

To do symbolic computation, we first define a *symbol*, and then do a computation, like so:

做符号计算，我们先定义一个 * 符号 *，然后做一个计算，像这样:[机器翻译]

In [None]:
from sympy import symbols,diff
sx,sy = symbols('sx sy')
diff(sx**2, sx)

2*sx

Here, SymPy has taken the derivative of `x**2` for us! It can take the derivative of complicated compound expressions, simplify and factor equations, and much more. There's really not much reason for anyone to do calculus manually nowadays—for calculating gradients, PyTorch does it for us, and for showing the equations, SymPy does it for us!

这里，SymPy为我们取了 'x * * 2' 的导数!它可以取复杂复合表达式的导数，简化和因子方程，等等。现在没有太多的理由让任何人手工做微积分 -- 为了计算梯度，PyTorch为我们做，为了展示方程，SymPy为我们做![机器翻译]

### End sidebar

# 结束侧边栏[机器翻译]

Once we have have defined those functions, we can use them to write the backward pass. Since each gradient is automatically populated in the right tensor, we don't need to store the results of those `_grad` functions anywhere—we just need to execute them in the reverse order of the forward pass, to make sure that in each function `out.g` exists:

一旦我们定义了这些函数，我们就可以使用它们来编写向后传递。由于每个渐变都自动填充在右张量中，我们不需要将那些 “_ grad” 函数的结果存储在任何地方 -- 我们只需要按照向前传递的相反顺序执行它们，以确保在每个函数中 “输出”。G '存在:[机器翻译]

In [None]:
def forward_and_backward(inp, targ):
    # forward pass:
    l1 = inp @ w1 + b1
    l2 = relu(l1)
    out = l2 @ w2 + b2
    # we don't actually need the loss in backward!
    loss = mse(out, targ)
    
    # backward pass:
    mse_grad(out, targ)
    lin_grad(l2, out, w2, b2)
    relu_grad(l1, l2)
    lin_grad(inp, l1, w1, b1)

And now we can access the gradients of our model parameters in `w1.g`, `b1.g`, `w2.g`, and `b2.g`.

现在我们可以在 'w1.g'，'b1.g'，'w2.g' 和 'b2.g' 中访问模型参数的梯度。[机器翻译]

We have sucessfuly defined our model—now let's make it a bit more like a PyTorch module.

我们已经成功地定义了我们的模型 -- 现在让我们让它更像一个PyTorch模块。[机器翻译]

### Refactoring the Model

# 重构模型[机器翻译]

The three functions we used have two associated functions: a forward pass and a backward pass. Instead of writing them separately, we can create a class to wrap them together. That class can also store the inputs and outputs for the backward pass. This way, we will just have to call `backward`:

我们使用的三个函数有两个相关的函数: 向前通过和向后通过。我们可以创建一个类来将它们包装在一起，而不是单独编写它们。该类还可以存储反向传递的输入和输出。这样，我们只需要调用 “backward”:[机器翻译]

In [None]:
class Relu():
    def __call__(self, inp):
        self.inp = inp
        self.out = inp.clamp_min(0.)
        return self.out
    
    def backward(self): self.inp.g = (self.inp>0).float() * self.out.g

`__call__` is a magic name in Python that will make our class callable. This is what will be executed when we type `y = Relu()(x)`. We can do the same for our linear layer and the MSE loss:

'_ _ Call _ _' 是Python中的一个神奇名称，它将使我们的类可调用。这是当我们键入 'y = Relu()(x)' 时将执行的操作。我们可以为我们的线性层和MSE损失做同样的事情:[机器翻译]

In [None]:
class Lin():
    def __init__(self, w, b): self.w,self.b = w,b
        
    def __call__(self, inp):
        self.inp = inp
        self.out = inp@self.w + self.b
        return self.out
    
    def backward(self):
        self.inp.g = self.out.g @ self.w.t()
        self.w.g = self.inp.t() @ self.out.g
        self.b.g = self.out.g.sum(0)

In [None]:
class Mse():
    def __call__(self, inp, targ):
        self.inp = inp
        self.targ = targ
        self.out = (inp.squeeze() - targ).pow(2).mean()
        return self.out
    
    def backward(self):
        x = (self.inp.squeeze()-self.targ).unsqueeze(-1)
        self.inp.g = 2.*x/self.targ.shape[0]

Then we can put everything in a model that we initiate with our tensors `w1`, `b1`, `w2`, `b2`:

然后，我们可以将所有内容放入我们用张量 'w1'，'b1'，'w2'，'b2' 发起的模型中:[机器翻译]

In [None]:
class Model():
    def __init__(self, w1, b1, w2, b2):
        self.layers = [Lin(w1,b1), Relu(), Lin(w2,b2)]
        self.loss = Mse()
        
    def __call__(self, x, targ):
        for l in self.layers: x = l(x)
        return self.loss(x, targ)
    
    def backward(self):
        self.loss.backward()
        for l in reversed(self.layers): l.backward()

What is really nice about this refactoring and registering things as layers of our model is that the forward and backward passes are now really easy to write. If we want to instantiate our model, we just need to write:

这种重构和将事物注册为我们模型的层的真正好处是，向前和向后的传递现在非常容易编写。如果我们想实例化我们的模型，我们只需要写:[机器翻译]

In [None]:
model = Model(w1, b1, w2, b2)

The forward pass can then be executed with:

然后可以使用以下命令执行向前传递:[机器翻译]

In [None]:
loss = model(x, y)

And the backward pass with:

和向后通过:[机器翻译]

In [None]:
model.backward()

### Going to PyTorch

# 去PyTorch[机器翻译]

The  `Lin`, `Mse` and `Relu` classes we wrote have a lot in common, so we could make them all inherit from the same base class:

我们编写的 'lin'，'mse' 和 'relu' 类有很多共同点，所以我们可以让它们都从同一个基类继承:[机器翻译]

In [None]:
class LayerFunction():
    def __call__(self, *args):
        self.args = args
        self.out = self.forward(*args)
        return self.out
    
    def forward(self):  raise Exception('not implemented')
    def bwd(self):      raise Exception('not implemented')
    def backward(self): self.bwd(self.out, *self.args)

Then we just need to implement `forward` and `bwd` in each of our subclasses:

然后我们只需要在我们的每个子类中实现 'forward' 和 'bwd':[机器翻译]

In [None]:
class Relu(LayerFunction):
    def forward(self, inp): return inp.clamp_min(0.)
    def bwd(self, out, inp): inp.g = (inp>0).float() * out.g

In [None]:
class Lin(LayerFunction):
    def __init__(self, w, b): self.w,self.b = w,b
        
    def forward(self, inp): return inp@self.w + self.b
    
    def bwd(self, out, inp):
        inp.g = out.g @ self.w.t()
        self.w.g = self.inp.t() @ self.out.g
        self.b.g = out.g.sum(0)

In [None]:
class Mse(LayerFunction):
    def forward (self, inp, targ): return (inp.squeeze() - targ).pow(2).mean()
    def bwd(self, out, inp, targ): 
        inp.g = 2*(inp.squeeze()-targ).unsqueeze(-1) / targ.shape[0]

The rest of our model can be the same as before. This is getting closer and closer to what PyTorch does. Each basic function we need to differentiate is written as a `torch.autograd.Function` object that has a `forward` and a `backward` method. PyTorch will then keep trace of any computation we do to be able to properly run the backward pass, unless we set the `requires_grad` attribute of our tensors to `False`.


我们模型的其余部分可以和以前一样。这越来越接近PyTorch所做的事情。我们需要区分的每个基本函数都被写成一个 “torch.autograd.function” 对象，它有一个 “前进” 和一个 “后退” 方法。然后，PyTorch将保留我们所做的任何计算的跟踪，以便能够正确地运行向后传递，除非我们将张量的 “requires_grad” 属性设置为 “false”。[机器翻译]

Writing one of these is (almost) as easy as writing our original classes. The difference is that we choose what to save and what to put in a context variable (so that we make sure we don't save anything we don't need), and we return the gradients in the `backward` pass. It's very rare to have to write your own `Function` but if you ever need something exotic or want to mess with the gradients of a regular function, here is how to write one:

写其中一个 (几乎) 和写我们的原创课程一样容易。不同的是，我们选择保存什么和放在上下文变量中 (这样我们就可以确保不保存任何不需要的东西)，我们返回 “反向” 传球中的梯度。很少需要编写自己的 “函数”，但是如果你需要一些奇特的东西或者想弄乱一个常规函数的梯度，下面是如何编写的:[机器翻译]

In [None]:
from torch.autograd import Function

class MyRelu(Function):
    @staticmethod
    def forward(ctx, i):
        result = i.clamp_min(0.)
        ctx.save_for_backward(i)
        return result
    
    @staticmethod
    def backward(ctx, grad_output):
        i, = ctx.saved_tensors
        return grad_output * (i>0).float()

The structure used to build a more complex model that takes advantage of those `Function`s is a `torch.nn.Module`. This is the base structure for all models, and all the neural nets you have seen up until now were from that class. It mostly helps to register all the trainable parameters, which as we've seen can be used in the training loop.


用于构建利用这些 “功能” 的更复杂模型的结构是 “torch.nn.Module”。这是所有模型的基本结构，到目前为止你看到的所有神经网络都来自那个类。它主要有助于注册所有可训练的参数，正如我们所看到的，这些参数可以在训练循环中使用。[机器翻译]

To implement an `nn.Module` you just need to:


要实现 “nn.Module”，您只需要:[机器翻译]

- Make sure the superclass `__init__` is called first when you initiliaze it.
- Define any parameters of the model as attributes with `nn.Parameter`.
- Define a `forward` function that returns the output of your model.


-确保在初始化时首先调用超类 “_ _ init _ _”。
-将模型的任何参数定义为具有 “nn.Parameter” 的属性。
-定义返回模型输出的 “转发” 函数。[机器翻译]

As an example, here is the linear layer from scratch:

作为一个例子，这里是从零开始的线性层:[机器翻译]

In [None]:
import torch.nn as nn

class LinearLayer(nn.Module):
    def __init__(self, n_in, n_out):
        super().__init__()
        self.weight = nn.Parameter(torch.randn(n_out, n_in) * sqrt(2/n_in))
        self.bias = nn.Parameter(torch.zeros(n_out))
    
    def forward(self, x): return x @ self.weight.t() + self.bias

As you see, this class automatically keeps track of what parameters have been defined:

如您所见，此类会自动跟踪已定义的参数:[机器翻译]

In [None]:
lin = LinearLayer(10,2)
p1,p2 = lin.parameters()
p1.shape,p2.shape

(torch.Size([2, 10]), torch.Size([2]))

It is thanks to this feature of `nn.Module` that we can just say `opt.step()` and have an optimizer loop through the parameters and update each one.


正是由于 “nn.Module” 的这一特性，我们可以只说 “opt.step()”，并在参数中循环优化器并更新每个参数。[机器翻译]

Note that in PyTorch, the weights are stored as an `n_out x n_in` matrix, which is why we have the transpose in the forward pass.


请注意，在PyTorch中，权重存储为 'n_out x n_in' 矩阵，这就是为什么我们在向前传递中具有转置。[机器翻译]

By using the linear layer from PyTorch (which uses the Kaiming initialization as well), the model we have been building up during this chapter can be written like this:

通过使用PyTorch中的线性层 (也使用Kaiming初始化)，我们在本章中建立的模型可以这样写:[机器翻译]

In [None]:
class Model(nn.Module):
    def __init__(self, n_in, nh, n_out):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(n_in,nh), nn.ReLU(), nn.Linear(nh,n_out))
        self.loss = mse
        
    def forward(self, x, targ): return self.loss(self.layers(x).squeeze(), targ)

fastai provides its own variant of `Module` that is identical to `nn.Module`, but doesn't require you to call `super().__init__()` (it does that for you automatically):

Fastai提供了自己的 “模块” 变体，与 “nn” 相同。模块，但不要求您调用 'super()。_ _ init _ _ ())' (它会自动为您这样做):[机器翻译]

In [None]:
class Model(Module):
    def __init__(self, n_in, nh, n_out):
        self.layers = nn.Sequential(
            nn.Linear(n_in,nh), nn.ReLU(), nn.Linear(nh,n_out))
        self.loss = mse
        
    def forward(self, x, targ): return self.loss(self.layers(x).squeeze(), targ)

In the last chapter, we will start from such a model and see how to build a training loop from scratch and refactor it to what we've been using in previous chapters.

在最后一章中，我们将从这样一个模型开始，看看如何从头开始构建一个训练循环，并将其重构为我们在前几章中使用的内容。[机器翻译]

## Conclusion

# # 结论[机器翻译]

In this chapter we explored the foundations of deep learning, beginning with matrix multiplication and moving on to implementing the forward and backward passes of a neural net from scratch. We then refactored our code to show how PyTorch works beneath the hood.


在本章中，我们探讨了深度学习的基础，从矩阵乘法开始，从头开始实现神经网络的向前和向后通过。然后，我们重构了我们的代码，以显示PyTorch如何在引擎盖下工作。[机器翻译]

Here are a few things to remember:


这里有几件事要记住:[机器翻译]

- A neural net is basically a bunch of matrix multiplications with nonlinearities in between.
- Python is slow, so to write fast code we have to vectorize it and take advantage of techniques such as elementwise arithmetic and broadcasting.
- Two tensors are broadcastable if the dimensions starting from the end and going backward match (if they are the same, or one of them is 1). To make tensors broadcastable, we may need to add dimensions of size 1 with `unsqueeze` or a `None` index.
- Properly initializing a neural net is crucial to get training started. Kaiming initialization should be used when we have ReLU nonlinearities.
- The backward pass is the chain rule applied multiple times, computing the gradients from the output of our model and going back, one layer at a time.
- When subclassing `nn.Module` (if not using fastai's `Module`) we have to call the superclass `__init__` method in our `__init__` method and we have to define a `forward` function that takes an input and returns the desired result.

-一个神经网络基本上是一堆矩阵乘法之间的非线性。
-Python速度很慢，所以要编写快速代码，我们必须将其矢量化，并利用元素运算和广播等技术。
-如果从末端开始和向后移动的维度匹配 (如果它们相同，或者其中一个是1)，则两个张量可以广播。为了使张量可广播，我们可能需要添加大小为1的尺寸，并带有 “unsqueeze” 或 “none” 索引。
-正确初始化神经网络对于开始训练至关重要。当我们有ReLU非线性时，应该使用Kaiming初始化。
-反向传递是多次应用的链规则，计算我们模型输出的梯度，然后返回，一次一层。
-当子类 'nn。模块' (如果不使用fastai的 '模块') 我们必须在我们的 “_ _ init _ _” 方法中调用超类 “_ _ init _ _” 方法，并且我们必须定义一个接受输入并返回所需的 “前向” 函数结果。[机器翻译]

## Questionnaire

# # 问卷调查[机器翻译]

1. Write the Python code to implement a single neuron.
1. Write the Python code to implement ReLU.
1. Write the Python code for a dense layer in terms of matrix multiplication.
1. Write the Python code for a dense layer in plain Python (that is, with list comprehensions and functionality built into Python).
1. What is the "hidden size" of a layer?
1. What does the `t` method do in PyTorch?
1. Why is matrix multiplication written in plain Python very slow?
1. In `matmul`, why is `ac==br`?
1. In Jupyter Notebook, how do you measure the time taken for a single cell to execute?
1. What is "elementwise arithmetic"?
1. Write the PyTorch code to test whether every element of `a` is greater than the corresponding element of `b`.
1. What is a rank-0 tensor? How do you convert it to a plain Python data type?
1. What does this return, and why? `tensor([1,2]) + tensor([1])`
1. What does this return, and why? `tensor([1,2]) + tensor([1,2,3])`
1. How does elementwise arithmetic help us speed up `matmul`?
1. What are the broadcasting rules?
1. What is `expand_as`? Show an example of how it can be used to match the results of broadcasting.
1. How does `unsqueeze` help us to solve certain broadcasting problems?
1. How can we use indexing to do the same operation as `unsqueeze`?
1. How do we show the actual contents of the memory used for a tensor?
1. When adding a vector of size 3 to a matrix of size 3×3, are the elements of the vector added to each row or each column of the matrix? (Be sure to check your answer by running this code in a notebook.)
1. Do broadcasting and `expand_as` result in increased memory use? Why or why not?
1. Implement `matmul` using Einstein summation.
1. What does a repeated index letter represent on the left-hand side of einsum?
1. What are the three rules of Einstein summation notation? Why?
1. What are the forward pass and backward pass of a neural network?
1. Why do we need to store some of the activations calculated for intermediate layers in the forward pass?
1. What is the downside of having activations with a standard deviation too far away from 1?
1. How can weight initialization help avoid this problem?
1. What is the formula to initialize weights such that we get a standard deviation of 1 for a plain linear layer, and for a linear layer followed by ReLU?
1. Why do we sometimes have to use the `squeeze` method in loss functions?
1. What does the argument to the `squeeze` method do? Why might it be important to include this argument, even though PyTorch does not require it?
1. What is the "chain rule"? Show the equation in either of the two forms presented in this chapter.
1. Show how to calculate the gradients of `mse(lin(l2, w2, b2), y)` using the chain rule.
1. What is the gradient of ReLU? Show it in math or code. (You shouldn't need to commit this to memory—try to figure it using your knowledge of the shape of the function.)
1. In what order do we need to call the `*_grad` functions in the backward pass? Why?
1. What is `__call__`?
1. What methods must we implement when writing a `torch.autograd.Function`?
1. Write `nn.Linear` from scratch, and test it works.
1. What is the difference between `nn.Module` and fastai's `Module`?

1.编写Python代码来实现单个神经元。
1.编写Python代码实现ReLU。
1.在矩阵乘法方面编写致密层的Python代码。
1.在普通Python中编写密集层的Python代码 (即在Python中内置列表理解和功能)。
1.图层的 “隐藏大小” 是多少？
1.PyTorch中的 't' 方法做什么？
1.为什么用普通Python写的矩阵乘法很慢？
1.在 'matmu' 中，为什么是 'ac = = br'？
1.在Jupyter Notebook中，如何衡量单个cell执行所花费的时间？
1.什么是 “元素运算”？
1.编写PyTorch代码，测试a的每个元素是否大于b的相应元素。
1.什么是秩-0张量？如何将其转换为普通Python数据类型？
1.这个返回什么，为什么？'tensor([1,2]) tensor([1])'
1.这个返回什么，为什么？'tensor([1,2]) tensor([1,2，3])'
1.elementwise算术如何帮助我们加快 “matmu”？
1.广播规则是什么？
1.什么是 'expand_as'？显示如何使用它来匹配广播结果的示例。
1.'unsqueeze' 如何帮助我们解决某些广播问题？
1.我们如何使用索引来执行与 “unsqueeze” 相同的操作？
1.我们如何显示用于张量的内存的实际内容？
1.将大小为3 × 3的向量加到矩阵中时，向量的元素是加到矩阵的每一行还是每一列？(请务必通过在笔记本中运行此代码来检查您的答案。)
1.广播和 'expand_as' 是否导致内存使用增加？为什么或为什么不？
1.使用爱因斯坦求和实现 “matmu”。
1.einsum左边重复的索引字母代表什么？
1.爱因斯坦求和记数法的三条规则是什么？为什么？
1.什么是神经网络的前传和后传？
1.为什么我们需要存储向前传递中为中间层计算的一些激活？
1.标准差离1太远的激活有什么缺点？
1.权重初始化如何帮助避免这个问题？
1.初始化权重的公式是什么，这样我们就可以得到普通线性层的标准偏差1，然后是ReLU？
1.为什么我们有时必须在损失函数中使用 “挤压” 方法？
1.'squeeze' 方法的参数有什么作用？为什么包括这个论点可能很重要，即使PyTorch不需要它？
1.什么是 “连锁法则”？用本章介绍的两种形式中的任何一种来显示方程。
1.展示如何使用链规则计算 'mse(lin(l2，w2，b2)，y)' 的梯度。
1.ReLU的梯度是多少？在数学或代码中显示它。(你不应该把它提交给内存 -- 试着用你对函数形状的了解来理解它。)
1.我们需要以什么顺序调用向后传递中的 '* _ grad' 函数？为什么？
1.什么是 '_ _ call _ _'？
1.编写 “torch.autograd.Function” 时必须实现哪些方法？
1.从头开始写 'nn.Linear'，并测试它的工作原理。
1.“nn.Module” 和fastai的 “module” 有什么区别？[机器翻译]

### Further Research

# 进一步研究[机器翻译]

1. Implement ReLU as a `torch.autograd.Function` and train a model with it.
1. If you are mathematically inclined, find out what the gradients of a linear layer are in mathematical notation. Map that to the implementation we saw in this chapter.
1. Learn about the `unfold` method in PyTorch, and use it along with matrix multiplication to implement your own 2D convolution function. Then train a CNN that uses it.
1. Implement everything in this chapter using NumPy instead of PyTorch. 

1.将ReLU实现为 'torch.autograd.Function' 并使用它训练模型。
1.如果你在数学上倾向于，用数学符号找出线性层的梯度。将其映射到我们在本章中看到的实现。
1.了解PyTorch中的 '展开' 方法，并将其与矩阵乘法一起使用，以实现您自己的2D卷积函数。然后训练一个使用它的CNN。
1.使用NumPy而不是PyTorch实现本章中的所有内容。[机器翻译]