In [None]:
#hide
!pip install -Uqq fastbook
import fastbook
fastbook.setup_book()

In [None]:
#hide
from fastbook import *

# A Language Model from Scratch

# 从头开始的语言模型[机器翻译]

We're now ready to go deep... deep into deep learning! You already learned how to train a basic neural network, but how do you go from there to creating state-of-the-art models? In this part of the book we're going to uncover all of the mysteries, starting with language models.

我们现在开始要深度的深度学习深度学习了！你现在

You saw in <<chapter_nlp>> how to fine-tune a pretrained language model to build a text classifier. In this chapter, we will explain to you what exactly is inside that model, and what an RNN is. First, let's gather some data that will allow us to quickly prototype our various models. 

您在 <<chapter_nlp>> 中看到了如何微调预训练的语言模型以构建文本分类器。在本章中，我们将向您解释该模型中的具体内容，以及RNN是什么。首先，让我们收集一些数据，使我们能够快速原型化我们的各种模型。[机器翻译]

## The Data

# # 数据[机器翻译]

Whenever we start working on a new problem, we always first try to think of the simplest dataset we can that will allow us to try out methods quickly and easily, and interpret the results. When we started working on language modeling a few years ago we didn't find any datasets that would allow for quick prototyping, so we made one. We call it *Human Numbers*, and it simply contains the first 10,000 numbers written out in English.

每当我们开始研究一个新问题时，我们总是首先尝试想出最简单的数据集，这样我们就可以快速、轻松地尝试方法，并解释结果。几年前，当我们开始进行语言建模时，我们没有找到任何允许快速原型制作的数据集，所以我们制作了一个。我们称之为 * 人类数字 *，它简单地包含了用英语写出的前10,000个数字。[机器翻译]

> j: One of the most common practical mistakes I see even amongst highly experienced practitioners is failing to use appropriate datasets at appropriate times during the analysis process. In particular, most people tend to start with datasets that are too big and too complicated.

> J: 我在经验丰富的从业者中看到的最常见的实际错误之一是在分析过程中没有在适当的时间使用适当的数据集。特别是，大多数人倾向于从太大太复杂的数据集开始。[机器翻译]

We can download, extract, and take a look at our dataset in the usual way:

我们可以用通常的方式下载、提取和查看数据集:[机器翻译]

In [1]:
from fastai.text.all import *
path = untar_data(URLs.HUMAN_NUMBERS)

In [2]:
#hide
Path.BASE_PATH = path

In [3]:
path.ls()

(#2) [Path('train.txt'),Path('valid.txt')]

Let's open those two files and see what's inside. At first we'll join all of the texts together and ignore the train/valid split given by the dataset (we'll come back to that later):

让我们打开那两个文件，看看里面有什么。首先，我们将把所有的文本连接在一起，忽略数据集给出的训练/有效分割 (我们稍后会回到这一点):[机器翻译]

In [9]:
lines = L()
with open(path/'train.txt') as f: lines += L(*f.readlines())
with open(path/'valid.txt') as f: lines += L(*f.readlines())
lines, len(lines)

((#9998) ['one \n','two \n','three \n','four \n','five \n','six \n','seven \n','eight \n','nine \n','ten \n'...],
 9998)

We take all those lines and concatenate them in one big stream. To mark when we go from one number to the next, we use a `.` as a separator:

我们把所有这些线连接在一个大流中。要标记何时从一个数字转到下一个数字，我们使用 '.' 作为分隔符:[机器翻译]

In [5]:
text = ' . '.join([l.strip() for l in lines])
text[:100]

'one . two . three . four . five . six . seven . eight . nine . ten . eleven . twelve . thirteen . fo'

We can tokenize this dataset by splitting on spaces:

我们可以通过在空间上拆分来标记这个数据集:[机器翻译]

In [6]:
tokens = text.split(' ')
tokens[:10]

['one', '.', 'two', '.', 'three', '.', 'four', '.', 'five', '.']

To numericalize, we have to create a list of all the unique tokens (our *vocab*):

要进行数字化，我们必须创建所有唯一令牌的列表 (我们的 * vocab *):[机器翻译]

In [7]:
vocab = L(*tokens).unique()
vocab

(#30) ['one','.','two','three','four','five','six','seven','eight','nine'...]

Then we can convert our tokens into numbers by looking up the index of each in the vocab:

然后，我们可以通过在vocab中查找每个令牌的索引来将令牌转换为数字:[机器翻译]

In [35]:
word2idx = {w:i for i,w in enumerate(vocab)}
nums = L(word2idx[i] for i in tokens)
nums

(#63095) [0,1,2,1,3,1,4,1,5,1...]

Now that we have a small dataset on which language modeling should be an easy task, we can build our first model.

现在我们有了一个小数据集，语言建模应该是一项简单的任务，我们可以构建我们的第一个模型。[机器翻译]

## Our First Language Model from Scratch

# # 我们的第一语言模型从头开始[机器翻译]

One simple way to turn this into a neural network would be to specify that we are going to predict each word based on the previous three words. We could create a list of every sequence of three words as our independent variables, and the next word after each sequence as the dependent variable. 


将它转化为神经网络的一个简单方法是指定我们将根据前三个单词预测每个单词。我们可以创建三个单词的每个序列的列表作为我们的自变量，每个序列后的下一个单词作为因变量。[机器翻译]

We can do that with plain Python. Let's do it first with tokens just to confirm what it looks like:

我们可以用普通的蟒蛇做到这一点。让我们首先使用令牌来确认它的外观:[机器翻译]

In [33]:
L((tokens[i:i+3], tokens[i+3]) for i in range(0,len(tokens)-4,3))

(#21031) [(['one', '.', 'two'], '.'),(['.', 'three', '.'], 'four'),(['four', '.', 'five'], '.'),(['.', 'six', '.'], 'seven'),(['seven', '.', 'eight'], '.'),(['.', 'nine', '.'], 'ten'),(['ten', '.', 'eleven'], '.'),(['.', 'twelve', '.'], 'thirteen'),(['thirteen', '.', 'fourteen'], '.'),(['.', 'fifteen', '.'], 'sixteen')...]

Now we will do it with tensors of the numericalized values, which is what the model will actually use:

现在我们将使用数字化值的张量来完成它，这就是模型实际使用的:[机器翻译]

In [36]:
seqs = L((tensor(nums[i:i+3]), nums[i+3]) for i in range(0,len(nums)-4,3))
seqs

(#21031) [(tensor([0, 1, 2]), 1),(tensor([1, 3, 1]), 4),(tensor([4, 1, 5]), 1),(tensor([1, 6, 1]), 7),(tensor([7, 1, 8]), 1),(tensor([1, 9, 1]), 10),(tensor([10,  1, 11]), 1),(tensor([ 1, 12,  1]), 13),(tensor([13,  1, 14]), 1),(tensor([ 1, 15,  1]), 16)...]

We can batch those easily using the `DataLoader` class. For now we will split the sequences randomly:

我们可以使用 'dataloader' 类轻松批处理这些。现在我们将随机分割序列:[机器翻译]

In [40]:
bs = 64
cut = int(len(seqs) * 0.8)
dls = DataLoaders.from_dsets(seqs[:cut], seqs[cut:], bs=64, shuffle=False)

We can now create a neural network architecture that takes three words as input, and returns a prediction of the probability of each possible next word in the vocab. We will use three standard linear layers, but with two tweaks.


我们现在可以创建一个神经网络架构，它将三个单词作为输入，并返回vocab中每个可能的下一个单词的概率预测。我们将使用三个标准线性层，但有两个调整。[机器翻译]

The first tweak is that the first linear layer will use only the first word's embedding as activations, the second layer will use the second word's embedding plus the first layer's output activations, and the third layer will use the third word's embedding plus the second layer's output activations. The key effect of this is that every word is interpreted in the information context of any words preceding it. 


第一个调整是，第一个线性层将只使用第一个单词的嵌入作为激活，第二个层将使用第二个单词的嵌入加上第一层的输出激活，第三层将使用第三个单词的嵌入加上第二层的输出激活。这样做的关键效果是，每个单词都在其前面的任何单词的信息上下文中解释。[机器翻译]

The second tweak is that each of these three layers will use the same weight matrix. The way that one word impacts the activations from previous words should not change depending on the position of a word. In other words, activation values will change as data moves through the layers, but the layer weights themselves will not change from layer to layer. So, a layer does not learn one sequence position; it must learn to handle all positions.


第二个调整是，这三层中的每一层都将使用相同的权重矩阵。一个单词影响前一个单词激活的方式不应该根据单词的位置而改变。换句话说，激活值将随着数据在图层间的移动而改变，但图层权重本身不会随图层而改变。因此，层不会学习一个序列位置; 它必须学会处理所有位置。[机器翻译]

Since layer weights do not change, you might think of the sequential layers as "the same layer" repeated. In fact, PyTorch makes this concrete; we can just create one layer, and use it multiple times.

由于层权重不会更改，您可能会将顺序层视为重复的 “相同层”。事实上，PyTorch使这个具体; 我们可以创建一个层，并多次使用它。[机器翻译]

### Our Language Model in PyTorch

# 我们在PyTorch的语言模式[机器翻译]

We can now create the language model module that we described earlier:

我们现在可以创建前面描述的语言模型模块:[机器翻译]

In [None]:
class LMModel1(Module):
    def __init__(self, vocab_sz, n_hidden):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)  
        self.h_h = nn.Linear(n_hidden, n_hidden)     
        self.h_o = nn.Linear(n_hidden,vocab_sz)
        
    def forward(self, x):
        h = F.relu(self.h_h(self.i_h(x[:,0])))
        h = h + self.i_h(x[:,1])
        h = F.relu(self.h_h(h))
        h = h + self.i_h(x[:,2])
        h = F.relu(self.h_h(h))
        return self.h_o(h)

As you see, we have created three layers:


如您所见，我们创建了三层:[机器翻译]

- The embedding layer (`i_h`, for *input* to *hidden*)
- The linear layer to create the activations for the next word (`h_h`, for *hidden* to *hidden*)
- A final linear layer to predict the fourth word (`h_o`, for *hidden* to *output*)


-嵌入层 ('i_h'，用于 * 输入 * 到 * 隐藏 *)
-为下一个单词创建激活的线性层 ('h_h'，用于 * hidden * to * hidden *)
-预测第四个单词的最终线性层 ('h_o'，用于 * 隐藏 * 到 * 输出 *)[机器翻译]

This might be easier to represent in pictorial form, so let's define a simple pictorial representation of basic neural networks. <<img_simple_nn>> shows how we're going to represent a neural net with one hidden layer.

这可能更容易以图形形式表示，所以让我们定义基本神经网络的简单图形表示。<<Img_simple_nn>> 展示了我们将如何用一个隐藏层来表示神经网络。[机器翻译]

<img alt="Pictorial representation of simple neural network" width="400" src="images/att_00020.png" caption="Pictorial representation of a simple neural network" id="img_simple_nn">

<Img alt = "简单神经网络的图像表示" width = "400" src = "images/att_00020.png" caption = "简单神经网络的图像表示" id = "img_simple_nn">[机器翻译]

Each shape represents activations: rectangle for input, circle for hidden (inner) layer activations, and triangle for output activations. We will use those shapes (summarized in <<img_shapes>>) in all the diagrams in this chapter.

每个形状表示激活: 用于输入的矩形、用于隐藏 (内部) 层激活的圆形和用于输出激活的三角形。我们将在本章的所有图表中使用这些形状 (总结为 <img_shapes>>)。[机器翻译]

<img alt="Shapes used in our pictorial representations" width="200" src="images/att_00021.png" id="img_shapes" caption="Shapes used in our pictorial representations">

<Img alt = "图形表示中使用的形状" width = "200" src = "images/att_00021.png" id = "img_shapes" caption = "图形表示中使用的形状">[机器翻译]

An arrow represents the actual layer computation—i.e., the linear layer followed by the activation function. Using this notation, <<lm_rep>> shows what our simple language model looks like.

箭头表示实际的层computation-i.e。，线性层后跟激活函数。使用此表示法，<<lm_rep>> 显示了我们的简单语言模型的外观。[机器翻译]

<img alt="Representation of our basic language model" width="500" caption="Representation of our basic language model" id="lm_rep" src="images/att_00022.png">

<Img alt = "我们的基本语言模型的表示" width = "500" caption = "我们的基本语言模型的表示" id = "lm_rep" src = "images/att_00022.png">[机器翻译]

To simplify things, we've removed the details of the layer computation from each arrow. We've also color-coded the arrows, such that all arrows with the same color have the same weight matrix. For instance, all the input layers use the same embedding matrix, so they all have the same color (green).


为了简化事情，我们从每个箭头中删除了层计算的细节。我们还对箭头进行了颜色编码，这样所有具有相同颜色的箭头都具有相同的权重矩阵。例如，所有输入层使用相同的嵌入矩阵，因此它们都具有相同的颜色 (绿色)。[机器翻译]

Let's try training this model and see how it goes:

让我们尝试训练这个模型，看看它是如何进行的:[机器翻译]

In [None]:
learn = Learner(dls, LMModel1(len(vocab), 64), loss_func=F.cross_entropy, 
                metrics=accuracy)
learn.fit_one_cycle(4, 1e-3)

epoch,train_loss,valid_loss,accuracy,time
0,1.824297,1.970941,0.467554,00:02
1,1.386973,1.823242,0.467554,00:02
2,1.417556,1.654497,0.494414,00:02
3,1.37644,1.650849,0.494414,00:02


To see if this is any good, let's check what a very simple model would give us. In this case we could always predict the most common token, so let's find out which token is most often the target in our validation set:

为了看看这是否有什么好处，让我们看看一个非常简单的模型会给我们带来什么。在这种情况下，我们总是可以预测最常见的令牌，因此让我们找出验证集中最常见的目标令牌:[机器翻译]

In [None]:
n,counts = 0,torch.zeros(len(vocab))
for x,y in dls.valid:
    n += y.shape[0]
    for i in range_of(vocab): counts[i] += (y==i).long().sum()
idx = torch.argmax(counts)
idx, vocab[idx.item()], counts[idx].item()/n

(tensor(29), 'thousand', 0.15165200855716662)

The most common token has the index 29, which corresponds to the token `thousand`. Always predicting this token would give us an accuracy of roughly 15\%, so we are faring way better!

最常见的令牌具有索引29，它对应于令牌 “千”。总是预测这个令牌会给我们大约15 \ % 的准确性，所以我们的情况更好![机器翻译]

> A: My first guess was that the separator would be the most common token, since there is one for every number. But looking at `tokens` reminded me that large numbers are written with many words, so on the way to 10,000 you write "thousand" a lot: five thousand, five thousand and one, five thousand and two, etc. Oops! Looking at your data is great for noticing subtle features and also embarrassingly obvious ones.

> A: 我的第一个猜测是分隔符将是最常见的令牌，因为每个数字都有一个。但是看着 '令牌' 提醒了我，大数是用很多字写的，所以在去10,000的路上你写 “千” 很多: 5,000、5,001、5,002等等。哎呀!查看您的数据非常适合注意细微的特征，也非常明显。[机器翻译]

This is a nice first baseline. Let's see how we can refactor it with a loop.

这是一个很好的第一个基线。让我们看看如何用循环重构它。[机器翻译]

### Our First Recurrent Neural Network

# 我们的第一个递归神经网络[机器翻译]

Looking at the code for our module, we could simplify it by replacing the duplicated code that calls the layers with a `for` loop. As well as making our code simpler, this will also have the benefit that we will be able to apply our module equally well to token sequences of different lengths—we won't be restricted to token lists of length three:

查看我们模块的代码，我们可以通过用 “for” 循环替换调用层的重复代码来简化它。以及使我们的代码更简单，这也将有一个好处，那就是我们将能够同样好地将我们的模块应用于不同长度的令牌序列 -- 我们将不受长度为三的令牌列表的限制:[机器翻译]

In [None]:
class LMModel2(Module):
    def __init__(self, vocab_sz, n_hidden):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)  
        self.h_h = nn.Linear(n_hidden, n_hidden)     
        self.h_o = nn.Linear(n_hidden,vocab_sz)
        
    def forward(self, x):
        h = 0
        for i in range(3):
            h = h + self.i_h(x[:,i])
            h = F.relu(self.h_h(h))
        return self.h_o(h)

Let's check that we get the same results using this refactoring:

让我们检查一下，我们使用这个重构得到相同的结果:[机器翻译]

In [None]:
learn = Learner(dls, LMModel2(len(vocab), 64), loss_func=F.cross_entropy, 
                metrics=accuracy)
learn.fit_one_cycle(4, 1e-3)

epoch,train_loss,valid_loss,accuracy,time
0,1.816274,1.964143,0.460185,00:02
1,1.423805,1.739964,0.473259,00:02
2,1.430327,1.685172,0.485382,00:02
3,1.38839,1.657033,0.470406,00:02


We can also refactor our pictorial representation in exactly the same way, as shown in <<basic_rnn>> (we're also removing the details of activation sizes here, and using the same arrow colors as in <<lm_rep>>).

我们也可以用完全相同的方式重构我们的图像表示，如 <<basic_rnn>> 所示 (我们也在这里删除激活大小的细节，并使用与 <<lm_rep>> 中相同的箭头颜色)。[机器翻译]

<img alt="Basic recurrent neural network" width="400" caption="Basic recurrent neural network" id="basic_rnn" src="images/att_00070.png">

<Img alt = "基本递归神经网络" width = "400" caption = "基本递归神经网络" id = "basic_rnn" src = "images/att_00070.png">[机器翻译]

You will see that there is a set of activations that are being updated each time through the loop, stored in the variable `h`—this is called the *hidden state*.

您将看到每次通过循环都有一组激活被更新，存储在变量 “h” 中，这称为 * 隐藏状态 *。[机器翻译]

> Jargon: hidden state: The activations that are updated at each step of a recurrent neural network.

> 行话: 隐藏状态: 在循环神经网络的每一步更新的激活。[机器翻译]

A neural network that is defined using a loop like this is called a *recurrent neural network* (RNN). It is important to realize that an RNN is not a complicated new architecture, but simply a refactoring of a multilayer neural network using a `for` loop.


使用这样的循环定义的神经网络称为 * 递归神经网络 * (RNN)。重要的是要认识到RNN不是一个复杂的新架构，而是使用 “for” 循环重构多层神经网络。[机器翻译]

> A: My true opinion: if they were called "looping neural networks," or LNNs, they would seem 50% less daunting!

> 答: 我的真实观点是: 如果它们被称为 “循环神经网络”，或者lnn，它们看起来就不那么令人生畏50% 了![机器翻译]

Now that we know what an RNN is, let's try to make it a little bit better.

既然我们知道什么是RNN，让我们试着让它变得更好一点。[机器翻译]

## Improving the RNN

# # 改进RNN[机器翻译]

Looking at the code for our RNN, one thing that seems problematic is that we are initializing our hidden state to zero for every new input sequence. Why is that a problem? We made our sample sequences short so they would fit easily into batches. But if we order the samples correctly, those sample sequences will be read in order by the model, exposing the model to long stretches of the original sequence. 


看看我们RNN的代码，似乎有问题的一点是，我们正在为每个新的输入序列将隐藏状态初始化为零。为什么这有什么问题吗？我们将样品序列制作得很短，以便它们可以轻松地成批。但是如果我们正确地订购样本，这些样本序列将由模型按顺序读取，使模型暴露于原始序列的长拉伸。[机器翻译]

Another thing we can look at is having more signal: why only predict the fourth word when we could use the intermediate predictions to also predict the second and third words? 


我们可以看到的另一件事是有更多的信号: 当我们可以用中间预测来预测第二和第三个单词时，为什么只预测第四个单词？[机器翻译]

Let's see how we can implement those changes, starting with adding some state.

让我们看看如何实现这些更改，从添加一些状态开始。[机器翻译]

### Maintaining the State of an RNN

# 维护RNN的状态[机器翻译]

Because we initialize the model's hidden state to zero for each new sample, we are throwing away all the information we have about the sentences we have seen so far, which means that our model doesn't actually know where we are up to in the overall counting sequence. This is easily fixed; we can simply move the initialization of the hidden state to `__init__`.


因为我们针对每个新样本将模型的隐藏状态初始化为零，所以我们扔掉了迄今为止所看到的关于句子的所有信息，这意味着我们的模型实际上不知道我们在整个计数序列中的位置。这很容易修复; 我们可以简单地将隐藏状态的初始化移动到 '_ _ init _ _'。[机器翻译]

But this fix will create its own subtle, but important, problem. It effectively makes our neural network as deep as the entire number of tokens in our document. For instance, if there were 10,000 tokens in our dataset, we would be creating a 10,000-layer neural network.


但是这个修复会产生自己微妙但重要的问题。它有效地使我们的神经网络与我们文档中的全部令牌数量一样深。例如，如果我们的数据集中有10,000个令牌，我们将创建一个10,000层的神经网络。[机器翻译]

To see why this is the case, consider the original pictorial representation of our recurrent neural network in <<lm_rep>>, before refactoring it with a `for` loop. You can see each layer corresponds with one token input. When we talk about the representation of a recurrent neural network before refactoring with the `for` loop, we call this the *unrolled representation*. It is often helpful to consider the unrolled representation when trying to understand an RNN.


要了解为什么会出现这种情况，请考虑 <<lm_rep>> 中我们的递归神经网络的原始图形表示，然后用 “for” 循环重构它。您可以看到每个层对应一个令牌输入。当我们在用 “for” 循环重构之前谈论递归神经网络的表示时，我们称之为 * 展开表示 *。当试图理解RNN时，考虑展开的表示通常是有帮助的。[机器翻译]

The problem with a 10,000-layer neural network is that if and when you get to the 10,000th word of the dataset, you will still need to calculate the derivatives all the way back to the first layer. This is going to be very slow indeed, and very memory-intensive. It is unlikely that you'll be able to store even one mini-batch on your GPU.


10,000层神经网络的问题是，如果和当你到达数据集的个单词时，您仍然需要计算一直到第一层的导数。这将会非常缓慢，并且非常内存密集型。你不太可能在你的GPU上存储哪怕一个小批量。[机器翻译]

The solution to this problem is to tell PyTorch that we do not want to back propagate the derivatives through the entire implicit neural network. Instead, we will just keep the last three layers of gradients. To remove all of the gradient history in PyTorch, we use the `detach` method.


这个问题的解决方案是告诉PyTorch，我们不想通过整个隐式神经网络来反向传播导数。相反，我们将只保留最后三层梯度。要移除PyTorch中的所有渐变历史，我们使用 'detach' 方法。[机器翻译]

Here is the new version of our RNN. It is now stateful, because it remembers its activations between different calls to `forward`, which represent its use for different samples in the batch:

这是我们的RNN的新版本。它现在是有状态的，因为它会记住它在 “转发” 的不同调用之间的激活，这表示它在批处理中的不同样本的用途:[机器翻译]

In [None]:
class LMModel3(Module):
    def __init__(self, vocab_sz, n_hidden):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)  
        self.h_h = nn.Linear(n_hidden, n_hidden)     
        self.h_o = nn.Linear(n_hidden,vocab_sz)
        self.h = 0
        
    def forward(self, x):
        for i in range(3):
            self.h = self.h + self.i_h(x[:,i])
            self.h = F.relu(self.h_h(self.h))
        out = self.h_o(self.h)
        self.h = self.h.detach()
        return out
    
    def reset(self): self.h = 0

This model will have the same activations whatever sequence length we pick, because the hidden state will remember the last activation from the previous batch. The only thing that will be different is the gradients computed at each step: they will only be calculated on sequence length tokens in the past, instead of the whole stream. This approach is called *backpropagation through time* (BPTT).

无论我们选择什么样的序列长度，这个模型都会有相同的激活，因为隐藏状态会记住前一批中的最后一次激活。唯一不同的是在每一步计算的梯度: 它们将只在过去的序列长度令牌上计算，而不是在整个流上。这种方法被称为 * 通过时间的反向传播 * (BPTT)。[机器翻译]

> jargon: Back propagation through time (BPTT): Treating a neural net with effectively one layer per time step (usually refactored using a loop) as one big model, and calculating gradients on it in the usual way. To avoid running out of memory and time, we usually use _truncated_ BPTT, which "detaches" the history of computation steps in the hidden state every few time steps.

> 行话: 通过时间反向传播 (BPTT): 将一个神经网络作为一个大模型，每个时间步长有效地使用一层 (通常使用循环重构)，用通常的方式计算梯度。为了避免耗尽内存和时间，我们通常使用 _ truncated _ BPTT，它每隔几个时间步骤 “分离” 隐藏状态下的计算步骤的历史。[机器翻译]

To use `LMModel3`, we need to make sure the samples are going to be seen in a certain order. As we saw in <<chapter_nlp>>, if the first line of the first batch is our `dset[0]` then the second batch should have `dset[1]` as the first line, so that the model sees the text flowing.


要使用 “LMModel3”，我们需要确保样本将按特定顺序显示。正如我们在 <<chapter_nlp>> 中看到的，如果第一批的第一行是我们的 “数据集 [0]”，那么第二批应该将 “数据集 [1]” 作为第一行，以便模型看到文本流动。[机器翻译]

`LMDataLoader` was doing this for us in <<chapter_nlp>>. This time we're going to do it ourselves.


“Lmdataloader” 在 <<chapter_nlp>> 中为我们做了这个。这次我们要自己做。[机器翻译]

To do this, we are going to rearrange our dataset. First we divide the samples into `m = len(dset) // bs` groups (this is the equivalent of splitting the whole concatenated dataset into, for example, 64 equally sized pieces, since we're using `bs=64` here). `m` is the length of each of these pieces. For instance, if we're using our whole dataset (although we'll actually split it into train versus valid in a moment), that will be:

为此，我们将重新排列数据集。首先，我们将样本分成m = len(dset) // bs '组 (这相当于将整个连接的数据集分割成例如64个大小相同的片段，因为我们在这里使用' bs = 64 ')。“M” 是这些作品的长度。例如，如果我们使用我们的整个数据集 (虽然我们实际上将它拆分为train而不是valid在一瞬间)，那将是:[机器翻译]

In [None]:
m = len(seqs)//bs
m,bs,len(seqs)

(328, 64, 21031)

The first batch will be composed of the samples:


第一批将由样品组成:[机器翻译]

    (0, m, 2*m, ..., (bs-1)*m)


(0，m，2 * m，.，(bs-1)* m)[机器翻译]

the second batch of the samples: 


第二批样品:[机器翻译]

    (1, m+1, 2*m+1, ..., (bs-1)*m+1)


(1，m 1，2 * m 1，.，(bs-1)* m 1)[机器翻译]

and so forth. This way, at each epoch, the model will see a chunk of contiguous text of size `3*m` (since each text is of size 3) on each line of the batch.


等等。这样，在每个纪元，模型将在批处理的每一行上看到大小为 “3 * m” 的连续文本块 (因为每个文本的大小为3)。[机器翻译]

The following function does that reindexing:

以下函数执行重新索引:[机器翻译]

In [None]:
def group_chunks(ds, bs):
    m = len(ds) // bs
    new_ds = L()
    for i in range(m): new_ds += L(ds[i + m*j] for j in range(bs))
    return new_ds

Then we just pass `drop_last=True` when building our `DataLoaders` to drop the last batch that does not have a shape of `bs`. We also pass `shuffle=False` to make sure the texts are read in order:

然后，当我们构建 “dataloaders” 时，我们只需传递 “drop_last = true” 即可删除没有 “bs” 形状的最后一批。我们还通过 “shuffle = false” 来确保文本按顺序阅读:[机器翻译]

In [None]:
cut = int(len(seqs) * 0.8)
dls = DataLoaders.from_dsets(
    group_chunks(seqs[:cut], bs), 
    group_chunks(seqs[cut:], bs), 
    bs=bs, drop_last=True, shuffle=False)

The last thing we add is a little tweak of the training loop via a `Callback`. We will talk more about callbacks in <<chapter_accel_sgd>>; this one will call the `reset` method of our model at the beginning of each epoch and before each validation phase. Since we implemented that method to zero the hidden state of the model, this will make sure we start with a clean state before reading those continuous chunks of text. We can also start training a bit longer:

我们添加的最后一件事是通过 “回拨” 对训练循环进行一点调整。我们将在 <<chapter_accel_sgd>> 中更多地讨论回调; 这一个将在每个纪元开始时和每个验证阶段之前调用我们模型的 “重置” 方法。由于我们实现了该方法来将模型的隐藏状态归零，这将确保在读取这些连续的文本块之前，我们从干净的状态开始。我们也可以开始训练一段时间:[机器翻译]

In [None]:
learn = Learner(dls, LMModel3(len(vocab), 64), loss_func=F.cross_entropy,
                metrics=accuracy, cbs=ModelResetter)
learn.fit_one_cycle(10, 3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,1.677074,1.827367,0.467548,00:02
1,1.282722,1.870913,0.388942,00:02
2,1.090705,1.651793,0.4625,00:02
3,1.005092,1.613794,0.516587,00:02
4,0.965975,1.560775,0.551202,00:02
5,0.916182,1.595857,0.560577,00:02
6,0.897657,1.539733,0.574279,00:02
7,0.836274,1.585141,0.583173,00:02
8,0.805877,1.629808,0.586779,00:02
9,0.795096,1.651267,0.588942,00:02


This is already better! The next step is to use more targets and compare them to the intermediate predictions.

这已经是更好!下一步是使用更多的目标，并将它们与中间预测进行比较。[机器翻译]

### Creating More Signal

# 创造更多信号[机器翻译]

Another problem with our current approach is that we only predict one output word for each three input words. That means that the amount of signal that we are feeding back to update weights with is not as large as it could be. It would be better if we predicted the next word after every single word, rather than every three words, as shown in <<stateful_rep>>.

我们当前方法的另一个问题是，我们只为每三个输入单词预测一个输出单词。这意味着我们反馈来更新权重的信号量并不像它可能的那么大。如果我们预测每个单词之后的下一个单词，而不是每三个单词，这将会更好，如 <<stateful_rep>> 所示。[机器翻译]

<img alt="RNN predicting after every token" width="400" caption="RNN predicting after every token" id="stateful_rep" src="images/att_00024.png">

<Img alt = "RNN每个令牌后预测" width = "400" caption = "RNN每个令牌后预测" id = "stateful_rep" src = "images/att_00024.png">[机器翻译]

This is easy enough to add. We need to first change our data so that the dependent variable has each of the three next words after each of our three input words. Instead of `3`, we use an attribute, `sl` (for sequence length), and make it a bit bigger:

这很容易添加。我们需要首先改变我们的数据，以便因变量在我们的三个输入单词之后都有接下来的三个单词。而不是 '3'，我们使用一个属性，'sl' (序列长度)，并使它有点大:[机器翻译]

In [None]:
sl = 16
seqs = L((tensor(nums[i:i+sl]), tensor(nums[i+1:i+sl+1]))
         for i in range(0,len(nums)-sl-1,sl))
cut = int(len(seqs) * 0.8)
dls = DataLoaders.from_dsets(group_chunks(seqs[:cut], bs),
                             group_chunks(seqs[cut:], bs),
                             bs=bs, drop_last=True, shuffle=False)

Looking at the first element of `seqs`, we can see that it contains two lists of the same size. The second list is the same as the first, but offset by one element:

查看 “seq” 的第一个元素，我们可以看到它包含两个大小相同的列表。第二个列表与第一个列表相同，但偏移了一个元素:[机器翻译]

In [None]:
[L(vocab[o] for o in s) for s in seqs[0]]

[(#16) ['one','.','two','.','three','.','four','.','five','.'...],
 (#16) ['.','two','.','three','.','four','.','five','.','six'...]]

Now we need to modify our model so that it outputs a prediction after every word, rather than just at the end of a three-word sequence:

现在我们需要修改我们的模型，以便它在每个单词之后输出预测，而不仅仅是在三个单词序列的末尾:[机器翻译]

In [None]:
class LMModel4(Module):
    def __init__(self, vocab_sz, n_hidden):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)  
        self.h_h = nn.Linear(n_hidden, n_hidden)     
        self.h_o = nn.Linear(n_hidden,vocab_sz)
        self.h = 0
        
    def forward(self, x):
        outs = []
        for i in range(sl):
            self.h = self.h + self.i_h(x[:,i])
            self.h = F.relu(self.h_h(self.h))
            outs.append(self.h_o(self.h))
        self.h = self.h.detach()
        return torch.stack(outs, dim=1)
    
    def reset(self): self.h = 0

This model will return outputs of shape `bs x sl x vocab_sz` (since we stacked on `dim=1`). Our targets are of shape `bs x sl`, so we need to flatten those before using them in `F.cross_entropy`:

该模型将返回形状 “bs x sl x vocab_sz” 的输出 (因为我们堆叠在 “dim = 1” 上)。我们的目标是形状 'bs x sl'，所以我们需要在 'f.cross_entropy' 中使用它们之前将它们展平:[机器翻译]

In [None]:
def loss_func(inp, targ):
    return F.cross_entropy(inp.view(-1, len(vocab)), targ.view(-1))

We can now use this loss function to train the model:

我们现在可以使用这个损失函数来训练模型:[机器翻译]

In [None]:
learn = Learner(dls, LMModel4(len(vocab), 64), loss_func=loss_func,
                metrics=accuracy, cbs=ModelResetter)
learn.fit_one_cycle(15, 3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,3.103298,2.874341,0.212565,00:01
1,2.231964,1.97128,0.462158,00:01
2,1.711358,1.813547,0.461182,00:01
3,1.448516,1.828176,0.483236,00:01
4,1.28863,1.659564,0.520671,00:01
5,1.16147,1.714023,0.554932,00:01
6,1.055568,1.660916,0.575033,00:01
7,0.960765,1.719624,0.591064,00:01
8,0.870153,1.83956,0.614665,00:01
9,0.808545,1.770278,0.624349,00:01


We need to train for longer, since the task has changed a bit and is more complicated now. But we end up with a good result... At least, sometimes. If you run it a few times, you'll see that you can get quite different results on different runs. That's because effectively we have a very deep network here, which can result in very large or very small gradients. We'll see in the next part of this chapter how to deal with this.


我们需要训练更长时间，因为任务已经改变了一点，现在更加复杂了。但是我们最终会有一个好结果.至少，有时候。如果你运行几次，你会发现你可以在不同的运行中得到完全不同的结果。这是因为实际上我们在这里有一个非常深的网络，这可能导致非常大或非常小的梯度。我们将在本章的下一部分看到如何处理这个问题。[机器翻译]

Now, the obvious way to get a better model is to go deeper: we only have one linear layer between the hidden state and the output activations in our basic RNN, so maybe we'll get better results with more.

现在，获得更好模型的明显方法是更深入: 在我们的基本RNN中，隐藏状态和输出激活之间只有一个线性层，所以也许我们会得到更好的结果。[机器翻译]

## Multilayer RNNs

# # 多层RNNs[机器翻译]

In a multilayer RNN, we pass the activations from our recurrent neural network into a second recurrent neural network, like in <<stacked_rnn_rep>>.

在多层RNN中，我们将激活从我们的递归神经网络传递到第二个递归神经网络，就像在 <<stacked_rnn_rep>> 中一样。[机器翻译]

<img alt="2-layer RNN" width="550" caption="2-layer RNN" id="stacked_rnn_rep" src="images/att_00025.png">

<Img alt = "2层RNN" width = "550" caption = "2层RNN" id = "stacked_rnn_rep" src = "images/att_00025.png">[机器翻译]

The unrolled representation is shown in <<unrolled_stack_rep>> (similar to <<lm_rep>>).

展开的表示显示在 <<unrolled_stack_rep>> 中 (类似于 <<lm_rep>>)。[机器翻译]

<img alt="2-layer unrolled RNN" width="500" caption="Two-layer unrolled RNN" id="unrolled_stack_rep" src="images/att_00026.png">

<Img alt = "2层展开RNN" width = "500" caption = "2层展开RNN" id = "unrolled_stack_rep" src = "images/att_00026.png">[机器翻译]

Let's see how to implement this in practice.

让我们看看如何在实践中实现这一点。[机器翻译]

### The Model

# 模特[机器翻译]

We can save some time by using PyTorch's `RNN` class, which implements exactly what we created earlier, but also gives us the option to stack multiple RNNs, as we have discussed:

我们可以通过使用PyTorch的 “rnn” 类来节省一些时间，它实现了我们之前创建的内容，但也为我们提供了堆叠多个rnn的选项，正如我们所讨论的:[机器翻译]

In [None]:
class LMModel5(Module):
    def __init__(self, vocab_sz, n_hidden, n_layers):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)
        self.rnn = nn.RNN(n_hidden, n_hidden, n_layers, batch_first=True)
        self.h_o = nn.Linear(n_hidden, vocab_sz)
        self.h = torch.zeros(n_layers, bs, n_hidden)
        
    def forward(self, x):
        res,h = self.rnn(self.i_h(x), self.h)
        self.h = h.detach()
        return self.h_o(res)
    
    def reset(self): self.h.zero_()

In [None]:
learn = Learner(dls, LMModel5(len(vocab), 64, 2), 
                loss_func=CrossEntropyLossFlat(), 
                metrics=accuracy, cbs=ModelResetter)
learn.fit_one_cycle(15, 3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,3.055853,2.59164,0.437907,00:01
1,2.162359,1.78731,0.471598,00:01
2,1.710663,1.941807,0.321777,00:01
3,1.520783,1.999726,0.312012,00:01
4,1.330846,2.012902,0.413249,00:01
5,1.163297,1.896192,0.450684,00:01
6,1.033813,2.005209,0.434814,00:01
7,0.91909,2.047083,0.456706,00:01
8,0.822939,2.068031,0.468831,00:01
9,0.75018,2.136064,0.475098,00:01


Now that's disappointing... our previous single-layer RNN performed better. Why? The reason is that we have a deeper model, leading to exploding or vanishing activations.

这令人失望.我们以前的单层RNN表现更好。为什么？原因是我们有一个更深层次的模型，导致爆炸或消失的激活。[机器翻译]

### Exploding or Disappearing Activations

# 爆炸或消失的激活[机器翻译]

In practice, creating accurate models from this kind of RNN is difficult. We will get better results if we call `detach` less often, and have more layers—this gives our RNN a longer time horizon to learn from, and richer features to create. But it also means we have a deeper model to train. The key challenge in the development of deep learning has been figuring out how to train these kinds of models.


在实践中，从这种RNN创建精确的模型是困难的。如果我们不经常称之为 “分离”，并且有更多的层，我们将获得更好的结果 -- 这给了我们的RNN更长的时间来学习，并创造了更丰富的功能。但这也意味着我们有一个更深层次的模型要训练。深度学习发展的关键挑战是如何训练这些类型的模型。[机器翻译]

The reason this is challenging is because of what happens when you multiply by a matrix many times. Think about what happens when you multiply by a number many times. For example, if you multiply by 2, starting at 1, you get the sequence 1, 2, 4, 8,... after 32 steps you are already at 4,294,967,296. A similar issue happens if you multiply by 0.5: you get 0.5, 0.25, 0.125… and after 32 steps it's 0.00000000023. As you can see, multiplying by a number even slightly higher or lower than 1 results in an explosion or disappearance of our starting number, after just a few repeated multiplications.


这具有挑战性的原因是因为当你多次乘以矩阵时会发生什么。想想当你多次乘以一个数字时会发生什么。例如，如果你乘以2，从1开始，你得到序列1,2，4,8，.经过32步，你已经在4,294,967,296。如果你乘以0.5，类似的问题也会发生: 你得到0.5，0.25，0.125…… 32步后，它就0.00000000023了。正如你所看到的，乘以一个稍微高于或低于1的数字会导致我们的起始数字在几次重复乘法后爆炸或消失。[机器翻译]

Because matrix multiplication is just multiplying numbers and adding them up, exactly the same thing happens with repeated matrix multiplications. And that's all a deep neural network is —each extra layer is another matrix multiplication. This means that it is very easy for a deep neural network to end up with extremely large or extremely small numbers.


因为矩阵乘法只是将数字相乘并相加，所以重复的矩阵乘法也是如此。这就是一个深度神经网络 -- 每一个额外的层都是另一个矩阵乘法。这意味着深度神经网络很容易以极大或极小的数字结束。[机器翻译]

This is a problem, because the way computers store numbers (known as "floating point") means that they become less and less accurate the further away the numbers get from zero. The diagram in <<float_prec>>, from the excellent article ["What You Never Wanted to Know About Floating Point but Will Be Forced to Find Out"](http://www.volkerschatz.com/science/float.html), shows how the precision of floating-point numbers varies over the number line.

这是一个问题，因为计算机存储数字的方式 (称为 “浮点”) 意味着数字离零越远，它们就越不准确。<<Float_prec>> 中的图表，来自优秀的文章 [“您从未想过要了解浮点但将被迫找出”] ( http://www.volkerschatz.com/science/float.html )，显示浮点数的精度如何随数线变化。[机器翻译]

<img alt="Precision of floating point numbers" width="1000" caption="Precision of floating-point numbers" id="float_prec" src="images/fltscale.svg">

<Img alt = "浮点数的精度" width = "1000" caption = "浮点数的精度" id = "float_prec" src = "images/fltscale。svg">[机器翻译]

This inaccuracy means that often the gradients calculated for updating the weights end up as zero or infinity for deep networks. This is commonly refered to as the *vanishing gradients* or *exploding gradients* problem. It means that in SGD, the weights are either not updated at all or jump to infinity. Either way, they won't improve with training.


这种不准确性意味着通常为更新权重而计算的梯度最终为深度网络的零或无穷大。这通常被称为 * 消失梯度 * 或 * 爆炸梯度 * 问题。这意味着在SGD中，权重要么根本不更新，要么跳到无穷大。不管怎样，他们都不会随着训练而进步。[机器翻译]

Researchers have developed a number of ways to tackle this problem, which we will be discussing later in the book. One option is to change the definition of a layer in a way that makes it less likely to have exploding activations. We'll look at the details of how this is done in <<chapter_convolutions>>, when we discuss batch normalization, and <<chapter_resnet>>, when we discuss ResNets, although these details don't generally matter in practice (unless you are a researcher that is creating new approaches to solving this problem). Another strategy for dealing with this is by being careful about initialization, which is a topic we'll investigate in <<chapter_foundations>>.


研究人员已经开发了许多方法来解决这个问题，我们将在本书的后面讨论这个问题。一种选择是更改图层的定义，使其不太可能有爆炸式激活。当我们讨论批标准化时，我们将在 <<chapter_convolutions>> 和 <<chapter_resnet>> 中详细讨论如何实现这一点，当我们讨论ResNets时，尽管这些细节在实践中并不重要 (除非你是一名正在创造新方法来解决这个问题的研究员)。处理这个问题的另一个策略是小心初始化，这是我们将在 <<chapter_foundation>> 中研究的一个主题。[机器翻译]

For RNNs, there are two types of layers that are frequently used to avoid exploding activations: *gated recurrent units* (GRUs) and *long short-term memory* (LSTM) layers. Both of these are available in PyTorch, and are drop-in replacements for the RNN layer. We will only cover LSTMs in this book; there are plenty of good tutorials online explaining GRUs, which are a minor variant on the LSTM design.

对于rnn，有两种类型的层经常用于避免爆炸激活: * 门控复发单位 * (GRUs) 和 * 长短期记忆 * (LSTM) 层。这两者都在PyTorch中可用，并且是RNN层的插入替换。我们将只在本书中涵盖LSTMs; 有很多很好的在线教程解释GRUs，这是LSTM设计的一个小变体。[机器翻译]

## LSTM

# # LSTM[机器翻译]

LSTM is an architecture that was introduced back in 1997 by Jürgen Schmidhuber and Sepp Hochreiter. In this architecture, there are not one but two hidden states. In our base RNN, the hidden state is the output of the RNN at the previous time step. That hidden state is then responsible for two things:


LSTM是由j ü rgen Schmidhuber和Sepp Hochreiter于1997年推出的架构。在这个架构中，隐藏状态不是一个而是两个。在我们的基本RNN中，隐藏状态是RNN在前一个时间步长的输出。然后隐藏状态负责两件事:[机器翻译]

- Having the right information for the output layer to predict the correct next token
- Retaining memory of everything that happened in the sentence


-为输出层提供正确的信息来预测正确的下一个令牌
-保留句子中发生的一切的记忆[机器翻译]

Consider, for example, the sentences "Henry has a dog and he likes his dog very much" and "Sophie has a dog and she likes her dog very much." It's very clear that the RNN needs to remember the name at the beginning of the sentence to be able to predict *he/she* or *his/her*. 


例如，考虑句子 “亨利有一只狗，他非常喜欢他的狗” 和 “索菲有一只狗，她非常喜欢她的狗。”很明显，RNN需要记住句子开头的名字，以便能够预测 * 他/她 * 或 * 他/她 *。[机器翻译]

In practice, RNNs are really bad at retaining memory of what happened much earlier in the sentence, which is the motivation to have another hidden state (called *cell state*) in the LSTM. The cell state will be responsible for keeping *long short-term memory*, while the hidden state will focus on the next token to predict. Let's take a closer look and how this is achieved and build an LSTM from scratch.

实际上，rnn确实不善于保留句子中更早发生的事情的记忆，这是在LSTM中具有另一种隐藏状态 (称为 * cell state *) 的动机。细胞状态将负责保持 * 长的短期记忆 *，而隐藏状态将专注于下一个要预测的令牌。让我们仔细看看这是如何实现的，并从头开始构建一个LSTM。[机器翻译]

### Building an LSTM from Scratch

# 从头开始构建LSTM[机器翻译]

In order to build an LSTM, we first have to understand its architecture. <<lstm>> shows its inner structure.
    
<img src="images/LSTM.png" id="lstm" caption="Architecture of an LSTM" alt="A graph showing the inner architecture of an LSTM" width="700">

为了构建一个LSTM，我们首先要了解它的架构。<<Lstm>> 显示其内部结构。

<Img src = "images/LSTM.png" id = "lstm" caption = "LSTM的架构" alt = "显示LSTM内部架构的图表" width = "700">[机器翻译]

In this picture, our input $x_{t}$ enters on the left with the previous hidden state ($h_{t-1}$) and cell state ($c_{t-1}$). The four orange boxes represent four layers (our neural nets) with the activation being either sigmoid ($\sigma$) or tanh. tanh is just a sigmoid function rescaled to the range -1 to 1. Its mathematical expression can be written like this:


在这张图片中，我们的输入 $ x _{t}$ 进入左边，带有前一个隐藏状态 ($ h _{t-1}$) 和单元状态 ($ c _{t-1}$)。四个橙色框代表四层 (我们的神经网络)，激活为sigmoid ($ \ sigma $) 或tanh。Tanh只是一个被重新缩放到-1到1范围的sigmoid函数，它的数学表达式可以这样写:[机器翻译]

$$\tanh(x) = \frac{e^{x} + e^{-x}}{e^{x}-e^{-x}} = 2 \sigma(2x) - 1$$


$ $ \ Tanh (x) = \ frac{e ^{x} e ^{-x}}{e ^{x}-e ^{-x}} = 2 \ sigma(2x) - 1 $ $[机器翻译]

where $\sigma$ is the sigmoid function. The green circles are elementwise operations. What goes out on the right is the new hidden state ($h_{t}$) and new cell state ($c_{t}$), ready for our next input. The new hidden state is also used as output, which is why the arrow splits to go up.


其中 $ \ sigma $ 是sigmoid函数。绿色圆圈是elementwise操作。右边的是新的隐藏状态 ($ h _{t}$) 和新的单元状态 ($ c _{t}$)，为我们的下一个输入做好准备。新的隐藏状态也被用作输出，这就是为什么箭头分裂向上。[机器翻译]

Let's go over the four neural nets (called *gates*) one by one and explain the diagram—but before this, notice how very little the cell state (at the top) is changed. It doesn't even go directly through a neural net! This is exactly why it will carry on a longer-term state.


让我们一个接一个地检查四个神经网络 (称为 * gates *) 并解释图表 -- 但是在此之前，请注意细胞状态 (在顶部) 有多少被改变了。它甚至不直接通过神经网络!这正是它将继续长期运行的原因。[机器翻译]

First, the arrows for input and old hidden state are joined together. In the RNN we wrote earlier in this chapter, we were adding them together. In the LSTM, we stack them in one big tensor. This means the dimension of our embeddings (which is the dimension of $x_{t}$) can be different than the dimension of our hidden state. If we call those `n_in` and `n_hid`, the arrow at the bottom is of size `n_in + n_hid`; thus all the neural nets (orange boxes) are linear layers with `n_in + n_hid` inputs and `n_hid` outputs.


首先，输入和旧隐藏状态的箭头连接在一起。在我们在本章前面写的RNN中，我们将它们加在一起。在LSTM中，我们将它们堆叠在一个大张量中。这意味着我们嵌入的维度 ($ x _{t}$ 的维度) 可以不同于我们隐藏状态的维度。如果我们调用那些 'n_in' 和 'n_hid'，底部的箭头大小为 'n_in n_hid'; 因此所有的神经网络 (橙色盒子) 是具有 “n_in n_hid” 输入和 “n_hid” 输出的线性层。[机器翻译]

Since it’s a linear layer followed by a sigmoid, its output will consist of scalars between 0 and 1. We multiply this result by the cell state to determine which information to keep and which to throw away: values closer to 0 are discarded and values closer to 1 are kept. This gives the LSTM the ability to forget things about its long-term state. For instance, when crossing a period or an `xxbos` token, we would expect to it to (have learned to) reset its cell state.


因为它是一个线性层，后面跟着一个sigmoid，所以它的输出将由介于0和1之间的标量组成。我们将这个结果乘以单元格状态，以确定保留哪些信息和扔掉哪些信息: 更接近0的值被丢弃，更接近1的值被保留。这使得LSTM能够忘记其长期状态。例如，当跨越一个周期或一个 “xxbo” 令牌时，我们会期望它 (已经学会) 重置其单元状态。[机器翻译]

The second gate is called the *input gate*. It works with the third gate (which doesn't really have a name but is sometimes called the *cell gate*) to update the cell state. For instance, we may see a new gender pronoun, in which case we'll need to replace the information about gender that the forget gate removed. Similar to the forget gate, the input gate decides which elements of the cell state to update (values close to 1) or not (values close to 0). The third gate determines what those updated values are, in the range of –1 to 1 (thanks to the tanh function). The result is then added to the cell state.


第二个门称为 * 输入门 *。它与第三个门 (它没有真正的名字，但有时被称为 * 细胞门 *) 一起更新细胞状态。例如，我们可能会看到一个新的性别代词，在这种情况下，我们需要替换忘记门删除的关于性别的信息。与忘记门类似，输入门决定要更新单元状态的哪些元素 (值接近1) 或不更新 (值接近0)。第三门确定这些更新的值是什么，在-1到1的范围内 (由于tanh函数)。然后将结果添加到单元状态。[机器翻译]

The last gate is the *output gate*. It determines which information from the cell state to use to generate the output. The cell state goes through a tanh before being combined with the sigmoid output from the output gate, and the result is the new hidden state.


最后一个门是 * 输出门 *。它确定用于生成输出的单元状态中的哪些信息。单元状态在与输出门的乙状结肠输出结合之前经过一个tanh，结果是新的隐藏状态。[机器翻译]

In terms of code, we can write the same steps like this:

在代码方面，我们可以像这样编写相同的步骤:[机器翻译]

In [None]:
class LSTMCell(Module):
    def __init__(self, ni, nh):
        self.forget_gate = nn.Linear(ni + nh, nh)
        self.input_gate  = nn.Linear(ni + nh, nh)
        self.cell_gate   = nn.Linear(ni + nh, nh)
        self.output_gate = nn.Linear(ni + nh, nh)

    def forward(self, input, state):
        h,c = state
        h = torch.stack([h, input], dim=1)
        forget = torch.sigmoid(self.forget_gate(h))
        c = c * forget
        inp = torch.sigmoid(self.input_gate(h))
        cell = torch.tanh(self.cell_gate(h))
        c = c + inp * cell
        out = torch.sigmoid(self.output_gate(h))
        h = outgate * torch.tanh(c)
        return h, (h,c)

In practice, we can then refactor the code. Also, in terms of performance, it's better to do one big matrix multiplication than four smaller ones (that's because we only launch the special fast kernel on the GPU once, and it gives the GPU more work to do in parallel). The stacking takes a bit of time (since we have to move one of the tensors around on the GPU to have it all in a contiguous array), so we use two separate layers for the input and the hidden state. The optimized and refactored code then looks like this:

实际上，我们可以重构代码。此外，就性能而言，最好做一个大矩阵乘法，而不是四个小矩阵乘法 (因为我们只在GPU上启动一次特殊的快速内核，它给了GPU更多的并行工作)。堆叠需要一点时间 (因为我们必须在GPU上移动其中一个张量，才能将它们全部放在一个连续的数组中)，因此，我们为输入和隐藏状态使用两个单独的层。然后优化和重构的代码看起来像这样:[机器翻译]

In [None]:
class LSTMCell(Module):
    def __init__(self, ni, nh):
        self.ih = nn.Linear(ni,4*nh)
        self.hh = nn.Linear(nh,4*nh)

    def forward(self, input, state):
        h,c = state
        # One big multiplication for all the gates is better than 4 smaller ones
        gates = (self.ih(input) + self.hh(h)).chunk(4, 1)
        ingate,forgetgate,outgate = map(torch.sigmoid, gates[:3])
        cellgate = gates[3].tanh()

        c = (forgetgate*c) + (ingate*cellgate)
        h = outgate * c.tanh()
        return h, (h,c)

Here we use the PyTorch `chunk` method to split our tensor into four pieces. It works like this:

在这里，我们使用PyTorch 'chunk' 方法将我们的张量分成四个部分。它的工作原理是这样的:[机器翻译]

In [None]:
t = torch.arange(0,10); t

tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [None]:
t.chunk(2)

(tensor([0, 1, 2, 3, 4]), tensor([5, 6, 7, 8, 9]))

Let's now use this architecture to train a language model!

现在让我们使用这个架构来训练一个语言模型![机器翻译]

### Training a Language Model Using LSTMs

# 使用LSTMs训练语言模型[机器翻译]

Here is the same network as `LMModel5`, using a two-layer LSTM. We can train it at a higher learning rate, for a shorter time, and get better accuracy:

这里是与 “LMModel5” 相同的网络，使用两层LSTM。我们可以在更短的时间内以更高的学习速度训练它，并获得更好的准确性:[机器翻译]

In [None]:
class LMModel6(Module):
    def __init__(self, vocab_sz, n_hidden, n_layers):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)
        self.rnn = nn.LSTM(n_hidden, n_hidden, n_layers, batch_first=True)
        self.h_o = nn.Linear(n_hidden, vocab_sz)
        self.h = [torch.zeros(n_layers, bs, n_hidden) for _ in range(2)]
        
    def forward(self, x):
        res,h = self.rnn(self.i_h(x), self.h)
        self.h = [h_.detach() for h_ in h]
        return self.h_o(res)
    
    def reset(self): 
        for h in self.h: h.zero_()

In [None]:
learn = Learner(dls, LMModel6(len(vocab), 64, 2), 
                loss_func=CrossEntropyLossFlat(), 
                metrics=accuracy, cbs=ModelResetter)
learn.fit_one_cycle(15, 1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,3.000821,2.663942,0.438314,00:02
1,2.139642,2.18478,0.240479,00:02
2,1.607275,1.812682,0.439779,00:02
3,1.347711,1.830982,0.497477,00:02
4,1.123113,1.937766,0.594401,00:02
5,0.852042,2.012127,0.631592,00:02
6,0.565494,1.312742,0.725749,00:02
7,0.347445,1.297934,0.711263,00:02
8,0.208191,1.441269,0.731201,00:02
9,0.126335,1.569952,0.737305,00:02


Now that's better than a multilayer RNN! We can still see there is a bit of overfitting, however, which is a sign that a bit of regularization might help.

这比多层RNN要好!然而，我们仍然可以看到有一点过度拟合，这是一个迹象，表明一点点正规化可能会有所帮助。[机器翻译]

## Regularizing an LSTM

# # 正规化LSTM[机器翻译]

Recurrent neural networks, in general, are hard to train, because of the problem of vanishing activations and gradients we saw before. Using LSTM (or GRU) cells makes training easier than with vanilla RNNs, but they are still very prone to overfitting. Data augmentation, while a possibility, is less often used for text data than for images because in most cases it requires another model to generate random augmentations (e.g., by translating the text into another language and then back into the original language). Overall, data augmentation for text data is currently not a well-explored space.


一般来说，递归神经网络很难训练，因为我们以前看到的激活和梯度消失的问题。使用LSTM (或GRU) 细胞比使用香草rnn更容易训练，但它们仍然非常容易过度拟合。数据增强虽然有可能，但不经常用于文本数据而不是图像，因为在大多数情况下，它需要另一个模型来生成随机增强 (e.g.，通过将文本翻译成另一种语言，然后返回到原始语言)。总的来说，文本数据的数据增强目前不是一个很好的探索空间。[机器翻译]

However, there are other regularization techniques we can use instead to reduce overfitting, which were thoroughly studied for use with LSTMs in the paper ["Regularizing and Optimizing LSTM Language Models"](https://arxiv.org/abs/1708.02182) by Stephen Merity, Nitish Shirish Keskar, and Richard Socher. This paper showed how effective use of *dropout*, *activation regularization*, and *temporal activation regularization* could allow an LSTM to beat state-of-the-art results that previously required much more complicated models. The authors called an LSTM using these techniques an *AWD-LSTM*. We'll look at each of these techniques in turn.

然而，我们可以使用其他正则化技术来减少过度拟合，在论文 [“正则化和优化LSTM语言模型”] ( https://arxiv.org/abs/1708.02182 ) 作者斯蒂芬 · 梅里蒂、尼蒂什 · 希里什 · 凯斯卡和理查德 · 索彻。本文展示了如何有效使用 * dropout *，* activation正则化 *，和 * 时间激活正则化 * 可以允许LSTM击败以前需要更复杂模型的最先进的结果。作者使用这些技术将LSTM称为 * AWD-LSTM *。我们将依次研究这些技术中的每一种。[机器翻译]

### Dropout

# 辍学[机器翻译]

Dropout is a regularization technique that was introduced by Geoffrey Hinton et al. in [Improving neural networks by preventing co-adaptation of feature detectors](https://arxiv.org/abs/1207.0580). The basic idea is to randomly change some activations to zero at training time. This makes sure all neurons actively work toward the output, as seen in <<img_dropout>> (from "Dropout: A Simple Way to Prevent Neural Networks from Overfitting" by Nitish Srivastava et al.).


Dropout是Geoffrey Hinton等人在 [通过防止特征检测器的共同适应来改善神经网络] ( https://arxiv.org/abs/1207.0580 )。基本的想法是在训练时间随机改变一些激活为零。这确保了所有神经元都积极地朝着输出工作，如 <<img_dropout>> 所示 (从 “Dropout: 防止神经网络过度拟合的简单方法“ 由Nitish Srivastava等人提供)。[机器翻译]

<img src="images/Dropout1.png" alt="A figure from the article showing how neurons go off with dropout" width="800" id="img_dropout" caption="Applying dropout in a neural network (courtesy of Nitish Srivastava et al.)">


<Img src = "images/Dropout1.png" alt = "文章中的一个数字，显示神经元如何随着dropout而关闭" width = "800" id = "img_dropout" caption = "应用dropout在神经网络中 (由Nitish Srivastava等人提供)">[机器翻译]

Hinton used a nice metaphor when he explained, in an interview, the inspiration for dropout:


辛顿在一次采访中解释辍学的灵感时，用了一个很好的比喻:[机器翻译]

> : I went to my bank. The tellers kept changing and I asked one of them why. He said he didn’t know but they got moved around a lot. I figured it must be because it would require cooperation between employees to successfully defraud the bank. This made me realize that randomly removing a different subset of neurons on each example would prevent conspiracies and thus reduce overfitting.


>: 我去了我的银行。出纳员不停地换衣服，我问其中一个为什么。他说他不知道，但他们经常搬家。我想这一定是因为成功欺骗银行需要员工之间的合作。这让我意识到，在每个例子上随机移除不同的神经元子集会防止阴谋，从而减少过度拟合。[机器翻译]

In the same interview, he also explained that neuroscience provided additional inspiration:


在同一次采访中，他还解释说，神经科学提供了额外的灵感:[机器翻译]

> : We don't really know why neurons spike. One theory is that they want to be noisy so as to regularize, because we have many more parameters than we have data points. The idea of dropout is that if you have noisy activations, you can afford to use a much bigger model.

>: 我们真的不知道为什么神经元会尖峰。一种理论是，他们想要嘈杂以便规范，因为我们有更多的参数比我们有数据点。辍学的想法是，如果你有嘈杂的激活，你可以使用更大的模型。[机器翻译]

This explains the idea behind why dropout helps to generalize: first it helps the neurons to cooperate better together, then it makes the activations more noisy, thus making the model more robust.

这解释了为什么dropout有助于概括背后的想法: 首先它帮助神经元更好地合作在一起，然后它使激活更加嘈杂，从而使模型更加健壮。[机器翻译]

We can see, however, that if we were to just zero those activations without doing anything else, our model would have problems training: if we go from the sum of five activations (that are all positive numbers since we apply a ReLU) to just two, this won't have the same scale. Therefore, if we apply dropout with a probability `p`, we rescale all activations by dividing them by `1-p` (on average `p` will be zeroed, so it leaves `1-p`), as shown in <<img_dropout1>>.


然而，我们可以看到，如果我们只是在不做任何其他事情的情况下将这些激活归零，那么我们的模型在训练时会有问题: 如果我们从五个激活的总和 (因为我们应用了一个ReLU，所以都是正数) 到两个，这将不会有相同的比例。因此，如果我们以概率 'p' 应用dropout，我们通过将它们除以 '1-p' 来重新缩放所有激活 (平均 'p' 将归零，因此它将离开 '1-p')，如 <<img_dropout1>> 所示。[机器翻译]

<img src="images/Dropout.png" alt="A figure from the article introducing dropout showing how a neuron is on/off" width="600" id="img_dropout1" caption="Why scale the activations when applying dropout (courtesy of Nitish Srivastava et al.)">


<Img src = "images/Dropout.png" alt = "一篇介绍dropout的文章中的数字显示了神经元是如何打开/关闭的" width = "600" id = "img_dropout1" caption = “为什么在申请辍学时要调整激活 (由Nitish Srivastava等人提供)">[机器翻译]

This is a full implementation of the dropout layer in PyTorch (although PyTorch's native layer is actually written in C, not Python):

这是PyTorch中dropout层的完整实现 (尽管PyTorch的native层实际上是用C编写的，而不是Python):[机器翻译]

In [None]:
class Dropout(Module):
    def __init__(self, p): self.p = p
    def forward(self, x):
        if not self.training: return x
        mask = x.new(*x.shape).bernoulli_(1-p)
        return x * mask.div_(1-p)

The `bernoulli_` method is creating a tensor of random zeros (with probability `p`) and ones (with probability `1-p`), which is then multiplied with our input before dividing by `1-p`. Note the use of the `training` attribute, which is available in any PyTorch `nn.Module`, and tells us if we are doing training or inference.


'Bernoulli _' 方法是创建一个随机零 (概率为 'p') 和1 (概率为 '1-p') 的张量，然后与我们的输入相乘，然后除以 '1-p'。请注意 'training' 属性的使用，它在任何PyTorch 'nn.Module' 中都可用，并告诉我们是否正在进行训练或推理。[机器翻译]

> note: Do Your Own Experiments: In previous chapters of the book we'd be adding a code example for `bernoulli_` here, so you can see exactly how it works. But now that you know enough to do this yourself, we're going to be doing fewer and fewer examples for you, and instead expecting you to do your own experiments to see how things work. In this case, you'll see in the end-of-chapter questionnaire that we're asking you to experiment with `bernoulli_`—but don't wait for us to ask you to experiment to develop your understanding of the code we're studying; go ahead and do it anyway!


> 注意: 做你自己的实验: 在这本书的前几章，我们将在这里添加一个 “伯努利” 的代码示例，这样你就可以确切地看到它是如何工作的。但是现在你已经知道足够自己做这件事了，我们将会为你做越来越少的例子，而不是期望你做你自己的实验来看看事情是如何运作的。在这种情况下，你会在章节结尾的问卷中看到，我们要求你用 “伯努利” 做实验 -- 但是不要等到我们要求你做实验来发展你对我们正在研究的代码; 不管怎样，去做吧![机器翻译]

Using dropout before passing the output of our LSTM to the final layer will help reduce overfitting. Dropout is also used in many other models, including the default CNN head used in `fastai.vision`, and is available in `fastai.tabular` by passing the `ps` parameter (where each "p" is passed to each added `Dropout` layer), as we'll see in <<chapter_arch_details>>.

在将LSTM的输出传递到最后一层之前使用dropout将有助于减少过度拟合。Dropout也用于许多其他模型，包括 “fastai” 中使用的默认CNN头。vision '，并在' fastai '中提供。通过传递' ps '参数 (其中每个 “p” 被传递给每个添加的' dropout '层)，我们将在 <<chapter_arch_details>> 中看到。[机器翻译]

Dropout has different behavior in training and validation mode, which we specified using the `training` attribute in `Dropout`. Calling the `train` method on a `Module` sets `training` to `True` (both for the module you call the method on and for every module it recursively contains), and `eval` sets it to `False`. This is done automatically when calling the methods of `Learner`, but if you are not using that class, remember to switch from one to the other as needed.

退出在训练和验证模式中有不同的行为，我们在 “退出” 中使用了 “训练” 属性。在 “模块” 上调用 “train” 方法将 “train” 设置为 “true” (对于您调用方法的模块以及它递归包含的每个模块)，并且 “value” 将其设置为 “false”。当调用 “学习者” 的方法时，这是自动完成的，但是如果您没有使用该类，请记住根据需要从一个类切换到另一个类。[机器翻译]

### Activation Regularization and Temporal Activation Regularization

# 激活正则化和时间激活正则化[机器翻译]

*Activation regularization* (AR) and *temporal activation regularization* (TAR) are two regularization methods very similar to weight decay, discussed in <<chapter_collab>>. When applying weight decay, we add a small penalty to the loss that aims at making the weights as small as possible. For activation regularization, it's the final activations produced by the LSTM that we will try to make as small as possible, instead of the weights.


* 激活正则化 * (AR) 和 * 时间激活正则化 * (TAR) 是两种与权重衰减非常相似的正则化方法，在 <chapter_collab>> 中讨论。当应用重量衰减时，我们在损失中添加了一个小的惩罚，目的是使重量尽可能小。对于激活正则化，我们将尽量使LSTM产生的最终激活尽可能小，而不是权重。[机器翻译]

To regularize the final activations, we have to store those somewhere, then add the means of the squares of them to the loss (along with a multiplier `alpha`, which is just like `wd` for weight decay):


为了规范最终的激活，我们必须将它们存储在某个地方，然后将它们的平方值添加到损失中 (以及乘数 “阿尔法”，这就像 “wd” 的重量衰减):[机器翻译]

``` python
loss += alpha * activations.pow(2).mean()
```

''Python
损失 = alpha * 激活.pow(2).mean()
'''[机器翻译]

Temporal activation regularization is linked to the fact we are predicting tokens in a sentence. That means it's likely that the outputs of our LSTMs should somewhat make sense when we read them in order. TAR is there to encourage that behavior by adding a penalty to the loss to make the difference between two consecutive activations as small as possible: our activations tensor has a shape `bs x sl x n_hid`, and we read consecutive activations on the sequence length axis (the dimension in the middle). With this, TAR can be expressed as:


时间激活正则化与我们预测句子中的令牌的事实相关联。这意味着，当我们按顺序阅读时，我们的LSTMs的输出可能有些意义。TAR是通过增加损失的惩罚来鼓励这种行为，使两个连续激活之间的差异尽可能小: 我们的激活张量有一个形状 'bs x sl x n_hid'，我们读取序列长度轴 (中间的维度) 上的连续激活。有了这个，TAR可以表示为:[机器翻译]

``` python
loss += beta * (activations[:,1:] - activations[:,:-1]).pow(2).mean()
```


''Python
Loss = beta * (activations[:，1:]-activations[:，:-1]).pow(2).mean()
'''[机器翻译]

`alpha` and `beta` are then two hyperparameters to tune. To make this work, we need our model with dropout to return three things: the proper output, the activations of the LSTM pre-dropout, and the activations of the LSTM post-dropout. AR is often applied on the dropped-out activations (to not penalize the activations we turned in zeros afterward) while TAR is applied on the non-dropped-out activations (because those zeros create big differences between two consecutive time steps). There is then a callback called `RNNRegularizer` that will apply this regularization for us.

“Α” 和 “β” 是两个要调整的超参数。为了实现这一点，我们需要我们的dropout模型返回三个东西: 正确的输出，LSTM预退出的激活，以及LSTM后退出的激活。AR通常应用于退出激活 (为了不惩罚我们之后用零转换的激活) 而TAR应用于非退出激活 (因为这些零在两个连续的时间步长之间产生很大差异)。然后有一个名为 “rnnregulator” 的回调将为我们应用这种正则化。[机器翻译]

### Training a Weight-Tied Regularized LSTM

# 训练一个重绑的正规化LSTM[机器翻译]

We can combine dropout (applied before we go into our output layer) with AR and TAR to train our previous LSTM. We just need to return three things instead of one: the normal output of our LSTM, the dropped-out activations, and the activations from our LSTMs. The last two will be picked up by the callback `RNNRegularization` for the contributions it has to make to the loss.


我们可以将dropout (在进入输出层之前应用) 与AR和TAR结合起来，来训练我们以前的LSTM。我们只需要返回三个而不是一个: LSTM的正常输出、退出激活和LSTM的激活。最后两个将由回调 “rnnregularization” 来弥补它对损失的贡献。[机器翻译]

Another useful trick we can add from [the AWD LSTM paper](https://arxiv.org/abs/1708.02182) is *weight tying*. In a language model, the input embeddings represent a mapping from English words to activations, and the output hidden layer represents a mapping from activations to English words. We might expect, intuitively, that these mappings could be the same. We can represent this in PyTorch by assigning the same weight matrix to each of these layers:


另一个有用的技巧，我们可以从 [AWD LSTM论文] ( https://arxiv.org/abs/1708.02182 ) 是 * 重量绑 *。在语言模型中，输入嵌入表示从英语单词到激活的映射，而输出隐藏层表示从激活到英语单词的映射。直觉上，我们可能会期望这些映射是相同的。我们可以在PyTorch中通过为每个层分配相同的权重矩阵来表示这一点:[机器翻译]

    self.h_o.weight = self.i_h.weight


Self.h_o.weight = self.i_h.weight[机器翻译]

In `LMMModel7`, we include these final tweaks:

在 'LMMModel7' 中，我们包括这些最终的调整:[机器翻译]

In [None]:
class LMModel7(Module):
    def __init__(self, vocab_sz, n_hidden, n_layers, p):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)
        self.rnn = nn.LSTM(n_hidden, n_hidden, n_layers, batch_first=True)
        self.drop = nn.Dropout(p)
        self.h_o = nn.Linear(n_hidden, vocab_sz)
        self.h_o.weight = self.i_h.weight
        self.h = [torch.zeros(n_layers, bs, n_hidden) for _ in range(2)]
        
    def forward(self, x):
        raw,h = self.rnn(self.i_h(x), self.h)
        out = self.drop(raw)
        self.h = [h_.detach() for h_ in h]
        return self.h_o(out),raw,out
    
    def reset(self): 
        for h in self.h: h.zero_()

We can create a regularized `Learner` using the `RNNRegularizer` callback:

我们可以使用 'rnnregularizer' 回调创建一个正则化的 'learner':[机器翻译]

In [None]:
learn = Learner(dls, LMModel7(len(vocab), 64, 2, 0.5),
                loss_func=CrossEntropyLossFlat(), metrics=accuracy,
                cbs=[ModelResetter, RNNRegularizer(alpha=2, beta=1)])

A `TextLearner` automatically adds those two callbacks for us (with those values for `alpha` and `beta` as defaults), so we can simplify the preceding line to:

'Textlearner' 会自动为我们添加这两个回调 (默认值为 'alpha' 和 'beta')，因此我们可以将前面的行简化为:[机器翻译]

In [None]:
learn = TextLearner(dls, LMModel7(len(vocab), 64, 2, 0.4),
                    loss_func=CrossEntropyLossFlat(), metrics=accuracy)

We can then train the model, and add additional regularization by increasing the weight decay to `0.1`:

然后，我们可以训练模型，并通过将权重衰减增加到 “0.1” 来添加额外的正则化:[机器翻译]

In [None]:
learn.fit_one_cycle(15, 1e-2, wd=0.1)

epoch,train_loss,valid_loss,accuracy,time
0,2.693885,2.013484,0.466634,00:02
1,1.685549,1.18731,0.629313,00:02
2,0.973307,0.791398,0.745605,00:02
3,0.555823,0.640412,0.794108,00:02
4,0.351802,0.557247,0.8361,00:02
5,0.244986,0.594977,0.807292,00:02
6,0.192231,0.51169,0.846761,00:02
7,0.162456,0.52037,0.858073,00:02
8,0.142664,0.525918,0.842285,00:02
9,0.128493,0.495029,0.858073,00:02


Now this is far better than our previous model!

现在这比我们以前的模型好得多![机器翻译]

## Conclusion

# # 结论[机器翻译]

You have now seen everything that is inside the AWD-LSTM architecture we used in text classification in <<chapter_nlp>>. It uses dropout in a lot more places:


您现在已经看到了我们在 <<chapter_nlp>> 中的文本分类中使用的AWD-LSTM架构中的所有内容。它在更多的地方使用辍学者:[机器翻译]

- Embedding dropout (just after the embedding layer)
- Input dropout (after the embedding layer)
- Weight dropout (applied to the weights of the LSTM at each training step)
- Hidden dropout (applied to the hidden state between two layers)


-嵌入脱落 (刚在嵌入层之后)
-输入丢失 (在嵌入层之后)
-体重下降 (应用于每个训练步骤的LSTM的重量)
-隐藏退出 (应用于两层之间的隐藏状态)[机器翻译]

This makes it even more regularized. Since fine-tuning those five dropout values (including the dropout before the output layer) is complicated, we have determined good defaults and allow the magnitude of dropout to be tuned overall with the `drop_mult` parameter you saw in that chapter (which is multiplied by each dropout).


这使得它更加规范化。因为微调这五个丢失值 (包括输出层之前的丢失) 是复杂的，我们已经确定了良好的默认值，并允许使用您在该章中看到的 “drop_mult” 参数 (乘以每个dropout) 对dropout的量级进行整体调整。[机器翻译]

Another architecture that is very powerful, especially in "sequence-to-sequence" problems (that is, problems where the dependent variable is itself a variable-length sequence, such as language translation), is the Transformers architecture. You can find it in a bonus chapter on the [book's website](https://book.fast.ai/).

另一个非常强大的架构，特别是在 “序列到序列” 问题 (即因变量本身是可变长度序列的问题，例如语言翻译) 中，是变形金刚建筑。你可以在 [图书网站的奖金章节中找到它] ( https://book.fast.ai/ )。[机器翻译]

## Questionnaire

# # 问卷调查[机器翻译]

1. If the dataset for your project is so big and complicated that working with it takes a significant amount of time, what should you do?
1. Why do we concatenate the documents in our dataset before creating a language model?
1. To use a standard fully connected network to predict the fourth word given the previous three words, what two tweaks do we need to make to ou model?
1. How can we share a weight matrix across multiple layers in PyTorch?
1. Write a module that predicts the third word given the previous two words of a sentence, without peeking.
1. What is a recurrent neural network?
1. What is "hidden state"?
1. What is the equivalent of hidden state in ` LMModel1`?
1. To maintain the state in an RNN, why is it important to pass the text to the model in order?
1. What is an "unrolled" representation of an RNN?
1. Why can maintaining the hidden state in an RNN lead to memory and performance problems? How do we fix this problem?
1. What is "BPTT"?
1. Write code to print out the first few batches of the validation set, including converting the token IDs back into English strings, as we showed for batches of IMDb data in <<chapter_nlp>>.
1. What does the `ModelResetter` callback do? Why do we need it?
1. What are the downsides of predicting just one output word for each three input words?
1. Why do we need a custom loss function for `LMModel4`?
1. Why is the training of `LMModel4` unstable?
1. In the unrolled representation, we can see that a recurrent neural network actually has many layers. So why do we need to stack RNNs to get better results?
1. Draw a representation of a stacked (multilayer) RNN.
1. Why should we get better results in an RNN if we call `detach` less often? Why might this not happen in practice with a simple RNN?
1. Why can a deep network result in very large or very small activations? Why does this matter?
1. In a computer's floating-point representation of numbers, which numbers are the most precise?
1. Why do vanishing gradients prevent training?
1. Why does it help to have two hidden states in the LSTM architecture? What is the purpose of each one?
1. What are these two states called in an LSTM?
1. What is tanh, and how is it related to sigmoid?
1. What is the purpose of this code in `LSTMCell`: `h = torch.stack([h, input], dim=1)`
1. What does `chunk` do in PyTorch?
1. Study the refactored version of `LSTMCell` carefully to ensure you understand how and why it does the same thing as the non-refactored version.
1. Why can we use a higher learning rate for `LMModel6`?
1. What are the three regularization techniques used in an AWD-LSTM model?
1. What is "dropout"?
1. Why do we scale the weights with dropout? Is this applied during training, inference, or both?
1. What is the purpose of this line from `Dropout`: `if not self.training: return x`
1. Experiment with `bernoulli_` to understand how it works.
1. How do you set your model in training mode in PyTorch? In evaluation mode?
1. Write the equation for activation regularization (in math or code, as you prefer). How is it different from weight decay?
1. Write the equation for temporal activation regularization (in math or code, as you prefer). Why wouldn't we use this for computer vision problems?
1. What is "weight tying" in a language model?

1.如果您项目的数据集如此之大且复杂，以至于使用它需要大量时间，您应该怎么做？
1.为什么我们在创建语言模型之前连接数据集中的文档？
1。要使用标准的全连接网络来预测给定前三个单词的第四个单词，我们需要对ou模型进行哪些调整？
1.我们如何在PyTorch中共享一个跨多个层的权重矩阵？
1.在给定句子的前两个单词的情况下，写一个预测第三个单词的模块，不偷看。
1.什么是递归神经网络？
1.什么是 “隐藏状态”？
1.'LMModel1' 中隐藏状态的等价物是什么？
1.为了保持RNN中的状态，为什么按顺序将文本传递给模型很重要？
1.什么是RNN的 “展开” 表示？
1.为什么在RNN中保持隐藏状态会导致内存和性能问题？我们如何解决这个问题？
1.什么是 “BPTT”？
1.编写代码以打印出验证集的前几批，包括将令牌id转换回英文字符串，正如我们在 <chapter_nlp>> 中显示的批次IMDb数据。
1.Modelreset回调做什么？为什么我们需要它？
1.每三个输入单词只预测一个输出单词有什么缺点？
1.为什么我们需要 'LMModel4' 的自定义损失函数？
1.为什么 'LMModel4' 的训练不稳定？
1.在展开的表示中，我们可以看到一个递归神经网络实际上有许多层。那么为什么我们需要堆叠rnn来获得更好的结果呢？
1.绘制堆叠 (多层) RNN的表示。
1.如果我们不经常称之为 “分离”，为什么我们应该在RNN中获得更好的结果？为什么在使用简单的RNN时不会发生这种情况？
1.为什么深度网络会导致非常大或非常小的激活？为什么这件事吗？
1.在计算机的数字浮点表示中，哪些数字最精确？
1.为什么消失的梯度会阻止训练？
1.为什么在LSTM架构中有两个隐藏状态会有所帮助？每个人的目的是什么？
1.LSTM中这两种状态叫什么？
1.什么是tanh，和乙状结肠有什么关系？
1.这段代码在 'lstmcell' 中的目的是什么: 'h = torch.stack([h，input]，dim = 1)'
1.“chunk” 在PyTorch中做什么？
1.仔细研究 “lstmcell” 的重构版本，以确保您理解它如何以及为什么与非重构版本做同样的事情。
1.为什么我们可以对 'LMModel6' 使用更高的学习率？
1.AWD-LSTM模型中使用的三种正则化技术是什么？
1.什么是 “辍学”？
1.为什么我们用辍学来衡量权重？这是否适用于训练、推理或两者？
1.'dropout': 'if not self.training: return x' 这一行的目的是什么
1.尝试 “bernoulli _” 以了解它是如何工作的。
1.你如何在PyTorch的训练模式中设置你的模型？在评估模式？
1.编写激活正则化的方程 (在数学或代码中，如你所愿)。它与重量衰减有什么不同？
1.编写时间激活正则化方程 (在数学或代码中，如您所愿)。为什么我们不用这个来解决计算机视觉问题？
1.什么是语言模型中的 “权重捆扎”？[机器翻译]

### Further Research

# 进一步研究[机器翻译]

1. In ` LMModel2`, why can `forward` start with `h=0`? Why don't we need to say `h=torch.zeros(...)`?
1. Write the code for an LSTM from scratch (you may refer to <<lstm>>).
1. Search the internet for the GRU architecture and implement it from scratch, and try training a model. See if you can get results similar to those we saw in this chapter. Compare you results to the results of PyTorch's built in `GRU` module.
1. Take a look at the source code for AWD-LSTM in fastai, and try to map each of the lines of code to the concepts shown in this chapter.

1.在 'LMModel2' 中，为什么 'forward' 能以 'h = 0' 开头？为什么我们不需要说 'h = torch.Zero (.？
1.从头开始编写LSTM的代码 (您可以参考 <<lstm>>)。
1.在网上搜索GRU架构并从零开始实现，尝试训练一个模型。看看你能否得到与我们在本章中看到的结果相似的结果。将您的结果与PyTorch的内置 “gru” 模块的结果进行比较。
1.看看fastai中AWD-LSTM的源代码，并尝试将每一行代码映射到本章中显示的概念。[机器翻译]