In [None]:
#hide
! [ -e /content ] && pip install -Uqq fastbook
import fastbook
fastbook.setup_book()

In [None]:
#hide
from fastbook import *
from IPython.display import display,HTML

# NLP Deep Dive: RNNs

# 深度NLP：RNNs

In <<chapter_intro>> we saw that deep learning can be used to get great results with natural language datasets. Our example relied on using a pretrained language model and fine-tuning it to classify reviews. That example highlighted a difference between transfer learning in NLP and computer vision: in general in NLP the pretrained model is trained on a different task.

What we call a language model is a model that has been trained to guess what the next word in a text is (having read the ones before). This kind of task is called *self-supervised learning*: we do not need to give labels to our model, just feed it lots and lots of texts. It has a process to automatically get labels from the data, and this task isn't trivial: to properly guess the next word in a sentence, the model will have to develop an understanding of the English (or other) language. Self-supervised learning can also be used in other domains; for instance, see ["Self-Supervised Learning and Computer Vision"](https://www.fast.ai/2020/01/13/self_supervised/) for an introduction to vision applications. Self-supervised learning is not usually used for the model that is trained directly, but instead is used for pretraining a model used for transfer learning.

在《chapter_intro》中，我们看到深度学习可以用自然语言数据集得到很好的结果。我们的示例依赖于使用预先训练好的语言模型，并对其进行微调，以对评论进行分类。这个例子突出了NLP中的迁移学习和计算机视觉之间的区别:一般来说，在NLP中，预训练的模型是在不同的任务上训练的。

我们所说的语言模型是一种经过训练的模型，可以猜测文本中的下一个单词是什么(之前已经读过了)。这种任务被称为*自我监督学习*:我们不需要给我们的模型贴上标签，只需要输入很多很多的文本。它有一个从数据中自动获取标签的过程，这个任务并不是微不足道的:要正确猜测句子中的下一个单词，模型必须发展出对英语(或其他)语言的理解。自监督学习也可用于其他领域;例如，关于视觉应用的介绍，请参阅["自我监督学习和计算机视觉"](https://www.fast.ai/2020/01/13/self_supervised/)。自我监督学习通常不用于直接训练的模型，而是用于迁移学习的模型的预训练。

> jargon: Self-supervised learning: Training a model using labels that are embedded in the independent variable, rather than requiring external labels. For instance, training a model to predict the next word in a text.

>术语：自我监督学习:使用嵌入在自变量中的标签来训练模型，而不是要求外部标签。例如，训练一个模型来预测文本中的下一个单词。

The language model we used in <<chapter_intro>> to classify IMDb reviews was pretrained on Wikipedia. We got great results by directly fine-tuning this language model to a movie review classifier, but with one extra step, we can do even better. The Wikipedia English is slightly different from the IMDb English, so instead of jumping directly to the classifier, we could fine-tune our pretrained language model to the IMDb corpus and then use *that* as the base for our classifier.

Even if our language model knows the basics of the language we are using in the task (e.g., our pretrained model is in English), it helps to get used to the style of the corpus we are targeting. It may be more informal language, or more technical, with new words to learn or different ways of composing sentences. In the case of the IMDb dataset, there will be lots of names of movie directors and actors, and often a less formal style of language than that seen in Wikipedia.

We already saw that with fastai, we can download a pretrained English language model and use it to get state-of-the-art results for NLP classification. (We expect pretrained models in many more languages to be available soon—they might well be available by the time you are reading this book, in fact.) So, why are we learning how to train a language model in detail?

One reason, of course, is that it is helpful to understand the foundations of the models that you are using. But there is another very practical reason, which is that you get even better results if you fine-tune the (sequence-based) language model prior to fine-tuning the classification model. For instance, for the IMDb sentiment analysis task, the dataset includes 50,000 additional movie reviews that do not have any positive or negative labels attached. Since there are 25,000 labeled reviews in the training set and 25,000 in the validation set, that makes 100,000 movie reviews altogether. We can use all of these reviews to fine-tune the pretrained language model, which was trained only on Wikipedia articles; this will result in a language model that is particularly good at predicting the next word of a movie review.

This is known as the Universal Language Model Fine-tuning (ULMFit) approach. The [paper](https://arxiv.org/abs/1801.06146) showed that this extra stage of fine-tuning of the language model, prior to transfer learning to a classification task, resulted in significantly better predictions. Using this approach, we have three stages for transfer learning in NLP, as summarized in <<ulmfit_process>>.

我们在<<chapter_intro>>中用于IMDb评论分类的语言模型在Wikipedia上进行了预训练。通过将这个语言模型直接调整为电影评论分类器，我们得到了很好的结果，但只要多做一步，我们就可以做得更好。维基百科英语与IMDb英语略有不同，所以我们不必直接跳到分类器，而是可以根据IMDb语料库微调我们的预训练语言模型，然后使用*它*作为分类器的基础。

即使我们的语言模型知道我们在任务中使用的语言的基础(例如，我们预先训练的模型是英语)，它有助于适应我们目标语料库的风格。它可能是更非正式的语言，也可能是更专业的语言，需要学习新单词或不同的造句方式。在IMDb数据集中，会有很多电影导演和演员的名字，而且语言风格通常不像在维基百科中看到的那样正式。

我们已经看到，通过fastai，我们可以下载一个预先训练的英语语言模型，并使用它来获得最先进的NLP分类结果。(我们希望很快就能得到更多语言的预训练模型——事实上，在你阅读这本书的时候，它们很可能已经出现了。)那么，为什么我们要学习如何详细地训练语言模型呢?

当然，其中一个原因是，理解您正在使用的模型的基础是有帮助的。但是还有另一个非常实际的原因，那就是如果在微调分类模型之前微调(基于序列的)语言模型，会得到更好的结果。例如，对于IMDb情感分析任务，该数据集包括5万个额外的没有任何正面或负面标签的电影评论。因为在训练集中有25000条带标签的评论，在验证集中有25000条，总共有100000条电影评论。我们可以使用所有这些评论来微调预训练的语言模型，它只在维基百科的文章上训练;这将产生一种语言模型，特别擅长预测电影评论的下一个词。

这就是所谓的通用语言模型微调(ULMFit)方法。这篇[论文](https://arxiv.org/abs/1801.06146)表明，在将学习转移到分类任务之前，对语言模型进行的这一额外阶段的微调可以显著提高预测效果。使用这种方法，我们将NLP中的迁移学习分为三个阶段，总结在《ulmfit_process》。

<img alt="Diagram of the ULMFiT process" width="700" caption="The ULMFiT process" id="ulmfit_process" src="https://github.com/fastai/fastbook/blob/master/images/att_00027.png?raw=1">

We'll now explore how to apply a neural network to this language modeling problem, using the concepts introduced in the last two chapters. But before reading further, pause and think about how *you* would approach this.

我们现在将探索如何将神经网络应用到这个语言建模问题，使用前两章介绍的概念。但是在继续阅读之前，停下来想一想你会怎么做。

## Text Preprocessing

## 文本预处理

It's not at all obvious how we're going to use what we've learned so far to build a language model. Sentences can be different lengths, and documents can be very long. So, how can we predict the next word of a sentence using a neural network? Let's find out!

We've already seen how categorical variables can be used as independent variables for a neural network. The approach we took for a single categorical variable was to:

1. Make a list of all possible levels of that categorical variable (we'll call this list the *vocab*).
1. Replace each level with its index in the vocab.
1. Create an embedding matrix for this containing a row for each level (i.e., for each item of the vocab).
1. Use this embedding matrix as the first layer of a neural network. (A dedicated embedding matrix can take as inputs the raw vocab indexes created in step 2; this is equivalent to but faster and more efficient than a matrix that takes as input one-hot-encoded vectors representing the indexes.)

We can do nearly the same thing with text! What is new is the idea of a sequence. First we concatenate all of the documents in our dataset into one big long string and split it into words, giving us a very long list of words (or "tokens"). Our independent variable will be the sequence of words starting with the first word in our very long list and ending with the second to last, and our dependent variable will be the sequence of words starting with the second word and ending with the last word. 

Our vocab will consist of a mix of common words that are already in the vocabulary of our pretrained model and new words specific to our corpus (cinematographic terms or actors names, for instance). Our embedding matrix will be built accordingly: for words that are in the vocabulary of our pretrained model, we will take the corresponding row in the embedding matrix of the pretrained model; but for new words we won't have anything, so we will just initialize the corresponding row with a random vector.

我们将如何使用到目前为止所学的知识来构建一个语言模型，这一点并不明显。句子可以有不同的长度，文档可以很长。那么，我们如何利用神经网络来预测句子中的下一个单词呢?让我们来看看!

我们已经看到了分类变量如何被用作神经网络的自变量。我们对单个分类变量采取的方法是:

1. 列出这个分类变量的所有可能级别(我们称之为词汇表)。
1. 用词汇表中的索引替换每一层。
1. 为此创建一个嵌入矩阵，包含每层的一行(即词汇表的每一项)。
1. 利用这个嵌入矩阵作为神经网络的第一层。(一个专用的嵌入矩阵可以将步骤2中创建的原始词汇表索引作为输入;这相当于一个以热编码向量表示索引的矩阵，但速度更快，效率更高。)

我们可以对文本做几乎同样的事情!新出现的是序列的概念。首先，我们将数据集中的所有文档连接到一个很长的字符串中，并将其拆分为单词，从而得到一个非常长的单词列表(或“标记”)。自变量是一长串单词表中以第一个单词开始，以倒数第二个单词结束的单词序列，因变量是一长串单词表中以第二个单词开始，以最后一个单词结束的单词序列。

我们的词汇表将由预先训练的模型词汇表中的常用词和特定于语料库的新词(例如，电影术语或演员名字)组成。相应构建我们的嵌入矩阵:对于预训练模型词汇表中的单词，取预训练模型嵌入矩阵中对应的行;但是对于新单词，我们什么都没有，所以我们将用一个随机向量初始化相应的行。

Each of the steps necessary to create a language model has jargon associated with it from the world of natural language processing, and fastai and PyTorch classes available to help. The steps are:

- Tokenization:: Convert the text into a list of words (or characters, or substrings, depending on the granularity of your model)
- Numericalization:: Make a list of all of the unique words that appear (the vocab), and convert each word into a number, by looking up its index in the vocab
- Language model data loader creation:: fastai provides an `LMDataLoader` class which automatically handles creating a dependent variable that is offset from the independent variable by one token. It also handles some important details, such as how to shuffle the training data in such a way that the dependent and independent variables maintain their structure as required
- Language model creation:: We need a special kind of model that does something we haven't seen before: handles input lists which could be arbitrarily big or small. There are a number of ways to do this; in this chapter we will be using a *recurrent neural network* (RNN). We will get to the details of these RNNs in the <<chapter_nlp_dive>>, but for now, you can think of it as just another deep neural network.

Let's take a look at how each step works in detail.

创建语言模型所需的每个步骤都有自然语言处理领域的相关术语，以及fastai和PyTorch类可以提供帮助。的步骤是:

- 将文本转换为单词列表(或字符，或子字符串，取决于模型的粒度)
- Numericalization:列出所有出现的唯一单词(词汇表)，通过在词汇表中查找每个单词的索引，将每个单词转换为一个数字
- fastai提供了一个`LMDataLoader`类，它会自动处理创建一个与自变量相对偏移一个标记的因变量。它还处理一些重要的细节，例如如何以一种因变量和自变量按需要维护其结构的方式洗选训练数据
- 语言模型创建::我们需要一种特殊的模型来做一些我们以前没有见过的事情:处理输入列表，可以是任意大或任意小。有许多方法可以做到这一点;在本章中，我们将使用*循环神经网络*(RNN)。我们将在<<chapter_nlp_dive>>中了解这些rnn的细节，但现在，你可以把它看作是另一个深度神经网络。

让我们详细地看看每个步骤是如何工作的。

### Tokenization

### 标记

When we said "convert the text into a list of words," we left out a lot of details. For instance, what do we do with punctuation? How do we deal with a word like "don't"? Is it one word, or two? What about long medical or chemical words? Should they be split into their separate pieces of meaning? How about hyphenated words? What about languages like German and Polish where we can create really long words from many, many pieces? What about languages like Japanese and Chinese that don't use bases at all, and don't really have a well-defined idea of *word*?

Because there is no one correct answer to these questions, there is no one approach to tokenization. There are three main approaches:

- Word-based:: Split a sentence on spaces, as well as applying language-specific rules to try to separate parts of meaning even when there are no spaces (such as turning "don't" into "do n't"). Generally, punctuation marks are also split into separate tokens.
- Subword based:: Split words into smaller parts, based on the most commonly occurring substrings. For instance, "occasion" might be tokenized as "o c ca sion."
- Character-based:: Split a sentence into its individual characters.

We'll be looking at word and subword tokenization here, and we'll leave character-based tokenization for you to implement in the questionnaire at the end of this chapter.

当我们说“将文本转换为单词列表”时，我们省略了许多细节。例如，我们怎么处理标点符号?我们如何处理像“don't”这样的单词?是一个词，还是两个词?那么长医学或化学词汇呢?它们应该被分解成各自的意义吗?连字符的单词呢?那像德语和波兰语这样的语言呢?我们可以用很多很多的片段创造很长的单词。那么像日语和汉语这样完全不使用*词*库的语言又如何呢?

因为这些问题没有一个正确的答案，所以也没有一种方法来实现标记化。主要有三种方法:

- 基于单词的:用空格分割句子，也可以应用特定的语言规则，在没有空格的情况下分割部分意思(比如把“don’t”变成“don’t”)。一般来说，标点符号也分为单独的符号。
- 根据最常见的子字符串将单词拆分为更小的部分。例如，“场合”可以标记为“occasion”。
- 以字为本:将一个句子分解成单个的字符。

我们将在这里讨论单词和子单词的标记化，我们将把基于字符的标记化留给您在本章末尾的问卷中实现。

> jargon: token: One element of a list created by the tokenization process. It could be a word, part of a word (a _subword_), or a single character.

>术语：token:由标记化过程创建的列表中的一个元素。它可以是一个单词、单词的一部分(副词)或单个字符。

### Word Tokenization with fastai

### 词标记与fastai

Rather than providing its own tokenizers, fastai instead provides a consistent interface to a range of tokenizers in external libraries. Tokenization is an active field of research, and new and improved tokenizers are coming out all the time, so the defaults that fastai uses change too. However, the API and options shouldn't change too much, since fastai tries to maintain a consistent API even as the underlying technology changes.

Let's try it out with the IMDb dataset that we used in <<chapter_intro>>:

fastai没有提供自己的标记器，而是在外部库中为一系列标记器提供一致的接口。标记化是一个活跃的研究领域，新的和改进的标记器一直在出现，所以fastai使用的默认值也在变化。然而，API和选项不应该改变太多，因为即使底层技术发生变化，fastai也试图保持一致的API。

让我们在《chapter_intro》中使用的IMDb数据集上尝试一下:

In [None]:
from fastai.text.all import *
path = untar_data(URLs.IMDB)

We'll need to grab the text files in order to try out a tokenizer. Just like `get_image_files`, which we've used many times already, gets all the image files in a path, `get_text_files` gets all the text files in a path. We can also optionally pass `folders` to restrict the search to a particular list of subfolders:

我们需要获取文本文件，以便尝试标记赋予器。就像`get_image_files`(我们已经用过很多次了)一样，`get_text_files`获取路径中的所有图像文件，get_text_files获取路径中的所有文本文件。我们也可以选择传递一个`文件夹`来限制搜索到特定的子文件夹列表:

In [None]:
files = get_text_files(path, folders = ['train', 'test', 'unsup'])

Here's a review that we'll tokenize (we'll just print the start of it here to save space):

这里有一个我们将标记的评论(我们将只在这里打印它的开始以节省空间):

In [None]:
txt = files[0].open().read(); txt[:75]

'This movie, which I just discovered at the video store, has apparently sit '

As we write this book, the default English word tokenizer for fastai uses a library called *spaCy*. It has a sophisticated rules engine with special rules for URLs, individual special English words, and much more. Rather than directly using `SpacyTokenizer`, however, we'll use `WordTokenizer`, since that will always point to fastai's current default word tokenizer (which may not necessarily be spaCy, depending when you're reading this).

Let's try it out. We'll use fastai's `coll_repr(collection, n)` function to display the results. This displays the first *`n`* items of *`collection`*, along with the full size—it's what `L` uses by default. Note that fastai's tokenizers take a collection of documents to tokenize, so we have to wrap `txt` in a list:

在我们编写本书时，fastai的默认英文单词标记生成器使用了一个名为space的库。它有一个复杂的规则引擎，其中包含针对url的特殊规则、单个特殊英语单词等。然而，比起直接使用`SpacyTokenizer`，我们将使用`WordTokenizer`，因为它将总是指向fastai当前默认的单词标记器(不一定是space，这取决于你何时阅读本文)。

我们来试试。我们将使用fastai的`coll_repr(collection, n)`函数来显示结果。这将显示`集合`的前`n`项，以及完整大小——这是`L`默认使用的。请注意，fastai的标记器需要一个文档集合来标记，所以我们必须将`txt`包装在一个列表中:

In [None]:
spacy = WordTokenizer()
toks = first(spacy([txt]))
print(coll_repr(toks, 30))

(#201) ['This','movie',',','which','I','just','discovered','at','the','video','store',',','has','apparently','sit','around','for','a','couple','of','years','without','a','distributor','.','It',"'s",'easy','to','see'...]


As you see, spaCy has mainly just separated out the words and punctuation. But it does something else here too: it has split "it's" into "it" and "'s". That makes intuitive sense; these are separate words, really. Tokenization is a surprisingly subtle task, when you think about all the little details that have to be handled. Fortunately, spaCy handles these pretty well for us—for instance, here we see that "." is separated when it terminates a sentence, but not in an acronym or number:

如您所见，空格主要是将单词和标点分开。但是它在这里也做了其他的事情:它把"it's"分成了"it"和" s"。这很直观;这两个词是分开的。当您考虑到必须处理的所有小细节时，标记化是一项非常微妙的任务。幸运的是，space为我们很好地处理了这些问题，例如，这里我们看到"."在句子结束时被分隔开，但不是在首字母缩略词或数字中:

In [None]:
first(spacy(['The U.S. dollar $1 is $1.00.']))

(#9) ['The','U.S.','dollar','$','1','is','$','1.00','.']

fastai then adds some additional functionality to the tokenization process with the `Tokenizer` class:

然后fastai通过`Tokenizer`类向标记化过程添加了一些额外的功能:

In [None]:
tkn = Tokenizer(spacy)
print(coll_repr(tkn(txt), 31))

(#228) ['xxbos','xxmaj','this','movie',',','which','i','just','discovered','at','the','video','store',',','has','apparently','sit','around','for','a','couple','of','years','without','a','distributor','.','xxmaj','it',"'s",'easy'...]


Notice that there are now some tokens that start with the characters "xx", which is not a common word prefix in English. These are *special tokens*.

For example, the first item in the list, `xxbos`, is a special token that indicates the start of a new text ("BOS" is a standard NLP acronym that means "beginning of stream"). By recognizing this start token, the model will be able to learn it needs to "forget" what was said previously and focus on upcoming words.

These special tokens don't come from spaCy directly. They are there because fastai adds them by default, by applying a number of rules when processing text. These rules are designed to make it easier for a model to recognize the important parts of a sentence. In a sense, we are translating the original English language sequence into a simplified tokenized language—a language that is designed to be easy for a model to learn.

For instance, the rules will replace a sequence of four exclamation points with a special *repeated character* token, followed by the number four, and then a single exclamation point. In this way, the model's embedding matrix can encode information about general concepts such as repeated punctuation rather than requiring a separate token for every number of repetitions of every punctuation mark. Similarly, a capitalized word will be replaced with a special capitalization token, followed by the lowercase version of the word. This way, the embedding matrix only needs the lowercase versions of the words, saving compute and memory resources, but can still learn the concept of capitalization.

Here are some of the main special tokens you'll see:

- `xxbos`:: Indicates the beginning of a text (here, a review)
- `xxmaj`:: Indicates the next word begins with a capital (since we lowercased everything)
- `xxunk`:: Indicates the word is unknown

To see the rules that were used, you can check the default rules:

注意，现在有一些标记以字符“xx”开头，这在英语中不是常见的单词前缀。这些是特别的纪念品。

例如，列表中的第一项`xxbos`是一个特殊的令牌，表示新文本的开始(“BOS”是一个标准的NLP首字母缩写，表示“流的开始”)。通过识别这个开始标记，模型将能够学习它需要“忘记”之前说过的话，并专注于即将到来的单词。

这些特殊的代币不是直接来自太空。它们之所以存在，是因为fastai在处理文本时通过应用许多规则默认添加了它们。设计这些规则是为了让模型更容易识别句子的重要部分。从某种意义上说，我们正在将原始的英语语言序列翻译成一种简化的标记化语言——一种被设计成易于模型学习的语言。

例如，规则将用一个特殊的重复字符标记替换一个由四个感叹号组成的序列，后面跟着数字4，然后是一个感叹号。通过这种方式，模型的嵌入矩阵可以编码一般概念的信息，如重复标点符号，而不是要求每个标点符号重复的次数都需要一个单独的标记。类似地，大写的单词将被替换为特殊的大写令牌，后面跟着该单词的小写版本。这样，嵌入矩阵只需要单词的小写版本，节省了计算和内存资源，但仍然可以学习大小写的概念。

以下是你将看到的一些主要的特殊标志:

- `xxbos`::表示文本的开始(这里是一个评论)
- `xxmaj`::表示下一个单词以大写开头(因为我们将所有内容都小写)
- `xxunk`::未知字

要查看使用的规则，你可以检查默认的规则:

In [None]:
defaults.text_proc_rules

[<function fastai.text.core.fix_html(x)>,
 <function fastai.text.core.replace_rep(t)>,
 <function fastai.text.core.replace_wrep(t)>,
 <function fastai.text.core.spec_add_spaces(t)>,
 <function fastai.text.core.rm_useless_spaces(t)>,
 <function fastai.text.core.replace_all_caps(t)>,
 <function fastai.text.core.replace_maj(t)>,
 <function fastai.text.core.lowercase(t, add_bos=True, add_eos=False)>]

As always, you can look at the source code of each of them in a notebook by typing:

```
??replace_rep
```

Here is a brief summary of what each does:

- `fix_html`:: Replaces special HTML characters with a readable version (IMDb reviews have quite a few of these)
- `replace_rep`:: Replaces any character repeated three times or more with a special token for repetition (`xxrep`), the number of times it's repeated, then the character
- `replace_wrep`:: Replaces any word repeated three times or more with a special token for word repetition (`xxwrep`), the number of times it's repeated, then the word
- `spec_add_spaces`:: Adds spaces around / and #
- `rm_useless_spaces`:: Removes all repetitions of the space character
- `replace_all_caps`:: Lowercases a word written in all caps and adds a special token for all caps (`xxup`) in front of it
- `replace_maj`:: Lowercases a capitalized word and adds a special token for capitalized (`xxmaj`) in front of it
- `lowercase`:: Lowercases all text and adds a special token at the beginning (`xxbos`) and/or the end (`xxeos`)

像往常一样，你可以在笔记本上输入以下内容查看它们的源代码:

```
??replace_rep
```

以下是它们各自作用的简要总结:

- `fix_html`:: 用可读版本替换特殊的HTML字符(IMDb评论中有很多这样的字符)
- `replace_rep`:: 将重复出现三次或三次以上的任何字符替换为一个特殊的重复标记(`xxrep`)，它重复的次数，然后是字符
- `replace_wrep`:: 将重复出现三次或三次以上的任何单词替换为单词重复的特殊标记(`xxwrep`)，即重复的次数，然后是单词
- `spec_add_spaces`:: 在/和#周围添加空格
- `rm_useless_spaces`:: 删除所有重复的空格字符
- `replace_all_caps`:: 小写:全大写书写的单词，并在单词前面添加一个表示全大写的特殊符号(`xxup`)
- `replace_maj`:: 将大写的单词小写，并在其前面添加大写的特殊符号(`xxmaj`)
- `lowercase`:: 将所有文本小写，并在开头(`xxbos`)和/或结尾(`xxeos`)添加一个特殊标记。

Let's take a look at a few of them in action:

让我们来看看其中的几个:

In [None]:
coll_repr(tkn('&copy;   Fast.ai www.fast.ai/INDEX'), 31)

"(#11) ['xxbos','©','xxmaj','fast.ai','xxrep','3','w','.fast.ai','/','xxup','index'...]"

Now let's take a look at how subword tokenization would work.

现在让我们看看子词标记化是如何工作的。

### Subword Tokenization

### Subword标记

In addition to the *word tokenization* approach seen in the last section, another popular tokenization method is *subword tokenization*. Word tokenization relies on an assumption that spaces provide a useful separation of components of meaning in a sentence. However, this assumption is not always appropriate. For instance, consider this sentence: 我的名字是郝杰瑞 ("My name is Jeremy Howard" in Chinese). That's not going to work very well with a word tokenizer, because there are no spaces in it! Languages like Chinese and Japanese don't use spaces, and in fact they don't even have a well-defined concept of a "word." There are also languages, like Turkish and Hungarian, that can add many subwords together without spaces, creating very long words that include a lot of separate pieces of information.

To handle these cases, it's generally best to use subword tokenization. This proceeds in two steps:

1. Analyze a corpus of documents to find the most commonly occurring groups of letters. These become the vocab.
2. Tokenize the corpus using this vocab of *subword units*.

Let's look at an example. For our corpus, we'll use the first 2,000 movie reviews:

除了上一节中看到的词标记化方法之外，另一种流行的标记化方法是子词标记化。单词标记化依赖于一个假设，即空格可以有效地分离句子中的意义成分。然而，这种假设并不总是恰当的。例如，考虑这个句子:我的名字是郝杰瑞(“我的名字是Jeremy Howard”的中文)。这在单词标记器中不会很好地工作，因为其中没有空格!像中文和日语这样的语言不使用空格，事实上，它们甚至没有一个明确定义的“单词”概念。还有一些语言，比如土耳其语和匈牙利语，可以在没有空格的情况下把许多子单词加在一起，创造出包含许多独立信息的非常长的单词。

要处理这些情况，通常最好使用子词标记化。这个过程分为两个步骤:

1. 分析文档语料库，找出最常见的一组字母。这些就变成了词汇。
1. 使用这个子词单元词汇表对语料库进行标记。

让我们来看一个例子。对于我们的语料库，我们将使用前2000个电影评论:

In [None]:
txts = L(o.open().read() for o in files[:2000])

We instantiate our tokenizer, passing in the size of the vocab we want to create, and then we need to "train" it. That is, we need to have it read our documents and find the common sequences of characters to create the vocab. This is done with `setup`. As we'll see shortly, `setup` is a special fastai method that is called automatically in our usual data processing pipelines. Since we're doing everything manually at the moment, however, we have to call it ourselves. Here's a function that does these steps for a given vocab size, and shows an example output:

我们实例化我们的标记赋值器，传入我们想要创建的词汇表的大小，然后我们需要“训练”它。也就是说，我们需要让它读取文档并找到用于创建词汇表的公共字符序列。这是通过`setup`完成的。很快我们就会看到，`setup`是一个特殊的fastai方法，在我们通常的数据处理管道中会自动调用它。然而，因为我们现在所有的工作都是手工完成的，所以我们必须自己调用它。下面是一个函数，它对给定的词汇表大小执行这些步骤，并显示了一个示例输出:

In [None]:
def subword(sz):
    sp = SubwordTokenizer(vocab_sz=sz)
    sp.setup(txts)
    return ' '.join(first(sp([txt]))[:40])

Let's try it out:

让我们来试试:

In [None]:
subword(1000)

'▁This ▁movie , ▁which ▁I ▁just ▁dis c over ed ▁at ▁the ▁video ▁st or e , ▁has ▁a p par ent ly ▁s it ▁around ▁for ▁a ▁couple ▁of ▁years ▁without ▁a ▁dis t ri but or . ▁It'

When using fastai's subword tokenizer, the special character `▁` represents a space character in the original text.

If we use a smaller vocab, then each token will represent fewer characters, and it will take more tokens to represent a sentence:

当使用fastai的子词标记器时，特殊字符`_`表示原始文本中的空格字符。

如果我们使用更小的词汇表，那么每个标记将代表更少的字符，并且将需要更多标记来代表一个句子:

In [None]:
subword(200)

'▁ T h i s ▁movie , ▁w h i ch ▁I ▁ j us t ▁ d i s c o ver ed ▁a t ▁the ▁ v id e o ▁ st or e , ▁h a s'

On the other hand, if we use a larger vocab, then most common English words will end up in the vocab themselves, and we will not need as many to represent a sentence:

另一方面，如果我们使用更大的词汇表，那么大多数常见的英语单词就会出现在词汇表中，我们就不需要那么多单词来代表一个句子:

In [None]:
subword(10000)

"▁This ▁movie , ▁which ▁I ▁just ▁discover ed ▁at ▁the ▁video ▁store , ▁has ▁apparently ▁sit ▁around ▁for ▁a ▁couple ▁of ▁years ▁without ▁a ▁distributor . ▁It ' s ▁easy ▁to ▁see ▁why . ▁The ▁story ▁of ▁two ▁friends ▁living"

Picking a subword vocab size represents a compromise: a larger vocab means fewer tokens per sentence, which means faster training, less memory, and less state for the model to remember; but on the downside, it means larger embedding matrices, which require more data to learn.

Overall, subword tokenization provides a way to easily scale between character tokenization (i.e., using a small subword vocab) and word tokenization (i.e., using a large subword vocab), and handles every human language without needing language-specific algorithms to be developed. It can even handle other "languages" such as genomic sequences or MIDI music notation! For this reason, in the last year its popularity has soared, and it seems likely to become the most common tokenization approach (it may well already be, by the time you read this!).

选择子词词汇表大小代表了一种妥协:更大的词汇表意味着每个句子更少的标记，这意味着更快的训练、更少的内存和更少的模型要记忆的状态;但缺点是，它意味着更大的嵌入矩阵，这需要更多的数据来学习。

总的来说，子词标记化提供了一种轻松地在字符标记化(即使用较小的子词词汇表)和单词标记化(即使用较大的子词词汇表)之间进行伸缩的方法，并处理每一种人类语言，而不需要开发特定于语言的算法。它甚至可以处理其他“语言”，如基因组序列或MIDI音乐符号!由于这个原因，在去年，它的流行程度飙升，而且它似乎有可能成为最常见的标记化方法(在您阅读本文时，它可能已经是最常见的方法了!)

Once our texts have been split into tokens, we need to convert them to numbers. We'll look at that next.

一旦我们的文本被分割成符号，我们需要将它们转换为数字。我们接下来会讲到。

### Numericalization with fastai

### 使用fastai进行数值化

*Numericalization* is the process of mapping tokens to integers. The steps are basically identical to those necessary to create a `Category` variable, such as the dependent variable of digits in MNIST:

1. Make a list of all possible levels of that categorical variable (the vocab).
1. Replace each level with its index in the vocab.

Let's take a look at this in action on the word-tokenized text we saw earlier:

*数字化*是将符号映射到整数的过程。步骤与创建Category变量所需的步骤基本相同，例如MNIST中的数字因变量:

1. 列出该分类变量(词汇表)的所有可能级别。
1. 用词汇表中的索引替换每一层。

让我们在前面看到的单词标记文本上看看它的实际效果:

In [None]:
toks = tkn(txt)
print(coll_repr(tkn(txt), 31))

(#228) ['xxbos','xxmaj','this','movie',',','which','i','just','discovered','at','the','video','store',',','has','apparently','sit','around','for','a','couple','of','years','without','a','distributor','.','xxmaj','it',"'s",'easy'...]


Just like with `SubwordTokenizer`, we need to call `setup` on `Numericalize`; this is how we create the vocab. That means we'll need our tokenized corpus first. Since tokenization takes a while, it's done in parallel by fastai; but for this manual walkthrough, we'll use a small subset:

就像`SubwordTokenizer`一样，我们需要在`Numericalize`上调用`setup`;这就是我们创建词汇表的方式。这意味着我们首先需要标记化语料库。由于标记化需要一段时间，它是由fastai并行完成的;但对于这个手动演练，我们将使用一个小子集:

In [None]:
toks200 = txts[:200].map(tkn)
toks200[0]

(#228) ['xxbos','xxmaj','this','movie',',','which','i','just','discovered','at'...]

We can pass this to `setup` to create our vocab:

我们可以将此传递给`setup`来创建我们的词汇表:

In [None]:
num = Numericalize()
num.setup(toks200)
coll_repr(num.vocab,20)

"(#2000) ['xxunk','xxpad','xxbos','xxeos','xxfld','xxrep','xxwrep','xxup','xxmaj','the','.',',','a','and','of','to','is','in','i','it'...]"

Our special rules tokens appear first, and then every word appears once, in frequency order. The defaults to `Numericalize` are `min_freq=3,max_vocab=60000`. `max_vocab=60000` results in fastai replacing all words other than the most common 60,000 with a special *unknown word* token, `xxunk`. This is useful to avoid having an overly large embedding matrix, since that can slow down training and use up too much memory, and can also mean that there isn't enough data to train useful representations for rare words. However, this last issue is better handled by setting `min_freq`; the default `min_freq=3` means that any word appearing less than three times is replaced with `xxunk`.

fastai can also numericalize your dataset using a vocab that you provide, by passing a list of words as the `vocab` parameter.

Once we've created our `Numericalize` object, we can use it as if it were a function:

我们的特殊规则标记首先出现，然后每个单词按频率顺序出现一次。`Numericalize`的默认值是`min_freq=3,max_vocab=60000`。`max_vocab =60000`将导致fastai使用特殊的未知单词标记`xxunk`替换除最常见的60000之外的所有单词。这有助于避免使用过大的嵌入矩阵，因为这会降低训练速度并消耗过多内存，还可能意味着没有足够的数据来训练罕见词的有用表示。但是，最后一个问题可以通过设置`min_freq`;默认的`min_freq=3`意味着任何出现少于三次的单词都将被替换为`xxunk`。

Fastai还可以使用您提供的词汇表对数据集进行数值化，方法是传递一个单词列表作为`vocab`参数。

一旦我们创建了`Numericalize`对象，我们就可以像使用函数一样使用它:

In [None]:
nums = num(toks)[:20]; nums

tensor([  2,   8,  21,  28,  11,  90,  18,  59,   0,  45,   9, 351, 499,  11,  72, 533, 584, 146,  29,  12])

This time, our tokens have been converted to a tensor of integers that our model can receive. We can check that they map back to the original text:

这一次，我们的令牌被转换为模型可以接收的整数张量。我们可以检查它们是否映射回原始文本:

In [None]:
' '.join(num.vocab[o] for o in nums)

'xxbos xxmaj this movie , which i just xxunk at the video store , has apparently sit around for a'

Now that we have numbers, we need to put them in batches for our model.

现在我们有了数据，我们需要为我们的模型分批地放入它们。

### Putting Our Texts into Batches for a Language Model

### 为语言模型分批处理文本

When dealing with images, we needed to resize them all to the same height and width before grouping them together in a mini-batch so they could stack together efficiently in a single tensor. Here it's going to be a little different, because one cannot simply resize text to a desired length. Also, we want our language model to read text in order, so that it can efficiently predict what the next word is. This means that each new batch should begin precisely where the previous one left off.

Suppose we have the following text:

> : In this chapter, we will go back over the example of classifying movie reviews we studied in chapter 1 and dig deeper under the surface. First we will look at the processing steps necessary to convert text into numbers and how to customize it. By doing this, we'll have another example of the PreProcessor used in the data block API.\nThen we will study how we build a language model and train it for a while.

The tokenization process will add special tokens and deal with punctuation to return this text:

> : xxbos xxmaj in this chapter , we will go back over the example of classifying movie reviews we studied in chapter 1 and dig deeper under the surface . xxmaj first we will look at the processing steps necessary to convert text into numbers and how to customize it . xxmaj by doing this , we 'll have another example of the preprocessor used in the data block xxup api . \n xxmaj then we will study how we build a language model and train it for a while .

We now have 90 tokens, separated by spaces. Let's say we want a batch size of 6. We need to break this text into 6 contiguous parts of length 15:

在处理图像时，我们需要将它们全部调整到相同的高度和宽度，然后再将它们分组到一个小批中，这样它们就可以在一个张量中有效地堆叠在一起。这里有一点不同，因为不能简单地将文本大小调整到所需的长度。此外，我们希望我们的语言模型能够按顺序阅读文本，以便它能够有效地预测下一个单词是什么。这意味着每个新批都应该精确地从上一批停止的地方开始。

假设我们有以下文本:

>:在这一章中，我们将回顾在第一章中学习的电影评论分类的例子，并深入挖掘表面之下的东西。首先，我们将了解将文本转换为数字所需的处理步骤，以及如何定制它。通过这样做，我们将得到数据块API中使用的另一个PreProcessor示例。然后我们将学习如何建立一个语言模型并训练它一段时间。

标记化过程将添加特殊的标记，并处理标点符号以返回此文本:

>:在这一章中，我们将回顾我们在第一章中研究的电影评论分类的例子，并深入挖掘表面之下的东西。首先，我们将看看将文本转换为数字所需的处理步骤，以及如何定制它。通过这样做，我们将有另一个在数据块xxup API中使用预处理器的例子。然后我们将学习如何建立一个语言模型，并训练它一段时间。

现在我们有90个标记，用空格隔开。假设我们想要批量大小为6。我们需要将这段文本分成6个长度为15的连续部分:

In [None]:
#hide_input
stream = "In this chapter, we will go back over the example of classifying movie reviews we studied in chapter 1 and dig deeper under the surface. First we will look at the processing steps necessary to convert text into numbers and how to customize it. By doing this, we'll have another example of the PreProcessor used in the data block API.\nThen we will study how we build a language model and train it for a while."
tokens = tkn(stream)
bs,seq_len = 6,15
d_tokens = np.array([tokens[i*seq_len:(i+1)*seq_len] for i in range(bs)])
df = pd.DataFrame(d_tokens)
display(HTML(df.to_html(index=False,header=None)))

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
xxbos,xxmaj,in,this,chapter,",",we,will,go,back,over,the,example,of,classifying
movie,reviews,we,studied,in,chapter,1,and,dig,deeper,under,the,surface,.,xxmaj
first,we,will,look,at,the,processing,steps,necessary,to,convert,text,into,numbers,and
how,to,customize,it,.,xxmaj,by,doing,this,",",we,'ll,have,another,example
of,the,preprocessor,used,in,the,data,block,xxup,api,.,\n,xxmaj,then,we
will,study,how,we,build,a,language,model,and,train,it,for,a,while,.


In a perfect world, we could then give this one batch to our model. But that approach doesn't scale, because outside of this toy example it's unlikely that a single batch containing all the texts would fit in our GPU memory (here we have 90 tokens, but all the IMDb reviews together give several million).

So, we need to divide this array more finely into subarrays of a fixed sequence length. It is important to maintain order within and across these subarrays, because we will use a model that maintains a state so that it remembers what it read previously when predicting what comes next. 

Going back to our previous example with 6 batches of length 15, if we chose a sequence length of 5, that would mean we first feed the following array:

在一个完美的世界里，我们可以把这一批给我们的模型。但是这种方法是无法扩展的，因为在这个玩具样例之外，包含所有文本的单个批不太可能适合我们的GPU内存(这里我们有90个令牌，但所有IMDb评论加起来有几百万)。

因此，我们需要将这个数组更精细地划分为固定序列长度的子数组。维护这些子数组内部和跨子数组的顺序是很重要的，因为我们将使用一个维护状态的模型，以便它在预测接下来发生什么时记住之前读到的内容。

回到我们之前的示例，6个批次的长度为15，如果我们选择的序列长度为5，这意味着我们将首先提供以下数组:

In [None]:
#hide_input
bs,seq_len = 6,5
d_tokens = np.array([tokens[i*15:i*15+seq_len] for i in range(bs)])
df = pd.DataFrame(d_tokens)
display(HTML(df.to_html(index=False,header=None)))

0,1,2,3,4
xxbos,xxmaj,in,this,chapter
movie,reviews,we,studied,in
first,we,will,look,at
how,to,customize,it,.
of,the,preprocessor,used,in
will,study,how,we,build


Then this one:

然后这个:

In [None]:
#hide_input
bs,seq_len = 6,5
d_tokens = np.array([tokens[i*15+seq_len:i*15+2*seq_len] for i in range(bs)])
df = pd.DataFrame(d_tokens)
display(HTML(df.to_html(index=False,header=None)))

0,1,2,3,4
",",we,will,go,back
chapter,1,and,dig,deeper
the,processing,steps,necessary,to
xxmaj,by,doing,this,","
the,data,block,xxup,api
a,language,model,and,train


And finally:

最后:

In [None]:
#hide_input
bs,seq_len = 6,5
d_tokens = np.array([tokens[i*15+10:i*15+15] for i in range(bs)])
df = pd.DataFrame(d_tokens)
display(HTML(df.to_html(index=False,header=None)))

0,1,2,3,4
over,the,example,of,classifying
under,the,surface,.,xxmaj
convert,text,into,numbers,and
we,'ll,have,another,example
.,\n,xxmaj,then,we
it,for,a,while,.


Going back to our movie reviews dataset, the first step is to transform the individual texts into a stream by concatenating them together. As with images, it's best to randomize the order of the inputs, so at the beginning of each epoch we will shuffle the entries to make a new stream (we shuffle the order of the documents, not the order of the words inside them, or the texts would not make sense anymore!).

We then cut this stream into a certain number of batches (which is our *batch size*). For instance, if the stream has 50,000 tokens and we set a batch size of 10, this will give us 10 mini-streams of 5,000 tokens. What is important is that we preserve the order of the tokens (so from 1 to 5,000 for the first mini-stream, then from 5,001 to 10,000...), because we want the model to read continuous rows of text (as in the preceding example). An `xxbos` token is added at the start of each during preprocessing, so that the model knows when it reads the stream when a new entry is beginning.

So to recap, at every epoch we shuffle our collection of documents and concatenate them into a stream of tokens. We then cut that stream into a batch of fixed-size consecutive mini-streams. Our model will then read the mini-streams in order, and thanks to an inner state, it will produce the same activation whatever sequence length we picked.

This is all done behind the scenes by the fastai library when we create an `LMDataLoader`. We do this by first applying our `Numericalize` object to the tokenized texts:

回到我们的电影评论数据集，第一步是通过将单个文本连接在一起，将它们转换为流。与图像一样，最好将输入的顺序随机化，这样在每个epoch开始时，我们将打乱条目以形成一个新的流(打乱文档的顺序，而不是其中单词的顺序，否则文本将不再有意义!)

然后，我们将该流切割成特定数量的批(这是我们的*批大小*)。例如，如果流有50,000个令牌，我们将批大小设置为10，这将给我们10个包含5,000个令牌的迷你流。重要的是我们要保持标记的顺序(所以第一个小流从1到5,000，然后从5,001到10,000…)，因为我们希望模型读取连续的文本行(如前面的示例中所示)。在预处理期间，在每个流的开头添加一个`xxbos`令牌，以便模型知道在新条目开始时何时读取流。

总之，在每一个时期，我们都将整理文档集合，并将它们连接到令牌流中。然后我们将该流切割成一批固定大小的连续微流。然后，我们的模型将按顺序读取微流，由于内部状态，它将产生相同的激活，无论我们选择的序列长度。

当我们创建一个`LMDataLoader`时，这一切都是由fastai库在幕后完成的。为此，我们首先将`Numericalize`对象应用于标记化的文本:

In [None]:
nums200 = toks200.map(num)

and then passing that to `LMDataLoader`:

然后传递给`LMDataLoader`:

In [None]:
dl = LMDataLoader(nums200)

Let's confirm that this gives the expected results, by grabbing the first batch:

让我们通过抓取第一批来确认这给出了预期的结果:

In [None]:
x,y = first(dl)
x.shape,y.shape

(torch.Size([64, 72]), torch.Size([64, 72]))

and then looking at the first row of the independent variable, which should be the start of the first text:

然后看自变量的第一行，这应该是第一个文本的开头:

In [None]:
' '.join(num.vocab[o] for o in x[0][:20])

'xxbos xxmaj this movie , which i just xxunk at the video store , has apparently sit around for a'

The dependent variable is the same thing offset by one token:

因变量是相同的东西，只差了一个标记:

In [None]:
' '.join(num.vocab[o] for o in y[0][:20])

'xxmaj this movie , which i just xxunk at the video store , has apparently sit around for a couple'

This concludes all the preprocessing steps we need to apply to our data. We are now ready to train our text classifier.

这就结束了我们需要应用于数据的所有预处理步骤。现在我们准备训练文本分类器。

## Training a Text Classifier

## 训练文本分类器

As we saw at the beginning of this chapter, there are two steps to training a state-of-the-art text classifier using transfer learning: first we need to fine-tune our language model pretrained on Wikipedia to the corpus of IMDb reviews, and then we can use that model to train a classifier.

As usual, let's start with assembling our data.

正如我们在本章开始时看到的，使用迁移学习来训练一个最先进的文本分类器有两个步骤:首先，我们需要将我们在维基百科上预先训练的语言模型微调到IMDb评论的语料库，然后我们可以使用该模型来训练一个分类器。

像往常一样，让我们从收集数据开始。

### Language Model Using DataBlock

### 使用数据锁的语言模型

fastai handles tokenization and numericalization automatically when `TextBlock` is passed to `DataBlock`. All of the arguments that can be passed to `Tokenize` and `Numericalize` can also be passed to `TextBlock`. In the next chapter we'll discuss the easiest ways to run each of these steps separately, to ease debugging—but you can always just debug by running them manually on a subset of your data as shown in the previous sections. And don't forget about `DataBlock`'s handy `summary` method, which is very useful for debugging data issues.

Here's how we use `TextBlock` to create a language model, using fastai's defaults:

fastai处理标记化和数字自动当`TextBlock`被传递给`DataBlock`。可以传递给`Tokenize`和`Numericalize`的所有参数也可以传递给`TextBlock`。在下一章中，我们将讨论分别运行这些步骤的最简单方法，以简化调试—但是您总是可以通过在数据的子集上手动运行它们来进行调试，如前几节所示。不要忘记`DataBlock`的方便的`总结`方法，它对于调试数据问题非常有用。

下面是我们如何使用`TextBlock`来创建一个语言模型，使用fastai的默认值:

In [None]:
get_imdb = partial(get_text_files, folders=['train', 'test', 'unsup'])

dls_lm = DataBlock(
    blocks=TextBlock.from_folder(path, is_lm=True),
    get_items=get_imdb, splitter=RandomSplitter(0.1)
).dataloaders(path, path=path, bs=128, seq_len=80)

One thing that's different to previous types we've used in `DataBlock` is that we're not just using the class directly (i.e., `TextBlock(...)`, but instead are calling a *class method*. A class method is a Python method that, as the name suggests, belongs to a *class* rather than an *object*. (Be sure to search online for more information about class methods if you're not familiar with them, since they're commonly used in many Python libraries and applications; we've used them a few times previously in the book, but haven't called attention to them.) The reason that `TextBlock` is special is that setting up the numericalizer's vocab can take a long time (we have to read and tokenize every document to get the vocab). To be as efficient as possible it performs a few optimizations: 

- It saves the tokenized documents in a temporary folder, so it doesn't have to tokenize them more than once
- It runs multiple tokenization processes in parallel, to take advantage of your computer's CPUs

We need to tell `TextBlock` how to access the texts, so that it can do this initial preprocessing—that's what `from_folder` does.

`show_batch` then works in the usual way:

与之前我们在`DataBlock`中使用的类型不同的一点是，我们不只是直接使用类(例如，`TextBlock(…)`)，而是调用一个类方法。类方法是一种Python方法，顾名思义，它属于类而不是对象。(如果你不熟悉类方法，请确保在网上搜索更多关于类方法的信息，因为它们通常在许多Python库和应用程序中使用;我们之前在书中已经用过几次了，但还没有引起人们的注意。)`TextBlock`特别的原因是，设置数值化器的词汇表可能需要很长时间(我们必须读取每个文档并对其进行标记以获得词汇表)。为了尽可能高效，它执行了一些优化:

- 它将标记化的文档保存在一个临时文件夹中，因此它不必对它们进行多次标记
- 它并行运行多个令牌化进程，以利用计算机的cpu

我们需要告诉`TextBlock`如何访问文本，这样它就可以进行初始化预处理—这就是`from_folder`所做的。

然后，`show_batch`就会以通常的方式工作:

In [None]:
dls_lm.show_batch(max_n=2)

Unnamed: 0,text,text_
0,"xxbos xxmaj it 's awesome ! xxmaj in xxmaj story xxmaj mode , your going from punk to pro . xxmaj you have to complete goals that involve skating , driving , and walking . xxmaj you create your own skater and give it a name , and you can make it look stupid or realistic . xxmaj you are with your friend xxmaj eric throughout the game until he betrays you and gets you kicked off of the skateboard","xxmaj it 's awesome ! xxmaj in xxmaj story xxmaj mode , your going from punk to pro . xxmaj you have to complete goals that involve skating , driving , and walking . xxmaj you create your own skater and give it a name , and you can make it look stupid or realistic . xxmaj you are with your friend xxmaj eric throughout the game until he betrays you and gets you kicked off of the skateboard xxunk"
1,"what xxmaj i 've read , xxmaj death xxmaj bed is based on an actual dream , xxmaj george xxmaj barry , the director , successfully transferred dream to film , only a genius could accomplish such a task . \n\n xxmaj old mansions make for good quality horror , as do portraits , not sure what to make of the killer bed with its killer yellow liquid , quite a bizarre dream , indeed . xxmaj also , this","xxmaj i 've read , xxmaj death xxmaj bed is based on an actual dream , xxmaj george xxmaj barry , the director , successfully transferred dream to film , only a genius could accomplish such a task . \n\n xxmaj old mansions make for good quality horror , as do portraits , not sure what to make of the killer bed with its killer yellow liquid , quite a bizarre dream , indeed . xxmaj also , this is"


Now that our data is ready, we can fine-tune the pretrained language model.

现在我们的数据已经准备好了，我们可以微调预训练的语言模型。

### Fine-Tuning the Language Model

### 微调语言模型

To convert the integer word indices into activations that we can use for our neural network, we will use embeddings, just like we did for collaborative filtering and tabular modeling. Then we'll feed those embeddings into a *recurrent neural network* (RNN), using an architecture called *AWD-LSTM* (we will show you how to write such a model from scratch in <<chapter_nlp_dive>>). As we discussed earlier, the embeddings in the pretrained model are merged with random embeddings added for words that weren't in the pretraining vocabulary. This is handled automatically inside `language_model_learner`:

为了将整数词索引转换为我们可以用于神经网络的激活，我们将使用嵌入，就像我们对协作过滤和表格建模所做的那样。然后，我们将使用名为AWD-LSTM的架构将这些嵌入输入循环神经网络(RNN)(我们将向您展示如何在<>中从头编写这样的模型)。正如我们前面讨论的，预训练模型中的嵌入与为预训练词汇表之外的单词添加的随机嵌入合并。这在`language_model_learner`内部自动处理:

In [None]:
learn = language_model_learner(
    dls_lm, AWD_LSTM, drop_mult=0.3, 
    metrics=[accuracy, Perplexity()]).to_fp16()

The loss function used by default is cross-entropy loss, since we essentially have a classification problem (the different categories being the words in our vocab). The *perplexity* metric used here is often used in NLP for language models: it is the exponential of the loss (i.e., `torch.exp(cross_entropy)`). We  also include the accuracy metric, to see how many times our model is right when trying to predict the next word, since cross-entropy (as we've seen) is both hard to interpret, and tells us more about the model's confidence than its accuracy.

Let's go back to the process diagram from the beginning of this chapter. The first arrow has been completed for us and made available as a pretrained model in fastai, and we've just built the `DataLoaders` and `Learner` for the second stage. Now we're ready to fine-tune our language model!

默认情况下使用的损失函数是交叉熵损失，因为我们本质上有一个分类问题(不同的类别是词汇表中的单词)。这里使用的困惑度度量通常在NLP语言模型中使用:它是损失的指数(即`torch.exp(cross_entropy)`)。我们还包括精度度量，以查看在试图预测下一个单词时我们的模型正确了多少次，因为交叉熵(正如我们所看到的)既很难解释，也告诉我们更多关于模型的置信度而不是它的准确性。

让我们回到本章开头的流程图。我们已经完成了第一个箭头，并在fastai中作为预训练模型，我们刚刚为第二阶段构建了`DataLoaders`和`Learner`。现在我们已经准备好微调我们的语言模型了!

<img alt="Diagram of the ULMFiT process" width="450" src="https://github.com/fastai/fastbook/blob/master/images/att_00027.png?raw=1">

It takes quite a while to train each epoch, so we'll be saving the intermediate model results during the training process. Since `fine_tune` doesn't do that for us, we'll use `fit_one_cycle`. Just like `vision_learner`, `language_model_learner` automatically calls `freeze` when using a pretrained model (which is the default), so this will only train the embeddings (the only part of the model that contains randomly initialized weights—i.e., embeddings for words that are in our IMDb vocab, but aren't in the pretrained model vocab):

训练每个epoch需要相当长的时间，所以我们将在训练过程中保存中间模型的结果。因为`fine_tune`没有为我们做这些，我们将使用`fit_one_cycle`。就像`vision_learner`一样，`language_model_learner`在使用预训练模型(这是默认值)时自动调用`freeze`，因此这将只训练嵌入(模型中唯一包含随机初始化权重的部分)。，嵌入的词是在我们的IMDb词汇，但不是在预训练的模型词汇):

In [None]:
learn.fit_one_cycle(1, 2e-2)

epoch,train_loss,valid_loss,accuracy,perplexity,time
0,4.120048,3.912788,0.299565,50.038246,11:39


This model takes a while to train, so it's a good opportunity to talk about saving intermediary results. 

这个模型需要一段时间来训练，所以这是一个很好的机会来讨论如何保存中介结果。

### Saving and Loading Models

### 保存和加载模型

You can easily save the state of your model like so:

你可以像这样轻松保存模型的状态:

In [None]:
learn.save('1epoch')

This will create a file in `learn.path/models/` named *1epoch.pth*. If you want to load your model in another machine after creating your `Learner` the same way, or resume training later, you can load the content of this file with:

这将在`learn.path/models/`named *1epoch.pth*中创建一个文件。如果你想在用同样的方法创建完你的`Learner`之后，将你的模型加载到另一台机器上，或者稍后继续训练，你可以用以下方式加载这个文件的内容:

In [None]:
learn = learn.load('1epoch')

Once the initial training has completed, we can continue fine-tuning the model after unfreezing:

初始训练完成后，解冻后我们可以继续对模型进行微调:

In [None]:
learn.unfreeze()
learn.fit_one_cycle(10, 2e-3)

epoch,train_loss,valid_loss,accuracy,perplexity,time
0,3.893486,3.77282,0.317104,43.502548,12:37
1,3.820479,3.717197,0.32379,41.14888,12:30
2,3.735622,3.65976,0.330321,38.851997,12:09
3,3.677086,3.624794,0.33396,37.516987,12:12
4,3.636646,3.6013,0.337017,36.645859,12:05
5,3.553636,3.584241,0.339355,36.026001,12:04
6,3.507634,3.571892,0.341353,35.583862,12:08
7,3.444101,3.565988,0.342194,35.374371,12:08
8,3.398597,3.566283,0.342647,35.384815,12:11
9,3.375563,3.568166,0.342528,35.4515,12:05


Once this is done, we save all of our model except the final layer that converts activations to probabilities of picking each token in our vocabulary. The model not including the final layer is called the *encoder*. We can save it with `save_encoder`:

完成后，我们保存所有模型，除了最后一层，它将激活转换为选择词汇表中每个标记的概率。不包括最后一层的模型称为*编码器*。我们可以用`save_encoder`保存它:

In [None]:
learn.save_encoder('finetuned')

> jargon: Encoder: The model not including the task-specific final layer(s). This term means much the same thing as _body_ when applied to vision CNNs, but "encoder" tends to be more used for NLP and generative models.

>术语：编码器:不包括特定于任务的最终层的模型。当应用于视觉cnn时，这个术语的意思与body大致相同，但“编码器”倾向于更多地用于NLP和生成模型。

This completes the second stage of the text classification process: fine-tuning the language model. We can now use it to fine-tune a classifier using the IMDb sentiment labels.

这就完成了文本分类过程的第二阶段:微调语言模型。我们现在可以使用它来使用IMDb情感标签微调分类器。

### Text Generation

### 文本生成

Before we move on to fine-tuning the classifier, let's quickly try something different: using our model to generate random reviews. Since it's trained to guess what the next word of the sentence is, we can use the model to write new reviews:

在我们继续微调分类器之前，让我们快速尝试一些不同的东西:使用我们的模型来生成随机评论。由于它被训练来猜测句子的下一个单词是什么，我们可以使用这个模型来编写新的评论:

In [None]:
TEXT = "I liked this movie because"
N_WORDS = 40
N_SENTENCES = 2
preds = [learn.predict(TEXT, N_WORDS, temperature=0.75) 
         for _ in range(N_SENTENCES)]

In [None]:
print("\n".join(preds))

i liked this movie because of its story and characters . The story line was very strong , very good for a sci - fi film . The main character , Alucard , was very well developed and brought the whole story
i liked this movie because i like the idea of the premise of the movie , the ( very ) convenient virus ( which , when you have to kill a few people , the " evil " machine has to be used to protect


As you can see, we add some randomness (we pick a random word based on the probabilities returned by the model) so we don't get exactly the same review twice. Our model doesn't have any programmed knowledge of the structure of a sentence or grammar rules, yet it has clearly learned a lot about English sentences: we can see it capitalizes properly (*I* is just transformed to *i* because our rules require two characters or more to consider a word as capitalized, so it's normal to see it lowercased) and is using consistent tense. The general review makes sense at first glance, and it's only if you read carefully that you can notice something is a bit off. Not bad for a model trained in a couple of hours! 

But our end goal wasn't to train a model to generate reviews, but to classify them... so let's use this model to do just that.

正如你所看到的，我们添加了一些随机性(基于模型返回的概率随机选择一个单词)，所以我们不会得到两次完全相同的回顾。我们的模型没有任何关于句子结构或语法规则的编程知识，但它显然已经学习了很多关于英语句子的知识:我们可以看到它正确地大写(I被转换为I，因为我们的规则要求两个或两个以上的字符才能认为一个单词是大写的，所以看到它小写是正常的)，并且使用一致的时态。第一眼看上去，总体回顾是有意义的，只有当你仔细阅读时，你才会发现有些地方不对劲。对于一个训练了几个小时的模特来说，这已经不错了!

但我们的最终目标并不是训练一个生成评论的模型，而是对它们进行分类……我们用这个模型来做。

### Creating the Classifier DataLoaders

### 创建分类器数据加载器

We're now moving from language model fine-tuning to classifier fine-tuning. To recap, a language model predicts the next word of a document, so it doesn't need any external labels. A classifier, however, predicts some external label—in the case of IMDb, it's the sentiment of a document.

This means that the structure of our `DataBlock` for NLP classification will look very familiar. It's actually nearly the same as we've seen for the many image classification datasets we've worked with:

我们现在正在从语言模型微调转向分类器微调。总而言之，语言模型预测文档中的下一个单词，因此它不需要任何外部标签。然而，分类器可以预测一些外部标签—在IMDb中，它是文档的情绪。

这意味着我们用于NLP分类的`DataBlock`的结构看起来非常熟悉。实际上，这与我们使用过的许多图像分类数据集几乎相同:

In [None]:
dls_clas = DataBlock(
    blocks=(TextBlock.from_folder(path, vocab=dls_lm.vocab),CategoryBlock),
    get_y = parent_label,
    get_items=partial(get_text_files, folders=['train', 'test']),
    splitter=GrandparentSplitter(valid_name='test')
).dataloaders(path, path=path, bs=128, seq_len=72)

Just like with image classification, `show_batch` shows the dependent variable (sentiment, in this case) with each independent variable (movie review text):

和图像分类一样，`show_batch`用每个自变量(影评文本)表示因变量(在本例中为情感):

In [None]:
dls_clas.show_batch(max_n=3)

Unnamed: 0,text,category
0,"xxbos i rate this movie with 3 skulls , only coz the girls knew how to scream , this could 've been a better movie , if actors were better , the twins were xxup ok , i believed they were evil , but the eldest and youngest brother , they sucked really bad , it seemed like they were reading the scripts instead of acting them … . spoiler : if they 're vampire 's why do they freeze the blood ? vampires ca n't drink frozen blood , the sister in the movie says let 's drink her while she is alive … .but then when they 're moving to another house , they take on a cooler they 're frozen blood . end of spoiler \n\n it was a huge waste of time , and that made me mad coz i read all the reviews of how",neg
1,"xxbos i have read all of the xxmaj love xxmaj come xxmaj softly books . xxmaj knowing full well that movies can not use all aspects of the book , but generally they at least have the main point of the book . i was highly disappointed in this movie . xxmaj the only thing that they have in this movie that is in the book is that xxmaj missy 's father comes to xxunk in the book both parents come ) . xxmaj that is all . xxmaj the story line was so twisted and far fetch and yes , sad , from the book , that i just could n't enjoy it . xxmaj even if i did n't read the book it was too sad . i do know that xxmaj pioneer life was rough , but the whole movie was a downer . xxmaj the rating",neg
2,"xxbos xxmaj this , for lack of a better term , movie is lousy . xxmaj where do i start … … \n\n xxmaj cinemaphotography - xxmaj this was , perhaps , the worst xxmaj i 've seen this year . xxmaj it looked like the camera was being tossed from camera man to camera man . xxmaj maybe they only had one camera . xxmaj it gives you the sensation of being a volleyball . \n\n xxmaj there are a bunch of scenes , haphazardly , thrown in with no continuity at all . xxmaj when they did the ' split screen ' , it was absurd . xxmaj everything was squished flat , it looked ridiculous . \n\n xxmaj the color tones were way off . xxmaj these people need to learn how to balance a camera . xxmaj this ' movie ' is poorly made , and",neg


Looking at the `DataBlock` definition, every piece is familiar from previous data blocks we've built, with two important exceptions:

- `TextBlock.from_folder` no longer has the `is_lm=True` parameter.
- We pass the `vocab` we created for the language model fine-tuning.

The reason that we pass the `vocab` of the language model is to make sure we use the same correspondence of token to index. Otherwise the embeddings we learned in our fine-tuned language model won't make any sense to this model, and the fine-tuning step won't be of any use.

By passing `is_lm=False` (or not passing `is_lm` at all, since it defaults to `False`) we tell `TextBlock` that we have regular labeled data, rather than using the next tokens as labels. There is one challenge we have to deal with, however, which is to do with collating multiple documents into a mini-batch. Let's see with an example, by trying to create a mini-batch containing the first 10 documents. First we'll numericalize them:

看看`DataBlock`的定义，每个部分与我们之前构建的数据块都很熟悉，但有两个重要的例外:

- `TextBlock.from_folder`不再有`is_lm=True`参数。
- 我们传递为语言模型微调创建的`vocab`。

我们传递语言模型的`vocab`的原因是确保我们使用相同的令牌对应索引。否则，我们在微调过的语言模型中所学到的嵌入对这个模型没有任何意义，微调步骤也没有任何用处。

通过传递`is_lm=False`(或者根本不传递`is_lm`，因为它默认为`False`)，我们告诉`TextBlock`我们有常规标记的数据，而不是使用下一个标记作为标签。但是，我们必须处理一个挑战，那就是将多个文档整理成一个小批处理。让我们看一个示例，尝试创建一个包含前10个文档的小批处理。首先，我们将其数值化:

In [None]:
nums_samp = toks200[:10].map(num)

Let's now look at how many tokens each of these 10 movie reviews have:

现在让我们看看这10个电影评论有多少个代币:

In [None]:
nums_samp.map(len)

(#10) [228,238,121,290,196,194,533,124,581,155]

Remember, PyTorch `DataLoader`s need to collate all the items in a batch into a single tensor, and a single tensor has a fixed shape (i.e., it has some particular length on every axis, and all items must be consistent). This should sound familiar: we had the same issue with images. In that case, we used cropping, padding, and/or squishing to make all the inputs the same size. Cropping might not be a good idea for documents, because it seems likely we'd remove some key information (having said that, the same issue is true for images, and we use cropping there; data augmentation hasn't been well explored for NLP yet, so perhaps there are actually opportunities to use cropping in NLP too!). You can't really "squish" a document. So that leaves padding!

We will expand the shortest texts to make them all the same size. To do this, we use a special padding token that will be ignored by our model. Additionally, to avoid memory issues and improve performance, we will batch together texts that are roughly the same lengths (with some shuffling for the training set). We do this by (approximately, for the training set) sorting the documents by length prior to each epoch. The result of this is that the documents collated into a single batch will tend to be of similar lengths. We won't pad every batch to the same size, but will instead use the size of the largest document in each batch as the target size. (It is possible to do something similar with images, which is especially useful for irregularly sized rectangular images, but at the time of writing no library provides good support for this yet, and there aren't any papers covering it. It's something we're planning to add to fastai soon, however, so keep an eye on the book's website; we'll add information about this as soon as we have it working well.)

The sorting and padding are automatically done by the data block API for us when using a `TextBlock`, with `is_lm=False`. (We don't have this same issue for language model data, since we concatenate all the documents together first, and then split them into equally sized sections.)

We can now create a model to classify our texts:

记住，PyTorch `DataLoader`s需要将一批中的所有项整理成一个单独的张量，而一个单独的张量有一个固定的形状(即，它在每个轴上都有一些特定的长度，所有项必须是一致的)。这听起来应该很熟悉:我们在图像方面也遇到了同样的问题。在这种情况下，我们使用裁剪、填充和/或挤压来使所有输入的大小相同。对于文档来说，裁剪可能不是一个好主意，因为我们可能会删除一些关键信息(话虽如此，同样的问题也适用于图像，我们在这里使用裁剪;数据增强在NLP中还没有得到很好的探索，所以也许实际上在NLP中也有机会使用裁剪!)你不能真正地“挤压”文档。这样就剩下填充了!

我们将扩展最短的文本，使其大小相同。为此，我们使用一个特殊的填充令牌，该令牌将被我们的模型忽略。此外，为了避免内存问题和提高性能，我们将把长度大致相同的文本批处理在一起(对训练集进行一些调整)。我们通过(对于训练集)在每个epoch之前按长度对文档进行排序来做到这一点。这样做的结果是，整理成单个批处理的文档往往具有相似的长度。我们不会将每个批填充为相同的大小，而是使用每个批中最大文档的大小作为目标大小。(可以对图像做类似的事情，这对不规则大小的矩形图像特别有用，但在编写本文时，还没有库对此提供良好的支持，也没有任何论文涉及它。不过，我们计划很快将其添加到fastai中，所以请关注这本书的网站;我们将添加这方面的信息，一旦我们有它工作良好。)

当使用`TextBlock`时，排序和填充由数据块API自动完成，`is_lm=False`。(对于语言模型数据，我们没有同样的问题，因为我们首先将所有文档连接在一起，然后将它们分成大小相同的部分。)

现在我们可以创建一个模型来对文本进行分类:

In [None]:
learn = text_classifier_learner(dls_clas, AWD_LSTM, drop_mult=0.5, 
                                metrics=accuracy).to_fp16()

The final step prior to training the classifier is to load the encoder from our fine-tuned language model. We use `load_encoder` instead of `load` because we only have pretrained weights available for the encoder; `load` by default raises an exception if an incomplete model is loaded:

训练分类器之前的最后一步是从经过优化的语言模型中加载编码器。我们使用`load_encoder`而不是`load`，因为编码器只有预训练的权重可用;默认情况下，如果加载不完整的模型，`load`将引发异常:

In [None]:
learn = learn.load_encoder('finetuned')

### Fine-Tuning the Classifier

### 微调分类器

The last step is to train with discriminative learning rates and *gradual unfreezing*. In computer vision we often unfreeze the model all at once, but for NLP classifiers, we find that unfreezing a few layers at a time makes a real difference:

最后一步是用判别学习率和*逐步解冻*进行训练。在计算机视觉中，我们经常一次解冻模型，但对于NLP分类器，我们发现一次解冻几个图层会产生真正的不同:

In [None]:
learn.fit_one_cycle(1, 2e-2)

epoch,train_loss,valid_loss,accuracy,time
0,0.347427,0.18448,0.92932,00:33


In just one epoch we get the same result as our training in <<chapter_intro>>: not too bad! We can pass `-2` to `freeze_to` to freeze all except the last two parameter groups:

在一个时期内，我们得到了与<<chapter_intro>>训练相同的结果:不算太糟!我们可以将`-2`传递给`freeze_to`来冻结除最后两个参数组外的所有参数:

In [None]:
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2))

epoch,train_loss,valid_loss,accuracy,time
0,0.247763,0.171683,0.93464,00:37


Then we can unfreeze a bit more, and continue training:

然后我们再解冻一点，继续训练:

In [None]:
learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3))

epoch,train_loss,valid_loss,accuracy,time
0,0.193377,0.156696,0.9412,00:45


And finally, the whole model!

最后是整个模型!

In [None]:
learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3))

epoch,train_loss,valid_loss,accuracy,time
0,0.172888,0.15377,0.94312,01:01
1,0.161492,0.155567,0.94264,00:57


We reached 94.3% accuracy, which was state-of-the-art performance just three years ago. By training another model on all the texts read backwards and averaging the predictions of those two models, we can even get to 95.1% accuracy, which was the state of the art introduced by the ULMFiT paper. It was only beaten a few months ago, by fine-tuning a much bigger model and using expensive data augmentation techniques (translating sentences in another language and back, using another model for translation).

Using a pretrained model let us build a fine-tuned language model that was pretty powerful, to either generate fake reviews or help classify them. This is exciting stuff, but it's good to remember that this technology can also be used for malign purposes.

我们达到了94.3%的精度，这是三年前最先进的性能。通过在所有倒读文本上训练另一个模型，并平均这两个模型的预测，我们甚至可以达到95.1%的准确率，这是ULMFiT论文介绍的最新水平。几个月前，通过微调一个更大的模型和使用昂贵的数据增强技术(用另一种语言翻译句子，然后用另一种模型进行翻译)，它才被击败。

使用预先训练的模型，我们可以构建一个微调过的语言模型，它非常强大，可以生成虚假评论或帮助对它们进行分类。这是令人兴奋的事情，但最好记住，这项技术也可以用于恶意目的。

## Disinformation and Language Models

## 虚假信息和语言模型

Even simple algorithms based on rules, before the days of widely available deep learning language models, could be used to create fraudulent accounts and try to influence policymakers. Jeff Kao, now a computational journalist at ProPublica, analyzed the comments that were sent to the US Federal Communications Commission (FCC) regarding a 2017 proposal to repeal net neutrality. In his article ["More than a Million Pro-Repeal Net Neutrality Comments Were Likely Faked"](https://hackernoon.com/more-than-a-million-pro-repeal-net-neutrality-comments-were-likely-faked-e9f0e3ed36a6), he reports how he discovered a large cluster of comments opposing net neutrality that seemed to have been generated by some sort of Mad Libs-style mail merge. In <<disinformation>>, the fake comments have been helpfully color-coded by Kao to highlight their formulaic nature.

在深度学习语言模型被广泛使用之前，即使是基于规则的简单算法，也可能被用来创建欺诈性账户，并试图影响决策者。现为ProPublica计算机记者的Jeff Kao分析了寄给美国联邦通信委员会(FCC)的关于2017年废除网络中立性提案的评论。在他的文章["超过一百万支持废除网络中立的评论很可能是伪造的"](https://hackernoon.com/more-than-a-million-pro-repeal-net-neutrality-comments-were-likely-faked-e9f0e3ed36a6)中，他报告了他是如何发现大量反对网络中立的评论似乎是由某种疯狂的lib式邮件合并产生的。在<<disinformation>>中，花王对虚假评论进行了颜色编码，以突出其公式化的性质。

<img src="https://github.com/fastai/fastbook/blob/master/images/ethics/image16.png?raw=1" width="700" id="disinformation" caption="Comments received by the FCC during the net neutrality debate">

Kao estimated that "less than 800,000 of the 22M+ comments… could be considered truly unique" and that "more than 99% of the truly unique comments were in favor of keeping net neutrality."

Given advances in language modeling that have occurred since 2017, such fraudulent campaigns could be nearly impossible to catch now.  You now have all the necessary tools at your disposal to create a compelling language model—that is, something that can generate context-appropriate, believable text. It won't necessarily be perfectly accurate or correct, but it will be plausible. Think about what this technology would mean when put together with the kinds of disinformation campaigns we have learned about in recent years. Take a look at the Reddit dialogue shown in <<ethics_reddit>>, where a language model based on OpenAI's GPT-2 algorithm is having a conversation with itself about whether the US government should cut defense spending.

花王估计，“2200多万条评论中，只有不到80万条可以被认为是真正独特的”，“99%以上的真正独特的评论都支持保持网络中立性”。

鉴于语言建模自2017年以来取得的进展，现在几乎不可能抓住这种欺诈活动。现在，您已经拥有了创建引人注目的语言模型所需的所有工具—即能够生成与上下文相适应的可信文本的工具。它不一定是完全准确或正确的，但它是可信的。想想看，如果把这种技术与我们近年来了解到的各种虚假信息运动放在一起，将意味着什么。看看Reddit上<<ethics_reddit>>的对话，其中一个基于OpenAI的GPT-2算法的语言模型正在与自己进行对话，讨论美国政府是否应该削减国防开支。

<img src="https://github.com/fastai/fastbook/blob/master/images/ethics/image14.png?raw=1" id="ethics_reddit" caption="An algorithm talking to itself on Reddit" alt="An algorithm talking to itself on Reddit" width="600">

In this case, it was explicitly said that an algorithm was used, but imagine what would happen if a bad actor decided to release such an algorithm across social networks. They could do it slowly and carefully, allowing the algorithm to gradually develop followers and trust over time. It would not take many resources to have literally millions of accounts doing this. In such a situation we could easily imagine getting to a point where the vast majority of discourse online was from bots, and nobody would have any idea that it was happening.

We are already starting to see examples of machine learning being used to generate identities. For example, <<katie_jones>> shows a LinkedIn profile for Katie Jones.

在这种情况下，明确表示使用了一种算法，但想象一下，如果一个糟糕的参与者决定在社交网络上发布这种算法会发生什么。他们可以慢慢地、仔细地做这件事，让算法随着时间的推移逐渐发展追随者和信任。不需要太多的资源就可以让数百万个账户这样做。在这种情况下，我们可以很容易地想象，网络上的绝大多数讨论都来自机器人，没有人会知道它正在发生。

我们已经开始看到机器学习被用于生成身份的例子。例如，<<katie_jones>>显示凯蒂·琼斯的LinkedIn个人资料。

<img src="https://github.com/fastai/fastbook/blob/master/images/ethics/image15.jpeg?raw=1" width="400" id="katie_jones" caption="Katie Jones's LinkedIn profile">

Katie Jones was connected on LinkedIn to several members of mainstream Washington think tanks. But she didn't exist. That image you see was auto-generated by a generative adversarial network, and somebody named Katie Jones has not, in fact, graduated from the Center for Strategic and International Studies.

Many people assume or hope that algorithms will come to our defense here—that we will develop classification algorithms that can automatically recognise autogenerated content. The problem, however, is that this will always be an arms race, in which better classification (or discriminator) algorithms can be used to create better generation algorithms.

凯蒂·琼斯(Katie Jones)在领英(LinkedIn)上与华盛顿主流智库的几位成员有联系。但她并不存在。你们看到的这张照片是由一个生成的对抗网络自动生成的，而一个叫凯蒂·琼斯的人，事实上，并不是从战略与国际研究中心毕业的。

许多人假设或希望算法会在这里为我们辩护——我们将开发出能够自动识别自动生成内容的分类算法。然而，问题在于这将永远是一场军备竞赛，在这场竞赛中，更好的分类(或鉴别器)算法可以用来创建更好的生成算法。

## Conclusion

## 结论

In this chapter we explored the last application covered out of the box by the fastai library: text. We saw two types of models: language models that can generate texts, and a classifier that determines if a review is positive or negative. To build a state-of-the art classifier, we used a pretrained language model, fine-tuned it to the corpus of our task, then used its body (the encoder) with a new head to do the classification.

Before we end this section, we'll take a look at how the fastai library can help you assemble your data for your specific problems.

在本章中，我们探索了fastai库的最后一个开箱即用的应用程序:text。我们看到了两种类型的模型:可以生成文本的语言模型，以及确定评论是正面还是负面的分类器。为了构建一个最先进的分类器，我们使用了一个预先训练好的语言模型，对它进行微调，以适应我们任务的语料库，然后使用它的主体(编码器)进行分类。

在结束本节之前，我们将看一看fastai库如何帮助您为特定的问题组装数据。

## Questionnaire

## 问卷调查

1. What is "self-supervised learning"?
1. What is a "language model"?
1. Why is a language model considered self-supervised?
1. What are self-supervised models usually used for?
1. Why do we fine-tune language models?
1. What are the three steps to create a state-of-the-art text classifier?
1. How do the 50,000 unlabeled movie reviews help us create a better text classifier for the IMDb dataset?
1. What are the three steps to prepare your data for a language model?
1. What is "tokenization"? Why do we need it?
1. Name three different approaches to tokenization.
1. What is `xxbos`?
1. List four rules that fastai applies to text during tokenization.
1. Why are repeated characters replaced with a token showing the number of repetitions and the character that's repeated?
1. What is "numericalization"?
1. Why might there be words that are replaced with the "unknown word" token?
1. With a batch size of 64, the first row of the tensor representing the first batch contains the first 64 tokens for the dataset. What does the second row of that tensor contain? What does the first row of the second batch contain? (Careful—students often get this one wrong! Be sure to check your answer on the book's website.)
1. Why do we need padding for text classification? Why don't we need it for language modeling?
1. What does an embedding matrix for NLP contain? What is its shape?
1. What is "perplexity"?
1. Why do we have to pass the vocabulary of the language model to the classifier data block?
1. What is "gradual unfreezing"?
1. Why is text generation always likely to be ahead of automatic identification of machine-generated texts?

1. 什么是“自我监督学习”?
1. 什么是“语言模型”?
1. 为什么语言模型被认为是自我监督的?
1. 自我监督模型通常用于什么?
1. 为什么我们要微调语言模型?
1. 创建最先进的文本分类器的三个步骤是什么?
1. 5万个未标记的电影评论如何帮助我们为IMDb数据集创建更好的文本分类器?
1. 为语言模型准备数据的三个步骤是什么?
1. “标记”是什么?我们为什么需要它?
1. 请说出三种不同的标记化方法。
1. `xxbos`是什么?
1. 列出fastai在标记化期间应用于文本的四个规则。
1. 为什么重复字符被替换为显示重复次数和重复字符的标记?
1. “numericalization”是什么?
1. 为什么会有单词被替换为“未知单词”标记?
1. 批大小为64时，表示第一批的张量的第一行包含数据集的前64个标记。这个张量的第二行包含什么?第二批的第一行包含什么?(仔细点——学生们经常弄错这个!一定要在书的网站上查看你的答案。)
1. 为什么文本分类需要填充?为什么我们不需要它来进行语言建模?
1. NLP的嵌入矩阵包含什么?它的形状是什么?
1. “困惑”是什么?
1. 为什么我们必须将语言模型的词汇表传递给分类器数据块?
1. 什么是“逐步解冻”?
1. 为什么文本生成总是可能先于机器生成的文本的自动识别?

### Further Research

### 进一步的研究

1. See what you can learn about language models and disinformation. What are the best language models today? Take a look at some of their outputs. Do you find them convincing? How could a bad actor best use such a model to create conflict and uncertainty?
1. Given the limitation that models are unlikely to be able to consistently recognize machine-generated texts, what other approaches may be needed to handle large-scale disinformation campaigns that leverage deep learning?

1. 看看你能学到什么关于语言模型和虚假信息。现在最好的语言模型是什么?看看他们的一些成果。你觉得它们有说服力吗?一个糟糕的行为者如何才能最好地利用这样的模型来制造冲突和不确定性?
1. 考虑到模型不太可能一致地识别机器生成的文本的局限性，还需要什么其他方法来处理利用深度学习的大规模虚假信息运动?