In [1]:
# For tips on running notebooks in Google Colab, see
# https://pytorch.org/tutorials/beginner/colab
%matplotlib inline

NLP From Scratch: Translation with a Sequence to Sequence Network and Attention 自然语言处理从零开始：使用序列到序列网络和注意力机制进行翻译
===============================================================================

**Author**: [Sean Robertson](https://github.com/spro)

This is the third and final tutorial on doing \"NLP From Scratch\",
where we write our own classes and functions to preprocess the data to
do our NLP modeling tasks. We hope after you complete this tutorial that
you\'ll proceed to learn how [torchtext]{.title-ref} can handle much of
this preprocessing for you in the three tutorials immediately following
this one.

这是“从零开始做自然语言处理”的第三个也是最后一个教程，在这个教程中，我们编写自己的类和函数来预处理数据，以完成我们的自然语言处理建模任务。我们希望在你完成本教程后，可以继续学习如何使用[torchtext]{.title-ref}来处理这些预处理工作，这将在紧接着的三个教程中进行介绍。

In this project we will be teaching a neural network to translate from
French to English.

在这个项目中，我们将教一个神经网络从法语翻译到英语。

``` {.sourceCode .sh}
[KEY: > input, = target, < output]

> il est en train de peindre un tableau .
= he is painting a picture .
< he is painting a picture .

> pourquoi ne pas essayer ce vin delicieux ?
= why not try that delicious wine ?
< why not try that delicious wine ?

> elle n est pas poete mais romanciere .
= she is not a poet but a novelist .
< she not not a poet but a novelist .

> vous etes trop maigre .
= you re too skinny .
< you re all alone .
```

\... to varying degrees of success.

...以不同程度的成功实现。

This is made possible by the simple but powerful idea of the [sequence
to sequence network](https://arxiv.org/abs/1409.3215), in which two
recurrent neural networks work together to transform one sequence to
another. An encoder network condenses an input sequence into a vector,
and a decoder network unfolds that vector into a new sequence.

这是通过一个简单但强大的想法实现的，即[序列到序列网络](https://arxiv.org/abs/1409.3215)，其中两个循环神经网络协同工作，将一个序列转换为另一个序列。编码器网络将输入序列压缩成一个向量，而解码器网络将该向量展开成一个新的序列。

![](https://pytorch.org/tutorials/_static/img/seq-seq-images/seq2seq.png)

To improve upon this model we\'ll use an [attention
mechanism](https://arxiv.org/abs/1409.0473), which lets the decoder
learn to focus over a specific range of the input sequence.

为了改进这个模型，我们将使用[注意力机制](https://arxiv.org/abs/1409.0473)，这让解码器能够学习在输入序列的特定范围内集中注意力。

**Recommended Reading:**

I assume you have at least installed PyTorch, know Python, and
understand Tensors:

-   <https://pytorch.org/> For installation instructions
-   `/beginner/deep_learning_60min_blitz`{.interpreted-text role="doc"}
    to get started with PyTorch in general
-   `/beginner/pytorch_with_examples`{.interpreted-text role="doc"} for
    a wide and deep overview
-   `/beginner/former_torchies_tutorial`{.interpreted-text role="doc"}
    if you are former Lua Torch user

我假设你至少已经安装了PyTorch，懂得Python，并理解张量：

- [https://pytorch.org/](https://pytorch.org/) 安装说明
    
- [Deep Learning with PyTorch: A 60 Minute Blitz](https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html) 开始学习PyTorch的入门教程
    
- [Learning PyTorch with Examples](https://pytorch.org/tutorials/beginner/pytorch_with_examples.html) 广泛而深入的概述
    
- [PyTorch for Former Torch Users](https://pytorch.org/tutorials/beginner/former_torchies_tutorial.html) 如果你以前是Lua Torch用户

It would also be useful to know about Sequence to Sequence networks and
how they work:

-   [Learning Phrase Representations using RNN Encoder-Decoder for
    Statistical Machine Translation](https://arxiv.org/abs/1406.1078)
-   [Sequence to Sequence Learning with Neural
    Networks](https://arxiv.org/abs/1409.3215)
-   [Neural Machine Translation by Jointly Learning to Align and
    Translate](https://arxiv.org/abs/1409.0473)
-   [A Neural Conversational Model](https://arxiv.org/abs/1506.05869)

了解序列到序列网络及其工作原理也很有用：

- [使用RNN编码器-解码器进行统计机器翻译的短语表示学习](https://arxiv.org/abs/1406.1078)
- [使用神经网络进行序列到序列学习](https://arxiv.org/abs/1409.3215)
- [通过联合学习对齐和翻译进行神经机器翻译](https://arxiv.org/abs/1409.0473)
- [神经会话模型](https://arxiv.org/abs/1506.05869)

You will also find the previous tutorials on [NLP From Scratch: Classifying Names with a Character-Level RNN](https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html) and [NLP From Scratch: Generating Names with a Character-Level RNN](https://pytorch.org/tutorials/intermediate/char_rnn_generation_tutorial.html) helpful as those concepts are very similar to the Encoder and Decoder models, respectively.

你还会发现之前的教程也很有帮助，例如[NLP From Scratch: Classifying Names with a Character-Level RNN](https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html)和[NLP From Scratch: Generating Names with a Character-Level RNN](https://pytorch.org/tutorials/intermediate/char_rnn_generation_tutorial.html)，因为这些概念分别与编码器和解码器模型非常相似。

**Requirements**


In [2]:
from __future__ import unicode_literals, print_function, division
# 这行代码使用了 __future__ 模块中的一些功能，以确保在使用 Python 2 时启用 Python 3 中的特性。这是为了帮助在 Python 2 中逐步迁移到 Python 3。具体来说，这段代码启用了三个 Python 3 的特性：
#   unicode_literals：启用此功能后，所有的字符串字面量默认都是 Unicode 字符串，而不是字节字符串。这与 Python 3 的行为一致，在 Python 2 中默认字符串字面量是字节字符串。
#   print_function：启用此功能后，print 语句将作为一个函数使用。这意味着你需要使用 print() 而不是 print 语句。这与 Python 3 的行为一致，在 Python 3 中 print 是一个函数，而不是一个语句。
#   division：启用此功能后，除法运算符 / 将执行浮点除法（即使两个操作数都是整数），而不是执行整数除法。这与 Python 3 的行为一致，在 Python 3 中 / 总是执行浮点除法，而 // 才是执行整数除法。
from io import open
import unicodedata
import re
import random

import torch
import torch.nn as nn
from torch import optim
import torch.nn.functional as F

import numpy as np
from torch.utils.data import TensorDataset, DataLoader, RandomSampler

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

Loading data files
==================

The data for this project is a set of many thousands of English to
French translation pairs.

本项目的数据是一组包含数千个英法翻译对的集合。

[This question on Open Data Stack
Exchange](https://opendata.stackexchange.com/questions/3888/dataset-of-sentences-translated-into-many-languages)
pointed me to the open translation site <https://tatoeba.org/> which has
downloads available at <https://tatoeba.org/eng/downloads> - and better
yet, someone did the extra work of splitting language pairs into
individual text files here: <https://www.manythings.org/anki/>

[Open Data Stack Exchange上的这个问题](https://opendata.stackexchange.com/questions/3888/dataset-of-sentences-translated-into-many-languages) 指引我找到了开放翻译网站 [https://tatoeba.org/](https://tatoeba.org/)，其下载地址为 [https://tatoeba.org/eng/downloads](https://tatoeba.org/eng/downloads)。更好的是，有人做了额外的工作，将语言对拆分成单独的文本文件，地址是 [https://www.manythings.org/anki/](https://www.manythings.org/anki/)

The English to French pairs are too big to include in the repository, so
download to `data/eng-fra.txt` before continuing. The file is a tab
separated list of translation pairs:

由于英法对的文件太大，无法包含在代码仓库中，因此请在继续之前下载到 `data/eng-fra.txt`。该文件是一个用制表符分隔的翻译对列表：

``` {.sourceCode .sh}
I am cold.    J'ai froid.
```

<div style="background-color: #54c7ec; color: #fff; font-weight: 700; padding-left: 10px; padding-top: 5px; padding-bottom: 5px"><strong>NOTE:</strong></div>
<div style="background-color: #f3f4f7; padding-left: 10px; padding-top: 10px; padding-bottom: 10px; padding-right: 10px">
<p>Download the data from<a href="https://download.pytorch.org/tutorial/data.zip">here</a>and extract it to the current directory.</p>
</div>


Similar to the character encoding used in the character-level RNN
tutorials, we will be representing each word in a language as a one-hot
vector, or giant vector of zeros except for a single one (at the index
of the word). Compared to the dozens of characters that might exist in a
language, there are many many more words, so the encoding vector is much
larger. We will however cheat a bit and trim the data to only use a few
thousand words per language.

与字符级RNN教程中使用的字符编码类似，我们将用独热向量（one-hot vector）表示语言中的每个单词，也就是除了单个位置为1，其余全为0的巨型向量。相比于一种语言中可能存在的几十个字符，单词数量要多得多，所以编码向量会大很多。不过，我们会稍微简化一下数据，每种语言只使用几千个单词。

![](https://pytorch.org/tutorials/_static/img/seq-seq-images/word-encoding.png)


We\'ll need a unique index per word to use as the inputs and targets of
the networks later. To keep track of all this we will use a helper class
called `Lang` which has word → index (`word2index`) and index → word
(`index2word`) dictionaries, as well as a count of each word
`word2count` which will be used to replace rare words later.

我们需要为每个单词分配一个唯一的索引，以便在后续用作网络的输入和目标。为了跟踪所有这些信息，我们将使用一个名为`Lang`的辅助类，该类包含单词到索引（`word2index`）和索引到单词（`index2word`）的字典，以及每个单词的计数（`word2count`），后者将用于替换罕见单词。

In [3]:
SOS_token = 0
EOS_token = 1

class Lang:
    def __init__(self, name):
        self.name = name
        self.word2index = {}
        self.word2count = {}
        self.index2word = {0: "SOS", 1: "EOS"}
        self.n_words = 2  # Count SOS and EOS

    def addSentence(self, sentence):
        for word in sentence.split(' '):
            self.addWord(word)

    def addWord(self, word):
        if word not in self.word2index:
            self.word2index[word] = self.n_words
            self.word2count[word] = 1
            self.index2word[self.n_words] = word
            self.n_words += 1
        else:
            self.word2count[word] += 1

The files are all in Unicode, to simplify we will turn Unicode
characters to ASCII, make everything lowercase, and trim most
punctuation.

这些文件都是Unicode编码的，为了简化处理，我们将把Unicode字符转换为ASCII字符，把所有内容变成小写，并删除大部分标点符号。

In [4]:
# Turn a Unicode string to plain ASCII, thanks to
# https://stackoverflow.com/a/518232/2809427
def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
        # 'Mn' 表示非间隔标记（通常是重音符号等）
    )

# Lowercase, trim, and remove non-letter characters
def normalizeString(s):
    s = unicodeToAscii(s.lower().strip())
    # s.lower() 将字符串 s 转换为小写。
    # s.strip() 去除字符串 s 两端的多余空格。
    # unicodeToAscii(s) 调用前面定义的 unicodeToAscii 函数，将 Unicode 字符串转换为 ASCII 字符串。
    s = re.sub(r"([.!?])", r" \1", s)
    # 使用正则表达式 re.sub 在标点符号（.、!、?）前添加一个空格。
    # 例如，将 "Hello!" 转换为 "Hello !", "What?" 转换为 "What ?"。
    s = re.sub(r"[^a-zA-Z!?]+", r" ", s)
    # 使用正则表达式 re.sub 将所有非字母字符（除了 .、! 和 ?）替换为空格。
    # 例如，将 "Hello, world!" 转换为 "Hello world !", "123 Go!" 转换为 " Go !"
    return s.strip()

In [5]:
print(re.sub(r"[^a-zA-Z!?]+", r" ", "Café"))
print(normalizeString("Café"))
s = "Héllo, wörld! Café"
print([c for c in unicodedata.normalize('NFD', s)])
unicodeToAscii(s)

Caf 
cafe
['H', 'e', '́', 'l', 'l', 'o', ',', ' ', 'w', 'o', '̈', 'r', 'l', 'd', '!', ' ', 'C', 'a', 'f', 'e', '́']


'Hello, world! Cafe'

To read the data file we will split the file into lines, and then split
lines into pairs. The files are all English → Other Language, so if we
want to translate from Other Language → English I added the `reverse`
flag to reverse the pairs.

为了读取数据文件，我们将把文件按行分割，然后再将每行分割成对。这些文件都是从英语翻译到其他语言的，所以如果我们想要从其他语言翻译到英语，可以添加`reverse`标志来反转这些对。

In [6]:
def readLangs(lang1, lang2, reverse=False):
    # 这段代码定义了一个名为 readLangs 的函数，用于读取两种语言的平行语料文件，将每一行文本拆分为句子对，并对句子进行规范化处理。该函数还支持反转句子对的顺序，并创建语言实例对象
    print("Reading lines...")

    # Read the file and split into lines
    lines = open('data/%s-%s.txt' % (lang1, lang2), encoding='utf-8').\
        read().strip().split('\n')
    # 使用 open 函数打开文件 'data/%s-%s.txt' % (lang1, lang2)，文件路径根据 lang1 和 lang2 动态生成。
    # 指定编码为 utf-8 以正确处理 Unicode 字符。
    # 使用 read 方法读取整个文件内容，并使用 strip 方法去除首尾空白。
    # 使用 split('\n') 方法将文件内容按行分割成列表 lines。

    # Split every line into pairs and normalize
    pairs = [[normalizeString(s) for s in l.split('\t')] for l in lines]
    # 每一行 l 通常包含两个句子，用制表符 \t 分隔, split('\t') 将行 l 按制表符拆分成两个子字符串（句子）
    # lines = [
    #             "I am a student.\tJe suis un étudiant.",
    #             "How are you?\tComment ça va?"
    #         ]
    # pairs = [
    #             ['i am a student .', 'je suis un etudiant .'],
    #             ['how are you ?', 'comment ca va ?']
    #         ]

    # Reverse pairs, make Lang instances
    if reverse:
        pairs = [list(reversed(p)) for p in pairs]
        input_lang = Lang(lang2)
        output_lang = Lang(lang1)
    else:
        input_lang = Lang(lang1)
        output_lang = Lang(lang2)
    # lang1 = 'eng', lang2 = 'fra'
    # pairs = [
    #             ['i am a student .', 'je suis un etudiant .'],
    #             ['how are you ?', 'comment ca va ?']
    #         ]
    # reverse = True:
    # pairs = [
    #             ['je suis un etudiant .', 'i am a student .'],
    #             ['comment ca va ?', 'how are you ?']
    #         ]
    # input_lang = Lang('fra')
    # output_lang = Lang('eng')
    # reverse = False:
    # input_lang = Lang('eng')
    # output_lang = Lang('fra')
    

    return input_lang, output_lang, pairs

Since there are a *lot* of example sentences and we want to train
something quickly, we\'ll trim the data set to only relatively short and
simple sentences. Here the maximum length is 10 words (that includes
ending punctuation) and we\'re filtering to sentences that translate to
the form \"I am\" or \"He is\" etc. (accounting for apostrophes replaced
earlier).

由于有*大量*的示例句子，我们希望快速训练，因此我们会将数据集修剪到仅包含相对较短和简单的句子。在这里，最大长度为10个单词（包括结束标点符号），并且我们过滤出翻译成“我在”或“他是”等形式的句子（考虑到前面已替换的撇号）。

In [7]:
MAX_LENGTH = 10

eng_prefixes = (
    "i am ", "i m ",
    "he is", "he s ",
    "she is", "she s ",
    "you are", "you re ",
    "we are", "we re ",
    "they are", "they re "
)

def filterPair(p):
    return len(p[0].split(' ')) < MAX_LENGTH and \
        len(p[1].split(' ')) < MAX_LENGTH and \
        p[1].startswith(eng_prefixes)
    # p[1] 是句子对中的第二个句子（通常是目标语言句子）
    # p[1].split(' ') 将句子 p[1] 按空格拆分成单词列表
    # len(p[1].split(' ')) 计算句子中的单词数
    # p[1].startswith(eng_prefixes) 检查句子 p[1] 是否以 eng_prefixes 中的某个前缀开头
    # eng-fra加上reverse=True所以检查p[1]
    

def filterPairs(pairs):
    return [pair for pair in pairs if filterPair(pair)]

In [8]:
pair1 = ["je suis un etudiant", "i am a student"]
pair2 = ["c est une longue phrase qui dépasse la longueur maximale", "this is a long sentence that exceeds the maximum length"]
pair3 = ["il est ici", "he is here"]
pair4 = ["nous sommes heureux", "we are happy"]
pair5 = ["i am happy", "i am"]

print(filterPair(pair1))
print(filterPair(pair2))
print(filterPair(pair3))
print(filterPair(pair4))
print(filterPair(pair5))

True
False
True
True
False


The full process for preparing the data is:

-   Read text file and split into lines, split lines into pairs
-   Normalize text, filter by length and content
-   Make word lists from sentences in pairs

准备数据的完整过程是：

- 读取文本文件并按行分割，将行分割成对
- 规范化文本，根据长度和内容进行过滤
- 从对中的句子中制作单词列表

In [9]:
def prepareData(lang1, lang2, reverse=False):
    input_lang, output_lang, pairs = readLangs(lang1, lang2, reverse)
    print("Read %s sentence pairs" % len(pairs))
    pairs = filterPairs(pairs)
    print("Trimmed to %s sentence pairs" % len(pairs))
    print("Counting words...")
    for pair in pairs:
        input_lang.addSentence(pair[0])
        output_lang.addSentence(pair[1])
    print("Counted words:")
    print(input_lang.name, input_lang.n_words)
    print(output_lang.name, output_lang.n_words)
    return input_lang, output_lang, pairs

input_lang, output_lang, pairs = prepareData('eng', 'fra', True)
print(random.choice(pairs))

Reading lines...
Read 135842 sentence pairs
Trimmed to 11445 sentence pairs
Counting words...
Counted words:
fra 4601
eng 2991
['elles ne sont pas seules', 'they aren t alone']


The Seq2Seq Model
=================

A Recurrent Neural Network, or RNN, is a network that operates on a
sequence and uses its own output as input for subsequent steps.

循环神经网络（RNN）是一种对序列进行操作的网络，它使用自身的输出作为后续步骤的输入。

A [Sequence to Sequence network](https://arxiv.org/abs/1409.3215), or
seq2seq network, or [Encoder Decoder
network](https://arxiv.org/pdf/1406.1078v3.pdf), is a model consisting
of two RNNs called the encoder and decoder. The encoder reads an input
sequence and outputs a single vector, and the decoder reads that vector
to produce an output sequence.

[序列到序列网络](https://arxiv.org/abs/1409.3215)（seq2seq网络）或[编码器解码器网络](https://arxiv.org/pdf/1406.1078v3.pdf)是一种由两个RNN组成的模型，分别称为编码器和解码器。编码器读取输入序列并输出一个单一的向量，解码器读取该向量以生成输出序列。

![](https://pytorch.org/tutorials/_static/img/seq-seq-images/seq2seq.png)

Unlike sequence prediction with a single RNN, where every input
corresponds to an output, the seq2seq model frees us from sequence
length and order, which makes it ideal for translation between two
languages.

与使用单个RNN进行序列预测（每个输入对应一个输出）不同，seq2seq模型使我们摆脱了序列长度和顺序的限制，这使它成为在两种语言之间进行翻译的理想选择。

Consider the sentence `Je ne suis pas le chat noir` →
`I am not the black cat`. Most of the words in the input sentence have a
direct translation in the output sentence, but are in slightly different
orders, e.g. `chat noir` and `black cat`. Because of the `ne/pas`
construction there is also one more word in the input sentence. It would
be difficult to produce a correct translation directly from the sequence
of input words.

例如句子`Je ne suis pas le chat noir` → `I am not the black cat`。输入句子中的大多数单词在输出句子中都有直接翻译，但顺序稍有不同，例如`chat noir`和`black cat`。由于`ne/pas`结构，输入句子中还有一个额外的单词。直接从输入单词的序列中生成正确的翻译会很困难。

With a seq2seq model the encoder creates a single vector which, in the
ideal case, encodes the \"meaning\" of the input sequence into a single
vector --- a single point in some N dimensional space of sentences.

使用seq2seq模型时，编码器会创建一个向量，在理想情况下，该向量将输入序列的“意义”编码成一个单一的向量——在某些N维句子空间中的一个点。

The Encoder
===========

The encoder of a seq2seq network is a RNN that outputs some value for
every word from the input sentence. For every input word the encoder
outputs a vector and a hidden state, and uses the hidden state for the
next input word.

seq2seq网络的编码器是一个RNN，它为输入句子的每个单词输出一些值。对于每个输入单词，编码器输出一个向量和一个隐藏状态，并使用该隐藏状态作为下一个输入单词的输入。

![](https://pytorch.org/tutorials/_static/img/seq-seq-images/encoder-network.png)


In [10]:
# hidden_size = 128
# batch_size = 32
# input_size = input_lang.n_words = 4601
# output_size = output_lang.n_words = 2991
class EncoderRNN(nn.Module):
    def __init__(self, input_size, hidden_size, dropout_p=0.1):
        super(EncoderRNN, self).__init__()
        self.hidden_size = hidden_size

        self.embedding = nn.Embedding(input_size, hidden_size)
        # nn.Embedding(input_size, hidden_size) 创建一个输入大小为 input_size，输出大小为 hidden_size 的嵌入层
        # input_size 是输入词汇表的大小，hidden_size 是嵌入向量的大小
        # nn.Embedding 会将输入的整数索引转换为大小为 hidden_size 的嵌入向量
        
        self.gru = nn.GRU(hidden_size, hidden_size, batch_first=True)
        self.dropout = nn.Dropout(dropout_p)

    def forward(self, input):
        # input: (batch_size, seq_len)= (32, 10)
        embedded = self.dropout(self.embedding(input))
        # input的第二维里面的每个元素都是一个整数索引，表示词汇表中的单词索引, embedding(input)会查找到对应的嵌入向量
        # 如第二维的某个元素值为i, 则embedding(i of input): (input_size, hidden_size)[i]: (hidden_size)找到第i个(第i行的, 这个[i]我这里表示相当于数组取第i行)嵌入向量 
        # self.embedding(input): (batch_size, seq_len, hidden_size)= (32, 10, 128)
        # embedded: (batch_size, seq_len, hidden_size)= (32, 10, 128)
        output, hidden = self.gru(embedded)
        # output: (batch_size, seq_len, hidden_size)= (32, 10, 128)
        # hidden: (num_layers, batch_size, hidden_size)= (1, 32, 128)
        return output, hidden
        # output: (batch_size, seq_len, hidden_size)= (32, 10, 128)
        # hidden: (num_layers, batch_size, hidden_size)= (1, 32, 128)

In [11]:
print(input_lang.n_words)
print(output_lang.n_words)

4601
2991


In [12]:

# 一个包含10个大小为3的张量的嵌入模块
embedding = nn.Embedding(10, 3)
# 一个包含2个样本的批次，每个样本有4个索引
print(embedding.weight)
input = torch.LongTensor([[1, 2, 4, 5], [4, 3, 2, 9]])
print(embedding(input))

embedding = nn.Embedding(10, 3, padding_idx=0)
# padding_idx=0 表示索引为 0 的嵌入向量将始终为零
# 这是合理的, 因为一般来说padding一个input的时候, 会用0来填充
input = torch.LongTensor([[0, 2, 0, 5]])
print(embedding.weight)
embedding(input)

Parameter containing:
tensor([[ 1.0809, -0.2764, -0.6606],
        [-0.9478, -0.9776,  1.4272],
        [-0.3416,  0.4340,  1.1769],
        [-1.6055,  0.8642, -0.8805],
        [ 1.4900, -1.0527, -0.2781],
        [-0.7973,  0.4130, -0.0401],
        [ 2.2093,  1.1120,  0.7554],
        [ 0.0854, -0.9074, -0.6337],
        [ 0.0219, -1.4945,  0.6451],
        [ 0.0122,  0.9897, -0.1523]], requires_grad=True)
tensor([[[-0.9478, -0.9776,  1.4272],
         [-0.3416,  0.4340,  1.1769],
         [ 1.4900, -1.0527, -0.2781],
         [-0.7973,  0.4130, -0.0401]],

        [[ 1.4900, -1.0527, -0.2781],
         [-1.6055,  0.8642, -0.8805],
         [-0.3416,  0.4340,  1.1769],
         [ 0.0122,  0.9897, -0.1523]]], grad_fn=<EmbeddingBackward0>)
Parameter containing:
tensor([[ 0.0000,  0.0000,  0.0000],
        [ 1.8191, -1.3784, -0.3600],
        [-0.5796, -0.8333,  0.0368],
        [-0.4686,  0.7916,  0.9450],
        [ 0.2114,  1.1717,  1.6877],
        [ 1.6693,  0.4422, -1.9951],
     

tensor([[[ 0.0000,  0.0000,  0.0000],
         [-0.5796, -0.8333,  0.0368],
         [ 0.0000,  0.0000,  0.0000],
         [ 1.6693,  0.4422, -1.9951]]], grad_fn=<EmbeddingBackward0>)

The Decoder
===========

The decoder is another RNN that takes the encoder output vector(s) and
outputs a sequence of words to create the translation.

解码器是另一个RNN，它接收编码器输出的向量，并输出一个单词序列来生成翻译。

Simple Decoder
==============

In the simplest seq2seq decoder we use only last output of the encoder.
This last output is sometimes called the *context vector* as it encodes
context from the entire sequence. This context vector is used as the
initial hidden state of the decoder.

在最简单的seq2seq解码器中，我们仅使用编码器的最后一个输出。这个最后的输出有时被称为*上下文向量*，因为它编码了整个序列的上下文。这个上下文向量用作解码器的初始隐藏状态。

At every step of decoding, the decoder is given an input token and
hidden state. The initial input token is the start-of-string `<SOS>`
token, and the first hidden state is the context vector (the encoder\'s
last hidden state).

在解码的每一步，解码器都会接收一个输入标记和一个隐藏状态。初始输入标记是字符串开头的`<SOS>`标记，第一个隐藏状态是上下文向量（编码器的最后一个隐藏状态）。

![](https://pytorch.org/tutorials/_static/img/seq-seq-images/decoder-network.png)


In [13]:
# hidden_size = 128
# batch_size = 32
# input_size = input_lang.n_words = 4601
# output_size = output_lang.n_words = 2991
class DecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size):
        super(DecoderRNN, self).__init__()
        self.embedding = nn.Embedding(output_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size, batch_first=True)
        self.out = nn.Linear(hidden_size, output_size)

    def forward(self, encoder_outputs, encoder_hidden, target_tensor=None):
        # encoder_outputs: (batch_size, seq_len, hidden_size)= (32, 10, 128)
        batch_size = encoder_outputs.size(0)
        decoder_input = torch.empty(batch_size, 1, dtype=torch.long, device=device).fill_(SOS_token)
        # decoder_input: (batch_size, 1)= (32, 1)
        # torch.empty(batch_size, 1, dtype=torch.long, device=device): (batch_size, 1)
        # fill_(SOS_token): 将所有元素填充为 SOS_token= 0
        # decoder_input: (batch_size, 1) = (32, 1)
        decoder_hidden = encoder_hidden
        # decoder_hidden: (1, batch_size, hidden_size)= (1, 32, 128)
        decoder_outputs = []

        for i in range(MAX_LENGTH):
            # seq_len = MAX_LENGTH = 10
            decoder_output, decoder_hidden  = self.forward_step(decoder_input, decoder_hidden)
            # decoder_output: (batch_size, 1, output_size)= (32, 1, 2991)
            # decoder_hidden: (num_layers, batch_size, hidden_size)= (1, 32, 128)
            decoder_outputs.append(decoder_output)
            # decoder_outputs: [(batch_size, 1, output_size), ...]= [(32, 1, 2991), ...]
            # decoder_outputs = [
            #     tensor1,  # 形状为 (32, 1, 2991)
            #     tensor2,  # 形状为 (32, 1, 2991)
            #     tensor3,  # 形状为 (32, 1, 2991)
            #     ...
            #     tensor10  # 形状为 (32, 1, 2991)
            # ]

            if target_tensor is not None:
                # Teacher forcing: Feed the target as the next input
                decoder_input = target_tensor[:, i].unsqueeze(1) # Teacher forcing
                # target_tensor: (batch_size, seq_len) = (32, 10)
                # target_tensor[:, i]: (batch_size,) = (32,)
                # target_tensor[:, i].unsqueeze(1): (batch_size, 1)= (32, 1)
                # unsqueeze(dim) 会在维度 dim 上插入一个新的维度
            else:
                # Without teacher forcing: use its own predictions as the next input
                _, topi = decoder_output.topk(1)
                # decoder_output.topk(1) 返回两个张量：第一个是具有最高概率的值，第二个是对应的索引（即预测的单词索引）
                # _: (batch_size, 1, 1)= (32, 1, 1)
                # topi: (batch_size, 1, 1)= (32, 1, 1)
                
                decoder_input = topi.squeeze(-1).detach()  # detach from history as input
                # decoder_output.topk(1): 返回每行最大的元素值和索引
                # squeeze(-1) 会去掉最后一个维度，使 topi 的形状从 (32, 1, 1) 变为 (32, 1)
                # topi.squeeze(-1): (batch_size, 1)= (32, 1)
                # detach() 会返回一个新的张量，从当前计算图中分离下来, 这在自回归生成中很重要，因为我们不希望当前时间步的预测影响前一步的梯度计算。
                
                
        # decoder_outputs: 10 x (batch_size, 1, output_size)
        decoder_outputs = torch.cat(decoder_outputs, dim=1)
        # decoder_outputs: (batch_size, MAX_LENGTH, output_size)= (32, 10, 2991)
        # torch.cat(decoder_outputs, dim=1): 沿着第二维度拼接
        decoder_outputs = F.log_softmax(decoder_outputs, dim=-1)
        # decoder_outputs: (batch_size, seq_len, output_size)= (32, 10, 2991)
        return decoder_outputs, decoder_hidden, None # We return `None` for consistency in the training loop
        # decoder_outputs: (batch_size, seq_len, output_size)= (32, 10, 2991)
        # decoder_hidden: (num_layers, batch_size, hidden_size)= (1, 32, 128)

    def forward_step(self, input, hidden):
        # input: (batch_size, 1) = (32, 1)
        # hidden: (num_layers, batch_size, hidden_size)= (1, 32, 128)
        output = self.embedding(input)
        # output: (batch_size, 1, hidden_size)= (32, 1, 128)
        output = F.relu(output)
        # output: (batch_size, 1, hidden_size)= (32, 1, 128)
        output, hidden = self.gru(output, hidden)
        # output: (batch_size, 1, hidden_size)= (32, 1, 128)
        # hidden: (num_layers, batch_size, hidden_size)= (1, 32, 128)
        output = self.out(output)
        # output: (batch_size, 1, output_size)= (32, 1, 2991)
        return output, hidden

I encourage you to train and observe the results of this model, but to
save space we\'ll be going straight for the gold and introducing the
Attention Mechanism.

我鼓励你训练并观察这个模型的结果，但为了节省空间，我们将直接介绍更先进的注意力机制。

Attention Decoder
=================

If only the context vector is passed between the encoder and decoder,
that single vector carries the burden of encoding the entire sentence.

如果在编码器和解码器之间只传递上下文向量，那么这个单一的向量就需要承载编码整个句子的负担。

Attention allows the decoder network to \"focus\" on a different part of
the encoder\'s outputs for every step of the decoder\'s own outputs.
First we calculate a set of *attention weights*. These will be
multiplied by the encoder output vectors to create a weighted
combination. The result (called `attn_applied` in the code) should
contain information about that specific part of the input sequence, and
thus help the decoder choose the right output words.

注意力机制允许解码器网络在生成每一步输出时“聚焦”到编码器输出的不同部分。首先，我们计算一组*注意力权重*。这些权重将乘以编码器的输出向量以创建加权组合。结果（在代码中称为`attn_applied`）应该包含关于输入序列特定部分的信息，从而帮助解码器选择正确的输出单词。

![](https://i.imgur.com/1152PYf.png)

Calculating the attention weights is done with another feed-forward
layer `attn`, using the decoder\'s input and hidden state as inputs.
Because there are sentences of all sizes in the training data, to
actually create and train this layer we have to choose a maximum
sentence length (input length, for encoder outputs) that it can apply
to. Sentences of the maximum length will use all the attention weights,
while shorter sentences will only use the first few.

计算注意力权重是通过另一个前馈层`attn`完成的，使用解码器的输入和隐藏状态作为输入。由于训练数据中包含各种长度的句子，为了实际创建和训练这一层，我们必须选择一个最大句子长度（即编码器输出的输入长度）来应用注意力权重。最大长度的句子将使用所有的注意力权重，而较短的句子将只使用前几个权重。

![](https://pytorch.org/tutorials/_static/img/seq-seq-images/attention-decoder-network.png)

Bahdanau attention, also known as additive attention, is a commonly used
attention mechanism in sequence-to-sequence models, particularly in
neural machine translation tasks. It was introduced by Bahdanau et al.
in their paper titled [Neural Machine Translation by Jointly Learning to
Align and Translate](https://arxiv.org/pdf/1409.0473.pdf). This
attention mechanism employs a learned alignment model to compute
attention scores between the encoder and decoder hidden states. It
utilizes a feed-forward neural network to calculate alignment scores.

Bahdanau注意力机制，也称为加性注意力，是序列到序列模型中常用的注意力机制，特别是在神经机器翻译任务中。它由Bahdanau等人在其论文[通过联合学习对齐和翻译进行神经机器翻译](https://arxiv.org/pdf/1409.0473.pdf)中提出。该注意力机制使用一个学习到的对齐模型来计算编码器和解码器隐藏状态之间的注意力得分。它利用一个前馈神经网络来计算对齐得分。

However, there are alternative attention mechanisms available, such as
Luong attention, which computes attention scores by taking the dot
product between the decoder hidden state and the encoder hidden states.
It does not involve the non-linear transformation used in Bahdanau
attention.

然而，还有其他可用的注意力机制，例如Luong注意力。Luong注意力通过计算解码器隐藏状态和编码器隐藏状态之间的点积来计算注意力得分。它不涉及Bahdanau注意力中使用的非线性变换。

In this tutorial, we will be using Bahdanau attention. However, it would
be a valuable exercise to explore modifying the attention mechanism to
use Luong attention.

在本教程中，我们将使用Bahdanau注意力。然而，探索修改注意力机制以使用Luong注意力将是一个有价值的练习。

In [14]:
class BahdanauAttention(nn.Module):
    def __init__(self, hidden_size):
        super(BahdanauAttention, self).__init__()
        # nn.Linear 包含两个主要参数：
        # weight：权重矩阵，形状为 (out_features, in_features)。
        # bias：偏置向量，形状为 (out_features,)。
        self.Wa = nn.Linear(hidden_size, hidden_size)
        # self.Wa.weight: (hidden_size, hidden_size)= (128, 128)
        self.Ua = nn.Linear(hidden_size, hidden_size)
        # self.Ua.weight: (hidden_size, hidden_size)= (128, 128)
        self.Va = nn.Linear(hidden_size, 1)
        # self.Va.weight: (1, hidden_size)= (1, 128)

    def forward(self, query, keys):
        # query: (batch_size, 1, hidden_size)= (32, 1, 128)
        # keys: (batch_size, seq_len, hidden_size)= (32, 10, 128)
        scores = self.Va(torch.tanh(self.Wa(query) + self.Ua(keys)))
        # Q = query · W_a.T + b_a= self.Wa(query): (batch_size, 1, hidden_size)= (32, 1, 128)
        # K = keys · U_a.T + b_a= self.Ua(keys): (batch_size, seq_len, hidden_size)= (32, 10, 128)
        # Q + K = self.Wa(query) + self.Ua(keys): (batch_size, seq_len, hidden_size) = (32, 10, 128)
        # 这两个张量相加时会广播 query，使其与 keys 具有相同的形状，结果的形状为 (batch_size, seq_length, hidden_size)
        # E = tanh(Q + K) = torch.tanh(self.Wa(query) + self.Ua(keys)): (batch_size, seq_len, hidden_size)= (32, 10, 128)
        # scores = E · V_a.T + b_a = self.Va(torch.tanh(self.Wa(query) + self.Ua(keys))): (batch_size, seq_len, 1)= (32, 10, 1)
        
        scores = scores.squeeze(2).unsqueeze(1)
        # scores.squeeze(2): (batch_size, seq_len)= (32, 10)
        # scores.squeeze(2).unsqueeze(1): (batch_size, 1, seq_len)= (32, 1, 10)
        # scores: (batch_size, 1, seq_len) <- (batch_size, seq_len) <- (batch_size, seq_len, 1)
        
        weights = F.softmax(scores, dim=-1)
        # weights: (batch_size, 1, seq_len) = (32, 1, 10)
        
        # keys: (batch_size, seq_len, hidden_size) = (32, 10, 128)
        context = torch.bmm(weights, keys)
        # context: (batch_size, 1, hidden_size)= (32, 1, 128) <- (32, 1, 10) x (32, 10, 128)
        # bmm:  第一个张量的形状为 (batch_size, n, m)
        #       第二个张量的形状为 (batch_size, m, p)
        #       输出张量的形状为 (batch_size, n, p)

        return context, weights
        # context: (batch_size, 1, hidden_size)= (32, 1, 128)
        # weights: (batch_size, 1, seq_len)= (32, 1, 10)

class AttnDecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size, dropout_p=0.1):
        # hidden_size = 128
        # output_size = 2991
        super(AttnDecoderRNN, self).__init__()
        # 调用父类nn.Module的初始化器
        self.embedding = nn.Embedding(output_size, hidden_size)
        # nn.Embedding(output_size, hidden_size) 创建一个输出大小为 output_size，输出大小为 hidden_size 的嵌入层
        self.attention = BahdanauAttention(hidden_size)
        self.gru = nn.GRU(2 * hidden_size, hidden_size, batch_first=True)
        # input_size 是 2 * hidden_size，因为 input_gru 是拼接了嵌入向量和上下文向量
        self.out = nn.Linear(hidden_size, output_size)
        self.dropout = nn.Dropout(dropout_p)

    def forward(self, encoder_outputs, encoder_hidden, target_tensor=None):
        # encoder_outputs: (batch_size, seq_len, hidden_size)= (32, 10, 128)
        # encoder_hidden: (1, batch_size, hidden_size)= (1, 32, 128)
        # target_tensor: (batch_size, seq_len)= (32, 10)
        batch_size = encoder_outputs.size(0)
        decoder_input = torch.empty(batch_size, 1, dtype=torch.long, device=device).fill_(SOS_token)
        # decoder_input: (batch_size, 1) = (32, 1)
        decoder_hidden = encoder_hidden
        # decoder_hidden: (1, batch_size, hidden_size)= (1, 32, 128)
        decoder_outputs = []
        # decoder_outputs: [(batch_size, 1, output_size), ...]= [(32, 1, 2991), ...]
        attentions = []
        # attentions: [(batch_size, 1, seq_len), ...]= [(32, 1, 10), ...]

        for i in range(MAX_LENGTH):
            # seq_len = MAX_LENGTH = 10
            # decoder_input: (batch_size, 1) = (32, 1)
            # decoder_hidden: (1, batch_size, hidden_size) = (1, 32, 128)
            # encoder_outputs: (batch_size, seq_len, hidden_size) = (32, 10, 128)
            decoder_output, decoder_hidden, attn_weights = self.forward_step(
                decoder_input, decoder_hidden, encoder_outputs
            )
            # decoder_output: (batch_size, 1, output_size)= (32, 1, 2991)
            # decoder_hidden: (num_layers, batch_size, hidden_size)= (1, 32, 128)
            # attn_weights: (batch_size, 1, seq_len)= (32, 1, 10)
            decoder_outputs.append(decoder_output)
            # decoder_outputs: [(batch_size, 1, output_size), ...]= [(32, 1, 2991), ...]
            attentions.append(attn_weights)
            # attentions: [(batch_size, 1, seq_len), ...]= [(32, 1, 10), ...]

            if target_tensor is not None:
                # Teacher forcing: Feed the target as the next input
                decoder_input = target_tensor[:, i].unsqueeze(1) # Teacher forcing
                # target_tensor: (batch_size, seq_len)= (32, 10)
                # target_tensor[:, i]: (batch_size,)= (32,)
                # target_tensor[:, i].unsqueeze(1): (batch_size, 1) = (32, 1)
                # decoder_input: (batch_size, 1)= (32, 1)
            else:
                # Without teacher forcing: use its own predictions as the next input
                _, topi = decoder_output.topk(1)
                # decoder_output.topk(1) 返回两个张量：第一个是具有最高概率的值，第二个是对应的索引（即预测的单词索引）
                # _: (batch_size, 1, 1) = (32, 1, 1)
                # topi: (batch_size, 1, 1) = (32, 1, 1)
                decoder_input = topi.squeeze(-1).detach()  # detach from history as input
                # decoder_output.topk(1): 返回每行最大的元素值和索引
                # squeeze(-1) 会去掉最后一个维度，使 topi 的形状从 (32, 1, 1) 变为 (32, 1)
                # topi.squeeze(-1): (batch_size, 1) = (32, 1)
                # detach() 会返回一个新的张量，从当前计算图中分离下来, 这在自回归生成中很重要，因为我们不希望当前时间步的预测影响前一步的梯度计算
                # decoder_input: (batch_size, 1) = (32, 1)

        # seq_len = MAX_LENGTH = 10
        # decoder_outputs: seq_len x (batch_size, 1, output_size)
        decoder_outputs = torch.cat(decoder_outputs, dim=1)
        # torch.cat(decoder_outputs, dim=1): 沿着第二维度拼接
        # decoder_outputs: (batch_size, seq_len, output_size)= (32, 10, 2991)
        
        decoder_outputs = F.log_softmax(decoder_outputs, dim=-1)
        # decoder_outputs: (batch_size, seq_len, output_size)= (32, 10, 2991)
        # F.log_softmax(decoder_outputs, dim=-1): 沿着最后一个维度计算对数 softmax
        
        # attentions: seq_len x (batch_size, 1, seq_len)
        attentions = torch.cat(attentions, dim=1)
        # attentions: (batch_size, seq_len, seq_len)= (32, 10, 10)

        return decoder_outputs, decoder_hidden, attentions
        # decoder_outputs: (batch_size, seq_len, output_size)= (32, 10, 2991)
        # decoder_hidden: (num_layers, batch_size, hidden_size)= (1, 32, 128)
        # attentions: (batch_size, seq_len, seq_len)= (32, 10, 10)


    def forward_step(self, input, hidden, encoder_outputs):
        # input: (batch_size, 1) = (32, 1)
        # hidden: (num_layers, batch_size, hidden_size)= (1, 32, 128)
        # encoder_outputs: (batch_size, seq_len, hidden_size)= (32, 10, 128)
        
        embedded =  self.dropout(self.embedding(input))
        # input的第二维里面的每个元素都是一个整数索引，表示词汇表中的单词索引, embedding(input)会查找到对应的嵌入向量
        # 如第二维的某个元素值为i, 则embedding(i of input): (ouput_size, hidden_size)[i]: (hidden_size)找到第i个(第i行的, 这个[i]我这里表示相当于数组取第i行)嵌入向量
        # self.embedding(input): (batch_size, 1, hidden_size)= (32, 1, 128)
        
        # hidden: (num_layers, batch_size, hidden_size)= (1, 32, 128)
        query = hidden.permute(1, 0, 2)
        # query = hidden.permute(1, 0, 2): (batch_size, num_layers, hidden_size)= (32, 1, 128)
        
        # encoder_outputs: (batch_size, seq_len, hidden_size) = (32, 10, 128)
        context, attn_weights = self.attention(query, encoder_outputs)
        # context: (batch_size, 1, hidden_size)= (32, 1, 128)
        # attn_weights: (batch_size, 1, seq_len)= (32, 1, 10)
        
        # embedded: (batch_size, 1, hidden_size)= (32, 1, 128)
        input_gru = torch.cat((embedded, context), dim=2)
        # torch.cat((embedded, context), dim=2): 沿着第三维度拼接
        # input_gru: (batch_size, 1, 2 * hidden_size)= (32, 1, 256)

        # hidden: (num_layers, batch_size, hidden_size)= (1, 32, 128)
        output, hidden = self.gru(input_gru, hidden)
        # output: (batch_size, 1, hidden_size)= (32, 1, 128)
        # hidden: (num_layers, batch_size, hidden_size)= (1, 32, 128)
        
        output = self.out(output)
        # output: (batch_size, 1, output_size)= (32, 1, 2991)

        return output, hidden, attn_weights
        # output: (batch_size, 1, output_size)= (32, 1, 2991)
        # hidden: (num_layers, batch_size, hidden_size)= (1, 32, 128)
        # attn_weights: (batch_size, 1, seq_len)= (32, 1, 10)

<div style="background-color: #54c7ec; color: #fff; font-weight: 700; padding-left: 10px; padding-top: 5px; padding-bottom: 5px"><strong>NOTE:</strong></div>
<div style="background-color: #f3f4f7; padding-left: 10px; padding-top: 10px; padding-bottom: 10px; padding-right: 10px">
<p>There are other forms of attention that work around the lengthlimitation by using a relative position approach. Read about "localattention" in <a href="https://arxiv.org/abs/1508.04025">Effective Approaches to Attention-based Neural MachineTranslation</a>.</p>
</div>

Training
========

Preparing Training Data
-----------------------

To train, for each pair we will need an input tensor (indexes of the
words in the input sentence) and target tensor (indexes of the words in
the target sentence). While creating these vectors we will append the
EOS token to both sequences.

为了训练，对于每一对句子，我们需要一个输入张量（输入句子中单词的索引）和一个目标张量（目标句子中单词的索引）。在创建这些向量时，我们会在两个序列的末尾添加EOS标记。

In [15]:
def indexesFromSentence(lang, sentence):
    return [lang.word2index[word] for word in sentence.split(' ')]

def tensorFromSentence(lang, sentence):
    indexes = indexesFromSentence(lang, sentence)
    indexes.append(EOS_token)
    return torch.tensor(indexes, dtype=torch.long, device=device).view(1, -1)

def tensorsFromPair(pair):
    input_tensor = tensorFromSentence(input_lang, pair[0])
    target_tensor = tensorFromSentence(output_lang, pair[1])
    return (input_tensor, target_tensor)

def get_dataloader(batch_size):
    input_lang, output_lang, pairs = prepareData('eng', 'fra', True)

    n = len(pairs)
    input_ids = np.zeros((n, MAX_LENGTH), dtype=np.int32)
    target_ids = np.zeros((n, MAX_LENGTH), dtype=np.int32)

    for idx, (inp, tgt) in enumerate(pairs):
        inp_ids = indexesFromSentence(input_lang, inp)
        tgt_ids = indexesFromSentence(output_lang, tgt)
        inp_ids.append(EOS_token)
        tgt_ids.append(EOS_token)
        input_ids[idx, :len(inp_ids)] = inp_ids
        target_ids[idx, :len(tgt_ids)] = tgt_ids

    train_data = TensorDataset(torch.LongTensor(input_ids).to(device),
                               torch.LongTensor(target_ids).to(device))

    train_sampler = RandomSampler(train_data)
    train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)
    return input_lang, output_lang, train_dataloader

Training the Model 训练模型
==================

To train we run the input sentence through the encoder, and keep track
of every output and the latest hidden state. Then the decoder is given
the `<SOS>` token as its first input, and the last hidden state of the
encoder as its first hidden state.

在训练过程中，我们将输入句子通过编码器，并记录每一个输出和最新的隐藏状态。然后，将 `<SOS>` 标记作为解码器的第一个输入，并将编码器的最后一个隐藏状态作为解码器的第一个隐藏状态。

\"Teacher forcing\" is the concept of using the real target outputs as
each next input, instead of using the decoder\'s guess as the next
input. Using teacher forcing causes it to converge faster but [when the
trained network is exploited, it may exhibit
instability](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.378.4095&rep=rep1&type=pdf).

“教师强制” 是指使用真实的目标输出作为每个下一个输入，而不是使用解码器的预测作为下一个输入。使用教师强制会使模型收敛更快，但 [当利用训练好的网络时，它可能会表现出不稳定性](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.378.4095&rep=rep1&type=pdf)。

You can observe outputs of teacher-forced networks that read with
coherent grammar but wander far from the correct translation
-intuitively it has learned to represent the output grammar and can
\"pick up\" the meaning once the teacher tells it the first few words,
but it has not properly learned how to create the sentence from the
translation in the first place.

你可以观察到教师强制网络的输出，其阅读具有连贯的语法但偏离正确的翻译——直观上它学会了表示输出语法，并且一旦教师告诉它前几个单词，它就能“拾取”意思，但它并没有正确学习如何从一开始就根据翻译创建句子。

Because of the freedom PyTorch\'s autograd gives us, we can randomly
choose to use teacher forcing or not with a simple if statement. Turn
`teacher_forcing_ratio` up to use more of it.

由于 PyTorch 的 autograd 给了我们自由，我们可以通过一个简单的 if 语句随机选择是否使用教师强制。将 `teacher_forcing_ratio` 调高以更多地使用它。

In [16]:
def train_epoch(dataloader, encoder, decoder, encoder_optimizer,
          decoder_optimizer, criterion):

    total_loss = 0
    for data in dataloader:
        input_tensor, target_tensor = data
        # print(input_tensor.size())
        # print(target_tensor.size())

        encoder_optimizer.zero_grad()
        decoder_optimizer.zero_grad()

        encoder_outputs, encoder_hidden = encoder(input_tensor)
        # print(encoder_outputs.size())
        # print(encoder_hidden.size())
        decoder_outputs, _, _ = decoder(encoder_outputs, encoder_hidden, target_tensor)

        loss = criterion(
            decoder_outputs.view(-1, decoder_outputs.size(-1)),
            target_tensor.view(-1)
        )
        loss.backward()

        encoder_optimizer.step()
        decoder_optimizer.step()

        total_loss += loss.item()

    return total_loss / len(dataloader)

This is a helper function to print time elapsed and estimated time
remaining given the current time and progress %.

这是一个辅助函数，用于根据当前时间和进度百分比打印已用时间和预计剩余时间。

In [17]:
import time
import math

def asMinutes(s):
    m = math.floor(s / 60)
    s -= m * 60
    return '%dm %ds' % (m, s)

def timeSince(since, percent):
    now = time.time()
    s = now - since
    es = s / (percent)
    rs = es - s
    return '%s (- %s)' % (asMinutes(s), asMinutes(rs))

The whole training process looks like this:

-   Start a timer
-   Initialize optimizers and criterion
-   Create set of training pairs
-   Start empty losses array for plotting

Then we call `train` many times and occasionally print the progress (%
of examples, time so far, estimated time) and average loss.

整个训练过程如下：

- 启动计时器
- 初始化优化器和损失函数
- 创建训练对集
- 启动一个空的损失数组用于绘图

然后我们多次调用 `train` 函数，并偶尔打印进度（例如的百分比、到目前为止的时间、预计时间）和平均损失。

In [18]:
def train(train_dataloader, encoder, decoder, n_epochs, learning_rate=0.001,
               print_every=100, plot_every=100):
    start = time.time()
    plot_losses = []
    print_loss_total = 0  # Reset every print_every
    plot_loss_total = 0  # Reset every plot_every

    encoder_optimizer = optim.Adam(encoder.parameters(), lr=learning_rate)
    decoder_optimizer = optim.Adam(decoder.parameters(), lr=learning_rate)
    criterion = nn.NLLLoss()

    for epoch in range(1, n_epochs + 1):
        loss = train_epoch(train_dataloader, encoder, decoder, encoder_optimizer, decoder_optimizer, criterion)
        print_loss_total += loss
        plot_loss_total += loss

        if epoch % print_every == 0:
            print_loss_avg = print_loss_total / print_every
            print_loss_total = 0
            print('%s (%d %d%%) %.4f' % (timeSince(start, epoch / n_epochs),
                                        epoch, epoch / n_epochs * 100, print_loss_avg))

        if epoch % plot_every == 0:
            plot_loss_avg = plot_loss_total / plot_every
            plot_losses.append(plot_loss_avg)
            plot_loss_total = 0

    showPlot(plot_losses)

Plotting results 绘制结果
================

Plotting is done with matplotlib, using the array of loss values
`plot_losses` saved while training.

使用 matplotlib 进行绘图，使用在训练过程中保存的损失值数组 `plot_losses`

In [19]:
import matplotlib.pyplot as plt
plt.switch_backend('agg')
import matplotlib.ticker as ticker
import numpy as np

def showPlot(points):
    plt.figure()
    fig, ax = plt.subplots()
    # this locator puts ticks at regular intervals
    loc = ticker.MultipleLocator(base=0.2)
    ax.yaxis.set_major_locator(loc)
    plt.plot(points)

Evaluation
==========

Evaluation is mostly the same as training, but there are no targets so
we simply feed the decoder\'s predictions back to itself for each step.
Every time it predicts a word we add it to the output string, and if it
predicts the EOS token we stop there. We also store the decoder\'s
attention outputs for display later.

评估过程与训练过程大致相同，但没有目标值，因此我们只需在每一步将解码器的预测结果反馈给解码器本身。每次预测出一个词时，我们将其添加到输出字符串中，如果预测出 EOS 标记，我们就停止。同时，我们还存储解码器的注意力输出，以便稍后显示。

In [20]:
def evaluate(encoder, decoder, sentence, input_lang, output_lang):
    with torch.no_grad():
        input_tensor = tensorFromSentence(input_lang, sentence)

        encoder_outputs, encoder_hidden = encoder(input_tensor)
        decoder_outputs, decoder_hidden, decoder_attn = decoder(encoder_outputs, encoder_hidden)

        _, topi = decoder_outputs.topk(1)
        decoded_ids = topi.squeeze()

        decoded_words = []
        for idx in decoded_ids:
            if idx.item() == EOS_token:
                decoded_words.append('<EOS>')
                break
            decoded_words.append(output_lang.index2word[idx.item()])
    return decoded_words, decoder_attn

We can evaluate random sentences from the training set and print out the
input, target, and output to make some subjective quality judgements:


In [21]:
def evaluateRandomly(encoder, decoder, n=10):
    for i in range(n):
        pair = random.choice(pairs)
        print('>', pair[0])
        print('=', pair[1])
        output_words, _ = evaluate(encoder, decoder, pair[0], input_lang, output_lang)
        output_sentence = ' '.join(output_words)
        print('<', output_sentence)
        print('')

Training and Evaluating
=======================

With all these helper functions in place (it looks like extra work, but
it makes it easier to run multiple experiments) we can actually
initialize a network and start training.

有了这些辅助函数（看起来是额外的工作，但它使运行多个实验更容易），我们实际上可以初始化网络并开始训练。

Remember that the input sentences were heavily filtered. For this small
dataset we can use relatively small networks of 256 hidden nodes and a
single GRU layer. After about 40 minutes on a MacBook CPU we\'ll get
some reasonable results.

请记住，输入句子经过了大量过滤。对于这个小数据集，我们可以使用相对较小的网络，包括256个隐藏节点和一个GRU层。在MacBook CPU上运行大约40分钟后，我们将得到一些合理的结果。

<div style="background-color: #54c7ec; color: #fff; font-weight: 700; padding-left: 10px; padding-top: 5px; padding-bottom: 5px"><strong>NOTE:</strong></div>
<div style="background-color: #f3f4f7; padding-left: 10px; padding-top: 10px; padding-bottom: 10px; padding-right: 10px">
<p>If you run this notebook you can train, interrupt the kernel,evaluate, and continue training later. Comment out the lines where theencoder and decoder are initialized and run <code>trainIters</code> again.</p>
</div>

如果你运行这个notebook，你可以训练、暂停内核、评估，然后继续训练。注释掉初始化编码器和解码器的行，然后再次运行 trainIters


In [22]:
hidden_size = 128
batch_size = 32

input_lang, output_lang, train_dataloader = get_dataloader(batch_size)
# input_lang.n_words= 4601
# output_lang.n_words= 2991
print(output_lang.n_words)
encoder = EncoderRNN(input_lang.n_words, hidden_size).to(device)
decoder = AttnDecoderRNN(hidden_size, output_lang.n_words).to(device)

train(train_dataloader, encoder, decoder, 80, print_every=5, plot_every=5)

Reading lines...


Read 135842 sentence pairs
Trimmed to 11445 sentence pairs
Counting words...
Counted words:
fra 4601
eng 2991
2991
0m 28s (- 7m 14s) (5 6%) 1.5316
0m 56s (- 6m 37s) (10 12%) 0.7015
1m 24s (- 6m 6s) (15 18%) 0.3684
1m 52s (- 5m 37s) (20 25%) 0.2049
2m 20s (- 5m 10s) (25 31%) 0.1257
2m 51s (- 4m 45s) (30 37%) 0.0866
3m 19s (- 4m 16s) (35 43%) 0.0658
3m 48s (- 3m 48s) (40 50%) 0.0539
4m 17s (- 3m 20s) (45 56%) 0.0459
5m 16s (- 3m 9s) (50 62%) 0.0409
6m 15s (- 2m 50s) (55 68%) 0.0378
7m 15s (- 2m 25s) (60 75%) 0.0348
8m 14s (- 1m 54s) (65 81%) 0.0331
9m 14s (- 1m 19s) (70 87%) 0.0312
10m 13s (- 0m 40s) (75 93%) 0.0303
11m 11s (- 0m 0s) (80 100%) 0.0289


Set dropout layers to `eval` mode

将 dropout 层设置为 `eval` 模式

In [23]:
encoder.eval()
decoder.eval()
evaluateRandomly(encoder, decoder)

> je ne vais pas prendre le moindre risque
= i m not taking any chances
< i m not taking any chances <EOS>

> j essaie d economiser de l argent
= i m trying to save money
< i m trying to save money <EOS>

> nous ne sommes que des enfants
= we re just children
< we are sorry we children <EOS>

> tu es tres attirant
= you re very attractive
< you re very attractive too much friend <EOS>

> je suis aussi perplexe que tu l es
= i m just as confused as you are
< i m just as confused as you are <EOS>

> nous sommes toujours prudents
= we re always careful
< we re always careful <EOS>

> tu es merveilleuse
= you re wonderful
< you re wonderful <EOS>

> je porte mon maillot de bain sous mes vetements
= i m wearing my swimsuit under my clothes
< i m wearing my swimsuit under my clothes <EOS>

> tu es une petite menteuse
= you re a little liar
< you re a little liar <EOS>

> nous sommes en retard
= we re late
< we re late <EOS>



Visualizing Attention 可视化注意力机制
=====================

A useful property of the attention mechanism is its highly interpretable
outputs. Because it is used to weight specific encoder outputs of the
input sequence, we can imagine looking where the network is focused most
at each time step.

注意力机制的一个有用特性是其高度可解释的输出。因为它用于对输入序列的特定编码器输出进行加权，我们可以想象在每个时间步网络最关注的地方。

You could simply run `plt.matshow(attentions)` to see attention output
displayed as a matrix. For a better viewing experience we will do the
extra work of adding axes and labels:

你可以简单地运行 `plt.matshow(attentions)` 来将注意力输出显示为一个矩阵。为了更好的观看体验，我们将额外添加坐标轴和标签：

In [24]:
def showAttention(input_sentence, output_words, attentions):
    fig = plt.figure()
    ax = fig.add_subplot(111)
    cax = ax.matshow(attentions.cpu().numpy(), cmap='bone')
    fig.colorbar(cax)

    # Set up axes
    ax.set_xticklabels([''] + input_sentence.split(' ') +
                       ['<EOS>'], rotation=90)
    ax.set_yticklabels([''] + output_words)

    # Show label at every tick
    ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
    ax.yaxis.set_major_locator(ticker.MultipleLocator(1))

    plt.show()


def evaluateAndShowAttention(input_sentence):
    output_words, attentions = evaluate(encoder, decoder, input_sentence, input_lang, output_lang)
    print('input =', input_sentence)
    print('output =', ' '.join(output_words))
    showAttention(input_sentence, output_words, attentions[0, :len(output_words), :])


evaluateAndShowAttention('il n est pas aussi grand que son pere')

evaluateAndShowAttention('je suis trop fatigue pour conduire')

evaluateAndShowAttention('je suis desole si c est une question idiote')

evaluateAndShowAttention('je suis reellement fiere de vous')

input = il n est pas aussi grand que son pere
output = he is not as tall as his father <EOS>
input = je suis trop fatigue pour conduire
output = i m too tired to drive <EOS>
input = je suis desole si c est une question idiote
output = i m sorry if this is a stupid question <EOS>


  ax.set_xticklabels([''] + input_sentence.split(' ') +
  ax.set_yticklabels([''] + output_words)
  ax.set_xticklabels([''] + input_sentence.split(' ') +
  ax.set_yticklabels([''] + output_words)
  ax.set_xticklabels([''] + input_sentence.split(' ') +
  ax.set_yticklabels([''] + output_words)
  ax.set_xticklabels([''] + input_sentence.split(' ') +
  ax.set_yticklabels([''] + output_words)


input = je suis reellement fiere de vous
output = i m really proud of you yet <EOS>


Exercises 练习
=========

-   Try with a different dataset
    -   Another language pair
    -   Human → Machine (e.g. IOT commands)
    -   Chat → Response
    -   Question → Answer
-   Replace the embeddings with pretrained word embeddings such as
    `word2vec` or `GloVe`
-   Try with more layers, more hidden units, and more sentences. Compare
    the training time and results.
-   If you use a translation file where pairs have two of the same
    phrase (`I am test \t I am test`), you can use this as an
    autoencoder. Try this:
    -   Train as an autoencoder
    -   Save only the Encoder network
    -   Train a new Decoder for translation from there

- 试试使用不同的数据集
    - 另一种语言对
    - 人类 → 机器（例如：物联网命令）
    - 聊天 → 回复
    - 问题 → 答案
- 使用预训练的词嵌入（如 `word2vec` 或 `GloVe`）替换当前的嵌入
- 尝试更多的层数、更多的隐藏单元和更多的句子。比较训练时间和结果。
- 如果你使用一个翻译文件，其中对包含两个相同的短语（例如：`I am test \t I am test`），你可以将其用作自动编码器。试试这个：
    - 作为自动编码器进行训练
    - 仅保存编码器网络
    - 从那里训练一个新的解码器用于翻译