# 1 - Sequence to Sequence Learning with Neural Networks

使用PyTorch和TorchText 实现seq2seq ，德语到英语的翻译，也可以适用于信息提取，文字摘要等

依据论文地址https://arxiv.org/abs/1409.3215

## Introduction

### Encoder阶段
seq2seq是encoder-decoder结构， encoder部分是RNN结构，接收输入的文本向量
c 是conctext vector,上下文的向量就是encoder后的输出, decoder部分也是一个RNN结构

![](assets/seq2seq1.png)
上图示例，输入"guten morgen"德语，翻译成英语
这里用<SOS>表示句子起始，<EOS>表示结尾，也可以用其它的代替，xt是当前的词的向量，$h_{t-1}$是上一时刻的输出的隐藏状态，ht是表示当前RNN输出的隐藏状态

$$h_t = \text{EncoderRNN}(x_t, h_{t-1})$$

RNN的cell可以是LSTM或这GRU，论文中是GRU
输入X可以表示成$X = \{x_1, x_2, ..., x_T\}$， x1是SOS,x2是第一个单词的向量，encoder部分的初始隐藏状态h0是全零向量

这里的z应该是c才对，c = tanh(Vht)，ht是encoder的输出

### Decoder阶段
这里有点问题，yt是预测的时候的时候的输出，这里的h0应该也是用tanh(V'c)生成的，就是第一个s0
ht 是由yt，ht-1，c计算得到，如下公式少了c
$$s_t = \text{DecoderRNN}(y_t, s_{t-1})$$

我们用得到的st经过maxout和softmax后得到预测的单词的最大概率

$$\hat{y}_t = f(s_t)$$
我们通常使用sos作为decoder阶段的第一个单词输入
训练阶段，我们使用真实值与预测值的差值作为decoder的输出损失计算，在测试阶段，持续生成单词直到生成eos为止

## Preparing Data


In [66]:
import torch
import torch.nn as nn
import torch.optim as optim    #优化器

from torchtext.datasets import TranslationDataset, Multi30k
from torchtext.data import Field, BucketIterator

# import spacy
import numpy as np

import random
import math
import time

In [67]:
import spacy

In [68]:
#随机种子
SEED = 1234  
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

tokenizers, 句子分成列表 
e.g. "good morning!" becomes ["good", "morning", "!"]
使用spacy模块进行tokenizers
```
python -m spacy download en
python -m spacy download de
```

In [69]:
spacy_de = spacy.load('de')
spacy_en = spacy.load('en')

spacy_de 和 spacy_en是tokenizer的func，我们把句子传给它，就会返回一个tokens列表
反转输入单词的顺序模型效果更好

In [70]:
def tokenize_de(text):
    """
    Tokenizes German text from a string into a list of strings (tokens) and reverses it
    [::-1] 德语需要逆序输出
    """
    return [tok.text for tok in spacy_de.tokenizer(text)][::-1]

def tokenize_en(text):
    """
    Tokenizes English text from a string into a list of strings (tokens)
    """
    return [tok.text for tok in spacy_en.tokenizer(text)]

In [71]:
tokenize_de("guten morgen")
tokenize_en('how are you !')

['how', 'are', 'you', '!']

TorchText Field 函数处理数据， 德语作为SRC， 英语作为TRG， 所有单词全部小写，并且附加SOS 和EOS在句子开始末尾等处

In [72]:
SRC = Field(tokenize = tokenize_de, 
            init_token = '<sos>', 
            eos_token = '<eos>', 
            lower = True)

TRG = Field(tokenize = tokenize_en, 
            init_token = '<sos>', 
            eos_token = '<eos>', 
            lower = True)

加载训练集，测试集和验证集
The dataset we'll be using is the [Multi30k dataset](https://github.com/multi30k/dataset). This is a dataset with ~30,000 parallel English, German and French sentences, each with ~12 words per sentence. 

`exts` specifies which languages to use as the source and target (source goes first) and `fields` specifies which field to use for the source and target.

In [73]:
train_data, valid_data, test_data = Multi30k.splits(exts = ('.de', '.en'), 
                                                    fields = (SRC, TRG))

In [74]:
print(f"Number of training examples: {len(train_data.examples)}")
print(f"Number of validation examples: {len(valid_data.examples)}")
print(f"Number of testing examples: {len(test_data.examples)}")

Number of training examples: 29000
Number of validation examples: 1014
Number of testing examples: 1000


打印一个样本，src句子已经反转了

In [75]:
print(vars(train_data.examples[0]))

{'src': ['.', 'büsche', 'vieler', 'nähe', 'der', 'in', 'freien', 'im', 'sind', 'männer', 'weiße', 'junge', 'zwei'], 'trg': ['two', 'young', ',', 'white', 'males', 'are', 'outside', 'near', 'many', 'bushes', '.']}


给encoder和decoder的语言建立单词表，vocabulary， 用于单词到索引和索引到单词， 并且建立one-hot编码
使用最小频率法min_freq， 至少出现过2次的单词才能放入到字典中，否则用unk作为token
字典使用训练集建立，不能用验证集和测试集建立

In [76]:
SRC.build_vocab(train_data, min_freq = 2)
TRG.build_vocab(train_data, min_freq = 2)

In [77]:
print(f"Unique tokens in source (de) vocabulary: {len(SRC.vocab)}")
print(f"Unique tokens in target (en) vocabulary: {len(TRG.vocab)}")
# SRC.vocab.stoi

Unique tokens in source (de) vocabulary: 7855
Unique tokens in target (en) vocabulary: 5893


预处理数据的最后一步是创建一个迭代器，这个迭代器返回一个bacth的数据
定义torch.device， 运行在GPU上还是CPU，使用torch.cuda.is_available()自动判断是否存在GPU
当获取到一个batch的样本时，需要确保所有的SRC的长度相同， 所有的TRG的长度也相同
BucketIterator 就是创建一个长度都一样的桶的迭代器返回

In [78]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [79]:
BATCH_SIZE = 128

train_iterator, valid_iterator, test_iterator = BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = BATCH_SIZE, 
    device = device)

## Building the Seq2Seq Model
创建encoder和decoder

### Encoder
encoder是2层的LSTM，论文上是4层，2层会更节约训练时间
ht1 是第一层的LSTM，ht2是第二层的
$$h_t^1 = \text{EncoderRNN}^1(x_t, h_{t-1}^1)$$
$$h_t^2 = \text{EncoderRNN}^2(h_t^1, h_{t-1}^2)$$
我们需要给每一层都初始化一个h0

标准RNN和LSTM的输出
$$\begin{align*}
h_t &= \text{RNN}(x_t, h_{t-1})\\
(h_t, c_t) &= \text{LSTM}(x_t, (h_{t-1}, c_{t-1}))
\end{align*}$$

第一层和第二层的c0和h0都是全零的tensor
第一层和第二层都是LSTM的结构，输入和输出如下 
$$\begin{align*}
(h_t^1, c_t^1) &= \text{EncoderLSTM}^1(x_t, (h_{t-1}^1, c_{t-1}^1))\\
(h_t^2, c_t^2) &= \text{EncoderLSTM}^2(h_t^1, (h_{t-1}^2, c_{t-1}^2))
\end{align*}$$

encoder结构如下
![](assets/seq2seq2.png)

通过继承torch.nn.Module，自定义一个Encoder类, Encoder类接收如下参数:
- `input_dim` 输入维度数，等于字典的大小，因为是用的one-hot编码
- `emb_dim` embedding layer 维度，词嵌入的维度
- `hid_dim` is the dimensionality of the hidden and cell states.就是一个cell的里面的维度
- `n_layers` RNN的层数
- `dropout` dropout参数

在`forward` method中，传入德语作为X，然后被转换成密集向量通过词嵌入，然后做dropout（这时候做？）， 然后传入RNN，如果不指定cell或者hidden的state，可以不用初始化h0或c0

RNN 返回 outputs 是最顶层的layer的隐藏状态， hidden 是每层的隐藏状态， cell是每层的cell state
这里只需要 hidden and cell states
n_directions 是RNN的方向，这里是1， 当用双向RNN时，bidirectional RNNs 会变成2

torch中LSTM公式
\begin{array}{ll} \\
 i_t = \sigma(W_{ii} x_t + b_{ii} + W_{hi} h_{(t-1)} + b_{hi}) \\
 f_t = \sigma(W_{if} x_t + b_{if} + W_{hf} h_{(t-1)} + b_{hf}) \\
 o_t = \sigma(W_{io} x_t + b_{io} + W_{ho} h_{(t-1)} + b_{ho}) \\
 g_t = \tanh(W_{ig} x_t + b_{ig} + W_{hg} h_{(t-1)} + b_{hg}) \\
 c_t = f_t * c_{(t-1)} + i_t * g_t \\
 h_t = o_t * \tanh(c_t) \\
\end{array}

In [80]:
class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, hid_dim, n_layers, dropout):
        super().__init__()
        
        self.hid_dim = hid_dim
        self.n_layers = n_layers
        
        self.embedding = nn.Embedding(input_dim, emb_dim)
        
        # dropout重复
        self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, dropout = dropout)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, src):
        
        #src = [src len, batch size]
        
        embedded = self.dropout(self.embedding(src))
        
        #embedded = [src len, batch size, emb dim]
        
        outputs, (hidden, cell) = self.rnn(embedded)
        
        #outputs = [src len, batch size, hid dim * n directions]
        #hidden = [n layers * n directions, batch size, hid dim]
        #cell = [n layers * n directions, batch size, hid dim]
        
        #outputs are always from the top hidden layer
        
        return hidden, cell

### Decoder

Decoder也是2层LSTM
![](assets/seq2seq3.png)

解码阶段，第2层的LSTM使用第一层的隐藏状态ht，下图下的是st，和st-1，和ct-1
$$\begin{align*}
(s_t^1, c_t^1) = \text{DecoderLSTM}^1(y_t, (s_{t-1}^1, c_{t-1}^1))\\
(s_t^2, c_t^2) = \text{DecoderLSTM}^2(s_t^1, (s_{t-1}^2, c_{t-1}^2))
\end{align*}$$

初始化s0和c0，这里有些问题，和seq2seq公式不对应
$(s_0^l,c_0^l)=z^l=(h_T^l,c_T^l)$.

这里直接用linear layer求输出yhat，与seq2seq公式不相符
$$\hat{y}_{t+1} = f(s_t^L)$$

RNN输出的output经过Linear layer后预测

**Note**: as we always have a sequence length of 1, we could use `nn.LSTMCell`, instead of `nn.LSTM`, as it is designed to handle a batch of inputs that aren't necessarily in a sequence. `nn.LSTMCell` is just a single cell and `nn.LSTM` is a wrapper around potentially multiple cells. Using the `nn.LSTMCell` in this case would mean we don't have to `unsqueeze` to add a fake sequence length dimension, but we would need one `nn.LSTMCell` per layer in the decoder and to ensure each `nn.LSTMCell` receives the correct initial hidden state from the encoder. All of this makes the code less concise - hence the decision to stick with the regular `nn.LSTM`.

In [81]:
class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, hid_dim, n_layers, dropout):
        super().__init__()
        
        self.output_dim = output_dim
        self.hid_dim = hid_dim
        self.n_layers = n_layers
        
        self.embedding = nn.Embedding(output_dim, emb_dim)
        
        self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, dropout = dropout)
        
        self.fc_out = nn.Linear(hid_dim, output_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, input, hidden, cell):
        
        #input = [batch size]
        #hidden = [n layers * n directions, batch size, hid dim]
        #cell = [n layers * n directions, batch size, hid dim]
        
        #n directions in the decoder will both always be 1, therefore:
        #hidden = [n layers, batch size, hid dim]
        #context = [n layers, batch size, hid dim]
        #添加一个维度，一个单词一个单词的预测
        input = input.unsqueeze(0)  
        
        #input = [1, batch size]
        
        embedded = self.dropout(self.embedding(input))
        
        #embedded = [1, batch size, emb dim]
                
        output, (hidden, cell) = self.rnn(embedded, (hidden, cell))
        
        #output = [seq len, batch size, hid dim * n directions]
        #hidden = [n layers * n directions, batch size, hid dim]
        #cell = [n layers * n directions, batch size, hid dim]
        
        #seq len and n directions will always be 1 in the decoder, therefore:
        #output = [1, batch size, hid dim]
        #hidden = [n layers, batch size, hid dim]
        #cell = [n layers, batch size, hid dim]
        
        prediction = self.fc_out(output.squeeze(0))
        
        #prediction = [batch size, output dim]
        
        return prediction, hidden, cell

### Seq2Seq
把encoder和decoder穿起来
![](assets/seq2seq4.png)
我门需要让encoder和decoder的layers数相同，隐藏层的维度相同，也可以不同，手动转换一下中间维度即可

迭代流程
- 传入input, previous hidden and previous cell states ($y_t, s_{t-1}, c_{t-1}$) into the decoder
- receive a prediction, next hidden state and next cell state ($\hat{y}_{t+1}, s_{t}, c_{t}$) from the decoder
- place our prediction, $\hat{y}_{t+1}$/`output` in our tensor of predictions, $\hat{Y}$/`outputs`
- decide if we are going to "teacher force" or not
    - if we do, the next `input` is the ground-truth next token in the sequence, $y_{t+1}$/`trg[t]`
    - if we don't, the next `input` is the predicted next token in the sequence, $\hat{y}_{t+1}$/`top1`, which we get by doing an `argmax` over the output tensor
    
Once we've made all of our predictions, we return our tensor full of predictions, $\hat{Y}$/`outputs`.

**Note**: our decoder loop starts at 1, not 0. This means the 0th element of our `outputs` tensor remains all zeros. So our `trg` and `outputs` look something like:

$$\begin{align*}
\text{trg} = [<sos>, &y_1, y_2, y_3, <eos>]\\
\text{outputs} = [0, &\hat{y}_1, \hat{y}_2, \hat{y}_3, <eos>]
\end{align*}$$

Later on when we calculate the loss, we cut off the first element of each tensor to get:

$$\begin{align*}
\text{trg} = [&y_1, y_2, y_3, <eos>]\\
\text{outputs} = [&\hat{y}_1, \hat{y}_2, \hat{y}_3, <eos>]
\end{align*}$$

In [82]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
        
        assert encoder.hid_dim == decoder.hid_dim, \
            "Hidden dimensions of encoder and decoder must be equal!"
        assert encoder.n_layers == decoder.n_layers, \
            "Encoder and decoder must have equal number of layers!"
        
    def forward(self, src, trg, teacher_forcing_ratio = 0.5):
        
        #src = [src len, batch size]
        #trg = [trg len, batch size]
        #teacher_forcing_ratio 是使用真实值作为下一个输入的概率
        #e.g. if teacher_forcing_ratio is 0.75 we use ground-truth inputs 75% of the time
        
        batch_size = trg.shape[1]
        trg_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim
        
        #tensor to store decoder outputs
        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)
        
        #last hidden state of the encoder is used as the initial hidden state of the decoder
        hidden, cell = self.encoder(src)
        
        #first input to the decoder is the <sos> tokens
        input = trg[0,:]
        
        for t in range(1, trg_len):
            
            #insert input token embedding, previous hidden and previous cell states
            #receive output tensor (predictions) and new hidden and cell states
            output, hidden, cell = self.decoder(input, hidden, cell)
            
            #place predictions in a tensor holding predictions for each token
            outputs[t] = output
            
            #decide if we are going to use teacher forcing or not
            teacher_force = random.random() < teacher_forcing_ratio
            
            #get the highest predicted token from our predictions
            top1 = output.argmax(1) 
            
            #if teacher forcing, use actual next token as next input
            #if not, use predicted token
            input = trg[t] if teacher_force else top1
        
        return outputs

# Training the Seq2Seq Model

In [83]:
INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
HID_DIM = 512
N_LAYERS = 2
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5

enc = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, N_LAYERS, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, N_LAYERS, DEC_DROPOUT)

model = Seq2Seq(enc, dec, device).to(device)

下一步就是初始化参数，初始化方法是均匀分布uniform distribution between -0.08 and +0.08

In [84]:
def init_weights(m):
    for name, param in m.named_parameters():
        nn.init.uniform_(param.data, -0.08, 0.08)
        
model.apply(init_weights)

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(7855, 256)
    (rnn): LSTM(256, 512, num_layers=2, dropout=0.5)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(5893, 256)
    (rnn): LSTM(256, 512, num_layers=2, dropout=0.5)
    (fc_out): Linear(in_features=512, out_features=5893, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
)

定义模型要训练的参数

In [85]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 13,899,013 trainable parameters


使用adam优化器 

In [86]:
optimizer = optim.Adam(model.parameters())

损失函数使用CrossEntropyLoss
忽略计算<pad>的词语的损失

In [87]:
TRG_PAD_IDX = TRG.vocab.stoi[TRG.pad_token]    #<pad>的索引是1

criterion = nn.CrossEntropyLoss(ignore_index = TRG_PAD_IDX)

预测值和真实值类似如下
$$\begin{align*}
\text{trg} = [<sos>, &y_1, y_2, y_3, <eos>]\\
\text{outputs} = [0, &\hat{y}_1, \hat{y}_2, \hat{y}_3, <eos>]
\end{align*}$$

去掉第一个不需要的计算损失的位置后
$$\begin{align*}
\text{trg} = [&y_1, y_2, y_3, <eos>]\\
\text{outputs} = [&\hat{y}_1, \hat{y}_2, \hat{y}_3, <eos>]
\end{align*}$$

每次迭代
- 获取 $X$ and $Y$ 的批次数据
- zero the gradients calculated from the last batch
- 喂入数据x和y,得到预测值 $\hat{Y}$
- flatten 真实值和预测值，为了计算损失
- 计算梯度 `loss.backward()`
- 梯度截断，防止梯度爆炸
- 更新模型参数通过优化器
- 计算总的损失值

最后，返回所有batch的损失的和

In [92]:
def train(model, iterator, optimizer, criterion, clip):
    
    model.train()
    
    epoch_loss = 0
    
    for i, batch in enumerate(iterator):
        
        src = batch.src
        trg = batch.trg
        
        optimizer.zero_grad()
        
        output = model(src, trg)
        
        #trg = [trg len, batch size]
        #output = [trg len, batch size, output dim]
        
        output_dim = output.shape[-1]
        
        output = output[1:].view(-1, output_dim)
        trg = trg[1:].view(-1)
        
        #trg = [(trg len - 1) * batch size]
        #output = [(trg len - 1) * batch size, output dim]
        
        loss = criterion(output, trg)
        
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        
        optimizer.step()
        
        epoch_loss += loss.item()
        print('The %s batch loss is %s' % (i, loss.item()))
    return epoch_loss / len(iterator)

测试阶段evaluation mode 
使用with torch.no_grad() 不更新梯度
model.eval() 不用做dropout
关闭teacher forcing 

In [89]:
def evaluate(model, iterator, criterion):
    
    model.eval()
    
    epoch_loss = 0
        
    with torch.no_grad():
    
        for i, batch in enumerate(iterator):

            src = batch.src
            trg = batch.trg

            output = model(src, trg, 0) #turn off teacher forcing

            #trg = [trg len, batch size]
            #output = [trg len, batch size, output dim]

            output_dim = output.shape[-1]
            
            output = output[1:].view(-1, output_dim)
            trg = trg[1:].view(-1)

            #trg = [(trg len - 1) * batch size]
            #output = [(trg len - 1) * batch size, output dim]

            loss = criterion(output, trg)
            
            epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

计算一个epoch所消耗的时间

In [90]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

开始训练模型，如果验证集效果损失达到最好，停止训练并保存模型

In [93]:
N_EPOCHS = 10
CLIP = 1

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
    valid_loss = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut1-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

The 0 batch loss is 5.227309703826904
The 1 batch loss is 5.18184757232666
The 2 batch loss is 5.241634368896484
The 3 batch loss is 5.3097453117370605
The 4 batch loss is 5.301510810852051
The 5 batch loss is 5.311967372894287
The 6 batch loss is 5.176915645599365
The 7 batch loss is 5.19198751449585
The 8 batch loss is 5.212375640869141
The 9 batch loss is 5.267341613769531
The 10 batch loss is 5.166488170623779
The 11 batch loss is 5.163708209991455
The 12 batch loss is 5.062734127044678
The 13 batch loss is 5.11533784866333
The 14 batch loss is 5.219797134399414
The 15 batch loss is 5.204813003540039
The 16 batch loss is 5.164453983306885
The 17 batch loss is 5.141345977783203
The 18 batch loss is 5.200850009918213
The 19 batch loss is 5.052821159362793
The 20 batch loss is 5.056247711181641
The 21 batch loss is 5.130373954772949
The 22 batch loss is 5.2228803634643555
The 23 batch loss is 5.107409477233887
The 24 batch loss is 5.150378704071045
The 25 batch loss is 5.0288772583007

The 208 batch loss is 4.624127388000488
The 209 batch loss is 4.8941802978515625
The 210 batch loss is 4.716525077819824
The 211 batch loss is 4.693122386932373
The 212 batch loss is 4.802249431610107
The 213 batch loss is 4.684479713439941
The 214 batch loss is 4.8478617668151855
The 215 batch loss is 4.644590854644775
The 216 batch loss is 4.612752437591553
The 217 batch loss is 4.643129825592041
The 218 batch loss is 4.735301494598389
The 219 batch loss is 4.7895188331604
The 220 batch loss is 4.69575834274292
The 221 batch loss is 4.765822887420654
The 222 batch loss is 4.633568286895752
The 223 batch loss is 4.453860759735107
The 224 batch loss is 4.858541965484619
The 225 batch loss is 4.764775276184082
The 226 batch loss is 4.652652263641357
Epoch: 01 | Time: 28m 52s
	Train Loss: 4.925 | Train PPL: 137.757
	 Val. Loss: 4.960 |  Val. PPL: 142.572
The 0 batch loss is 4.797477722167969
The 1 batch loss is 4.631030559539795
The 2 batch loss is 4.597670555114746
The 3 batch loss is 4

The 187 batch loss is 4.38511848449707
The 188 batch loss is 4.5495829582214355
The 189 batch loss is 4.470181465148926
The 190 batch loss is 4.445215225219727
The 191 batch loss is 4.3130292892456055
The 192 batch loss is 4.513171195983887
The 193 batch loss is 4.377730846405029
The 194 batch loss is 4.733048439025879
The 195 batch loss is 4.615927696228027
The 196 batch loss is 4.302975177764893
The 197 batch loss is 4.529049396514893
The 198 batch loss is 4.659888744354248
The 199 batch loss is 4.368391990661621
The 200 batch loss is 4.438083171844482
The 201 batch loss is 4.471482276916504
The 202 batch loss is 4.4749650955200195
The 203 batch loss is 4.561156749725342
The 204 batch loss is 4.47731876373291
The 205 batch loss is 4.310968399047852
The 206 batch loss is 4.344976425170898
The 207 batch loss is 4.321959972381592
The 208 batch loss is 4.3902812004089355
The 209 batch loss is 4.406135559082031
The 210 batch loss is 4.488180637359619
The 211 batch loss is 4.42556047439575

The 165 batch loss is 4.410851955413818
The 166 batch loss is 4.172938346862793
The 167 batch loss is 4.141534805297852
The 168 batch loss is 4.1402106285095215
The 169 batch loss is 4.24867057800293
The 170 batch loss is 4.18686056137085
The 171 batch loss is 4.1696600914001465
The 172 batch loss is 4.417567729949951
The 173 batch loss is 4.372066020965576
The 174 batch loss is 4.341028690338135
The 175 batch loss is 4.31783390045166
The 176 batch loss is 4.0787739753723145
The 177 batch loss is 4.081729888916016
The 178 batch loss is 4.504328727722168
The 179 batch loss is 4.191431999206543
The 180 batch loss is 3.9646105766296387
The 181 batch loss is 4.264227390289307
The 182 batch loss is 4.145939350128174
The 183 batch loss is 4.2639594078063965
The 184 batch loss is 4.156716823577881
The 185 batch loss is 4.309530735015869
The 186 batch loss is 4.12214469909668
The 187 batch loss is 4.108486652374268
The 188 batch loss is 4.094907283782959
The 189 batch loss is 4.182820796966553

The 143 batch loss is 4.033021926879883
The 144 batch loss is 4.098025321960449
The 145 batch loss is 4.2982635498046875
The 146 batch loss is 4.094449043273926
The 147 batch loss is 4.156161785125732
The 148 batch loss is 3.763213872909546
The 149 batch loss is 3.8338475227355957
The 150 batch loss is 3.9777514934539795
The 151 batch loss is 3.834501028060913
The 152 batch loss is 3.872028112411499
The 153 batch loss is 4.017601490020752
The 154 batch loss is 3.6380255222320557
The 155 batch loss is 3.899747371673584
The 156 batch loss is 3.8690738677978516
The 157 batch loss is 3.924701452255249
The 158 batch loss is 3.95959734916687
The 159 batch loss is 3.7458441257476807
The 160 batch loss is 4.267694473266602
The 161 batch loss is 4.167255878448486
The 162 batch loss is 3.954576253890991
The 163 batch loss is 4.109953880310059
The 164 batch loss is 4.117443084716797
The 165 batch loss is 3.9337501525878906
The 166 batch loss is 4.155518531799316
The 167 batch loss is 3.6702616214

The 119 batch loss is 3.696643114089966
The 120 batch loss is 3.658400774002075
The 121 batch loss is 3.7597103118896484
The 122 batch loss is 3.3438498973846436
The 123 batch loss is 3.5431594848632812
The 124 batch loss is 3.770787477493286
The 125 batch loss is 3.8425347805023193
The 126 batch loss is 3.7276768684387207
The 127 batch loss is 3.883430242538452
The 128 batch loss is 3.864579677581787
The 129 batch loss is 3.71492862701416
The 130 batch loss is 3.8233070373535156
The 131 batch loss is 3.770671844482422
The 132 batch loss is 3.8678715229034424
The 133 batch loss is 3.6873040199279785
The 134 batch loss is 3.79801607131958
The 135 batch loss is 3.6075408458709717
The 136 batch loss is 3.9487831592559814
The 137 batch loss is 3.631981134414673
The 138 batch loss is 4.290680408477783
The 139 batch loss is 3.7343101501464844
The 140 batch loss is 3.6792447566986084
The 141 batch loss is 4.1345133781433105
The 142 batch loss is 4.021808624267578
The 143 batch loss is 3.97473

The 95 batch loss is 3.7364118099212646
The 96 batch loss is 3.4881410598754883
The 97 batch loss is 3.8260157108306885
The 98 batch loss is 4.128074645996094
The 99 batch loss is 3.881185293197632
The 100 batch loss is 3.540809392929077
The 101 batch loss is 3.809256076812744
The 102 batch loss is 3.781315326690674
The 103 batch loss is 3.778160810470581
The 104 batch loss is 3.5176634788513184
The 105 batch loss is 3.729886293411255
The 106 batch loss is 3.6450741291046143
The 107 batch loss is 3.684741497039795
The 108 batch loss is 3.6268253326416016
The 109 batch loss is 3.5216970443725586
The 110 batch loss is 3.351209878921509
The 111 batch loss is 3.7819838523864746
The 112 batch loss is 3.777371406555176
The 113 batch loss is 3.997359037399292
The 114 batch loss is 3.4139771461486816
The 115 batch loss is 3.825582265853882
The 116 batch loss is 3.469015121459961
The 117 batch loss is 3.7952606678009033
The 118 batch loss is 3.8098208904266357
The 119 batch loss is 3.7193384170

The 70 batch loss is 3.3754324913024902
The 71 batch loss is 3.5493974685668945
The 72 batch loss is 3.517956018447876
The 73 batch loss is 3.509089469909668
The 74 batch loss is 3.2330641746520996
The 75 batch loss is 3.772731304168701
The 76 batch loss is 3.641932487487793
The 77 batch loss is 3.517904281616211
The 78 batch loss is 3.7057511806488037
The 79 batch loss is 3.5159432888031006
The 80 batch loss is 3.36368989944458
The 81 batch loss is 3.4531030654907227
The 82 batch loss is 3.3277313709259033
The 83 batch loss is 3.3838977813720703
The 84 batch loss is 3.7549831867218018
The 85 batch loss is 3.5066277980804443
The 86 batch loss is 3.8696632385253906
The 87 batch loss is 3.313004732131958
The 88 batch loss is 3.2742819786071777
The 89 batch loss is 3.412674903869629
The 90 batch loss is 3.772923707962036
The 91 batch loss is 3.324510335922241
The 92 batch loss is 3.477475643157959
The 93 batch loss is 3.4082701206207275
The 94 batch loss is 3.5108273029327393
The 95 batch

The 45 batch loss is 3.231532096862793
The 46 batch loss is 3.410384178161621
The 47 batch loss is 3.4966461658477783
The 48 batch loss is 3.2455992698669434
The 49 batch loss is 3.4959897994995117
The 50 batch loss is 3.258225679397583
The 51 batch loss is 3.3361918926239014
The 52 batch loss is 2.9442825317382812
The 53 batch loss is 3.950249433517456
The 54 batch loss is 3.7281887531280518
The 55 batch loss is 3.304412364959717
The 56 batch loss is 3.280369281768799
The 57 batch loss is 3.687957763671875
The 58 batch loss is 3.759880542755127
The 59 batch loss is 3.435725212097168
The 60 batch loss is 3.578972816467285
The 61 batch loss is 3.532464027404785
The 62 batch loss is 3.2556543350219727
The 63 batch loss is 3.5143375396728516
The 64 batch loss is 3.6539433002471924
The 65 batch loss is 3.356271982192993
The 66 batch loss is 3.5853617191314697
The 67 batch loss is 3.5372796058654785
The 68 batch loss is 3.684807300567627
The 69 batch loss is 3.3698835372924805
The 70 batch 

The 20 batch loss is 3.8712539672851562
The 21 batch loss is 3.5050997734069824
The 22 batch loss is 3.434964179992676
The 23 batch loss is 3.4981441497802734
The 24 batch loss is 3.1887590885162354
The 25 batch loss is 3.472649574279785
The 26 batch loss is 3.3422698974609375
The 27 batch loss is 3.27793288230896
The 28 batch loss is 3.0221009254455566
The 29 batch loss is 3.3563284873962402
The 30 batch loss is 3.22770357131958
The 31 batch loss is 2.981278896331787
The 32 batch loss is 3.2826895713806152
The 33 batch loss is 3.0839321613311768
The 34 batch loss is 3.510685920715332
The 35 batch loss is 3.0990681648254395
The 36 batch loss is 3.0563879013061523
The 37 batch loss is 3.0412447452545166
The 38 batch loss is 3.1715035438537598
The 39 batch loss is 3.224090337753296
The 40 batch loss is 3.467590808868408
The 41 batch loss is 3.3159754276275635
The 42 batch loss is 3.096254825592041
The 43 batch loss is 3.1249289512634277
The 44 batch loss is 3.4180502891540527
The 45 batc

The 225 batch loss is 3.1809329986572266
The 226 batch loss is 2.881571054458618
Epoch: 09 | Time: 24m 53s
	Train Loss: 3.268 | Train PPL:  26.258
	 Val. Loss: 3.972 |  Val. PPL:  53.088
The 0 batch loss is 3.29844069480896
The 1 batch loss is 2.8954949378967285
The 2 batch loss is 3.1408958435058594
The 3 batch loss is 3.2076969146728516
The 4 batch loss is 3.1362385749816895
The 5 batch loss is 3.12420916557312
The 6 batch loss is 2.9977762699127197
The 7 batch loss is 3.118506669998169
The 8 batch loss is 3.1913790702819824
The 9 batch loss is 3.3622617721557617
The 10 batch loss is 3.2785580158233643
The 11 batch loss is 3.0558714866638184
The 12 batch loss is 2.9261820316314697
The 13 batch loss is 3.349606990814209
The 14 batch loss is 3.099592447280884
The 15 batch loss is 3.0850260257720947
The 16 batch loss is 2.88156795501709
The 17 batch loss is 3.3520941734313965
The 18 batch loss is 3.0549144744873047
The 19 batch loss is 3.2173473834991455
The 20 batch loss is 3.075501203

The 201 batch loss is 3.1332168579101562
The 202 batch loss is 3.2177696228027344
The 203 batch loss is 3.1426827907562256
The 204 batch loss is 3.1447064876556396
The 205 batch loss is 3.222234010696411
The 206 batch loss is 2.9496994018554688
The 207 batch loss is 3.3220250606536865
The 208 batch loss is 3.1690926551818848
The 209 batch loss is 3.1986851692199707
The 210 batch loss is 3.609957456588745
The 211 batch loss is 3.264378309249878
The 212 batch loss is 3.2297916412353516
The 213 batch loss is 2.9110982418060303
The 214 batch loss is 3.213965654373169
The 215 batch loss is 3.351102352142334
The 216 batch loss is 2.8525805473327637
The 217 batch loss is 3.1235055923461914
The 218 batch loss is 3.174438238143921
The 219 batch loss is 3.238590717315674
The 220 batch loss is 2.8850996494293213
The 221 batch loss is 3.688338279724121
The 222 batch loss is 3.265326499938965
The 223 batch loss is 2.8652491569519043
The 224 batch loss is 3.0527756214141846
The 225 batch loss is 3.0

加载模型，测试集测试

In [94]:
model.load_state_dict(torch.load('tut1-model.pt'))

test_loss = evaluate(model, test_iterator, criterion)

print(f'| Test Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f} |')

| Test Loss: 3.825 | Test PPL:  45.832 |
