# 第三课 语言模型

褚则伟 zeweichu@gmail.com

学习目标
- 学习语言模型，以及如何训练一个语言模型
- 学习torchtext的基本使用方法
    - 构建 vocabulary
    - word to inde 和 index to word
- 学习torch.nn的一些基本模型
    - Linear
    - RNN
    - LSTM
    - GRU
- RNN的训练技巧
    - Gradient Clipping
- 如何保存和读取模型

我们会使用 [torchtext](https://github.com/pytorch/text) 来创建vocabulary, 然后把数据读成batch的格式。请大家自行阅读README来学习torchtext。

In [13]:
import torchtext
from torchtext.vocab import Vectors
import torch
import numpy as np
import random

USE_CUDA = torch.cuda.is_available()

# 为了保证实验结果可以复现，我们经常会把各种random seed固定在某一个值
random.seed(53113)
np.random.seed(53113)
torch.manual_seed(53113)
if USE_CUDA:
    torch.cuda.manual_seed(53113)

BATCH_SIZE = 8
EMBEDDING_SIZE = 100
MAX_VOCAB_SIZE = 10000

- 我们会继续使用上次的text8作为我们的训练，验证和测试数据
- TorchText的一个重要概念是`Field`，它决定了你的数据会如何被处理。我们使用`TEXT`这个field来处理文本数据。我们的`TEXT` field有`lower=True`这个参数，所以所有的单词都会被lowercase。
- torchtext提供了LanguageModelingDataset这个class来帮助我们处理语言模型数据集。
- `build_vocab`可以根据我们提供的训练数据集来创建最高频单词的单词表，`max_size`帮助我们限定单词总量。
- BPTTIterator可以连续地得到连贯的句子，BPTT的全程是back propagation through time。


In [14]:
TEXT = torchtext.data.Field(lower=True)
train, val, test = torchtext.datasets.LanguageModelingDataset.splits(path=".", 
    train="text8.train.txt", validation="text8.dev.txt", test="text8.test.txt", text_field=TEXT)
TEXT.build_vocab(train, max_size=MAX_VOCAB_SIZE)
print("vocabulary size: {}".format(len(TEXT.vocab)))

VOCAB_SIZE = len(TEXT.vocab)
train_iter, val_iter, test_iter = torchtext.data.BPTTIterator.splits(
    (train, val, test), batch_size=BATCH_SIZE, device=-1, bptt_len=32, repeat=False, shuffle=True)


The `device` argument should be set by using `torch.device` or passing a string as an argument. This behavior will be deprecated soon and currently defaults to cpu.
The `device` argument should be set by using `torch.device` or passing a string as an argument. This behavior will be deprecated soon and currently defaults to cpu.
The `device` argument should be set by using `torch.device` or passing a string as an argument. This behavior will be deprecated soon and currently defaults to cpu.


vocabulary size: 10002


### 上面的参数应改为device="cpu"

- 为什么我们的单词表有50002个单词而不是50000呢？因为TorchText给我们增加了两个特殊的token，`<unk>`表示未知的单词，`<pad>`表示padding。
- 模型的输入是一串文字，模型的输出也是一串文字，他们之间相差一个位置，因为语言模型的目标是根据之前的单词预测下一个单词。

In [15]:
it = iter(train_iter)
batch = next(it)
print(" ".join([TEXT.vocab.itos[i] for i in batch.text[:,1].data]))
print(" ".join([TEXT.vocab.itos[i] for i in batch.target[:,1].data]))

a jet powered vehicle and an aircraft typically a <unk> or an <unk> airplane historical <unk> one nine zero nine <unk> air meet in france in august one nine zero nine a
jet powered vehicle and an aircraft typically a <unk> or an <unk> airplane historical <unk> one nine zero nine <unk> air meet in france in august one nine zero nine a key


In [16]:
for i in range(5):
    batch = next(it)
    print(" ".join([TEXT.vocab.itos[i] for i in batch.text[:,2].data]))
    print(" ".join([TEXT.vocab.itos[i] for i in batch.target[:,2].data]))

three one nine six two six eight five three two three zero three four one zero four one two five five three five one zero six five four nine zero three six
one nine six two six eight five three two three zero three four one zero four one two five five three five one zero six five four nine zero three six three
three five three nine one two four five zero zero zero four one one two seven seven eight five five four three four zero four four one two seven one seven two
five three nine one two four five zero zero zero four one one two seven seven eight five five four three four zero four four one two seven one seven two five
five four two one three zero seven zero four zero four four four five four nine one one five one one five five three eight one one eight two three eight zero
four two one three zero seven zero four zero four four four five four nine one one five one one five five three eight one one eight two three eight zero three
three nine five zero five four one zero three three th

In [41]:
batch=next(it)
da=batch.text
print(da.size())
print(*da.size())

torch.Size([32, 8])
32 8


### 定义模型

- 继承nn.Module
- 初始化函数
- forward函数
- 其余可以根据模型需要定义相关的函数

In [28]:
import torch
import torch.nn as nn


class RNNModel(nn.Module):
    """ 一个简单的循环神经网络"""

    def __init__(self, rnn_type, ntoken, ninp, nhid, nlayers, dropout=0.5):
        ''' 该模型包含以下几层:
            - 词嵌入层
            - 一个循环神经网络层(RNN, LSTM, GRU)
            - 一个线性层，从hidden state到输出单词表
            - 一个dropout层，用来做regularization
        '''
        super(RNNModel, self).__init__()
        self.drop = nn.Dropout(dropout)
        self.encoder = nn.Embedding(ntoken, ninp)
        if rnn_type in ['LSTM', 'GRU']:
            self.rnn = getattr(nn, rnn_type)(ninp, nhid, nlayers, dropout=dropout)
        else:
            try:
                nonlinearity = {'RNN_TANH': 'tanh', 'RNN_RELU': 'relu'}[rnn_type]
            except KeyError:
                raise ValueError( """An invalid option for `--model` was supplied,
                                 options are ['LSTM', 'GRU', 'RNN_TANH' or 'RNN_RELU']""")
            self.rnn = nn.RNN(ninp, nhid, nlayers, nonlinearity=nonlinearity, dropout=dropout)
        self.decoder = nn.Linear(nhid, ntoken)

        self.init_weights()

        self.rnn_type = rnn_type
        self.nhid = nhid
        self.nlayers = nlayers

    def init_weights(self):
        initrange = 0.1
        self.encoder.weight.data.uniform_(-initrange, initrange)
        self.decoder.bias.data.zero_()
        self.decoder.weight.data.uniform_(-initrange, initrange)

    def forward(self, input, hidden):
        ''' Forward pass:
            - word embedding
            - 输入循环神经网络
            - 一个线性层从hidden state转化为输出单词表
        '''
        #input seq_length*batch_size
        emb = self.drop(self.encoder(input)) #seq_length*batch_size*embed_size
        output, hidden = self.rnn(emb, hidden)
        #output:seq_length*batch_size*hidden_size
        #hidden:(1*batch_size*hidden_size,1*batch_size*hidden_size)
        output = self.drop(output)
        decoded = self.decoder(output.view(output.size(0)*output.size(1), output.size(2)))#(seq_length*batch_size)*hidden_size
        return decoded.view(output.size(0), output.size(1), decoded.size(1)), hidden

    def init_hidden(self, bsz, requires_grad=True):
        weight = next(self.parameters())
        if self.rnn_type == 'LSTM':
            return (weight.new_zeros((self.nlayers, bsz, self.nhid), requires_grad=requires_grad),
                    weight.new_zeros((self.nlayers, bsz, self.nhid), requires_grad=requires_grad))
        else:
            return weight.new_zeros((self.nlayers, bsz, self.nhid), requires_grad=requires_grad)

初始化一个模型

In [29]:
model = RNNModel("LSTM", VOCAB_SIZE, EMBEDDING_SIZE, EMBEDDING_SIZE, 2, dropout=0.5)
if USE_CUDA:
    model = model.cuda()

- 我们首先定义评估模型的代码。
- 模型的评估和模型的训练逻辑基本相同，唯一的区别是我们只需要forward pass，不需要backward pass

In [30]:
def evaluate(model, data):
    model.eval()
    total_loss = 0.
    it = iter(data)
    total_count = 0.
    with torch.no_grad():
        hidden = model.init_hidden(BATCH_SIZE, requires_grad=False)
        for i, batch in enumerate(it):
            data, target = batch.text, batch.target
            if USE_CUDA:
                data, target = data.cuda(), target.cuda()
            hidden = repackage_hidden(hidden)
            with torch.no_grad():
                output, hidden = model(data, hidden)
            loss = loss_fn(output.view(-1, VOCAB_SIZE), target.view(-1))
            total_count += np.multiply(*data.size())
            total_loss += loss.item()*np.multiply(*data.size())
            
    loss = total_loss / total_count
    model.train()
    return loss


我们需要定义下面的一个function，帮助我们把一个hidden state和计算图之前的历史分离。

In [31]:
# Remove this part
def repackage_hidden(h):
    """Wraps hidden states in new Tensors, to detach them from their history."""
    if isinstance(h, torch.Tensor):
        return h.detach()
    else:
        return tuple(repackage_hidden(v) for v in h)

定义loss function和optimizer


In [32]:
loss_fn = nn.CrossEntropyLoss()
learning_rate = 0.001
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, 0.5)

训练模型：
- 模型一般需要训练若干个epoch
- 每个epoch我们都把所有的数据分成若干个batch
- 把每个batch的输入和输出都包装成cuda tensor
- forward pass，通过输入的句子预测每个单词的下一个单词
- 用模型的预测和正确的下一个单词计算cross entropy loss
- 清空模型当前gradient
- backward pass
- gradient clipping，防止梯度爆炸
- 更新模型参数
- 每隔一定的iteration输出模型在当前iteration的loss，以及在验证集上做模型的评估

In [33]:
import copy
GRAD_CLIP = 1.
NUM_EPOCHS = 2

val_losses = []
for epoch in range(NUM_EPOCHS):
    model.train()
    it = iter(train_iter)
    hidden = model.init_hidden(BATCH_SIZE)
    for i, batch in enumerate(it):
        data, target = batch.text, batch.target
        if USE_CUDA:
            data, target = data.cuda(), target.cuda()
        hidden = repackage_hidden(hidden)
        model.zero_grad()
        output, hidden = model(data, hidden)
        loss = loss_fn(output.view(-1, VOCAB_SIZE), target.view(-1))
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), GRAD_CLIP)
        optimizer.step()
        if i % 500 == 0:
            print("epoch", epoch, "iter", i, "loss", loss.item())
    
        if i % 5000 == 0:
            val_loss = evaluate(model, val_iter)
            
            if len(val_losses) == 0 or val_loss < min(val_losses):
                print("best model, val loss: ", val_loss)
                torch.save(model.state_dict(), "lm-best.th")
            else:
                scheduler.step()
                optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
            val_losses.append(val_loss)

epoch 0 iter 0 loss 9.2100248336792
best model, val loss:  9.203267803760312
epoch 0 iter 500 loss 6.336516857147217
epoch 0 iter 1000 loss 6.238970756530762
epoch 0 iter 1500 loss 6.127810955047607
epoch 0 iter 2000 loss 6.633368968963623
epoch 0 iter 2500 loss 6.150583744049072
epoch 0 iter 3000 loss 5.831256866455078
epoch 0 iter 3500 loss 6.034218788146973
epoch 0 iter 4000 loss 6.206968784332275
epoch 0 iter 4500 loss 6.000154972076416
epoch 0 iter 5000 loss 6.155076503753662
best model, val loss:  5.675728633221132
epoch 0 iter 5500 loss 5.967208385467529
epoch 0 iter 6000 loss 5.9928059577941895
epoch 0 iter 6500 loss 5.830376148223877
epoch 0 iter 7000 loss 6.240983009338379
epoch 0 iter 7500 loss 5.52305793762207
epoch 0 iter 8000 loss 5.768531799316406
epoch 0 iter 8500 loss 5.860937595367432
epoch 0 iter 9000 loss 5.940154075622559
epoch 0 iter 9500 loss 5.726594924926758
epoch 0 iter 10000 loss 5.735716819763184
best model, val loss:  5.4877998923957465
epoch 0 iter 10500 l

KeyboardInterrupt: 

In [42]:
best_model = RNNModel("LSTM", VOCAB_SIZE, EMBEDDING_SIZE, EMBEDDING_SIZE, 2, dropout=0.5)
if USE_CUDA:
    best_model = best_model.cuda()
best_model.load_state_dict(torch.load("lm-best.th"))

<All keys matched successfully>

### 使用最好的模型在valid数据上计算perplexity

In [43]:
val_loss = evaluate(best_model, val_iter)
print("perplexity: ", np.exp(val_loss))

perplexity:  241.7248008249996


### 使用最好的模型在测试数据上计算perplexity

In [16]:
test_loss = evaluate(best_model, test_iter)
print("perplexity: ", np.exp(test_loss))

perplexity:  178.54742013696125


使用训练好的模型生成一些句子。

In [49]:
hidden = best_model.init_hidden(1)
print(hidden[0].shape)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
input = torch.randint(VOCAB_SIZE, (1, 1), dtype=torch.long).to(device)
print(input.shape)
words = []
for i in range(100):
    output, hidden = best_model(input, hidden)
    word_weights = output.squeeze().exp().cpu()
    word_idx = torch.multinomial(word_weights, 1)[0]
    input.fill_(word_idx)
    word = TEXT.vocab.itos[word_idx]
    words.append(word)
print(" ".join(words))

torch.Size([2, 1, 100])
torch.Size([1, 1])
scandal creators two zero zero zero deaths high expansion resolutions in one five zero one one five four three two zero three <unk> in one nine nine one one nine seven seven the series in the battle of the decree one nine three two july the congo novel one times still probable land party korea and companies over his trade of their game barrel one october two business four one nine seven three the <unk> army was <unk> from the <unk> weekly <unk> <unk> varies to the property must development it glass everyone traditions areas chile <unk> and alternatively some
