In [1]:
import numpy as np

# Description
In this tutorial, we'll see how we can train a language model using the built-in datasets in torchtext.

We'll also take a look at some more practical features of torchtext that you might want to use when training your own practical models.
This tutorial assumes that you have access to a GPU for the sake of training speed. If you don't have a GPU, you can change the following variable `USE_GPU` to False. Be warned though, since the training will be very slow.

In [2]:
USE_GPU = True
BATCH_SIZE = 32

# 1. What is Language Modeling?
Language modeling is a task where we build a model that can take a sequence of words as input and determine how likely that sequence is to be actual human language. For instance, we would want our model to predict "This is a sentence" to be a likely sequence and "cold his book her" to be unlikely.

The way we generally train language models is by training them to predict the next word given all previous words in a sentence or multiple sentences. Therefore, all we need to do language modeling is a large amount of language data (called a corpus).

In this tutorial, we'll be using the famous WikiText2 dataset.

# 2. Preparing the Data

In [3]:
import torchtext
from torchtext import data

In the last tutorial we tokenized on spaces. This time, we'll use a slightly more sophisticated tokenizer: the spacy tokenizer.

[Spacy](https://spacy.io/) is a framework that handles many natural language processing tasks, and torchtext is designed to work closely with it.

Using the tokenizer is easy with torchtext: all we have to do is pass in the tokenizer function!

In [4]:
import spacy

from spacy.symbols import ORTH
my_tok = spacy.load('en')
my_tok.tokenizer.add_special_case('<eos>', [{ORTH: '<eos>'}])
my_tok.tokenizer.add_special_case('<bos>', [{ORTH: '<bos>'}])
my_tok.tokenizer.add_special_case('<unk>', [{ORTH: '<unk>'}])
def spacy_tok(x):
    return [tok.text for tok in my_tok.tokenizer(x)]

`add_special_case` simply tells the tokenizer to parse a certain string in a certain way. The list after the special case string represents how we want the string to be tokenized. 

If we wanted to tokenize "don't" into "do" and "'nt", then we would write

`my_tok.tokenizer.add_special_case("don't", [{ORTH: "do"}, {ORTH: "n't"}])`

We need to initialize the text field by ourselves.

In [5]:
TEXT = data.Field(lower=True, tokenize=spacy_tok)

Now we'll load the built-in datasets.
There are two effective ways of using these datasets: one is loading as a Dataset split into the train, validation, and test sets, and the other is loading as an Iterator. The dataset offers more flexibility, so we'll use that approach here.

There is currently one built-in dataset for language modeling: the WikiText2 dataset. (I've sent a pull request for the also commonly used and slightly smaller dataset called the Penn Treebank dataset. If you install the version on [my fork](https://github.com/keitakurita/text@penn_treebank), you can use it in place and have the code run faster!)

In [6]:
from torchtext.datasets import WikiText2

In [7]:
train, valid, test = WikiText2.splits(TEXT) # loading custom datasets requires passing in the field, but nothing else.

Let's take a quick look inside. Remember, datasets behave largely like normal lists, so we can measure the length using the `len` function.

In [8]:
len(train)

1

Only one training example?! Did we do something wrong?

Turns out not. It's just that the entire corpus of the dataset is contained within a single example. We'll see how this example gets batched and processed later.

Now that we have our data, let's build the vocabulary. This time, let's try using precomputed word embeddings.

We'll use GloVe vectors with 200 dimensions this time. There are various other precomputed word embeddings in torchtext (including GloVe vectors with 100 and 300 dimensions) as well which can be loaded in mostly the same way.

In [9]:
TEXT.build_vocab(train, vectors="glove.6B.200d")

Now we can build our iterator. This is the climax of this tutorial!
It turns out that torchtext has a very handy iterator that does most of the heavy lifting for us. It's called the `BPTTIterator`.
The `BPTTIterator` does the following for us:
- Divide the corpus into batches of sequence length `bptt`

For instance, suppose we have the following corpus: 

*"Machine learning is a field of computer science that gives computers the ability to learn without being explicitly programmed."*

Though this sentence is short, the actual corpus is thousands of words long, so we can't possibly feed it in all at once. We'll want to divide the corpus into sequences of a shorter length. In the above example, if we wanted to divide the corpus into batches of sequence length 5, we would get the following sequences:

["*Machine*", "*learning*", "*is*", "*a*", "*field*"],

["*of*", "*computer*", "*science*", "*that*", "*gives*"],

["*computers*", "*the*", "*ability*", "*to*", "*learn*"],

["*without*", "*being*", "*explicitly*", "*programmed*", EOS]


- Generate batches that are the input sequences offset by one

In language modeling, the supervision data is the next word in a sequence of words. We, therefore, want to generate the sequences that are the input sequences offset by one. In the above example, we would get the following sequence that we train the model to predict:

["*learning*", "*is*", "*a*", "*field*", "*of*"],

["*computer*", "*science*", "*that*", "*gives*", "*computers*"],

["*the*", "*ability*", "*to*", "*learn*", "*without*"],

["*being*", "*explicitly*", "*programmed*", EOS, EOS]

In [10]:
train_iter, valid_iter, test_iter = data.BPTTIterator.splits(
    (train, valid, test),
    batch_size=BATCH_SIZE,
    bptt_len=30, # this is where we specify the sequence length
    device=(0 if USE_GPU else -1),
    repeat=False)

As always, it's a good idea to take a look into what is actually happening behind the scenes

In [11]:
b = next(iter(train_iter))

In [12]:
vars(b).keys()

dict_keys(['batch_size', 'dataset', 'train', 'text', 'target'])

We never specified a target field, so it must have been automatically generated. Hopefully, it's the original text offset by one. Let's see...

In [13]:
b.text[:, :3]

Variable containing:
     9    953      0
    10    324   5909
     9     11  20014
    12   5906     27
  3872  10434      2
  3892      3  10780
   886     11   3273
    12   9357      0
    10   8826  23499
     9   1228      4
    10      7    569
     9      2    235
 20059   2592   5909
    90      3     20
  3872    141      2
    95      8   1450
    49   6794    369
     0   9046      5
  3892   1497      2
    24     13   2168
   786      4    488
    49     26   5967
 28867     25    656
     3  18430     14
  6213     58     48
     4   4886   4364
  3872    217      4
     5      5     22
     2      2   1936
  5050    593     59
[torch.cuda.LongTensor of size 30x3 (GPU 0)]

In [14]:
b.target[:, :3]

Variable containing:
    10    324   5909
     9     11  20014
    12   5906     27
  3872  10434      2
  3892      3  10780
   886     11   3273
    12   9357      0
    10   8826  23499
     9   1228      4
    10      7    569
     9      2    235
 20059   2592   5909
    90      3     20
  3872    141      2
    95      8   1450
    49   6794    369
     0   9046      5
  3892   1497      2
    24     13   2168
   786      4    488
    49     26   5967
 28867     25    656
     3  18430     14
  6213     58     48
     4   4886   4364
  3872    217      4
     5      5     22
     2      2   1936
  5050    593     59
    95      7     14
[torch.cuda.LongTensor of size 30x3 (GPU 0)]

Be careful, the first dimension of the text and target is the sequence, and the next is the batch.
We see that the target is indeed the original text offset by 1 (shifted downwards by 1). Which means we have all the we need to start training a language model!

# 3. Training the Language Model

With the above iterators, training the language model is easy. 

First, we need to prepare the model. We'll be borrowing and customizing the model from the [examples](https://github.com/pytorch/examples/tree/master/word_language_model) in pytorch.

In [15]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.autograd import Variable as V

In [16]:
class RNNModel(nn.Module):
    def __init__(self, ntoken, ninp,
                 nhid, nlayers, bsz,
                 dropout=0.5, tie_weights=True):
        super(RNNModel, self).__init__()
        self.nhid, self.nlayers, self.bsz = nhid, nlayers, bsz
        self.drop = nn.Dropout(dropout)
        self.encoder = nn.Embedding(ntoken, ninp)
        self.rnn = nn.LSTM(ninp, nhid, nlayers, dropout=dropout)
        self.decoder = nn.Linear(nhid, ntoken)
        self.init_weights()
        self.hidden = self.init_hidden(bsz) # the input is a batched consecutive corpus
                                            # therefore, we retain the hidden state across batches

    def init_weights(self):
        initrange = 0.1
        self.encoder.weight.data.uniform_(-initrange, initrange)
        self.decoder.bias.data.fill_(0)
        self.decoder.weight.data.uniform_(-initrange, initrange)

    def forward(self, input):
        emb = self.drop(self.encoder(input))
        output, self.hidden = self.rnn(emb, self.hidden)
        output = self.drop(output)
        decoded = self.decoder(output.view(output.size(0)*output.size(1), output.size(2)))
        return decoded.view(output.size(0), output.size(1), decoded.size(1))

    def init_hidden(self, bsz):
        weight = next(self.parameters()).data
        return (V(weight.new(self.nlayers, bsz, self.nhid).zero_().cuda()),
                V(weight.new(self.nlayers, bsz, self.nhid).zero_()).cuda())
    
    def reset_history(self):
        """Wraps hidden states in new Variables, to detach them from their history."""
        self.hidden = tuple(V(v.data) for v in self.hidden)

We need to explicitly pass the initial weights of the embedding matrix that are initialize with the GloVe vectors

In [17]:
weight_matrix = TEXT.vocab.vectors

In [18]:
model = RNNModel(weight_matrix.size(0),
                 weight_matrix.size(1), 200, 1, BATCH_SIZE)

In [19]:
model.encoder.weight.data.copy_(weight_matrix);

In [20]:
if USE_GPU:
    model.cuda()

Now we can begin training the language model. We'll use the Adam optimizer here.

For the loss, we'll use the `nn.CrossEntropyLoss` function. This loss takes the index of the correct class as the ground truth instead of a one-hot vector. Unfortunately, it only takes tensors of dimension 2 or 4, so we'll need to do a bit of reshaping.

In [21]:
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3, betas=(0.7, 0.99))

In [22]:
n_epochs = 2

In [23]:
n_tokens = weight_matrix.size(0)

In [24]:
from tqdm import tqdm

Now we can start the training loop.

In [25]:
def train_epoch(epoch):
    """One epoch of a training loop"""
    epoch_loss = 0
    for batch in tqdm(train_iter):
        # reset the hidden state or else the model will try to backpropagate to the
        # beginning of the dataset, requiring lots of time and a lot of memory
        model.reset_history()
        
        optimizer.zero_grad()
        
        text, targets = batch.text, batch.target
        prediction = model(text)
        # pytorch currently only supports cross entropy loss for inputs of 2 or 4 dimensions.
        # we therefore flatten the predictions out across the batch axis so that it becomes
        # shape (batch_size * sequence_length, n_tokens)
        # in accordance to this, we reshape the targets to be
        # shape (batch_size * sequence_length)
        loss = criterion(prediction.view(-1, n_tokens), targets.view(-1))
        loss.backward()
        
        optimizer.step()
        
        epoch_loss += loss.data[0] * prediction.size(0) * prediction.size(1)

    epoch_loss /= len(train.examples[0].text)

    # monitor the loss
    val_loss = 0
    model.eval()
    for batch in valid_iter:
        model.reset_history()
        text, targets = batch.text, batch.target
        prediction = model(text)
        loss = criterion(prediction.view(-1, n_tokens), targets.view(-1))
        val_loss += loss.data[0] * text.size(0)
    val_loss /= len(valid.examples[0].text)
    
    print('Epoch: {}, Training Loss: {:.4f}, Validation Loss: {:.4f}'.format(epoch, epoch_loss, val_loss))

In [26]:
for epoch in range(1, n_epochs + 1):
    train_epoch(epoch)

100%|██████████| 2217/2217 [01:59<00:00, 18.59it/s]
  0%|          | 0/2217 [00:00<?, ?it/s]

Epoch: 1, Training Loss: 6.2056, Validation Loss: 0.1711


100%|██████████| 2217/2217 [01:59<00:00, 18.56it/s]


Epoch: 2, Training Loss: 5.2659, Validation Loss: 0.1599


Let's examine the output at 2 epochs

In [27]:
b = next(iter(valid_iter))

In [28]:
def word_ids_to_sentence(id_tensor, vocab, join=None):
    """Converts a sequence of word ids to a sentence"""
    if isinstance(id_tensor, torch.LongTensor):
        ids = id_tensor.transpose(0, 1).contiguous().view(-1)
    elif isinstance(id_tensor, np.ndarray):
        ids = id_tensor.transpose().reshape(-1)

    batch = [vocab.itos[ind] for ind in ids]  # denumericalize
    if join is None:
        return batch
    else:
        return join.join(batch)

In [29]:
word_ids_to_sentence(b.text.cpu().data, TEXT.vocab, join=' ')[:210]

'  <eos>   = homarus gammarus = <eos>   <eos>   homarus gammarus , known as the european lobster or common lobster , is a species of <unk> lobster from . <unk> ceo hiroshi <unk> referred to <unk> as one of his f'

In [30]:
arrs = model(b.text).cpu().data.numpy()

In [31]:
word_ids_to_sentence(np.argmax(arrs, axis=2), TEXT.vocab, join=' ')[:210]

'<unk>   <eos> = = ( <eos>   <eos>   = = ( <unk> as the <unk> @-@ ( <unk> species , <unk> a <unk> of the <unk> ( the <eos> was <unk> <unk> <unk> to the the a of the first " , the , <eos>   <eos> reviewers were t'

Hmm.. doesn't seem to be making much sense yet.
Let's train for another 2 epochs and see how the results change

In [32]:
for epoch in range(n_epochs + 1, n_epochs * 2 + 1):
    train_epoch(epoch)

100%|██████████| 2217/2217 [01:59<00:00, 18.56it/s]
  0%|          | 0/2217 [00:00<?, ?it/s]

Epoch: 3, Training Loss: 4.9020, Validation Loss: 0.1568


100%|██████████| 2217/2217 [01:59<00:00, 18.61it/s]


Epoch: 4, Training Loss: 4.6959, Validation Loss: 0.1549


In [33]:
arrs = model(b.text).cpu().data.numpy()
word_ids_to_sentence(np.argmax(arrs, axis=2), TEXT.vocab, join=' ')[:210]

'<unk>   <eos> = = ( <eos>   <eos>   <eos> ( ( is as the <unk> union <unk> <unk> starling <unk> <unk> the <unk> of the <unk> , the <eos> , <unk> <unk> , to the the a of the " " , the , <eos>   <eos> reviewers ha'

Is this getting better? The loss is certainly getting better.
This just goes to show how difficult it is to match a loss value with the quality of the predictions in language modeling.