# 08 - Neural Language Models
Prepared by Jan Christian Blaise Cruz

DLSU Machine Learning Group

In this notebook, we'll learn how to use RNNs for one of their most common use cases: language modeling in NLP. We'll start by processing our data then move on to an RNN walkthrough. At the end of the notebook, we'll implement ideas from a handful of papers, along with some de facto standard practices.

# Preliminaries

In [None]:
!nvidia-smi

Tue Aug 25 06:58:45 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.57       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   31C    P0    25W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

For this notebook, we'll use the WikiText-2 language modeling dataset (Merity et al., 2016). We first download the files.

In [None]:
!wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip
!unzip wikitext-2-v1.zip && rm wikitext-2-v1.zip

--2020-08-25 07:00:14--  https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.96.134
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.96.134|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4475746 (4.3M) [application/zip]
Saving to: ‘wikitext-2-v1.zip’


2020-08-25 07:00:15 (7.96 MB/s) - ‘wikitext-2-v1.zip’ saved [4475746/4475746]

Archive:  wikitext-2-v1.zip
   creating: wikitext-2/
  inflating: wikitext-2/wiki.test.tokens  
  inflating: wikitext-2/wiki.valid.tokens  
  inflating: wikitext-2/wiki.train.tokens  


Then we'll import some preliminary packages and set the random seeds for reproducibility.

In [None]:
import torch
import numpy as np
from tqdm import tqdm

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
np.random.seed(42)
torch.manual_seed(42);

# Data Preprocessing

We load the dataset as follows. No further preprocessing is needed as the dataset comes pre-preprocessed already, however we have to replace the newline characters with end of sequence characters.

In [None]:
train = []
with open('wikitext-2/wiki.train.tokens', 'r', encoding='utf-8') as f:
    for line in f:
        line = line.strip()
        if len(line) > 0 and not line.startswith('='):
            train.extend(line.split() + ['<eos>'])

valid = []
with open('wikitext-2/wiki.valid.tokens', 'r', encoding='utf-8') as f:
    for line in f:
        line = line.strip()
        if len(line) > 0 and not line.startswith('='):
            valid.extend(line.split() + ['<eos>'])

Let's see how the first twenty characters look like.

In [None]:
print(train[:10])

['Senjō', 'no', 'Valkyria', '3', ':', '<unk>', 'Chronicles', '(', 'Japanese', ':']


We then construct our vocabularies.

In [None]:
idx2word = ['<unk>', '<pad>', '<eos>']
for line in train:
    idx2word.append(line)

vocab_set = set(idx2word)
idx2word = list(vocab_set)
word2idx = {idx2word[i]: i for i in range(len(idx2word))}

In [None]:
print(idx2word[42])
print(word2idx['residual'])

residual
42


In [None]:
valid = [token if token in vocab_set else '<unk>' for token in valid]

And convert each token into its corresponding index. We can turn the lists into Tensors afterwards.

In [None]:
X_train = [word2idx[word] for word in train]
X_valid = [word2idx[word] for word in valid]

X_train = torch.LongTensor(X_train)
X_valid = torch.LongTensor(X_valid)

print(X_train.shape, X_valid.shape)

torch.Size([2024702]) torch.Size([211179])


We can see that our training set has about 2 million contiguous tokens in the training set (this is how WikiText-2 gets its name).

Our next order of business is to figure out how to batch our data. We want to set a batch size (number of tokens the model will see at one time), then work out how to divide the dataset evenly.

In [None]:
def batchify(data, bsz):
    # Work out how cleanly we can divide the dataset into bsz parts.
    nbatch = data.size(0) // bsz
    # Trim off any extra elements that wouldn't cleanly fit (remainders).
    data = data.narrow(0, 0, nbatch * bsz)
    # Evenly divide the data across the bsz batches.
    data = data.view(bsz, -1).t().contiguous()
    return data

It's easier to see how that function works in action. Let's produce dividends from our data with a batch size of 40.

In [None]:
bs = 40

X_train = batchify(X_train, bs)
X_valid = batchify(X_valid, bs)

Let's see what that looks like.

In [None]:
X_train

tensor([[26728, 15475, 23915,  ...,  8914, 16368,  3832],
        [ 6939, 13704,  2565,  ...,   902, 27749, 26855],
        [ 7632, 22425, 17206,  ..., 24026, 22356, 17206],
        ...,
        [ 2846, 26557,  9952,  ..., 17206,  7066, 22392],
        [  550, 17940, 24426,  ..., 26557,  5745, 17206],
        [17206, 26407, 17206,  ...,  5627,  8914, 18571]])

In [None]:
idx2word[X_train[0][0]], idx2word[X_train[1][0]], idx2word[X_train[2][0]]

('Senjō', 'no', 'Valkyria')

And the shape.

In [None]:
X_train.shape

torch.Size([50617, 40])

The indexes on the top row are the first tokens in their respective sequences, with the token proceeding it in the row below it.

We currently have 40 dividends with about 52 thousand tokens each. We can't feed this much tokens into our model or else the hidden state saturates. We'll further divide our data to a specific "bptt" length or sequence length.

*Note: BPTT length means "backpropagation through time length" which is the number of steps the model needs to backpropagate through to process the sequence. In modern literature we usually just use the term "maximum sequence length" or MSL.*

In [None]:
def get_batch(source, i, bptt):
    seq_len = min(bptt, len(source) - 1 - i)
    data = source[i:i+seq_len]
    target = source[i+1:i+1+seq_len].view(-1)
    return data, target

Let's see how it works.

In [None]:
bptt = 35

x, y = get_batch(X_train, 0, bptt)

The shapes will be as we expected.

In [None]:
x.shape, y.shape

(torch.Size([35, 40]), torch.Size([1400]))

This is our training tensor.

In [None]:
x

tensor([[26728, 15475, 23915,  ...,  8914, 16368,  3832],
        [ 6939, 13704,  2565,  ...,   902, 27749, 26855],
        [ 7632, 22425, 17206,  ..., 24026, 22356, 17206],
        ...,
        [28392, 24026, 26557,  ..., 24026, 19439, 19817],
        [23288, 17861,   294,  ..., 28392, 15095, 16316],
        [18642, 13175, 28005,  ..., 11057,  9952, 11468]])

And our target tensor.

In [None]:
y

tensor([ 6939, 13704,  2565,  ...,  8914,  7498,  5005])

Our target tensor is basically just a flattened version of our training tensor sliced from the first index on the first axis, plus the token targets of the last row in our training tensor.

In [None]:
y.view(x.shape)

tensor([[ 6939, 13704,  2565,  ...,   902, 27749, 26855],
        [ 7632, 22425, 17206,  ..., 24026, 22356, 17206],
        [14163, 28005, 20790,  ...,   902,   392, 14230],
        ...,
        [23288, 17861,   294,  ..., 28392, 15095, 16316],
        [18642, 13175, 28005,  ..., 11057,  9952, 11468],
        [10854, 23839,   447,  ...,  8914,  7498,  5005]])

We can iterate from 0 to the length of the full sequences but it's cleaner to simply put everything into a list we can iterate. We'll call these our "dataloaders."

In [None]:
train_loader = []
for i in range(0, X_train.size(0), bptt):
    train_loader.append(get_batch(X_train, i, bptt))

valid_loader = []
for i in range(0, X_valid.size(0), bptt):
    valid_loader.append(get_batch(X_valid, i, bptt))

We can check the number of training batches we have.

In [None]:
len(train_loader), len(valid_loader)

(1447, 151)

And check the sizes of the batch.

In [None]:
x, y = train_loader[0]
print(x.shape, y.shape)

x, y = train_loader[-1]
print(x.shape, y.shape)

torch.Size([35, 40]) torch.Size([1400])
torch.Size([6, 40]) torch.Size([240])


# RNN Basics

Let's take a batch so we can see how forward propagation works.

In [None]:
x, y = train_loader[0]

We import our neural networks package from PyTorch.

In [None]:
import torch.nn as nn

Our data isn't immediately usable to our model. We have to produce features from our tokens. One way we can do this is by using Embeddings. We assign a vector representation to each token in our training set, which will be trained alongside the neural network.

Pretrained embeddings (like GloVe) can be used here to inject more information to the model, but we'll make do with untrained embeddings for now.

In [None]:
embedding = nn.Embedding(len(vocab_set), 100)

Modules in PyTorch (subclasses of the ```nn.Module``` class) can be called like functions. To embed our data, we call:

In [None]:
out = embedding(x)

The resulting tensor ```out``` now has in its internal history all the operations that were carried out to result to its current form. This history tracking is how PyTorch does automatic differentiation when we do backprop. More on that later.

We can check the resulting size.

In [None]:
out.shape

torch.Size([35, 40, 100])

Each of the 40 tokens in our tensor has been represented by a 100-length vector. This results in a three dimensional tensor.

In [None]:
out[0].shape

torch.Size([40, 100])

Instantiating an RNN is likewise straightforward. Here we make an LSTM, passing in the embedding dimensions and our specified hidden dimension for the LSTM's hidden weigh matrices.

In [None]:
rnn = nn.LSTM(100, 128)

In PyTorch, we have to manually specify the starting hidden and cell state tensors. They follow the shape (1, batch size, hidden dimensions).

*Note: The first dimension of the hidden and cell states can be larger if the LSTM has more recurrent layers or is bidirectional. For now, since we're using a basic LSTM, we just specify 1. More on this in the future.*

In [None]:
hidden, cell = torch.zeros(1, 40, 128), torch.zeros(1, 40, 128)

We get new hidden and cell states plus our output by calling our LSTM.

In [None]:
out, (hidden, cell) = rnn(out, (hidden, cell))

We can check the shapes.

In [None]:
out.shape, hidden.shape, cell.shape

(torch.Size([35, 40, 128]), torch.Size([1, 40, 128]), torch.Size([1, 40, 128]))

After passing our data to the LSTM, we need to pass it through a linear transform to get a distribution over our vocabulary.

We make a linear layer, passing the hidden dimension it is expecting, and the output dimension it will result in.

In [None]:
fc1 = nn.Linear(128, len(vocab_set))

Passing our current output is likewise easy.

In [None]:
out = fc1(out)

We can then check our shape.

For each of our 40 tokens on every step in the sequence (35 total steps), we have a distribution over 33,279 tokens, the highest of which corresponds to the predicted next token.

In [None]:
out.shape

torch.Size([35, 40, 33232])

We can check the loss by instantiating a loss function.

In [None]:
criterion = nn.CrossEntropyLoss()

Let's check our target tensor again.

In [None]:
y.shape

torch.Size([1400])

PyTorch losses do not accept 3D inputs, so we have to manually flatten the first and second dimensions of our logits like so:

In [None]:
out.view(-1, len(vocab_set)).shape

torch.Size([1400, 33232])

We calculate the loss.

In [None]:
loss = criterion(out.view(-1, len(vocab_set)), y)

And display the results.

In [None]:
loss

tensor(10.4169, grad_fn=<NllLossBackward>)

# Putting it all together

Let's construct a simple training loop.

First we import the optimizers from PyTorch.

In [None]:
import torch.optim as optim

We implement a function that detaches a tensor from its history. Remember that any resulting tensor will remember all operations carried from the moment it was instantiated. 

Our hidden and cell states will be reused per batch to carry information from the previous batch's timesteps to the current one, but we only want to backpropagate through the steps in our current batch. If we don't detach them from history, PyTorch will backpropagate our hidden and cell states *all the way to the start of the sequence* and we don't want that.

In [None]:
def repackage_hidden(h):
    if isinstance(h, torch.Tensor): 
        return h.detach()
    else: 
        return tuple(repackage_hidden(v) for v in h)

We can create our model by subclassing the ```nn.Module``` class, overriding the contructor and the ```forward()``` function.

In [None]:
class LSTMLanguageModel(nn.Module):
    def __init__(self, vocab_sz, emb_dim, hidden_dim):
        super(LSTMLanguageModel, self).__init__()
        self.embedding = nn.Embedding(vocab_sz, emb_dim)
        self.rnn = nn.LSTM(emb_dim, hidden_dim)
        self.fc1 = nn.Linear(hidden_dim, vocab_sz)
        self.hidden, self.cell = None, None

    # Initializes blank hidden and cell states
    # creates tensors in the same device as the model's parameters
    def init_hidden(self, bs):
        weight = next(self.parameters())
        hidden_dim = self.rnn.hidden_size

        h = weight.new_zeros(1, bs, hidden_dim)
        c = weight.new_zeros(1, bs, hidden_dim)
        return h, c

    # We want to reset the hidden states at the start of every epoch
    def reset_hidden(self):
        self.hidden, self.cell = None, None
    
    def forward(self, x):
        bptt, bs = x.shape

        # We initialize hidden states if we have none
        # Otherwise, we detach the current ones from history
        if self.hidden is None and self.cell is None:
            self.hidden, self.cell = self.init_hidden(bs)
        else:
            self.hidden = repackage_hidden(self.hidden)
            self.cell = repackage_hidden(self.cell)

        out = self.embedding(x)
        out, (self.hidden, self.cell) = self.rnn(out, (self.hidden, self.cell))
        out = self.fc1(out)

        return out

Let's instantiate a model, a loss function, and an optimizer.

In [None]:
model = LSTMLanguageModel(vocab_sz=len(vocab_set), emb_dim=100, hidden_dim=128)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=20)

We can check the behavior of the model per batch. Let's verify if it learns.

In [None]:
x, y = train_loader[0]

Same thing, except now we only call the model as a whole.

In [None]:
out = model(x)
print(out.shape)
loss = criterion(out.view(-1, len(vocab_set)), y)
print(loss)

torch.Size([35, 40, 33232])
tensor(10.4165, grad_fn=<NllLossBackward>)


We first clear out the optimizer because gradients get accumulated here. We then let our loss function backpropagate (using its operation history). Then, we let the optimizer perform one gradient descent step for each of the parameters in our model.

In [None]:
optimizer.zero_grad()
loss.backward()
optimizer.step()

Let's feed in the second batch.

In [None]:
x, y = train_loader[1]

Same thing.

In [None]:
out = model(x)
print(out.shape)
loss = criterion(out.view(-1, len(vocab_set)), y)
print(loss)

torch.Size([35, 40, 33232])
tensor(10.1407, grad_fn=<NllLossBackward>)


Notice that the loss has gone down.

In [None]:
optimizer.zero_grad()
loss.backward()
optimizer.step()

# Training

Let's train the model for five epochs to see how it works.

In [None]:
model = LSTMLanguageModel(vocab_sz=len(vocab_set), emb_dim=100, hidden_dim=128).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=20)

def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print("The model has {:,} trainable parameters".format(count_parameters(model)))

The model has 7,727,888 trainable parameters


A basic training loop in PyTorch looks like the following, iterating over training and validation set batches. We set ```torch.no_grad()``` on the validation code as we don't need to backpropagate there. 

In [None]:
epochs = 5

for e in range(1, epochs + 1):
    model.train()
    model.reset_hidden()
    train_loss = 0
    
    for batch in tqdm(train_loader):
        x, y = batch
        x = x.to(device)
        y = y.to(device)

        out = model(x)
        loss = criterion(out.view(-1, len(vocab_set)), y)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        train_loss += loss.item()
    train_loss /= len(train_loader)

    model.eval()
    model.reset_hidden()
    valid_loss = 0
    
    with torch.no_grad():
        for batch in tqdm(valid_loader):
            x, y = batch
            x = x.to(device)
            y = y.to(device)

            out = model(x)
            loss = criterion(out.view(-1, len(vocab_set)), y)

            valid_loss += loss.item()
    valid_loss /= len(valid_loader)

    print("\nEpoch {:3} | Train Loss {:.4f} | Train Ppl {:.4f} | Valid Loss {:.4f} | Valid Ppl {:.4f}".format(e, train_loss, np.exp(train_loss), valid_loss, np.exp(valid_loss)))

100%|██████████| 1447/1447 [00:19<00:00, 75.77it/s]
100%|██████████| 151/151 [00:00<00:00, 203.36it/s]
  1%|          | 8/1447 [00:00<00:18, 76.66it/s]


Epoch   1 | Train Loss 6.9435 | Train Ppl 1036.3765 | Valid Loss 6.5669 | Valid Ppl 711.1827


100%|██████████| 1447/1447 [00:19<00:00, 75.63it/s]
100%|██████████| 151/151 [00:00<00:00, 203.15it/s]
  1%|          | 8/1447 [00:00<00:18, 77.05it/s]


Epoch   2 | Train Loss 6.2582 | Train Ppl 522.2590 | Valid Loss 6.2196 | Valid Ppl 502.5220


100%|██████████| 1447/1447 [00:19<00:00, 75.75it/s]
100%|██████████| 151/151 [00:00<00:00, 202.51it/s]
  1%|          | 8/1447 [00:00<00:18, 77.24it/s]


Epoch   3 | Train Loss 6.0385 | Train Ppl 419.2703 | Valid Loss 6.0870 | Valid Ppl 440.0840


100%|██████████| 1447/1447 [00:19<00:00, 75.70it/s]
100%|██████████| 151/151 [00:00<00:00, 204.19it/s]
  1%|          | 8/1447 [00:00<00:18, 76.05it/s]


Epoch   4 | Train Loss 5.9082 | Train Ppl 368.0377 | Valid Loss 5.9948 | Valid Ppl 401.3236


100%|██████████| 1447/1447 [00:19<00:00, 75.74it/s]
100%|██████████| 151/151 [00:00<00:00, 202.77it/s]


Epoch   5 | Train Loss 5.8134 | Train Ppl 334.7700 | Valid Loss 5.9822 | Valid Ppl 396.2922





Then we can generate a sequence from a starting word to see how the language model performs. Not bad for our first try!

In [None]:
nwords = 30
temp = 1.0

# Pick starting word
word = 'this'
ix = word2idx[word if word in word2idx else '<unk>']
inp = torch.LongTensor([ix]).unsqueeze(0).to(device)

# Generate
print(word, end=' ')
model.reset_hidden()
with torch.no_grad():
    for i in range(nwords):
        output = model(inp)
        word_weights = output.squeeze().div(temp).exp().cpu()
        word_idx = torch.multinomial(word_weights, 1)[0]
        inp.fill_(word_idx)

        word = idx2word[word_idx]
        print(word, end=' ')

this very could be attended , the Bishop of the kakapo , Uttar crimes on aggregate by an average and division with the closure in the rest of the Union , 

# Better Language Models

In this section, we'll improve on our basic setup by adding in standard practices during training, adding some regularization via dropout, and adding in weight tying (Press & Wolf, 2016; Inan et al., 2016). We'll also add in some initialization, which we will see is important for getting better solutions.

Weight tying ties the parameters of the embedding layer with the weights of the projection layer. This improves performance as well as reduces the number of parameters we have to train.

In [None]:
class LSTMLanguageModel(nn.Module):
    def __init__(self, vocab_sz, emb_dim, hidden_dim, dropout=0.5, tie_weights=True, initrange=0.1):
        super(LSTMLanguageModel, self).__init__()
        self.embedding = nn.Embedding(vocab_sz, emb_dim)
        self.rnn = nn.LSTM(emb_dim, hidden_dim)
        self.fc1 = nn.Linear(emb_dim if tie_weights else hidden_dim, vocab_sz)
        self.dropout = nn.Dropout(dropout)
        self.hidden, self.cell = None, None

        if tie_weights:
            self.fc1.weight = self.embedding.weight

        self.init_weights(initrange)

    def init_hidden(self, bs):
        weight = next(self.parameters())
        hidden_dim = self.rnn.hidden_size

        h = weight.new_zeros(1, bs, hidden_dim)
        c = weight.new_zeros(1, bs, hidden_dim)
        return h, c

    def reset_hidden(self):
        self.hidden, self.cell = None, None

    # Initialize embedding and projection parameters to a uniform distribution
    def init_weights(self, initrange=0.1):
        self.embedding.weight.data.uniform_(-initrange, initrange)
        self.fc1.bias.data.zero_()
        self.fc1.weight.data.uniform_(-initrange, initrange)
    
    def forward(self, x):
        bptt, bs = x.shape

        if self.hidden is None and self.cell is None:
            self.hidden, self.cell = self.init_hidden(bs)
        else:
            self.hidden = repackage_hidden(self.hidden)
            self.cell = repackage_hidden(self.cell)

        out = self.embedding(x)
        out, (self.hidden, self.cell) = self.rnn(out, (self.hidden, self.cell))
        out = self.dropout(out)
        out = self.fc1(out)

        return out

We instantiate a training setup.

In [None]:
model = LSTMLanguageModel(vocab_sz=len(vocab_set), emb_dim=650, hidden_dim=650, tie_weights=True, dropout=0.5).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=30)

def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print("The model has {:,} trainable parameters".format(count_parameters(model)))

The model has 25,019,232 trainable parameters


And train the model.

We add in gradient clipping to prevent exploding gradients, as well as learning rate annealing when the validation loss fails to improve.

*Note: We'll only train the model for 5 epochs to compare it with the earlier setup. The full model is trained for 40 epochs, for around an hour and a half on a Tesla K80 GPU. We'll load a copy of the fully trained weights later.*

In [None]:
epochs = 1
clip = 0.25
best_loss = np.inf

for e in range(1, epochs + 1):
    model.train()
    model.reset_hidden()
    train_loss = 0
    
    for batch in tqdm(train_loader):
        x, y = batch
        x = x.to(device)
        y = y.to(device)

        out = model(x)
        loss = criterion(out.view(-1, len(vocab_set)), y)
        optimizer.zero_grad()
        loss.backward()
        nn.utils.clip_grad_norm_(model.parameters(), clip)
        optimizer.step()

        train_loss += loss.item()
    train_loss /= len(train_loader)

    model.eval()
    model.reset_hidden()
    valid_loss = 0
    
    with torch.no_grad():
        for batch in tqdm(valid_loader):
            x, y = batch
            x = x.to(device)
            y = y.to(device)

            out = model(x)
            loss = criterion(out.view(-1, len(vocab_set)), y)

            valid_loss += loss.item()
    valid_loss /= len(valid_loader)

    if valid_loss < best_loss: best_loss = valid_loss
    else: optimizer.param_groups[0]['lr'] /= 4.0

    print("\nEpoch {:3} | Train Loss {:.4f} | Train Ppl {:.4f} | Valid Loss {:.4f} | Valid Ppl {:.4f}".format(e, train_loss, np.exp(train_loss), valid_loss, np.exp(valid_loss)))

Save the model weights if we train it from scratch.

In [None]:
#with open('weight-tied-lstm-40e.pt', 'wb') as f:
#    torch.save(model.state_dict(), f)

I pretrained this model with our current setup so we can just download the weights.

In [None]:
!wget https://s3.us-east-2.amazonaws.com/blaisecruz.com/pretrained-models/weight-tied-lstm-40e.pt

And load them.

In [None]:
with open('weight-tied-lstm-40e.pt', 'rb') as f:
    model.load_state_dict(torch.load(f))

We can see how it fares on the validation set.

In [None]:
model.eval()
model.reset_hidden()
valid_loss = 0
    
with torch.no_grad():
    for batch in tqdm(valid_loader):
        x, y = batch
        x = x.to(device)
        y = y.to(device)

        out = model(x)
        loss = criterion(out.view(-1, len(vocab_set)), y)

        valid_loss += loss.item()
valid_loss /= len(valid_loader)
print("\nValid Loss: {:.4f} | Valid Ppl: {:.4f}".format(valid_loss, np.exp(valid_loss)))

100%|██████████| 151/151 [00:02<00:00, 71.68it/s]


Valid Loss: 4.7992 | Valid Ppl: 121.4092





Let's try generating a sentence.

In [None]:
nwords = 100
temp = 0.5
torch.manual_seed(10)

# Pick starting word
word = 'Tomorrow'
ix = word2idx[word if word in word2idx else '<unk>'] 
inp = torch.LongTensor([ix]).unsqueeze(0).to(device)

# Generate
print(word, end=' ')
model.reset_hidden()
with torch.no_grad():
    for i in range(nwords):
        output = model(inp)
        word_weights = output.squeeze().div(temp).exp().cpu()
        word_idx = torch.multinomial(word_weights, 1)[0]
        inp.fill_(word_idx)

        word = idx2word[word_idx]

        print(word, end=' ')

Tomorrow I 've been released in the United States . " The Dreamscape " is the sixth episode of the second season of the American science fiction television series The X @-@ Files . The episode was written by series creator Ryan Murphy , directed by <unk> <unk> , who had previously appeared in the episode , with the same name being directed by John <unk> . <eos> The episode received mostly positive reviews from television critics . The episode received mixed reviews from critics . The Edge said that " The Secret of Monkey Island was a very good episode 

In [None]:
nwords = 50
temp = 0.8
torch.manual_seed(1234)

# Pick starting word
word = 'People'
ix = word2idx[word if word in word2idx else '<unk>'] 
inp = torch.LongTensor([ix]).unsqueeze(0).to(device)

# Generate
print(word, end=' ')
model.reset_hidden()
with torch.no_grad():
    for i in range(nwords):
        output = model(inp)
        word_weights = output.squeeze().div(temp).exp().cpu()
        word_idx = torch.multinomial(word_weights, 1)[0]
        inp.fill_(word_idx)

        word = idx2word[word_idx]

        print(word, end=' ')

People 's <unk> , a separate subject of a <unk> full @-@ size : <eos> The novel is a real @-@ life pop and pop culture , for a rest of the era , the red @-@ brick model of one of the surrounding art . The film is believed to 