# LSTM Language Model

In this notebook, we are going to make a Language Model using LSTMs. This is the "old-school" way to make language models. Recently, with the introduction of the Transformer architecture, one can successfully make a Language Model with better overall quality instead of using LSTM.

In [1]:
%load_ext autoreload
%autoreload 2
from practicalnlp import settings
from practicalnlp.models import *
from practicalnlp.training import *
from practicalnlp.data import *
import torch

# Loading data

Here we load all data with `batch_size = 20`. It's important to note that we subdivide data with 2 parameters: `nctx` and `batch_size`. `nctx` is the number of words we are using in a single pass of a training phase. For example, the figure below ilustrates each *step* in the training phase for `nctx = 3` over a single `batch_size` of the entire sentence below.


<img src="training_step_lm.svg" width="800" />
<!--- [svg](training_step_lm.svg)> --->

Arrows indicate that the origin word is trying to predict the next word in the `nctx` window. When the last word of the `nctx` window is processed, the window is translated by `nctx` words and the process repeats until it reads the entire batch. The `nctx` param is also known as `bptt` (*backpropagation through time*), and is the name used in the official PyTorch tutorial for Language Modeling.

Although this example shows the execution for only a single batch, in practice, we do it for all batchs at the same time. It might be easy to understand how it can be done in practice with a 2-dimensional tensor (one dimension for batch size, and other for the sequence length). In the code below, we do it using PyTorch.

In [2]:
batch_size = 20
nctx = 35
TRAIN = settings.WIKI_TRAIN_DATA
VALID = settings.WIKI_VALID_DATA
reader = WordDatasetReader(nctx)
reader.build_vocab((TRAIN,))

train_set = reader.load(TRAIN, batch_size)
valid_set = reader.load(VALID, batch_size)

In [4]:
train_set.shape

torch.Size([20, 104431])

In [5]:
model = LSTMLanguageModel(len(reader.vocab), 512, 512)
model.to('cuda:0')

num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Model has {num_params} parameters") 


learnable_params = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.Adam(learnable_params, lr=0.001)
fit_lm(model, optimizer, 1, batch_size, nctx, train_set, valid_set)

Model has 21274623 parameters
EPOCH 1
Training Results
average_train_loss 7.039359 (7.674278)
average_train_loss 6.940264 (7.212206)
average_train_loss 6.476130 (6.988302)
average_train_loss 6.665786 (6.845469)
average_train_loss 6.520135 (6.741749)
average_train_loss 6.277775 (6.662471)
average_train_loss 6.214710 (6.606091)
average_train_loss 6.106718 (6.555020)
average_train_loss 6.054318 (6.509634)
average_train_loss 5.910986 (6.472560)
average_train_loss 6.157823 (6.441040)
average_train_loss 5.861746 (6.414230)
average_train_loss 6.177941 (6.391719)
average_train_loss 6.130061 (6.365675)
average_train_loss 6.221581 (6.348069)
average_train_loss 6.110457 (6.332552)
average_train_loss 6.081135 (6.311838)
average_train_loss 5.832395 (6.291147)
average_train_loss 5.744493 (6.275616)
average_train_loss 6.009275 (6.259222)
average_train_loss 5.699287 (6.242274)
average_train_loss 5.692174 (6.223206)
average_train_loss 5.925848 (6.205693)
average_train_loss 5.840515 (6.193715)
average_t

In [4]:
model = TransformerModel(len(reader.vocab), 512, 2, 200, 2, 0.2)

num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Model has {num_params} parameters") 


learnable_params = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.Adam(learnable_params, lr=0.001)
fit_lm(model, optimizer, 1, batch_size, nctx, train_set, valid_set)

'1.1.0'

In [6]:
def sample(model, index2word, start_word='the', maxlen=20):
  

    model.eval() 
    words = [start_word]
    x = torch.tensor(reader.vocab.get(start_word)).long().reshape(1, 1).to('cuda:0')
    hidden = model.init_hidden(1)

    with torch.no_grad():
        for i in range(20):
            output, hidden = model(x, hidden)
            word_softmax = output.squeeze().exp().cpu()
            selected = torch.multinomial(word_softmax, 1)[0]
            x.fill_(selected)
            word = index2word[selected.item()]
            words.append(word)
    words.append('...')
    return words

index2word = {i: w for w, i in reader.vocab.items()}
words = sample(model, index2word)
print(' '.join(words))

the American album were 28 – 3 . During 1907 , suspicions remained the first time since 1972 with him just ...
