# LSTM Language Model

In this notebook, we gonna present how to make a Language Model using LSTMs. This is the "old-school" way to make language models. Recently, with the introduction of the Transformer architecture, one can successfully make a Language Model with better overall quality instead of using LSTM.

In [4]:
%load_ext autoreload
%autoreload 2
from practicalnlp import settings
from practicalnlp.models import *
from practicalnlp.training import *
from practicalnlp.data import *
import torch

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Loading data

Here we load all the data with `batch_size = 20`. It's important to note that we subdivide data with 2 parameters: `nctx` and `batch_size`. `nctx` is the number of words we are using in a single pass of a training phase. For example, the figure below ilustrates each *step* in the training phase for `nctx = 3` over a single `batch_size` of the entire sentence below.


<img src="training_step_lm.svg" width="800" />
<!--- [svg](training_step_lm.svg)> --->

Arrows indicate that the origin word is trying to predict the next word in the `nctx` window. When the last word of the `nctx` window is processed, the window is translated by `nctx` words and the process repeats until it reads the entire batch. The `nctx` param is also known as `bptt` (*backpropagation through time*), and is the name used in the official PyTorch tutorial for Language Modeling.

Although this example shows the execution for only a single batch, in practice, we do it for all batchs at the same time. It might be easy to understand how it can be done in practice with a 2-dimensional tensor (one dimension for batch size, and other for the sequence length).

In [5]:
batch_size = 20
nctx = 35
TRAIN = settings.WIKI_TRAIN_DATA
VALID = settings.WIKI_VALID_DATA
reader = WordDatasetReader(nctx)
reader.build_vocab((TRAIN,))

train_set = reader.load(TRAIN, batch_size)
valid_set = reader.load(VALID, batch_size)

In [9]:
model = LSTMLanguageModel(len(reader.vocab), 512, 512)
model.to('cuda:0')

num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Model has {num_params} parameters") 


learnable_params = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.Adam(learnable_params, lr=0.001)
fit_lm(model, optimizer, 1, batch_size, nctx, train_set, valid_set)

Model has 21274623 parameters
EPOCH 1
Training Results
average_train_loss 7.032219 (7.693288)
average_train_loss 6.941036 (7.207089)
average_train_loss 6.441201 (6.979663)
average_train_loss 6.652155 (6.837320)
average_train_loss 6.503147 (6.733491)
average_train_loss 6.221502 (6.655999)
average_train_loss 6.215572 (6.598819)
average_train_loss 6.135547 (6.548828)
average_train_loss 6.034210 (6.503680)
average_train_loss 5.849336 (6.467311)
average_train_loss 6.151655 (6.435877)
average_train_loss 5.897994 (6.409550)
average_train_loss 6.146118 (6.387423)
average_train_loss 6.198409 (6.361967)
average_train_loss 6.185694 (6.344496)
average_train_loss 6.094780 (6.329319)
average_train_loss 6.078131 (6.308456)
average_train_loss 5.808933 (6.288046)
average_train_loss 5.748820 (6.272158)
average_train_loss 5.981767 (6.255495)
average_train_loss 5.676554 (6.238284)
average_train_loss 5.675500 (6.219382)
average_train_loss 5.981535 (6.202112)
average_train_loss 5.863875 (6.189984)
average_t