# Language model

In [2]:
from chapter import *

Our goal in this section is to train a character-level RNN language model to predict the next token at *each* step with varying-length context. Hence, during training, our model predicts on each time-step. This is reflected in the following implementation. Also see {numref}`04-char-rnn`.

<br>

```{figure} ../../../img/nn/04-char-rnn.svg
---
width: 550px
name: 04-char-rnn
align: center
---
Character-level RNN language model for predicting the next character at each step.  [Source](https://www.d2l.ai/chapter_recurrent-neural-networks/rnn.html)
```

In [3]:
%%save
class RNNLanguageModel(nn.Module):
    """RNN based language model."""
    def __init__(self, dim_inputs, dim_hidden, vocab_size):
        super().__init__()
        self.rnn = SimpleRNN(dim_inputs, dim_hidden)
        self.linear = nn.Linear(dim_hidden, vocab_size)

    def forward(self, x, state=None):
        outs, _ = self.rnn(x, state)
        logits = self.linear(outs)         # (B, T, H) -> (B, T, C)
        return logits.permute(0, 2, 1)     # F.cross_entropy expects (B, C, T)

The linear layer performs matrix multiplication on the rightmost dimension of `outs` which contains  the value of the state vector at each time step. Thus, we have $T$ predictions with increasing context size[^1] $1, 2, \ldots, T.$ As such, our dataset consists of input-output pairs of $T$ input characters and $T$ target characters.

[^1]: Consequently, the model gets corrected at each time step, with variable-length dependency, during backward pass. 

In [3]:
%%save
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader

class SequenceDataset(Dataset):
    def __init__(self, corpus: list, seq_len: int, vocab_size: int):
        super().__init__()
        self.corpus = corpus
        self.seq_len = seq_len
        self.vocab_size = vocab_size

    def __getitem__(self, i):
        c = torch.tensor(self.corpus[i: i + self.seq_len + 1])
        x, y = c[:-1], c[1:]
        x = F.one_hot(x, num_classes=self.vocab_size).float()
        return x, y
    
    def __len__(self):
        return len(self.corpus) - self.seq_len

We train the RNN language model on the *Time Machine* text.

In [4]:
tm = TimeMachine()
corpus, vocab = tm.build()
len(corpus), len(vocab)

(174215, 28)

In [5]:
from torch.utils.data import random_split

dataset = SequenceDataset(corpus, seq_len=10, vocab_size=len(vocab))
train_dataset, valid_dataset = random_split(dataset, [0.80, 0.20])
train_loader = DataLoader(train_dataset, batch_size=128, shuffle=True)
valid_loader = DataLoader(valid_dataset, batch_size=128)

Recall that the inputs are one-hot vectors. Note that these are not character to character mappings, e.g. the third output here depends explicitly on the third character, and implicitly on the first two characters. The batch index here shuffles the anchor point, but otherwise the character ordering is intact. 

In [6]:
x, y = next(iter(valid_loader))

a, T = 1, dataset.seq_len
torch.argmax(x[a], dim=1), y[a]
x_chars = vocab.to_tokens(torch.argmax(x[a], dim=1))
y_chars = vocab.to_tokens(y[a])
for i in range(T):
    print(f"{x_chars[i]} --> {y_chars[i]}")

t --> r
r --> e
e --> e
e --> s
s -->  
  --> t
t --> h
h --> e
e -->  
  --> f


In [7]:
print(x.shape, y.shape)
print("target:", y[0])
print("inputs:", torch.argmax(x, dim=-1)[0])

torch.Size([128, 10, 28]) torch.Size([128, 10])
target: tensor([ 6,  1, 19,  2, 15,  8,  6,  1, 16,  7])
inputs: tensor([ 9,  6,  1, 19,  2, 15,  8,  6,  1, 16])


Note that we shaped our tensors so that `F.cross_entropy` is used as usual:

In [8]:
import torch.nn.functional as F

x, y = next(iter(train_loader))
model = RNNLanguageModel(28, 5, len(vocab))     # output: (B, C, T)
loss = F.cross_entropy(model(x), y)
loss

tensor(3.3789, grad_fn=<NllLoss2DBackward0>)