# Language Models

In [1]:
import torch
from d2l import torch as d2l

## Learning Language Models

We could imagine tokenising a document as words, which would allow us to begin applying basic probability rules to compute poroperties. For example, the probability of four words occuring one after the orther would be

$$ P(Machine learning is fun) = P(Machine) * P(learning | Machine) * P(is | Machine, learning) * P(fun | machine, learning is)$$

### Markov Models and n-grams

Simply estimating the probability of the next word based ont  the conditional probabiltiy of the preceeding words.

### Word Frequency

Just look at the frequency of words over a very large corpus. highly challenging because many two/three word combinations are not encountered often enough to give a good estimate.

### Laplace Smoothing 

Add a small constant to all word counts to apply smoothing and help with occurence of singletons... 

### In summary...

None of these simple statistical methods for estimating the probabiltiy of a word based on the previously observed words works well, they cannot be relied on. So we use neural networks

## Perplexity

How to evaluate the performance of a language model. A good language model is able to predict the next tokens with high accuracy. For the text "it is raining" the word "outside" might be a good next token, but "banana" would not be (but does at least include words), nor would "wjslaia'ksapjw" where a model has clearly learned nothing. 

You could imagine evaluating the model by computing the likelihood of a sequence, but of course, the longer the sequence the less likely it is. So a huge novel like war and peace would necessarily have a lower likelihood than a barely coherent tweet. Not a great measure of quality. What is missing is the concept of an average. 

Use perplexity, which is derived from the cross-entropy loss over n tokens in a sequence. Perplexity is best understood as the geometric mean of the number of real choices we have when choosing which token to pick next. For a perfect model, we would have a perplexity of 1, with a 100% confidence in which token to pick. A worst-case scenario might set the probability of all tokens to 0, s the perplexity is positive infinity. A baseline might be to have a uniform proabbility across all tokens in the vocabulary, any useful model must beat this. 

### Partitioning Sequences

We will use perplexity to evaluate a model based on how good the model is at predicting the next token given a set of previous tokens. Next question is how we would go about selecting minibatches of sequences with some predefined length for training and testing from a corpus. 

At the beginning of each epoch, discard the first d tokens, of the total T, where d is a number sampled at random from 0 to n. Then subdivide the remaining text into (T - d) / n subsequences. 

In [2]:
@d2l.add_to_class(d2l.TimeMachine)
def __init__(self, batch_size, num_steps, num_train=10_000, num_val=5_000):
    super(d2l.TimeMachine, self).__init__()
    self.save_hyperparameters()

    corpus, self.vocab = self.build(self._download())

    array = torch.tensor([corpus[i:i+num_steps+1] for i in range(len(corpus) - num_steps)])
    self.X, self.Y = array[:, :-1], array[:, 1:]

In [3]:
@d2l.add_to_class(d2l.TimeMachine)
def get_dataloader(self, train):
    idx = slice(0, self.num_train) if train else slice(
        self.num_train, self.num_train + self.num_val
    )
    return self.get_tensorloader([self.X, self.Y], train, idx)

In [5]:
data = d2l.TimeMachine(batch_size=2, num_steps=10)
for X, Y in data.train_dataloader():
    print('X:', X, '\nY:', Y)
    break

X: tensor([[22,  3,  6,  0,  9,  2, 23,  6,  0,  2],
        [17,  2, 22, 20,  6,  0, 19,  6, 18, 22]]) 
Y: tensor([[ 3,  6,  0,  9,  2, 23,  6,  0,  2,  0],
        [ 2, 22, 20,  6,  0, 19,  6, 18, 22, 10]])
