<a href="https://colab.research.google.com/github/joshgregory42/practical_deep_learning/blob/main/ch_12_nlp_dive.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# A Language Model from Scratch

## The Data

Dataset is called *Human Numbers*, which contains the first 10,000 numbers written out in English. This is a dataset that will let us try out methods quickly and easily and interpret the results.

Download the dataset the usual way:

In [1]:
from fastai.text.all import *
path = untar_data(URLs.HUMAN_NUMBERS)

In [2]:
path.ls()

(#2) [Path('/root/.fastai/data/human_numbers/valid.txt'),Path('/root/.fastai/data/human_numbers/train.txt')]

Open up those two files and see what's inside. First will join everything together and ignore the train/valid split (will come back to it later):

In [3]:
lines = L()
with open(path/'train.txt') as f: lines += L(*f.readlines())
with open(path/'valid.txt') as f: lines += L(*f.readlines())

Take all those lines and concatenate them, separating them with '.':

In [4]:
text = ' . '.join([l.strip() for l in lines])

text[:100]

'one . two . three . four . five . six . seven . eight . nine . ten . eleven . twelve . thirteen . fo'

Tokenize by splitting on spaces:

In [5]:
tokens = text.split(' ')
tokens[:10]

['one', '.', 'two', '.', 'three', '.', 'four', '.', 'five', '.']

To numericalize, have to create a list of all the unique tokens (our *vocab*):

In [6]:
vocab = L(*(tokens)).unique()
vocab

(#30) ['one','.','two','three','four','five','six','seven','eight','nine'...]

Convert our tokens into numbers by looking up the index of each in the vocab:

In [7]:
word2idx = {w:i for i, w in enumerate(vocab)}

nums = L(word2idx[i] for i in tokens)
nums

(#63095) [0,1,2,1,3,1,4,1,5,1...]

## First Lanuage Model from Scratch

A simple way to turn this into a neural network would be to specify that we are going to predict each word based on the previous three words. Could create a list of every sequence of three words as our independent variables, and the next word after each sequence as the dependent variable.

Can do that with plain Python. First do it with tokens to confirm what it looks like:

In [8]:
L((tokens[i:i+3], tokens[i+3]) for i in range(0, len(tokens)-4,3))

(#21031) [(['one', '.', 'two'], '.'),(['.', 'three', '.'], 'four'),(['four', '.', 'five'], '.'),(['.', 'six', '.'], 'seven'),(['seven', '.', 'eight'], '.'),(['.', 'nine', '.'], 'ten'),(['ten', '.', 'eleven'], '.'),(['.', 'twelve', '.'], 'thirteen'),(['thirteen', '.', 'fourteen'], '.'),(['.', 'fifteen', '.'], 'sixteen')...]

Now do it with tensors of the numericalized values, which is what the model will actually use:

In [9]:
seqs = L((tensor(nums[i:i+3]), nums[i+3]) for i in range(0, len(nums)-4, 3))
seqs

(#21031) [(tensor([0, 1, 2]), 1),(tensor([1, 3, 1]), 4),(tensor([4, 1, 5]), 1),(tensor([1, 6, 1]), 7),(tensor([7, 1, 8]), 1),(tensor([1, 9, 1]), 10),(tensor([10,  1, 11]), 1),(tensor([ 1, 12,  1]), 13),(tensor([13,  1, 14]), 1),(tensor([ 1, 15,  1]), 16)...]

Can batch those using the `DataLoader` class. For now we'll split the sequences randomly:

In [10]:
bs = 64
cut = int(len(seqs) * 0.8)
dls = DataLoaders.from_dsets(seqs[:cut], seqs[cut:], bs=64, shuffle=False)

Now we create a neural network architecture that takes three words as input, and returns a prediction of the probability of each possible next word in the vocab. Use three standard layers, with a few changes:

First change is that the first linear layer will only use the first word's embedding as activations, the second layer will use the second word's embedding plus the first layer's output activations, and the third layer will use the third word's embedding plus the second layer's output activations. Key effect here is that every word is interpreted in the information context of any words preceding it.

Second main change is that each of these three layers will use the same weight matrix. This means that the way one word impacts the activations from previous words should not change depending on the position of the word. So a layer does not learn one sequence position; must learn to handle all positions.

Since layer weights don't change, could think of the sequential layers as "the same layer" repeated.

## Our Language Model in PyTorch

Create the language model module that we described earlier:

In [11]:
class LMModel1(Module):
  def __init__(self, vocab_sz, n_hidden):
    self.i_h = nn.Embedding(vocab_sz, n_hidden)
    self.h_h = nn.Linear(n_hidden, n_hidden)
    self.h_o = nn.Linear(n_hidden, vocab_sz)

  def forward(self, x):
    h = F.relu(self.h_h(self.i_h(x[:, 0])))
    h = h + self.i_h(x[:, 1])
    h = F.relu(self.h_h(h))
    h = h + self.i_h(x[:, 2])
    h = F.relu(self.h_h(h))
    return self.h_o(h)

We've created three layers here:

* The embedding layer (`i_h`, for *input* to *hidden*)
* The linear layer to create the activations for the next word (`h_h`, for *hidden* to *hidden*)
* A final linear layer to predict the fourth word (`h_o`, for *hidden* to *output*)



Try training this model and see what happens:

In [12]:
learn = Learner(dls, LMModel1(len(vocab), 64), loss_func=F.cross_entropy,
                              metrics=accuracy)

learn.fit_one_cycle(4, 1e-3)

epoch,train_loss,valid_loss,accuracy,time
0,1.791373,1.968191,0.478726,00:04
1,1.407304,1.744796,0.474923,00:03
2,1.402936,1.648738,0.496553,00:02
3,1.372682,1.625802,0.49275,00:02


Let's compare this to what a really simple model would give us. We could always predict the mode common token, so let's find out which token is most often the target in our validation set:

In [13]:
n, counts = 0, torch.zeros(len(vocab))

for x, y in dls.valid:
  n += y.shape[0]
  for i in range_of(vocab): counts[i] += (y==i).long().sum()

idx = torch.argmax(counts)
idx, vocab[idx.item()], counts[idx].item()/n

(tensor(29), 'thousand', 0.15165200855716662)

Most common token has the index of 29, which corresponds to `thousand`. So if we always predicted this token we would have an accuracy of roughly 15\%, so our model is doing much better.

This baseline is okay. Let's see how we can refactor it with a loop.

## Our First Recurrent Neural Network

We could simplify our module code by replacing it with code that calls the layers with a `for` loop. This would make our code simpler and also let us apply our module equally well to token sequences of different lengths. Won't be limited to token lists of length three:

In [14]:
class LMModel2(Module):
    def __init__(self, vocab_sz, n_hidden):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)
        self.h_h = nn.Linear(n_hidden, n_hidden)
        self.h_o = nn.Linear(n_hidden,vocab_sz)

    def forward(self, x):
        h = 0
        for i in range(3):
            h = h + self.i_h(x[:,i])
            h = F.relu(self.h_h(h))
        return self.h_o(h)

Check that we get the same results using this refactoring:

In [15]:
learn = Learner(dls, LMModel2(len(vocab), 64), loss_func=F.cross_entropy,
                metrics=accuracy)
learn.fit_one_cycle(4, 1e-3)

epoch,train_loss,valid_loss,accuracy,time
0,1.857394,1.903886,0.458522,00:02
1,1.393145,1.731786,0.466128,00:02
2,1.4063,1.594919,0.491799,00:03
3,1.380854,1.658734,0.410744,00:03


Note that a neural network that has a loop like this is called a *recurrent neural network* (RNN). An RNN isn't something new, it's just a refactoring of a multilayer neural network using a `for` loop. Could just call it a "looping neural network" and it would mean the same thing.

## Improving the RNN

