# Language model from scratch -- RNN

In [2]:
from fastai.text.all import *
path = untar_data(URLs.HUMAN_NUMBERS)

In [3]:
path

Path('/home/jupyter/.fastai/data/human_numbers')

In [4]:
Path.BASE_PATH = path

In [5]:
path.ls()

(#2) [Path('valid.txt'),Path('train.txt')]

In [6]:
lines = L()
with open(path/'train.txt') as f: lines += L(*f.readlines())
with open(path/'valid.txt') as f: lines += L(*f.readlines())
lines

(#9998) ['one \n','two \n','three \n','four \n','five \n','six \n','seven \n','eight \n','nine \n','ten \n'...]

In [7]:
lines[-5:]

(#5) ['nine thousand nine hundred ninety five \n','nine thousand nine hundred ninety six \n','nine thousand nine hundred ninety seven \n','nine thousand nine hundred ninety eight \n','nine thousand nine hundred ninety nine \n']

The dataset contains the numbers 1-9999 written out in english.

## Tokenize and numericalize

In [8]:
text = ' . '.join([l.strip() for l in lines])
text[:100]

'one . two . three . four . five . six . seven . eight . nine . ten . eleven . twelve . thirteen . fo'

In [17]:
tokens = L(text.split(' '))
tokens[:10]

(#10) ['one','.','two','.','three','.','four','.','five','.']

In [18]:
tokens[40:50]

(#10) ['twenty','one','.','twenty','two','.','twenty','three','.','twenty']

In [19]:
vocab = L(*tokens).unique()
vocab

(#30) ['one','.','two','three','four','five','six','seven','eight','nine'...]

In [20]:
word2idx = {w:i for i,w in enumerate(vocab)}
nums = L(word2idx[i] for i in tokens)
tokens, nums

((#63095) ['one','.','two','.','three','.','four','.','five','.'...],
 (#63095) [0,1,2,1,3,1,4,1,5,1...])

## A language model 

Let's try and take 3 tokens, and predict the fourth

In [23]:
L((tokens[i:i+3], tokens[i+3]) for i in range(0, len(tokens)-4, 3))

(#21031) [((#3) ['one','.','two'], '.'),((#3) ['.','three','.'], 'four'),((#3) ['four','.','five'], '.'),((#3) ['.','six','.'], 'seven'),((#3) ['seven','.','eight'], '.'),((#3) ['.','nine','.'], 'ten'),((#3) ['ten','.','eleven'], '.'),((#3) ['.','twelve','.'], 'thirteen'),((#3) ['thirteen','.','fourteen'], '.'),((#3) ['.','fifteen','.'], 'sixteen')...]

In [22]:
seqs = L((tensor(nums[i:i+3]), nums[i+3]) for i in range(0, len(nums)-4, 3))

So `seqs` is our complete dataset, with the independent and dependent variables.

`seqs` are legitimate `Datasets` because they have a length and we can index into them.

In [24]:
bs = 64
cut = int(len(seqs) * 0.8)
dls = DataLoaders.from_dsets(seqs[:cut], seqs[cut:], bs=62, shuffle=False)

In [34]:
x, y = first(dls.train)

In [35]:
x.shape, y.shape

(torch.Size([62, 3]), torch.Size([62]))

### Create a regular neural net

In [27]:
class LMModel1(Module):
    def __init__(self, vocab_sz, n_hidden):
        self.i_h = nn.Embedding(vocab_sz, n_hidden) # input -> hidden
        self.h_h = nn.Linear(n_hidden, n_hidden) # hidden -> hidden
        self.h_o = nn.Linear(n_hidden, vocab_sz) # hidden -> output
    
    def forward(self, x):
        h = F.relu(self.h_h(self.i_h(x[:,0])))
        h = h + self.i_h(x[:,1])
        h = F.relu(self.h_h(h))
        h = h + self.i_h(x[:,2])
        h = F.relu(self.h_h(h))
        return self.h_o(h)

- First create an embedding for each word in the vocab with `n_hidden` latent variables 

Then, for each word in the mini-batch

1. Pass the 1st embedding through a hidden layer and get the activation
1. Add the activation to the embedding vector of the second word
1. Pass that through a hidden layer
1. Add the activation to the embedding vector of the third word
1. Pass that through a hidden layer
1. Pass through a final linear layer to reshape the output to `vocab_sz`

The final layer gives us a probability distribution over all words in the vocabulary.

Diagramatically, we can represent the neural net like this:

- input is rectangle
- arrow is a computation
- circle is computed activations
- output is triangle
- Two arrows impinging on a circle is an addition (or concatenation, either is fine -- though it'll change the shape of the hidden layers if you concatenate)


<img src="./figures/rnn.png" width="500">

There is a sublety in the choice of architecture here: we use the **same neural net** `self.h_h` to go from hidden layer to hidden layer. This is because we expect there to be a single rule to transition from word-to-word in the language of human numbers. 

In my own wording, there needs to be a _spatial invariance_ to the model architecture. There should be no special significance to the position of the first, second, or third activations because there is no special significance to the first, second, or third position of the input data. What matters is how you transition between adjacent words/tokens. That's all.

Question: Seems quite limiting in terms of the richness of the model. How do you improve on this...?

In [28]:
learn = Learner(dls, LMModel1(len(vocab), 64), loss_func=F.cross_entropy, metrics=accuracy)

In [28]:
learn.fit_one_cycle(4, 1e-3)

epoch,train_loss,valid_loss,accuracy,time
0,1.83132,1.877368,0.47207,00:02
1,1.402578,1.770471,0.475398,00:01
2,1.396065,1.664922,0.491086,00:01
3,1.375362,1.690605,0.432612,00:01


Is this any good? A baseline would be to always predict the most common token:

In [44]:
c = Counter(tokens[cut:])
mc = c.most_common(5)
mc

[('thousand', 7104),
 ('.', 7103),
 ('hundred', 6405),
 ('nine', 2440),
 ('eight', 2344)]

In [45]:
mc[0][1]/len(tokens[cut:])

0.15353028894988222

So you only get 15% accuracy when you always predict the token "thousand". We're doing substantially better than that so it is learning something of the structure of the language.