## 00:56:22 - Language model from scratch

* Going to learn how a recurrent neural network works.

## 00:56:52 - Question: are there model interpretabilty tools for language models?

* There are some, but won't be covered in this part of the course.

## 00:58:11 - Preparing the dataset: tokenisation and numericalisation

* Jeremy created a dataset called human numbers that contains first 10k numbers written out in English.
  * Seems very few people create datasets, even though it's not particularly hard.

In [2]:
from fastai.text.all import *
from fastcore.foundation import L

path = untar_data(URLs.HUMAN_NUMBERS)

In [3]:
Path.BASE_PATH = path

In [4]:
path.ls()

(#2) [Path('valid.txt'),Path('train.txt')]

In [5]:
lines = L()
with open(path/'train.txt') as f:
    lines += L(*f.readlines())
with open(path/'valid.txt') as f:
    lines += L(*f.readlines())
    
lines

(#9998) ['one \n','two \n','three \n','four \n','five \n','six \n','seven \n','eight \n','nine \n','ten \n'...]

* Concat them together and put a dot between them for tokenising.

In [6]:
text = ' . '.join([l.strip() for l in lines])
text[:100]

'one . two . three . four . five . six . seven . eight . nine . ten . eleven . twelve . thirteen . fo'

* Tokenise by splitting on spaces:

In [7]:
tokens = L(text.split(' '))
tokens[100:110]

(#10) ['.','forty','two','.','forty','three','.','forty','four','.']

* Create a vocab by getting unique tokens:

In [8]:
vocab = L(tokens).unique()
vocab, len(vocab)

((#30) ['one','.','two','three','four','five','six','seven','eight','nine'...],
 30)

* Convert tokens into numbers by looking up index of each word:

In [9]:
word2idx = {w: i for i,w in enumerate(vocab)}
nums = L(word2idx[i] for i in tokens)
tokens, nums

((#63095) ['one','.','two','.','three','.','four','.','five','.'...],
 (#63095) [0,1,2,1,3,1,4,1,5,1...])

* That gives us a small easy dataset for building a language model.

## 01:01:31 - Language model from scratch

* Create a independent and dependent pair: first 3 words are input, next is dependent.

In [10]:
L((tokens[i:i+3], tokens[i+3]) for i in range(0, len(tokens)-4, 3))

(#21031) [(['one', '.', 'two'], '.'),(['.', 'three', '.'], 'four'),(['four', '.', 'five'], '.'),(['.', 'six', '.'], 'seven'),(['seven', '.', 'eight'], '.'),(['.', 'nine', '.'], 'ten'),(['ten', '.', 'eleven'], '.'),(['.', 'twelve', '.'], 'thirteen'),(['thirteen', '.', 'fourteen'], '.'),(['.', 'fifteen', '.'], 'sixteen')...]

* Same thing numericalised:

In [11]:
seqs = L((tensor(nums[i:i+3]), nums[i+3]) for i in range(0, len(nums)-4, 3))
seqs

(#21031) [(tensor([0, 1, 2]), 1),(tensor([1, 3, 1]), 4),(tensor([4, 1, 5]), 1),(tensor([1, 6, 1]), 7),(tensor([7, 1, 8]), 1),(tensor([1, 9, 1]), 10),(tensor([10,  1, 11]), 1),(tensor([ 1, 12,  1]), 13),(tensor([13,  1, 14]), 1),(tensor([ 1, 15,  1]), 16)...]

* Can batch those with a `DataLoader` object
  * Take first 80% as training, last 20% as validation.

In [12]:
bs = 64
cut = int(len(seqs) * 0.8)
dls = DataLoaders.from_dsets(seqs[:cut], seqs[:cut], bs=64, shuffle=False)

## 01:03:35 - Simple Language model

* 3 layer neural network.
  * One linear layer that is reused 3 times.
  * Each time the result is added to the embedding.

In [13]:
class LMModel(Module):
    def __init__(self, vocab_sz, n_hidden):
        self.input2hidden = nn.Embedding(vocab_sz, n_hidden)
        self.hidden2hidden = nn.Linear(n_hidden, n_hidden)
        self.hidden2output = nn.Linear(n_hidden, vocab_sz)
    
    def forward(self, x):
        hidden = self.input2hidden(x[:,0])
        hidden = F.relu(self.hidden2hidden(hidden))
    
        hidden = hidden + self.input2hidden(x[:,1])
        hidden = F.relu(self.hidden2hidden(hidden))
    
        hidden = hidden + self.input2hidden(x[:,2])
        hidden = F.relu(self.hidden2hidden(hidden))

        return self.hidden2output(hidden)

## 01:04:49 - Question: can you speed up fine-tuning the NLP model?

* Do something else while you wait or leave overnight.

## 01:05:44 - Simple Language model cont.

* 2 interesting happening:
  * Some of the inputs are being fed into later layers, instead of just the first.
  * The model is reusing hidden state throughout layers.

In [14]:
learn = Learner(dls, LMModel(len(vocab), 64), loss_func=F.cross_entropy, metrics=accuracy)
learn.fit_one_cycle(4, 1e-3)

epoch,train_loss,valid_loss,accuracy,time
0,1.800355,1.929378,0.462197,00:03
1,1.364388,1.685311,0.46612,00:02
2,1.399541,1.464157,0.490787,00:03
3,1.362921,1.412893,0.495007,00:03


* We can find out if that accuracy is good by making a simple model that predicts most common token:

In [15]:
c = Counter(tokens[cut:])
mc = c.most_common(5)
mc

[('thousand', 7104),
 ('.', 7103),
 ('hundred', 6405),
 ('nine', 2440),
 ('eight', 2344)]

In [16]:
mc[0][1]/len(tokens[cut:])

0.15353028894988222

## 01:14:41 - Recurrent neural network

* We can refactor in Python using a for loop.
  * Note that `hidden = 0` is being broadcast into the hidden state.

In [17]:
class LMModel2(Module):
    def __init__(self, vocab_sz, n_hidden):
        self.input2hidden = nn.Embedding(vocab_sz, n_hidden)
        self.hidden2hidden = nn.Linear(n_hidden, n_hidden)
        self.hidden2output = nn.Linear(n_hidden, vocab_sz)
    
    def forward(self, x):
        hidden = 0.
        
        for i in range(3):
            hidden = hidden + self.input2hidden(x[:,i])
            hidden = F.relu(self.hidden2hidden(hidden))

        return self.hidden2output(hidden)

In [18]:
learn = Learner(dls, LMModel2(len(vocab), 64), loss_func=F.cross_entropy, metrics=accuracy)
learn.fit_one_cycle(4, 1e-3)

epoch,train_loss,valid_loss,accuracy,time
0,1.827151,1.976822,0.457739,00:02
1,1.383131,1.667376,0.468141,00:03
2,1.398363,1.497497,0.490252,00:03
3,1.382042,1.432224,0.490965,00:03


* That's what a Recurrent Neural Network is!

## 01:18:39 - Improving the RNN

* Note that we're setting the previous state to 0.
  * However, the hidden state from sequence to sequence contains useful information.
  * We can rewrite to maintain state of RNN.

In [19]:
class LMModel3(Module):
    def __init__(self, vocab_sz, n_hidden):
        self.input2hidden = nn.Embedding(vocab_sz, n_hidden)
        self.hidden2hidden = nn.Linear(n_hidden, n_hidden)
        self.hidden2output = nn.Linear(n_hidden, vocab_sz)
        self.hidden = 0.
        
    
    def forward(self, x):
        for i in range(3):
            self.hidden = self.hidden + self.input2hidden(x[:,i])
            self.hidden = F.relu(self.hidden2hidden(self.hidden))
    
        output = self.hidden2output(self.hidden)
        self.hidden = self.hidden.detach()

        return output
    
    def reset(self):
        self.h = 0.

## 01:19:41 - Back propagation through time

* Note that we called `self.hidden.detach()` each forward pass to ensure we're not back propogating through all the previous forward passes.
  * Known as Back propagation through time (BPTT)
  
## 01:22:19 - Ordered sequences and callbacks

* Samples must be seen in correct order - each batch needs to connect to previous batch.
* At the start of each epoch, we need to call reset.

In [20]:
def group_chunks(ds, bs):
    m = len(ds) // bs
    new_ds = L()
    for i in range(m):
        new_ds += L(ds[i + m*j] for j in range(bs))
    return new_ds

cut = int(len(seqs) * 0.8)
dls = DataLoaders.from_dsets(
    group_chunks(seqs[:cut], bs),
    group_chunks(seqs[cut:], bs),
    bs=bs, drop_last=True, shuffle=False
)

* At start of epoch, we call `reset`
  * Last thing to add is little tweak of training model via a`Callback` called `ModelReseter`

In [21]:
from fastai.callback.rnn import ModelResetter

In [22]:
learn = Learner(dls, LMModel3(len(vocab), 64), loss_func=F.cross_entropy, metrics=accuracy, cbs=ModelResetter)
learn.fit_one_cycle(10, 3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,1.708671,1.819227,0.483173,00:02
1,1.231601,1.618543,0.513221,00:02
2,1.09424,1.709943,0.462019,00:02
3,1.020911,1.706639,0.529808,00:02
4,0.97725,1.65954,0.542788,00:02
5,0.909005,1.68144,0.56875,00:02
6,0.873126,1.604766,0.578125,00:02
7,0.811297,1.745289,0.6,00:02
8,0.774836,1.684849,0.60625,00:02
9,0.762098,1.62969,0.61851,00:02


* Training is doing a lot better now.

## 01:25:00 - Creating more signal

* Instead of putting output stage outside the loop, can put it in the loop.
  * After every hidden state, we get a prediction.
* Can we change the data so dependant variable has each of the three words after each three input words.

In [23]:
sl = 16
seqs = L((tensor(nums[i:i+sl]), tensor(nums[i+1:i+sl+1]))
        for i in range(0, len(nums) - sl - 1, sl))
cut = int(len(seqs) * 0.8)

dls = DataLoaders.from_dsets(group_chunks(seqs[:cut], bs),
                             group_chunks(seqs[cut:], bs),
                             bs=bs, drop_last=True, shuffle=False)

In [24]:
[L(vocab[o] for o in s) for s in seqs[0]]

[(#16) ['one','.','two','.','three','.','four','.','five','.'...],
 (#16) ['.','two','.','three','.','four','.','five','.','six'...]]

* We can modify the model to return a list of outputs.

In [25]:
class LMModel4(Module):
    def __init__(self, vocab_sz, n_hidden):
        self.input2hidden = nn.Embedding(vocab_sz, n_hidden)
        self.hidden2hidden = nn.Linear(n_hidden, n_hidden)
        self.hidden2output = nn.Linear(n_hidden, vocab_sz)
        self.hidden = 0.

    def forward(self, x):
        outputs = []

        for i in range(sl):
            self.hidden = self.hidden + self.input2hidden(x[:,i])
            self.hidden = F.relu(self.hidden2hidden(self.hidden))
            outputs.append(self.hidden2output(self.hidden))

        self.hidden = self.hidden.detach()

        return torch.stack(outputs, dim=1)
    
    def reset(self):
        self.h = 0.

* We have to write a custom loss function to flatten the outputs:

In [26]:
def loss_func(inp, targ):
    return F.cross_entropy(inp.view(-1, len(vocab)), targ.view(-1))

In [27]:
learn = Learner(dls, LMModel4(len(vocab), 64), loss_func=loss_func, metrics=accuracy, cbs=ModelResetter)
learn.fit_one_cycle(15, 3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,3.248546,3.115103,0.164551,00:01
1,2.386219,1.933674,0.466797,00:01
2,1.760738,1.806952,0.472331,00:01
3,1.467624,1.882069,0.503092,00:01
4,1.259531,1.883375,0.535889,00:01
5,1.079385,2.0109,0.555339,00:01
6,0.939819,1.925642,0.572917,00:01
7,0.831508,1.87717,0.59139,00:01
8,0.748246,2.093921,0.574544,00:01
9,0.673372,2.032348,0.602865,00:01


## 01:28:29 - Multilayer RNN

* Even though the RNN seemed to have a lot of layers, each layer is sharing the same weight matrix.
  * Not that much better than a simple linear model.
* A multilayer RNN can stake multiple linear layers within the for loop.
* PyTorch provides the `nn.RNN` class for creating multilayers RNNs.

In [28]:
class LMModel5(Module):
    def __init__(self, vocab_sz, n_hidden, n_layers):
        self.input2hidden = nn.Embedding(vocab_sz, n_hidden)
        self.rnn = nn.RNN(n_hidden, n_hidden, n_layers, batch_first=True)
        self.hidden2output = nn.Linear(n_hidden, vocab_sz)
        self.hidden = torch.zeros(n_layers, bs, n_hidden)

    def forward(self, x):
        res, h = self.rnn(self.input2hidden(x), self.hidden)
        self.hidden = h.detach()
        return self.hidden2output(res)
    
    def reset(self):
        self.hidden.zero_()

In [29]:
learn = Learner(dls, LMModel5(len(vocab), 64, 2), loss_func=loss_func, metrics=accuracy, cbs=ModelResetter)
learn.fit_one_cycle(15, 3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,3.041824,2.657038,0.445638,00:01
1,2.138109,1.801325,0.470947,00:01
2,1.706817,1.811761,0.436361,00:01
3,1.530675,1.894353,0.373617,00:01
4,1.37585,1.788797,0.450928,00:01
5,1.222879,1.682562,0.478353,00:01
6,1.038785,1.689814,0.497152,00:01
7,0.894064,1.736756,0.507975,00:01
8,0.791512,1.795761,0.511068,00:01
9,0.720978,1.885496,0.516113,00:01


* The model seems to be doing worse. Validation loss appears to be really bad now.
  
## 01:32:39 - Exploding and vanishing gradients 

* Very deep models can be hard to train due to exploding or vanishing gradients.
* Doing a lot of matrix multiplications across layers can give you very big or very small results.
  * This can also cause gradients to grow.
* This is because numbers in a computer aren't stored precisely: they're stored as floating point numbers.
  * Really big or small numbers become very close together and differences become practically 0.
* Lots of ways to deal with this:
  * Batch norm.
  * Smart initialisation.
* One simple technique is to use an RNN architecture called LSTM.

## 01:36:29 - LSTM

* Designed such that there's mini neural networks that decide how much previous state should be kept or discarded.
* Main detail: can replace matrix multiplication with LSTMCell sequence below:

In [32]:
class LSTMCell(nn.Module):
    def __init__(self, num_inputs, num_hidden):
        self.forget_gate = nn.Linear(num_inputs + num_hidden, num_hidden)
        self.input_gate = nn.Linear(num_inputs + num_hidden, num_hidden)
        self.cell_gate = nn.Linear(num_inputs + num_hidden, num_hidden)
        self.output_gate = nn.Linear(num_inputs + num_hidden, num_hidden)
        
    def forward(self, input, state):
        h, c = state
        h = torch.stack([h, input], dim=1)
        forget = torch.sigmoid(self.forget_gate(h))
        c = c * forget
        inp = torch.sigmoid(self.input_gate(h))
        cell = torch.tanh(self.cell_gate(h))
        c = c + inp * cell
        outgate = torch.sigmoid(self.output_gate(h))
        h = outgate * torch.tanh(c)
        return h, (h, c)

* RNN that uses LSTMCell is called `LSTM`

In [35]:
class LMModel6(Module):
    def __init__(self, vocab_sz, n_hidden, n_layers):
        self.input2hidden = nn.Embedding(vocab_sz, n_hidden)
        self.rnn = nn.LSTM(n_hidden, n_hidden, n_layers, batch_first=True)
        self.hidden2output = nn.Linear(n_hidden, vocab_sz)
        self.hidden = [torch.zeros(n_layers, bs, n_hidden) for _ in range(2)]

    def forward(self, x):
        res, h = self.rnn(self.input2hidden(x), self.hidden)
        self.hidden = [h_.detach() for h_ in h]
        return self.hidden2output(res)
    
    def reset(self):
        for h in self.hidden:
            h.zero_()

In [36]:
learn = Learner(dls, LMModel6(len(vocab), 64, 2), loss_func=loss_func, metrics=accuracy, cbs=ModelResetter)
learn.fit_one_cycle(15, 1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,3.012093,2.717043,0.217204,00:01
1,2.21824,2.029626,0.351644,00:01
2,1.658893,1.942375,0.456868,00:01
3,1.369292,2.049241,0.499268,00:02
4,1.129897,2.095356,0.525228,00:01
5,0.875359,1.999668,0.559977,00:01
6,0.627453,1.819402,0.649333,00:01
7,0.430494,1.841276,0.679118,00:01
8,0.288909,1.796739,0.671224,00:01
9,0.1932,1.857251,0.688558,00:01


## 01:40:00 - Questions

* Can we use regularisation to make RNN params close to identity matrix?
  * Will look at regularisation approaches.
* Can you check if activations are exploding or vanishing?
  * Yes. You can output activations of each layer with print statements.
 
## 01:42:23 - Regularisation using Dropout

* Dropout is basically deleting activations at random.
  * By removing activations at random, no single activations can become too "overspecialised"
* Dropout implementation:

In [37]:
class Dropout(nn.Module):
    def __init__(self, p):
        self.p = p
        
    def forward(self, x):
        if not self.training:
            return x
        
        mask = x.new(*x.shape).bernoulli_(1-p)
        return x * mask.div_(1-p)

* A bermoulli random variable is a bunch of 1 and 0s with `1-p` probability of getting a 1.
  * By multipying that by our input, we end up removing some layers.

## 01:47:16 - AR and TAR regularisation

* Jeremy has only seen in RNNs.
* AR (for activation regularisation)
  * Similar to Weight Decay.
  * Rather than adding a multiplier * sum of squares * weights.
  * We add multiplier * sum of squares * activations.
* TAR (for temporal activation regularisation).
  * TAR is used to calculate the difference of activations between each layer.
  
## 01:49:09 - Weight tying

* Since predicting the next word is about converting activations to English words.
 * An embedding is about converting words to activations.
 
* Hypothesis: since they're roughly the same idea, can't they use the same weight matrix?
  * Yes! This appears to work in practice. 

In [44]:
class LMModel7(Module):
    def __init__(self, vocab_sz, n_hidden, n_layers, p):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)
        self.rnn = nn.LSTM(n_hidden, n_hidden, n_layers, batch_first=True)
        self.drop = nn.Dropout(p)
        self.h_o = nn.Linear(n_hidden, vocab_sz)
        
        # This new line of code ensures that the weights of the embedding will always be the same as linear weights.
        self.h_o.weight = self.i_h.weight
        self.h = [torch.zeros(n_layers, bs, n_hidden) for _ in range(2)]
        
    def forward(self, x):
        raw, h = self.rnn(self.i_h(x), self.h)
        out = self.drop(raw)
        self.h = [h_.detach() for h_ in h]
        return self.h_o(out), raw, out
    
    def reset(self):
        for h in self.h:
            h.zero_()

## 01:51:00 - TextLearner

* We pass in `RNNRegularizer` callback:

In [45]:
learn = Learner(
    dls,
    LMModel7(len(vocab), 64, 2, 0.5),
    loss_func=CrossEntropyLossFlat(),
    metrics=accuracy, cbs=[ModelResetter, RNNRegularizer(alpha=2, beta=1)])

* Or use the `TextLearner` which passes it for us:

In [48]:
learn = TextLearner(
    dls,
    LMModel7(len(vocab), 64, 2, 0.5),
    loss_func=CrossEntropyLossFlat(),
    metrics=accuracy
)

In [49]:
learn.fit_one_cycle(15, 1e-2, wd=0.1)

epoch,train_loss,valid_loss,accuracy,time
0,2.583476,2.096658,0.452637,00:01
1,1.619627,1.360577,0.628581,00:01
2,0.874051,0.910055,0.776449,00:01
3,0.441865,0.803067,0.819824,00:02
4,0.226623,0.656291,0.85262,00:01
5,0.127504,0.559974,0.862549,00:01
6,0.077614,0.45028,0.887614,00:01
7,0.056029,0.559095,0.87443,00:02
8,0.038954,0.491778,0.877441,00:02
9,0.030191,0.524026,0.879964,00:01


* We've now reproduced everything in AWD LSTM, which was state of the art a few years ago.

## 01:52:48 - Conclusions

* Go idea to connect with other people in your community or forum who are along the learning journey.