# Recurrent Neural Network — RNN

In [None]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

from fastai.io import *
from fastai.conv_learner import *

from fastai.column_data import *

We're going to download the collected works of Nietzsche to use as our data for this class.

In [None]:
PATH='data/nietzsche/'

In [None]:
get_data("https://s3.amazonaws.com/text-datasets/nietzsche.txt", f'{PATH}nietzsche.txt')
text = open(f'{PATH}nietzsche.txt').read()
print('corpus length:', len(text))

In [None]:
test[:400]

In [None]:
chars = sorted(list(set(text)))
vocab_size = len(chars)+1
print('total chars:', vocab_size)

Sometimes it's useful to have a zero value in the dataset, e.g. for padding

In [None]:
chars.insert(0, "\0")

''.join(chars[1:-6])

Map from chars to indices and back again

In [None]:
char_indices = {c: i for i, c in enumerate(chars)}
indices_char = {i: c for i, c in enumerate(chars)}

idx will be the data we use from now on - it simply converts all the characters to their index (based on the mapping above)

In [None]:
idx = [char_indices[c] for c in text]

idx[:10]

In [None]:
''.join(indices_char[i] for i in idx[:70])

## Three char model

- Generally, you want to combine character level model and word level model (e.g. for translation).
- Character level model is useful when a vocabulary contains unusual words — which word level model will just treat as “unknown”. When you see a word you have not seen before, you can use a character level model.
- There is also something in between that is called Byte Pair Encoding (BPE) which looks at n-gram of characters.

### Create inputs

Create a list of every 4th character, starting at the 0th, 1st, 2nd, then 3rd characters

In [None]:
cs=3
c1_dat = [idx[i]   for i in range(0, len(idx)-cs, cs)]
c2_dat = [idx[i+1] for i in range(0, len(idx)-cs, cs)]
c3_dat = [idx[i+2] for i in range(0, len(idx)-cs, cs)]
c4_dat = [idx[i+3] for i in range(0, len(idx)-cs, cs)]

Our Inputs

In [None]:
x1 = np.stack(c1_dat)
x2 = np.stack(c2_dat)
x3 = np.stack(c3_dat)

Our Output

In [None]:
y = np.stack(c4_dat)

The first 4 inputs and outputs

In [None]:
x1[:4], x2[:4], x3[:4]

In [None]:
y[:4]

In [None]:
x1.shape, y.shape

## Create and train model

Pick a size for our hidden state

In [None]:
n_hidden = 256

The number of latent factors to create (i.e. the size of the embedding matrix)

In [None]:
n_fac = 42

In [None]:
class Char3Model(nn.Module):
    def __init__(self, vocab_size, n_fac):
        super().__init__()
        self.e = nn.Embedding(vocab_size, n_fac)
        self.l_in = nn.Linear(n_fac, n_hidden)
        self.l_hidden = nn.Linear(n_hidden, n_hidden)
        self.l_out = nn.Linear(n_hidden, vocab_size)
        
    def forward(self, c1, c2, c3):
        in1 = F.relu(self.l_in(self.e(c1)))
        in2 = F.relu(self.l_in(self.e(c2)))
        in3 = F.relu(self.l_in(self.e(c3)))
        
        h = V(torch.zeros(inl.size)).cuda()
        h = F.tanh(self.l_hidden(h+in1))
        h = F.tanh(self.l_hidden(h+in2))
        h = F.tanh(self.l_hidden(h+in3))
        
        return F.log_softmax(self.l_out(h))

In [None]:
md = ColumnarModelData.from_array('.', [-1], np.stack([x1, x2, x3], axis=1), y, bs=512)

We will reuse ColumnarModelData[1:32:20]. If we stack x1 , x2, and x3, we will get c1, c2, c3 in the forward method. ColumnarModelData.from_arrays will come in handy when you want to train a model in raw-er approach, what you put in [x1, x2, x3] , you will get back in def forward(self, c1, c2, c3)

In [None]:
m = Char3Model(vocab_size, n_fac).cuda()

We create a standard PyTorch model with cuda

In [None]:
it = iter(md.trn_dl)
*xs,yt = next(it)
t = m(*V(xs))

- `iter` to grab an iterator
- `next` returns a mini-batch
- `“Variabize”` the xs tensor, and put it through the model — which will give us 512x85 tensor containing prediction (batch size * unique character)

In [None]:
opt = optim.Adam(m.parameters(), 1e-2)

Create a standard PyTorch optimizer — for which you need to pass in a list of things to optimize, which is returned by m.parameters()

In [None]:
fit(m, md, 1, opt, F.nll_loss)

In [None]:
set_lrs(opt, 0.001)

In [None]:
fit(m, md, 1, opt, F.nll_loss)

We do not find a learning rate finder and SGDR because we are not using Learner, so we would need to manually do learning rate annealing (set LR a little bit lower)

### Test model

In [None]:
def get_next(inp):
    idxs = T(np.array([char_indices[c] for c in inp]))
    p = m(*VV(idxs))
    i = np.argmax(to_np(p))
    return chars[i]

This function takes three characters and return what the model predict as the fourth. Note: np.argmax returns index of the maximum values.

In [None]:
get_next('y. ')

In [None]:
get_next('ppl')

In [None]:
get_next(' th')

In [None]:
get_next('and')

## Our first RNN!

### Create inputs

This is the size of our unrolled RNN.

This time, we will use the first 8 characters to predict the 9th

In [None]:
cs = 8

For each of 0 through 7, create a list of every 8th character with that starting point. These will be the 8 inputs to our model.

In [None]:
c_in_dat = [[idx[i+j] for i in range(cs)] for j in range(len(idx)-cs)]

Then create a list of the next character in each of these series. This will be the labels for our model.

In [None]:
c_out_dat = [idx[j+cs] for j in range(len(idx)-cs)]

In [None]:
xs = np.stack(c_in_dat, axis=0)

In [None]:
xs.shape

In [None]:
y  = np.stack(c_out_dat)

So each column below is one series of 8 characters from the text.

In [None]:
xs[:cs, :cs]

...and this is the next character after each sequence.

In [None]:
y[:cs]

### Create and train model

In [None]:
val_idx = get_cv_idxs(len(idx)-cs-1)

In [None]:
md = ColumnarModelData.from_arrays('.', val_idx, xs, y, bs=512)

In [None]:
class CharLoopModel(nn.Module):
    # This is an RNN!
    def __init__(self, vocab_size, n_fac):
        super().__init__()
        self.e = nn.Embedding(vocab_size, n_fac)
        self.l_in = nn.Linear(n_fac, n_hidden)
        self.l_hidden = nn.Linear(n_hidden, n_hidden)
        self.l_out = nn.Linear(n_hidden, vocab_size)
        
    def forward(self, *cs):
        bs = cs[0].size(0)
        h = V(torch.seros(bs, n_hidden).cuda())
        for c in cs:
            inp = F.relu(self.l_in(self.e(c)))
            h = F.tanh(self.l_hidden(h+inp))
        return F.log_softmax(self.l_out(h), dim=-1)

Most of the code is the same as before. You will notice that there is one for loop in forward function.

This now is a quite deep network as it uses 8 characters instead of 2. And as networks get deeper, they become harder to train.



In [None]:
m = CharLoopModel(vocab_size, n_fac).cuda()
opt = optim.Adam(m.parameters(), 1e-2)

In [None]:
fit(m, md, 1, opt, F.nll_loss)

In [None]:
set_lrs(opt, 0.001)

In [None]:
fit(m, md, 1, opt, F.nll_loss)

**Adding vs. Contatenating**
We now will try something else for self.l_hidden(h+inp). The reason is that the input state and the hidden state are qualitatively different. Input is the encoding of a character, and h is an encoding of series of characters. So adding them together, we might lose information. Let’s concatenate them instead. Don’t forget to change the input to match the shape (n_fac+n_hidden instead of n_fac).

## Concat Model

In [None]:
class CharLoopModel(nn.Module):
    # This is an RNN!
    def __init__(self, vocab_size, n_fac):
        super().__init__()
        self.e = nn.Embedding(vocab_size, n_fac)
        self.l_in = nn.Linear(n_fac, n_hidden)
        self.l_hidden = nn.Linear(n_hidden, n_hidden)
        self.l_out = nn.Linear(n_hidden, vocab_size)
        
    def forward(self, *cs):
        bs = cs[0].size(0)
        h = V(torch.seros(bs, n_hidden).cuda())
        for c in cs:
            inp = torch.cat((h, self.e(c)), 1)
            inp = F.relu(self.l_in(inp)
            h = F.tanh(self.l_hidden(inp))
        return F.log_softmax(self.l_out(h), dim=-1)

In [None]:
m = CharLoopConcatModel(vocab_size, n_fac).cuda()
opt = optim.Adam(m.parameters(), 1e-3)

In [None]:
it = iter(md.trn_dl)
*xs,yt = next(it)
t = m(*V(xs))

In [None]:
fit(m, md, 1, opt, F.nll_loss)

In [None]:
set_lrs(opt, 1e-4)

In [None]:
fit(m, md, 1, opt, F.nll_loss)

### Test Model

In [None]:
def get_next(inp):
    idxs = T(np.array([char_indices[c] for c in inp]))
    p = m(*VV(idxs))
    i = np.argmax(to_np(p))
    return chars[i]

In [None]:
get_next('for thos')

In [None]:
get_next('part of ')

In [None]:
get_next('queens a')

## RNN with pytorch

PyTorch will write the for loop automatically for us and also the linear input layer

In [None]:
class charRnn(nn.Module):
    def __init__(self, vocab_size, n_fac):
        super.__init__()
        self.e = nn.Embedding(vocab_size, n_fac)
        self.rnn = nn.RNN(n_fac, n_hidden)
        self.l_out = nn.Linear(n_hidden, vocab_size)
        
    def forward(self, *cs):
        bs = cs[0].size(0)
        h = V(torch.zeros(1, bs, n_hidden))
        inp = self.e(torch.stack(cs))
        outp,h = self.rnn(inp, h)
        
        return F.log_softmax(self.l_out(outp[-1]), dim=-1)

- For reasons that will become apparent later on, self.rnn will return not only the output but also the hidden state.
- or difference in PyTorch is that self.rnn will append a new hidden state to a tensor instead of replacing (in other words, it will give back all ellipses in the diagram) . We only want the final one so we do outp[-1]

In [None]:

m = CharRnn(vocab_size, n_fac).cuda()
opt = optim.Adam(m.parameters(), 1e-3)

In [None]:
it = iter(md.trn_dl)
*xs,yt = next(it)

In [None]:
t = m.e(V(torch.stack(xs)))
t.size()

In [None]:
ht = V(torch.zeros(1, 512,n_hidden))
outp, hn = m.rnn(t, ht)
outp.size(), hn.size()

The idea is that it is going to be better at finding relationships that go backwards — it is called “bi-directional RNN”. Also you can have an RNN feeds to an RNN which is called “multi layer RNN”. For these RNN’s, you will need the additional axis in the tensor to keep track of additional layers of hidden state. For now, we will just have 1 there, and get back 1.

In [None]:
t = m(*V(xs)); t.size()

In [None]:
fit(m, md, 4, opt, F.nll_loss)

In [None]:
set_lrs(opt, 1e-4)

In [None]:
fit(m, md, 2, opt, F.nll_loss)

### Test model

In [None]:
def get_next(inp):
    idxs = T(np.array([char_indices[c] for c in inp]))
    p = m(*VV(idxs))
    i = np.argmax(to_np(p))
    return chars[i]

In [None]:
get_next('for thos')

In [None]:
def get_next_n(inp, n):
    res = inp
    for i in range(n):
        c = get_next(inp)
        res += c
        inp = inp[1:]+c
    return res

In [None]:
get_next_n('for thos', 40)

This time, we loop n times calling get_next each time, and each time we will replace our input by removing the first character and adding the character we just predicted.

## Multi-output model

Setup

Let's take non-overlapping sets of characters this time

In [None]:
c_in_dat = [[idx[i+j] for i in range(cs)] for j in range(0, len(idx)-cs-1, cs)]

Then create the exact same thing, offset by 1, as our labels

In [None]:
c_out_dat = [[idx[i+j] for i in range(cs)] for j in range(1, len(idx)-cs, cs)]

In [None]:
xs = np.stack(c_in_dat)
xs.shape

In [None]:
ys = np.stack(c_out_dat)
ys.shape

One of the reasons we may want to do this is the redundancies we had seen before:

We can make it more efficient by taking non-overlapping sets of character this time. Because we are doing multi-output, for an input char 0 to 7, the output would be the predictions for char 1 to 8.



In [None]:
xs[:cs,:cs]

In [None]:
ys[:cs,:cs]

In [None]:
val_idx = get_cv_idxs(len(xs)-cs-1)

In [None]:

md = ColumnarModelData.from_arrays('.', val_idx, xs, ys, bs=512)

This will not make our model any more accurate, but we can train it more efficiently.

In [None]:
class CharSeqRnn(nn.Module):
    def __init__(self, vocab_size, n_fac):
        super().__init__()
        self.e = nn.Embedding(vocab_size, n_fac)
        self.rnn = nn.RNN(n_fac, n_hidden)
        self.l_out = nn.Linear(n_hidden, vocab_size)
        
    def forward(self, *cs):
        bs = cs[0].size(0)
        h = V(torch.zeros(1, bs, n_hidden))
        inp = self.e(torch.stack(cs))
        outp,h = self.rnn(inp, h)
        return F.log_softmax(self.l_out(outp), dim=-1)

Notice that we are no longer doing outp[-1] since we want to keep all of them. But everything else is identical. One complexity is that we want to use the negative log-likelihood loss function as before, but it expects two rank 2 tensors (two mini-batches of vectors). But here, we have rank 3 tensor:
- 8 characters (time steps)
- 84 probabilities
- for 512 minibatch

In [None]:
m = CharSeqRnn(vocab_size, n_fac).cuda()
opt = optim.Adam(m.parameters(), 1e-3)

In [None]:
it = iter(md.trn_dl)
*xst,yt = next(it)

In [None]:
def nll_loss_seq(inp, targ):
    sl,bs,nh = inp.size()
    targ = targ.transpose(0,1).contiguous().view(-1)
    return F.nll_loss(inp.view(-1,nh), targ)

- F.nll_loss is the PyTorch loss function.
- Flatten our inputs and targets.
- Transpose the first two axes because PyTorch expects 1. sequence length (how many time steps), 2. batch size, 3. hidden state itself. yt.size() is 512 by 8, whereas sl, bs is 8 by 512.
- PyTorch does not generally actually shuffle the memory order when you do things like ‘transpose’, but instead it keeps some internal metadata to treat it as if it is transposed. When you transpose a matrix, PyTorch just updates the metadata . If you ever see an error that says “this tensor is not continuous” , add .contiguous() after it and error goes away.
- .view is same as np.reshape. -1 indicates as long as it needs to be.

In [None]:

fit(m, md, 4, opt, nll_loss_seq)

In [None]:

set_lrs(opt, 1e-4)

In [None]:
fit(m, md, 1, opt, nll_loss_seq)

### Gradient Explosion

self.rnn(inp, h) is a loop applying the same matrix multiply again and again. If that matrix multiply tends to increase the activations each time, we are effectively doing that to the power of 8 — we call this a gradient explosion. We want to make sure the initial l_hidden will not cause our activations on average to increase or decrease.

### Identity init!

In [None]:
m = CharSeqRnn(vocab_size, n_fac).cuda()
opt = optim.Adam(m.parameters(), 1e-2)

In [None]:
m.rnn.weight_hh_l0.data.copy_(torch.eye(n_hidden))

In [None]:
fit(m, md, 4, opt, nll_loss_seq)

In [None]:
set_lrs(opt, 1e-3)

In [None]:
fit(m, md, 4, opt, nll_loss_seq)

## Stateful model

### Setup

In [None]:
from torchtext import vocab, data

from fastai.nlp import *
from fastai.lm_rnn import *

PATH='data/nietzsche/'

TRN_PATH = 'trn/'
VAL_PATH = 'val/'
TRN = f'{PATH}{TRN_PATH}'
VAL = f'{PATH}{VAL_PATH}'

# Note: The student needs to practice her shell skills and prepare her own dataset before proceeding:
# - trn/trn.txt (first 80% of nietzsche.txt)
# - val/val.txt (last 20% of nietzsche.txt)

%ls {PATH}

In [None]:
%ls {PATH}trn

In [None]:
TEXT = data.Field(lower=True, tokenize=list)
bs=64; bptt=8; n_fac=42; n_hidden=256

FILES = dict(train=TRN_PATH, validation=VAL_PATH, test=VAL_PATH)
md = LanguageModelData.from_text_files(PATH, TEXT, **FILES, bs=bs, bptt=bptt, min_freq=3)

len(md.trn_dl), md.nt, len(md.trn_ds), len(md.trn_ds[0].text)

## RNN

In [None]:
class CharSeqStatefulRnn(nn.Module):
    def __init__(self, vocab_size, n_fac, bs):
        self.vocab_size = vocab_size
        super().__init__()
        self.e = nn.Embedding(vocab_size, n_fac)
        self.rnn = nn.RNN(n_fac, n_hidden)
        self.l_out = nn.Linear(n_hidden, vocab_size)
        self.init_hidden(bs)
    
    def forward(self, cs):
        bs = cs[0].size(0)
        if self.h.size(1) != bs: self.init_hidden(bs)
        outp,h = self.rnn(self.e(cs), self.h)
        self.h = repackage_var(h)
        return F.log_softmax(self.l_out(outp), dim=-1).view(-1, self.vocab_size)
    
    def init_hidden(self, bs): self.h = V(torch.zeros(1, bs, n_hidden))

One additional line in constructor. self.init_hidden(bs) sets self.h to bunch of zeros.

- if we were to simply do self.h = h , and we trained on a document that is a million characters long, then the size of unrolled version of the RNN has a million layers (ellipses). One million layer fully connected network is going to be very memory intensive because in order to do a chain rule, we have to multiply one million layers while remembering all one million gradients every batch.

- To avoid this, we tell it to forget its history from time to time. We can still remember the state (the values in our hidden matrix) without remembering everything about how we got there.

```
def repackage_var(h):
    return Variable(h.data) if type(h) == Variable else tuple(repackage_var(v) for v in h)
```
- Grab the tensor out of Variable h (remember, a tensor itself does not have any concept of history), and create a new Variable out of that. The new variable has the same value but no history of operations, therefore when it tries to back-propagate, it will stop there.
- forward will process 8 characters, it then back propagate through eight layers, keep track of the values in out hidden state, but it will throw away its history of operations. This is called back-prop through time (bptt).
- In other words, after the for loop, just throw away the history of operations and start afresh. So we are keeping our hidden state but we are not keeping our hidden state history.
- Another good reason not to back-propagate through too many layers is that if you have any kind of gradient instability (e.g. gradient explosion or gradient vanishing), the more layers you have, the harder the network gets to train (slower and less resilient).
- On the other hand, the longer bptt means that you are able to explicitly capture a longer memory and more state.

how to create mini-batches. We do not want to process one section at a time, but a bunch in parallel at a time. 
- When we started looking at TorchText for the first time, we talked about how it creates these mini-batches.
- we take a whole long document consisting of the entire works of Nietzsche or all of the IMDB reviews concatenated together, we split this into 64 equal sized chunks (NOT chunks of size 64).
- For a document that is 64 million characters long, each “chunk” will be 1 million characters. We stack them together and now split them by bptt — 1 mini-bach consists of 64 by bptt matrix.
- The first character of the second chunk(1,000,001th character) is likely be in the middle of a sentence. But it is okay since it only happens once every million characters.

**How do we choose the size of bptt?**
- the first is that mini-batch matrix has a size of bs (# of chunks) by bptt so your GPU RAM must be able to fit that by your embedding matrix. So if you get CUDA out of memory error, you need reduce one of these.
- If your training is unstable (e.g. your loss is shooting off to NaN suddenly), then you could try decreasing your bptt because you have less layers to gradient explode through.
- If it is too slow, try decreasing your bptt because it will do one of those steps at a time. for loop cannot be parallelized (for the current version). There is a recent thing called QRNN (Quasi-Recurrent Neural Network) which does parallelize it and we hope to cover in part 2.
- So pick the highest number that satisfies all these.

In [None]:
m = CharSeqStatefulRnn(md.nt, n_fac, 512).cuda()
opt = optim.Adam(m.parameters(), 1e-3)

In [None]:
fit(m, md, 4, opt, F.nll_loss)

In [None]:
set_lrs(opt, 1e-4)

fit(m, md, 4, opt, F.nll_loss)

## RNN

In [None]:
# From the pytorch source

def RNNCell(input, hidden, w_ih, w_hh, b_ih, b_hh):
    return F.tanh(F.linear(input, w_ih, b_ih) + F.linear(hidden, w_hh, b_hh))

In [None]:
class CharSeqStatefulRnn2(nn.Module):
    def __init__(self, vocab_size, n_fac, bs):
        super().__init__()
        self.vocab_size = vocab_size
        self.e = nn.Embedding(vocab_size, n_fac)
        self.rnn = nn.RNNCell(n_fac, n_hidden)
        self.l_out = nn.Linear(n_hidden, vocab_size)
        self.init_hidden(bs)
    
    def forward(self, cs):
        bs = cs[0].size(0)
        if self.h.size(1) != bs:
            self.init_hidden(bs)
        outp = []
        self.h = repackage_var(o)
        return F.log_softmax(outp, dim=-1).view(-1, self.vocab_size)

    def init_hidden(self, bs): self.h = V(torch.zeros(1, bs, n_hidden))

In [None]:
m = CHarSeqStatefulRnn2(md.nt, n_fac, 512).cuda()
opt = optim.Adam(m.parameters(), 1e-3)

In [None]:
fit(m, md, 4, opt, F.nll_loss)

## Gated Recurrent Unit (GRU)

In practice, nobody really uses `RNNCell` since even with `tanh` , gradient explosions are still a problem and we need use `low learning rate` and `small bptt` to get them to train. So what we do is to replace RNNCell with something like GRUCell.

In [None]:
class CharSeqStatefulGRU(nn.Module):
    def __init__(self, vocab_size, n_fac, bs):
        super().__init__()
        self.vocab_size = vocab_size
        self.e = nn.Embedding(vocab_size, n_fac)
        self.rnn = nn.GRU(n_fac, n_hidden)
        self.l_out = nn.Linear(n_hidden, vocab_size)
        self.init_hidden(bs)
    
    def forward(self, cs):
        bs = cs[0].size(0)
        if self.h.size(1) != bs:
            self.init_hidden(bs)
        self.h = repackage_var(o)
        return F.log_softmax(self.l_out(outp), dim=-1).view(-1, self.vocab_size)

    def init_hidden(self, bs): self.h = V(torch.zeros(1, bs, n_hidden))

In [None]:
def GRUCell(input, hidden, w_ih, w_hh, b_ih, b_hh):
    gi = F.linear(input, w_ih, b_ih)
    gh = F.linear(hidden, w_hh, b_hh)
    i_r, i_i, i_n = gi.chunk(3, 1)
    h_r, h_i, h_n = gh.chunk(3, 1)
    
    resetgate = F.sigmoid(i_r + h_r)
    inputgate = F.sigmoid(i_i + h_i)
    newgate = F.tanh(i_n + resetgate * h_n)
    return newgate + inputgate * (hidden - newgate)

- Normally, the input gets multiplied by a weight matrix to create new activations h and get added to the existing activations straight away. That is not wha happens here.
- Input goes into h˜ and it doesn’t just get added to the previous activations, but the previous activation gets multiplied by r (reset gate) which has a value of 0 or 1.
- r is calculated as below — matrix multiplication of some weight matrix and the concatenation of our previous hidden state and new input. In other words, this is a little one hidden layer neural net. It gets put through the sigmoid function as well. This mini neural net learns to determine how much of the hidden states to remember (maybe forget it all when it sees a full-stop character — beginning of a new sentence).
- z gate (update gate) determines what degree to use h˜ (the new input version of hidden states) and what degree to leave the hidden state the same as before.


In [None]:
m = CharSeqStatefulGRU(md.nt, n_fac, 512).cuda()

opt = optim.Adam(m.parameters(), 1e-3)

In [None]:

fit(m, md, 6, opt, F.nll_loss)

In [None]:
set_lrs(opt, 1e-4)
fit(m, md, 3, opt, F.nll_loss)

## Putting it all together: LSTM

LSTM has one more piece of state in it called “cell state” (not just hidden state), so if you do use a LSTM, you have to return a tuple of matrices in init_hidden (exactly the same size as hidden state):

In [None]:
from fastai import sgdr

n_hidden=512

In [None]:
class CharSeqStatefulLSTM(nn.Module):
    def __init__(self, vocab_size, n_fac, bs, nl):
        super().__init__()
        self.vocab_size, self.nl = vocab_size, nl
        self.e = nn.Embedding(vocab_size, n_fac)
        self.rnn = nn.LSTM(n_fac, n_hidden, nl, dropout=0.5)
        self.l_out = nn.Linear(n_hidden, vocab_size)
        self.init_hidden(bs)
        
    def forward(self, cs):
        bs = cs[0].size(0)
        if self.h[0].size(1) != bs: 
            self.init_hidden(bs)
        outp,h = self.rnn(self.e(cs), self.h)
        self.h = repackage_var(h)
        return F.log_softmax(self.l_out(outp), dim=-1).view(-1, self.vocab_size)
    
    def init_hidden(self, bs):
        self, h = (V(torch.zeros(self.nl, bs, n_hidden)),
                   V(torch.zeros(self.nl, bs, n_hidden)))

In [None]:
m = CharSeqStatefulLSTM(md.nt, n_fac, 512, 2).cuda()
lo = LayerOptimizer(optim.Adam, m, 1e-2, 1e-5)

- After creating a standard PyTorch model, we usually do something like opt = optim.Adam(m.parameters(), 1e-3). Instead, we will use fast.ai LayerOptimizer which takes an optimizer optim.Adam , our model m , learning rate 1e-2 , and optionally weight decay 1e-5 .
- A key reason LayerOptimizer exists is to do differential learning rates and differential weight decay. The reason we need to use it is that all of the mechanics inside fast.ai assumes that you have one of these. If you want to use callbacks or SGDR in code you are not using the Learner class, you need to use this.
- lo.opt returns the optimizer.

In [None]:
os.makedirs(f'{PATH}models', exist_ok=True)

In [None]:
fit(m, md, 2, lo.opt, F.nll_loss)

In [None]:
on_end = lambda sched, cycle: save_model(m, f'{PATH}models/cyc_{cycle}')
cb = [CosAnneal(lo, len(md.trn_dl), cycle_mult=2, on_cycle_end=on_end)]
fit(m, md, 2**4-1, lo.opt, F.nll_loss, callbacks=cb)

- When we call fit, we can now pass the LayerOptimizer and also callbacks.
- Here, we use cosine annealing callback — which requires a LayerOptimizer object. It does cosine annealing by changing learning rate in side the lo object.
- Concept: Create a cosine annealing callback which is going to update the learning rates in the layer optimizer lo . The length of an epoch is equal to len(md.trn_dl) — how many mini-batches are there in an epoch is the length of the data loader. Since it is doing cosine annealing, it needs to know how often to reset. You can pass in cycle_mult in usual way. We can even save our model automatically just like we did with cycle_save_name in Learner.fit.
- We can do callback at a start of a training, epoch or a batch, or at the end of a training, an epoch, or a batch.
- It has been used for CosAnneal (SGDR), and decoupled weight decay (AdamW), loss-over-time graph, etc.

In [None]:
on_end = lambda sched, cycle: save_model(m, f'{PATH}models/cyc_{cycle}')
cb = [CosAnneal(lo, len(md.trn_dl), cycle_mult=2, on_cycle_end=on_end)]
fit(m, md, 2**6-1, lo.opt, F.nll_loss, callbacks=cb)

### Test

In [None]:
def get_next(inp):
    idxs = TEXT.numericalize(inp)
    p = m(VV(idxs.transpose(0,1)))
    r = torch.multinomial(p[-1].exp(), 1)
    return TEXT.vocab.itos[to_np(r)[0]]

In [None]:
get_next('for thos')

In [None]:
def get_next_n(inp, n):
    res = inp
    for i in range(n):
        c = get_next(inp)
        res += c
        inp = inp[1:]+c
    return res

In [None]:
print(get_next_n('for thos', 400))