<a href="https://colab.research.google.com/github/joshgregory42/practical_deep_learning/blob/main/ch_12_nlp_dive.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# A Language Model from Scratch

## The Data

Dataset is called *Human Numbers*, which contains the first 10,000 numbers written out in English. This is a dataset that will let us try out methods quickly and easily and interpret the results.

Download the dataset the usual way:

In [1]:
from fastai.text.all import *
path = untar_data(URLs.HUMAN_NUMBERS)

In [2]:
path.ls()

(#2) [Path('/root/.fastai/data/human_numbers/train.txt'),Path('/root/.fastai/data/human_numbers/valid.txt')]

Open up those two files and see what's inside. First will join everything together and ignore the train/valid split (will come back to it later):

In [3]:
lines = L()
with open(path/'train.txt') as f: lines += L(*f.readlines())
with open(path/'valid.txt') as f: lines += L(*f.readlines())

Take all those lines and concatenate them, separating them with '.':

In [4]:
text = ' . '.join([l.strip() for l in lines])

text[:100]

'one . two . three . four . five . six . seven . eight . nine . ten . eleven . twelve . thirteen . fo'

Tokenize by splitting on spaces:

In [5]:
tokens = text.split(' ')
tokens[:10]

['one', '.', 'two', '.', 'three', '.', 'four', '.', 'five', '.']

To numericalize, have to create a list of all the unique tokens (our *vocab*):

In [6]:
vocab = L(*(tokens)).unique()
vocab

(#30) ['one','.','two','three','four','five','six','seven','eight','nine'...]

Convert our tokens into numbers by looking up the index of each in the vocab:

In [7]:
word2idx = {w:i for i, w in enumerate(vocab)}

nums = L(word2idx[i] for i in tokens)
nums

(#63095) [0,1,2,1,3,1,4,1,5,1...]

## First Lanuage Model from Scratch

A simple way to turn this into a neural network would be to specify that we are going to predict each word based on the previous three words. Could create a list of every sequence of three words as our independent variables, and the next word after each sequence as the dependent variable.

Can do that with plain Python. First do it with tokens to confirm what it looks like:

In [8]:
L((tokens[i:i+3], tokens[i+3]) for i in range(0, len(tokens)-4,3))

(#21031) [(['one', '.', 'two'], '.'),(['.', 'three', '.'], 'four'),(['four', '.', 'five'], '.'),(['.', 'six', '.'], 'seven'),(['seven', '.', 'eight'], '.'),(['.', 'nine', '.'], 'ten'),(['ten', '.', 'eleven'], '.'),(['.', 'twelve', '.'], 'thirteen'),(['thirteen', '.', 'fourteen'], '.'),(['.', 'fifteen', '.'], 'sixteen')...]

Now do it with tensors of the numericalized values, which is what the model will actually use:

In [9]:
seqs = L((tensor(nums[i:i+3]), nums[i+3]) for i in range(0, len(nums)-4, 3))
seqs

(#21031) [(tensor([0, 1, 2]), 1),(tensor([1, 3, 1]), 4),(tensor([4, 1, 5]), 1),(tensor([1, 6, 1]), 7),(tensor([7, 1, 8]), 1),(tensor([1, 9, 1]), 10),(tensor([10,  1, 11]), 1),(tensor([ 1, 12,  1]), 13),(tensor([13,  1, 14]), 1),(tensor([ 1, 15,  1]), 16)...]

Can batch those using the `DataLoader` class. For now we'll split the sequences randomly:

In [10]:
bs = 64
cut = int(len(seqs) * 0.8)
dls = DataLoaders.from_dsets(seqs[:cut], seqs[cut:], bs=64, shuffle=False)

Now we create a neural network architecture that takes three words as input, and returns a prediction of the probability of each possible next word in the vocab. Use three standard layers, with a few changes:

First change is that the first linear layer will only use the first word's embedding as activations, the second layer will use the second word's embedding plus the first layer's output activations, and the third layer will use the third word's embedding plus the second layer's output activations. Key effect here is that every word is interpreted in the information context of any words preceding it.

Second main change is that each of these three layers will use the same weight matrix. This means that the way one word impacts the activations from previous words should not change depending on the position of the word. So a layer does not learn one sequence position; must learn to handle all positions.

Since layer weights don't change, could think of the sequential layers as "the same layer" repeated.

## Our Language Model in PyTorch

Create the language model module that we described earlier:

In [11]:
class LMModel1(Module):
  def __init__(self, vocab_sz, n_hidden):
    self.i_h = nn.Embedding(vocab_sz, n_hidden)
    self.h_h = nn.Linear(n_hidden, n_hidden)
    self.h_o = nn.Linear(n_hidden, vocab_sz)

  def forward(self, x):
    h = F.relu(self.h_h(self.i_h(x[:, 0])))
    h = h + self.i_h(x[:, 1])
    h = F.relu(self.h_h(h))
    h = h + self.i_h(x[:, 2])
    h = F.relu(self.h_h(h))
    return self.h_o(h)

We've created three layers here:

* The embedding layer (`i_h`, for *input* to *hidden*)
* The linear layer to create the activations for the next word (`h_h`, for *hidden* to *hidden*)
* A final linear layer to predict the fourth word (`h_o`, for *hidden* to *output*)



Try training this model and see what happens:

In [12]:
learn = Learner(dls, LMModel1(len(vocab), 64), loss_func=F.cross_entropy,
                              metrics=accuracy)

learn.fit_one_cycle(4, 1e-3)

epoch,train_loss,valid_loss,accuracy,time
0,1.831179,1.91837,0.470644,00:02
1,1.394328,1.802045,0.468267,00:02
2,1.410896,1.697776,0.491086,00:01
3,1.393379,1.648621,0.490849,00:01


Let's compare this to what a really simple model would give us. We could always predict the mode common token, so let's find out which token is most often the target in our validation set:

In [13]:
n, counts = 0, torch.zeros(len(vocab))

for x, y in dls.valid:
  n += y.shape[0]
  for i in range_of(vocab): counts[i] += (y==i).long().sum()

idx = torch.argmax(counts)
idx, vocab[idx.item()], counts[idx].item()/n

(tensor(29), 'thousand', 0.15165200855716662)

Most common token has the index of 29, which corresponds to `thousand`. So if we always predicted this token we would have an accuracy of roughly 15\%, so our model is doing much better.

This baseline is okay. Let's see how we can refactor it with a loop.

## Our First Recurrent Neural Network

We could simplify our module code by replacing it with code that calls the layers with a `for` loop. This would make our code simpler and also let us apply our module equally well to token sequences of different lengths. Won't be limited to token lists of length three:

In [14]:
class LMModel2(Module):
    def __init__(self, vocab_sz, n_hidden):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)
        self.h_h = nn.Linear(n_hidden, n_hidden)
        self.h_o = nn.Linear(n_hidden,vocab_sz)

    def forward(self, x):
        h = 0
        for i in range(3):
            h = h + self.i_h(x[:,i])
            h = F.relu(self.h_h(h))
        return self.h_o(h)

Check that we get the same results using this refactoring:

In [15]:
learn = Learner(dls, LMModel2(len(vocab), 64), loss_func=F.cross_entropy,
                metrics=accuracy)
learn.fit_one_cycle(4, 1e-3)

epoch,train_loss,valid_loss,accuracy,time
0,1.845536,2.057746,0.478488,00:01
1,1.396628,1.765957,0.48348,00:02
2,1.397797,1.665134,0.493939,00:02
3,1.347163,1.651769,0.494176,00:01


Note that a neural network that has a loop like this is called a *recurrent neural network* (RNN). An RNN isn't something new, it's just a refactoring of a multilayer neural network using a `for` loop. Could just call it a "looping neural network" and it would mean the same thing.

## Improving the RNN



Some issues from the code for our RNN:

We are initializing our hidden state to zero for every new input sequence. This is a problem because we made our sample sequences short so they would fit easily into batches. But if we order the samples corectly, those sample sequences will be read in order by the model, exposing the model to long stretches of the original sequence.

Can also look at having more signal. Why predict only the fourth word when we could use the intermediate predictions to also predict the second and third words?

### Maintaining the State of an RNN

Since we initialize the model's hidden state to zero for each new sample, we're throwing away all the info about the sentences we've already seen. The fix here is to put the initialization of the hidden state into `__init__`.

But here's the catch: Makes out NN as deep as the number of tokens in our document. So if we have 10,000 tokens in our dataset, we would have a 10,000-layer neural network. Problem here is that if we get to the 10,000th word in the dataset, need to calculate all of those derivatives, which would be really slow and probably not work.

Solution: Tell PyTorch that we don't want to backpropagate the derivatives through the entire implicit neural network. Instead we'll just keep the last three layers of gradient.

To remove all of the gradient history in PyTorch, use the `detach` method.

So here's the new RNN, which is not stateful:

In [16]:
class LMModel3(Module):
  def __init__(self, vocab_sz, n_hidden):
    self.i_h = nn.Embedding(vocab_sz, n_hidden)
    self.h_h = nn.Linear(n_hidden, n_hidden)
    self.h_o = nn.Linear(n_hidden, vocab_sz)
    self.h = 0

  def forward(self, x):
    for i in range(3):
      self.h = self.h + self.i_h(x[:, i])
      self.h = F.relu(self.h_h(self.h))

    out = self.h_o(self.h)
    self.h = self.h.detach()
    return out

  def reset(self): self.h = 0


This model has the same activations whatever sequence length we pick, since the hidden state will remember the last activation from the previous batch.

Difference here is that we'll use something called *backpropagation through time* (BPTT), where the gradients computed at each step will only be calculated on sequence length tokens in the past, **not** the whole stream.

BPTT: Treating a neural net with effectively one layer per step (usually refactored using a loop) as one big model, and calculating gradients on it in the usual way. To avoid running out of memory and time, usually use *truncated* BPTT, which "detaches" the history of computation steps in the hidden state every few time steps.

Split the dataset into groups:

In [17]:
m = len(seqs)//bs # // is floor division (divide then round down)

m, bs, len(seqs)

(328, 64, 21031)

First batch will be composed of the samples

$$ (0, m, 2m, \ldots, (bs-1) m) $$

Second batch:

$$ (1, m+1, 2m+1, \ldots (bs-1)m + 1) $$

and so on. So at each epoch, the model will see a chunk of contiguous text of size $3m$ (since each text is of size 3) on each line of the batch.

Function that does that reindexing:

In [18]:
def group_chunks(ds, bs):
  m = len(ds) // bs
  new_ds = L()
  for i in range(m): new_ds += L(ds[i+m*j] for j in range(bs))
  return new_ds

Then pass `drop_last=True` when building our `DataLoaders` to drop the last batch that does not have the correct shape. Also pass `shuffle=False` so that the texts are read in order:

In [19]:
cut = int(len(seqs) * 0.8)

dls = DataLoaders.from_dsets(
    group_chunks(seqs[:cut], bs),
    group_chunks(seqs[cut:], bs),
    bs=bs, drop_last=True, shuffle=False
)

Also add a `Callback`.

In [20]:
learn = Learner(dls, LMModel3(len(vocab), 64), loss_func=F.cross_entropy,
                metrics=accuracy, cbs=ModelResetter)

learn.fit_one_cycle(10, 3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,1.717644,1.880074,0.436538,00:01
1,1.261081,1.610183,0.515385,00:01
2,1.077195,1.648596,0.521154,00:01
3,1.004429,1.610231,0.543269,00:02
4,0.957333,1.632679,0.567308,00:02
5,0.894831,1.969951,0.556731,00:01
6,0.873104,1.62545,0.581971,00:01
7,0.812236,1.727251,0.609615,00:01
8,0.77313,1.743647,0.622115,00:01
9,0.759798,1.74221,0.620913,00:02


## Creating More Signal

An issue here is that we only predict one output word for every three input words. So the amount of signal that we're feeding back to update weights with is not as large as it could be. Would be better if we predicted the next word after every single word rather than every three words.

To do this, we need to first change our data so that the dependent variable has each of the three next words after each of our three input words. Instead of `3`, we use the attribute `sl` (sequence length) and make it a bit bigger:

In [21]:
sl = 16

seqs = L((tensor(nums[i:i+sl]), tensor(nums[i+1:i+sl+1]))
          for i in range(0, len(nums)-sl-1, sl))

cut = int(len(seqs) * 0.8)
dls = DataLoaders.from_dsets(group_chunks(seqs[:cut], bs),
                             group_chunks(seqs[cut:], bs),
                             bs=bs, drop_last=True, shuffle=False)

First element of `seqs` contains two lists of the same size. Second list is the same as the first, but offset by one element:

In [22]:
[L(vocab[o] for o in s) for s in seqs[0]]

[(#16) ['one','.','two','.','three','.','four','.','five','.'...],
 (#16) ['.','two','.','three','.','four','.','five','.','six'...]]

Now modify our model so it outputs a prediction after every word, rather than just at the end of a three-word sequence:

In [23]:
class LMModel4(Module):
  def __init__(self, vocab_sz, n_hidden):
    self.i_h = nn.Embedding(vocab_sz, n_hidden)
    self.h_h = nn.Linear(n_hidden, n_hidden)
    self.h_o = nn.Linear(n_hidden, vocab_sz)
    self.h = 0

  def forward(self, x):
    outs = []
    for i in range(sl):
      self.h = self.h + self.i_h(x[:, i])
      self.h = F.relu(self.h_h(self.h))
      outs.append(self.h_o(self.h))
    self.h = self.h.detach()
    return torch.stack(outs, dim=1)

  def reset(self): self.h = 0

Returns outputs of shape `bs x sl x vocab_sz` (since we stacked on `dim=1`). Targets are of shape `bs x sl`, so need to flatten them before using cross entropy:

In [24]:
def loss_func(inp, targ):
  return F.cross_entropy(inp.view(-1, len(vocab)), targ.view(-1))

In [25]:
learn = Learner(dls, LMModel4(len(vocab), 64), loss_func=loss_func,
                metrics=accuracy, cbs=ModelResetter)

learn.fit_one_cycle(15, 3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,3.329616,3.229458,0.106771,00:01
1,2.423797,1.894349,0.467204,00:01
2,1.787894,1.823074,0.466634,00:00
3,1.472199,1.701404,0.515299,00:00
4,1.240452,1.842897,0.556478,00:00
5,1.079337,1.897636,0.556396,00:00
6,0.947961,1.920025,0.565674,00:00
7,0.846254,1.978843,0.574137,00:00
8,0.779088,2.027684,0.599609,00:00
9,0.70787,2.146893,0.605143,00:00


Accuracy is okay, but we need to train for longer, since the task has changed a bit and is more complicated now.

Obvious way to bet a better model is to go deeper. Only have one linear layer between the hidden state and the output activations in our basic RNN,so maybe need to go deeper

## Multilayer RNNs

In a multilayer RNN, pass the activations from one RNN into another RNN.

### The Model

Can save time by using PyTorch's `RNN`class, which does what we were doing earlier, but also lets us stack multiple RNNs:

In [26]:
class LMModel5(Module):
  def __init__(self, vocab_sz, n_hidden, n_layers):
    self.i_h = nn.Embedding(vocab_sz, n_hidden)
    self.rnn = nn.RNN(n_hidden, n_hidden, n_layers, batch_first=True)
    self.h_o = nn.Linear(n_hidden, vocab_sz)
    self.h = torch.zeros(n_layers, bs, n_hidden)

  def forward(self, x):
    res, h = self.rnn(self.i_h(x), self.h)
    self.h = h.detach()
    return self.h_o(res)

  def reset(self): self.h.zero_()

In [27]:
learn = Learner(dls, LMModel5(len(vocab), 64, 2),
                loss_func=CrossEntropyLossFlat(),
                metrics=accuracy, cbs=ModelResetter)
learn.fit_one_cycle(15, 3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,3.039893,2.63324,0.442057,00:01
1,2.146945,1.770291,0.468831,00:01
2,1.703722,1.892327,0.352702,00:02
3,1.489365,1.773435,0.426514,00:01
4,1.331773,1.70386,0.480713,00:01
5,1.199094,1.65688,0.521647,00:01
6,1.054824,1.616002,0.524333,00:01
7,0.934596,1.797078,0.523275,00:02
8,0.829899,1.858169,0.518148,00:01
9,0.749235,1.770244,0.549723,00:01


Things got worse. What the fuck?

### Exploding or Diappearing Activations

Creating accurate models from this kind of RNN is difficult. Would get better results if we call `detach` less often and have more layers. But this also means we have a deeper model to train. Key challenge in DL has been to figure out how to train these kinds of models.

The reason this is challenging is because of what happens when you multiply by a matrix many times. Think about what happens when you multiply by a number many times. For example, if you multiply by 2, starting at 1, you get the sequence 1, 2, 4, 8,... after 32 steps you are already at 4,294,967,296. A similar issue happens if you multiply by 0.5: you get 0.5, 0.25, 0.125… and after 32 steps it's 0.00000000023. As you can see, multiplying by a number even slightly higher or lower than 1 results in an explosion or disappearance of our starting number, after just a few repeated multiplications.

Because matrix multiplication is just multiplying numbers and adding them up, exactly the same thing happens with repeated matrix multiplications. And that's all a deep neural network is —each extra layer is another matrix multiplication. This means that it is very easy for a deep neural network to end up with extremely large or extre|mely small numbers.

This is a problem, because the way computers store numbers (known as "floating point") means that they become less and less accurate the further away the numbers get from zero. The diagram in <<float_prec>>, from the excellent article ["What You Never Wanted to Know About Floating Point but Will Be Forced to Find Out"](http://www.volkerschatz.com/science/float.html), shows how the precision of floating-point numbers varies over the number line.

This inaccuracy means that often the gradients calculated for updating the weights end up as zero or infinity for deep networks. This is commonly referred to as the *vanishing gradients* or *exploding gradients* problem. It means that in SGD, the weights are either not updated at all or jump to infinity. Either way, they won't improve with training.

For RNNs, there are two types of layers that are frequently used to avoid exploding activations: *gated recurrent units* (GRUs) and *long short-term memory* (LSTM) layers. Both of these are available in PyTorch, and are drop-in replacements for the RNN layer. We will only cover LSTMs in this book; there are plenty of good tutorials online explaining GRUs, which are a minor variant on the LSTM design.

## LSTM

In this picture, our input $x_{t}$ enters on the left with the previous hidden state ($h_{t-1}$) and cell state ($c_{t-1}$). The four orange boxes represent four layers (our neural nets) with the activation being either sigmoid ($\sigma$) or tanh. tanh is just a sigmoid function rescaled to the range -1 to 1. Its mathematical expression can be written like this:

$$\tanh(x) = \frac{e^{x} - e^{-x}}{e^{x}+e^{-x}} = 2 \sigma(2x) - 1$$

where $\sigma$ is the sigmoid function. The green circles are elementwise operations. What goes out on the right is the new hidden state ($h_{t}$) and new cell state ($c_{t}$), ready for our next input. The new hidden state is also used as output, which is why the arrow splits to go up.

Let's go over the four neural nets (called *gates*) one by one and explain the diagram—but before this, notice how very little the cell state (at the top) is changed. It doesn't even go directly through a neural net! This is exactly why it will carry on a longer-term state.

First, the arrows for input and old hidden state are joined together. In the RNN we wrote earlier in this chapter, we were adding them together. In the LSTM, we stack them in one big tensor. This means the dimension of our embeddings (which is the dimension of $x_{t}$) can be different than the dimension of our hidden state. If we call those `n_in` and `n_hid`, the arrow at the bottom is of size `n_in + n_hid`; thus all the neural nets (orange boxes) are linear layers with `n_in + n_hid` inputs and `n_hid` outputs.

The first gate (looking from left to right) is called the *forget gate*. Since it’s a linear layer followed by a sigmoid, its output will consist of scalars between 0 and 1. We multiply this result by the cell state to determine which information to keep and which to throw away: values closer to 0 are discarded and values closer to 1 are kept. This gives the LSTM the ability to forget things about its long-term state. For instance, when crossing a period or an `xxbos` token, we would expect to it to (have learned to) reset its cell state.

The second gate is called the *input gate*. It works with the third gate (which doesn't really have a name but is sometimes called the *cell gate*) to update the cell state. For instance, we may see a new gender pronoun, in which case we'll need to replace the information about gender that the forget gate removed. Similar to the forget gate, the input gate decides which elements of the cell state to update (values close to 1) or not (values close to 0). The third gate determines what those updated values are, in the range of –1 to 1 (thanks to the tanh function). The result is then added to the cell state.

The last gate is the *output gate*. It determines which information from the cell state to use to generate the output. The cell state goes through a tanh before being combined with the sigmoid output from the output gate, and the result is the new hidden state.

In terms of code, we can write the same steps like this:

In [28]:
class LSTMCell(Module):
  def __init__(self, ni, nh):
    self.forget_state = nn.Linear(ni + nh, nh)
    self.input_gate = nn.Linear(ni + nh, nh)
    self.cell_gate = n.Linear(ni + nh, nh)
    self.output_gate = nn.Linear(ni + nh, nh)

  def forward(self, input, state):
    h, c = state
    h = torch.cat([h, input], dim=1)
    forget = torch.sigmoid(self.forget_gate(h))
    c = c * forget
    inp = torch.sigmoid(self.input_gate(h))
    cell = torch.tanh(self.cell_gate(h))
    c = c + inp * cell
    out = torch.sigmoid(self.output_gate(h))
    h = out * torch.tanh(c)
    return h, (h, c)

Can refactor the code so that we do one giant matrix multiplication (which is better for the GPU). That takes a bit of time (need to move one of the tensors around on the GPU to have it all in an array), so use two separate layers for the input and hidden states. Code is now:

In [29]:
class LSTMCell(Module):
  def __init__(self, ni, nh):
    self.ih = nn.Linear(ni, 4*nh)
    self.hh = nn.Linear(nh, 4*nh)

  def forward(self, input, state):
    h, c = state
    # Multiplying everything at once instead of 4 smaller ones
    gates = (self.ih(input) + self.hh(h)).chunk(4, 1)
    ingate, forgetgate, outgate = map(torch,sigmoid, gates[:3])
    cellgate = gates[3].tanh()

    c = (forgetgate * c) + (ingate * cellgate)
    h = outgate * c.tanh()

    return h, (h, c)

Use PyTorch `chunk` to split tensor into four pieces. Example:

In [30]:
t = torch.arange(0, 10); t

tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [31]:
t.chunk(2)

(tensor([0, 1, 2, 3, 4]), tensor([5, 6, 7, 8, 9]))

### Training a Language Model Using LSTMs

Same network as `LMModel5`, using a two-layer LSTM. Can train it at a higher learning rate, for a shorter time, and get a better accuracy:

In [35]:
class LMModel6(Module):
    def __init__(self, vocab_sz, n_hidden, n_layers):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)
        self.rnn = nn.LSTM(n_hidden, n_hidden, n_layers, batch_first=True)
        self.h_o = nn.Linear(n_hidden, vocab_sz)
        self.h = [torch.zeros(n_layers, bs, n_hidden) for _ in range(2)]

    def forward(self, x):
        res,h = self.rnn(self.i_h(x), self.h)
        self.h = [h_.detach() for h_ in h]
        return self.h_o(res)

    def reset(self):
        for h in self.h: h.zero_()



In [36]:
learn = Learner(dls, LMModel6(len(vocab), 64, 2),
                loss_func=CrossEntropyLossFlat(),
                metrics=accuracy, cbs=ModelResetter)
learn.fit_one_cycle(15, 1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,3.031378,2.795523,0.245036,00:01
1,2.242504,1.741623,0.471029,00:01
2,1.668135,1.780295,0.466878,00:01
3,1.355659,1.869319,0.513184,00:01
4,1.121206,2.043567,0.528239,00:01
5,0.938162,1.95034,0.584229,00:01
6,0.763232,2.022436,0.589193,00:02
7,0.610504,1.821868,0.618815,00:01
8,0.473664,1.482878,0.664388,00:01
9,0.369229,1.297714,0.704834,00:01


There's some overfitting, so let's try regularizing things

## Regularizing an LSTM

Can use dropout:

In [37]:
class Dropout(Module):
  def __init__(self, p): self.p = p

  def forward(self, x):
    if not self.training: return x
    mask = x.new(*x.shape).bernoulli_(1-p)
    return x* mask.div_(1-p)

`bernoulli_` method creates a tensor of random zeros (with probability `p`) and ones (prob. `1-p`), which is then multiplied by our input defore dividing by `1-p`. Note that when using `Dropout`, need to specify whether we are training or validating since it acts differently depending.

### Activation Regularization and Temporal Activation Regularization

Both are similar to weight decay. When we use weight decay, we add a small penalty to the loss that aims at making the weights as small as possible.

For activation regularization, it's the final activations from the LSTM that we try to make as small as possible, instead of the weights.

First need to store the activations somewhere, then add the means of the sequares of them to the loss (with `alpha`, which is just like `wd` for weight decay).

Temporal activation regularization is linked to the fact we are predicting tokens in a sentence. That means it's likely that the outputs of our LSTMs should somewhat make sense when we read them in order. TAR is there to encourage that behavior by adding a penalty to the loss to make the difference between two consecutive activations as small as possible: our activations tensor has a shape `bs x sl x n_hid`, and we read consecutive activations on the sequence length axis (the dimension in the middle). With this, TAR can be expressed as:

``` python
loss += beta * (activations[:,1:] - activations[:,:-1]).pow(2).mean()
```

`alpha` and `beta` are then two hyperparameters to tune. To make this work, we need our model with dropout to return three things: the proper output, the activations of the LSTM pre-dropout, and the activations of the LSTM post-dropout. AR is often applied on the dropped-out activations (to not penalize the activations we turned into zeros afterward) while TAR is applied on the non-dropped-out activations (because those zeros create big differences between two consecutive time steps). There is then a callback called `RNNRegularizer` that will apply this regularization for us.