<a href="https://colab.research.google.com/github/hduongck/AI-ML-Learning/blob/master/2019%20Fastai%20Deep%20Learning/2019_Deep_Learning_7_RNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Recurrent Neural Network (RNN) [1:38:31](https://youtu.be/nWpdkZE2_cc?t=5911)

One thing that doesn't get here is RNNs. So that's the last thing we're going to do. RNNs; I'm going to introduce a little diagrammatic method here to explain to RNNs, and the diagrammatic method, I'll start by showing your basic neural net with a single hidden layer.

![](https://github.com/hiromis/notes/raw/master/lesson7/46.png?raw=true)

Rectangle means an input. That'll be batch size by number of inputs. An arrow means a layer (broadly defined) such as matrix product followed by ReLU. A circle is activations. So this case, we have one set of hidden activations and so given that the input was number of inputs, this here (the first arrow) is a matrix of number of inputs by number of activations. So the output will be batch size by a number of activations.

It's really important you know how to calculate these shapes. So go **learn.summary()** lots to see all the shapes. Then here's another arrow, so that means it's another layer; matrix product followed by non-linearity. In this case, we go into the output, so we use softmax.

Then triangle means an output. This matrix product will be number of activations by a number of classes, so our output is batch size by number classes.

![](https://github.com/hiromis/notes/raw/master/lesson7/47.png?raw=true)

Let's reuse the that key; triangle is output, circle is activations (hidden state - we also call that) and rectangle is input. Let's now imagine that we wanted to get a big document, split it into sets of three words at a time, and grab each set of three words and then try to predict the third word using the first two words. If we had the dataset in place, we could:

1. Grab word 1 as an input.
2. Chuck it through an embedding, create some activations.
3. Pass that through a matrix product and nonlinearity.
4. Grab the second word.
5. Put it through an embedding.
6. Then we could either add those two things together or concatenate them. Generally speaking, when you see two sets of activations coming together in a diagram, you normally have a choice of concatenate or or add. And that's going to create the second bunch of activations.
7. Then you can put it through one more fully connected layer and softmax to create an output.

So that would be a totally standard, fully connected neural net with one very minor tweak which is concatenating or adding at this point, which we could use to try to predict the third word from pairs of two words.

Remember, arrows represent layer operations and I removed in this one the specifics of what they are because they're always an affine function followed by a non-linearity.

![](https://github.com/hiromis/notes/blob/master/lesson7/48.png?raw=true)

Let's go further. What if we wanted to predict word 4 using words 1 and 2 and 3? It's basically the same picture as last time except with one extra input and one extra circle. But I want to point something out which is each time we go from rectangle to circle, we're doing the same thing - we're doing an embedding. Which is just a particular kind of matrix multiply where you have a one hot encoded input. Each time we go from circle to circle, we're basically taking one piece of hidden state (a.k.a activations) and turning it into another set of activations by saying we're now at the next word. Then when we go from circle to triangle, we're doing something else again which is we're saying let's convert the hidden state (i.e. these activations) into an output. So I've colored each of those arrows differently. Each of those arrows should probably use the same weight matrix because it's doing the same thing. So why would you have a different set of embeddings for each word or a different matrix to multiply by to go from this hidden state to this hidden state versus this one? So this is what we're going to build.

**Human numbers** [1:43:11](https://youtu.be/nWpdkZE2_cc?t=6191)

We're now going to jump into human numbers which is [lesson7-human-numbers.ipynb](https://github.com/fastai/course-v3/blob/master/nbs/dl1/lesson7-human-numbers.ipynb). This is a dataset that I created which literally just contains all the numbers from 1 to 9,999 written out in English.

We're going to try to create a language model that can predict the next word in this document. It's just a toy example for this purpose. In this case, we only have one document. That one document is the list of numbers. So we can use a TextList to create an item list with text in for the training of the validation.





In [0]:
from fastai.text import *

In [0]:
bs = 64

In [3]:
path =untar_data(URLs.HUMAN_NUMBERS)
path.ls()

[PosixPath('/root/.fastai/data/human_numbers/train.txt'),
 PosixPath('/root/.fastai/data/human_numbers/valid.txt')]

In [0]:
def readnums(d): return [', '.join(o.strip() for o in open(path/d).readlines())]

In [6]:
train_txt = readnums('train.txt');train_txt[0][:80]

'one, two, three, four, five, six, seven, eight, nine, ten, eleven, twelve, thirt'

In [7]:
valid_txt = readnums('valid.txt');valid_txt[0][-80:]

' nine thousand nine hundred ninety eight, nine thousand nine hundred ninety nine'

In [0]:
train = TextList(train_txt,path=path)
valid = TextList(valid_txt,path=path)

src = ItemLists(path=path,train=train,valid=valid).label_for_lm()
data = src.databunch(bs=bs)

In this case, the validation set is the numbers from 8,000 onwards, and the training set is 1 to 8,000. We can combine them together, turn that into a data bunch.

In [9]:
train[0].text[:80]

'xxbos one , two , three , four , five , six , seven , eight , nine , ten , eleve'

We only have one document, so **train[0]** is the document grab its **.text** that's how you grab the contents of a text list, and here are the first 80 characters. I**t starts with a special token xxbos**. Anything starting with **xx is a special fast.ai token**, **bos is the beginning of stream token**. It basically says this is the start of a document, and it's very helpful in NLP to know when documents start so that your models can learn to recognize them.

In [10]:
len(data.valid_ds[0][0].data)

13017

The validation set contains 13,000 tokens. So 13,000 words or punctuation marks because everything between spaces is a separate token.

In [11]:
data.bptt, len(data.valid_dl)

(70, 3)

In [12]:
13017/70/bs

2.905580357142857

- The batch size that we asked for was 64, and 
- then by default it uses something called **bptt of 70. bptt, as we briefly mentioned, stands for "back prop through time"**. That's the sequence length. For each of our 64 document segments, we split it up into lists of 70 words that we look at at one time. 

So what we do for the validation set is we grab this entire string of 13,000 tokens, and then we split it into 64 roughly equal sized sections. People very very very often think I'm saying something different. I did not say "they are of length 64" - they're not. **They're 64 roughly equally sized segments.** So we take the first 1/64 of the document - piece 1. The second 1/64 - piece 2.

Then for each of those 1/64 of the document, we then split those into pieces of length 70. So let's now say for those 13,000 tokens, how many batches are there? Well, divide by batch size and divide by 70, so there's going to be 3 batches.

In [0]:
it = iter(data.valid_dl)
x1,y1 = next(it)
x2,y2 = next(it)
x3,y3 = next(it)
it.close()

In [14]:
x1.numel()+x2.numel()+x3.numel()

13440

Let's grab an iterator for a data loader, grab 1 2 3 batches (the X and the Y), and let's add up the number of elements, and we get back slightly less than 13,017 because there's a little bit left over at the end that doesn't quite make up a full batch. This is the kind of stuff you should play around with a lot - lots of shapes and sizes and stuff and iterators.

In [15]:
x1.shape,y1.shape

(torch.Size([64, 70]), torch.Size([64, 70]))

In [16]:
x2.shape,y2.shape

(torch.Size([64, 70]), torch.Size([64, 70]))

In [17]:
x1[:,0]

tensor([ 2,  9, 11, 12, 13, 11, 10,  9, 10, 14, 19, 25, 19, 15, 16, 11, 19,  9,
        10,  9, 19, 25, 19, 11, 19, 11, 10,  9, 19, 20, 11, 26, 20, 23, 20, 20,
        24, 20, 11, 14, 11, 11,  9, 14,  9, 20, 10, 20, 35, 17, 11, 10,  9, 17,
         9, 20, 10, 20, 11, 20, 11, 20, 20, 20], device='cuda:0')

In [18]:
y1[:,0]

tensor([19, 19, 27, 10,  9, 12, 32, 19, 26, 10, 11, 15, 11, 10,  9, 15, 11, 19,
        26, 19, 11, 18, 11, 18,  9, 18, 21, 19, 10, 10, 20,  9, 11, 16, 11, 11,
        13, 11, 13,  9, 13, 14, 20, 10, 20, 11, 24, 11,  9,  9, 16, 17, 20, 10,
        20, 11, 24, 11, 19,  9, 19, 11, 11, 10], device='cuda:0')

So here, you can see the first batch of X (remember, we've numeric alized all these) and here's the first batch of Y. And you'll see here x1 is [2, 18, 10, 11, 8, ...], y1 is [18, 10, 11, 8, ...]. So y1 is offset by 1 from x1. Because that's what you want to do with a language model. We want to predict the next word. So after 2, should come 18, and after 18, should come 10.

In [0]:
v = data.valid_ds.vocab

In [25]:
v.textify(x1[0])

'xxbos eight thousand one , eight thousand two , eight thousand three , eight thousand four , eight thousand five , eight thousand six , eight thousand seven , eight thousand eight , eight thousand nine , eight thousand ten , eight thousand eleven , eight thousand twelve , eight thousand thirteen , eight thousand fourteen , eight thousand fifteen , eight thousand sixteen , eight thousand seventeen , eight'

In [26]:
v.textify(y1[0])

'eight thousand one , eight thousand two , eight thousand three , eight thousand four , eight thousand five , eight thousand six , eight thousand seven , eight thousand eight , eight thousand nine , eight thousand ten , eight thousand eleven , eight thousand twelve , eight thousand thirteen , eight thousand fourteen , eight thousand fifteen , eight thousand sixteen , eight thousand seventeen , eight thousand'

You can grab the vocab for this dataset, and a vocab has a textify so if we look at exactly the same thing but with textify, that will just look it up in the vocab. So here you can see xxbos eight thousand one where else in the y, there's no xxbos, it's just eight thousand one. So after xxbos is eight, after eight is thousand, after thousand is one.

In [27]:
v.textify(x2[0])

'thousand eighteen , eight thousand nineteen , eight thousand twenty , eight thousand twenty one , eight thousand twenty two , eight thousand twenty three , eight thousand twenty four , eight thousand twenty five , eight thousand twenty six , eight thousand twenty seven , eight thousand twenty eight , eight thousand twenty nine , eight thousand thirty , eight thousand thirty one , eight thousand thirty two ,'

In [28]:
v.textify(x3[0])

'eight thousand thirty three , eight thousand thirty four , eight thousand thirty five , eight thousand thirty six , eight thousand thirty seven , eight thousand thirty eight , eight thousand thirty nine , eight thousand forty , eight thousand forty one , eight thousand forty two , eight thousand forty three , eight thousand forty four , eight thousand forty five , eight thousand forty six , eight'

Then after we get 8023, comes x2, and look at this, we're always looking at column 0, so this is the first batch (the first mini batch) comes 8024 and then x3, all the way up to 8,040 .

In [29]:
v.textify(x1[1])

', eight thousand forty six , eight thousand forty seven , eight thousand forty eight , eight thousand forty nine , eight thousand fifty , eight thousand fifty one , eight thousand fifty two , eight thousand fifty three , eight thousand fifty four , eight thousand fifty five , eight thousand fifty six , eight thousand fifty seven , eight thousand fifty eight , eight thousand fifty nine ,'

In [30]:
v.textify(x2[1])

'eight thousand sixty , eight thousand sixty one , eight thousand sixty two , eight thousand sixty three , eight thousand sixty four , eight thousand sixty five , eight thousand sixty six , eight thousand sixty seven , eight thousand sixty eight , eight thousand sixty nine , eight thousand seventy , eight thousand seventy one , eight thousand seventy two , eight thousand seventy three , eight thousand'

In [31]:
v.textify(x3[1])

'seventy four , eight thousand seventy five , eight thousand seventy six , eight thousand seventy seven , eight thousand seventy eight , eight thousand seventy nine , eight thousand eighty , eight thousand eighty one , eight thousand eighty two , eight thousand eighty three , eight thousand eighty four , eight thousand eighty five , eight thousand eighty six , eight thousand eighty seven , eight thousand eighty'

In [32]:
v.textify(x3[-1])

'ninety , nine thousand nine hundred ninety one , nine thousand nine hundred ninety two , nine thousand nine hundred ninety three , nine thousand nine hundred ninety four , nine thousand nine hundred ninety five , nine thousand nine hundred ninety six , nine thousand nine hundred ninety seven , nine thousand nine hundred ninety eight , nine thousand nine hundred ninety nine xxbos eight thousand one , eight'

Then we can go right back to the start, but look at batch index 1 which is batch number 2. Now we can continue. A slight skip from 8,040 to 8,046, that's because the last mini batch wasn't quite complete. What this means is that every mini batch joins up with a previous mini batch. So you can go straight from x1[0] to x2[0] - it continues 8,023, 8,024. If you took the same thing for :,1, you'll also see they join up. So all the mini batches join up.



In [33]:
data.show_batch(ds_type=DatasetType.Valid)

idx,text
0,"thousand forty seven , eight thousand forty eight , eight thousand forty nine , eight thousand fifty , eight thousand fifty one , eight thousand fifty two , eight thousand fifty three , eight thousand fifty four , eight thousand fifty five , eight thousand fifty six , eight thousand fifty seven , eight thousand fifty eight , eight thousand fifty nine , eight thousand sixty , eight thousand sixty"
1,"eight , eight thousand eighty nine , eight thousand ninety , eight thousand ninety one , eight thousand ninety two , eight thousand ninety three , eight thousand ninety four , eight thousand ninety five , eight thousand ninety six , eight thousand ninety seven , eight thousand ninety eight , eight thousand ninety nine , eight thousand one hundred , eight thousand one hundred one , eight thousand one"
2,"thousand one hundred twenty four , eight thousand one hundred twenty five , eight thousand one hundred twenty six , eight thousand one hundred twenty seven , eight thousand one hundred twenty eight , eight thousand one hundred twenty nine , eight thousand one hundred thirty , eight thousand one hundred thirty one , eight thousand one hundred thirty two , eight thousand one hundred thirty three , eight thousand"
3,"three , eight thousand one hundred fifty four , eight thousand one hundred fifty five , eight thousand one hundred fifty six , eight thousand one hundred fifty seven , eight thousand one hundred fifty eight , eight thousand one hundred fifty nine , eight thousand one hundred sixty , eight thousand one hundred sixty one , eight thousand one hundred sixty two , eight thousand one hundred sixty three"
4,"thousand one hundred eighty three , eight thousand one hundred eighty four , eight thousand one hundred eighty five , eight thousand one hundred eighty six , eight thousand one hundred eighty seven , eight thousand one hundred eighty eight , eight thousand one hundred eighty nine , eight thousand one hundred ninety , eight thousand one hundred ninety one , eight thousand one hundred ninety two , eight thousand"


That's the data. We can do show_batch to see it.

##Single fully connected model

In [0]:
data = src.databunch(bs=bs,bptt=3)

In [50]:
x,y=data.one_batch()
x.shape,y.shape

(torch.Size([64, 3]), torch.Size([64, 3]))

In [51]:
nv = len(v.itos); nv

39

In [0]:
nh = 64

In [0]:
def loss4(input,target): return F.cross_entropy(input,target[:,-1])
def acc4(input,target): return accuracy(input,target[:,-1])

Here is our model which is doing what we saw in the diagram:

In [0]:
class Model0(nn.Module):
    def __init__(self):
        super().__init__()
        self.i_h = nn.Embedding(nv,nh) # green arrow
        self.h_h = nn.Linear(nh,nh) # brown arrow
        self.h_o = nn.Linear(nh,nv) # blue arrow
        self.bn = nn.BatchNorm1d(nh)
        
    def forward(self,x):
        h = self.bn(F.relu(self.i_h(x[:,0])))
        if x.shape[1]>1:
            h += self.i_h(x[:,1])
            h = self.bn(F.relu(self.h_h(h)))
        if x.shape[1]>2:
            h += self.i_h(x[:,2])
            h = self.bn(F.relu(self.h_h(h)))
            
        return self.h_o(h)
        

This is just a code copied over:

![alt text](https://github.com/hiromis/notes/raw/master/lesson7/49.png?raw=true)

It content contains 1 embedding (i.e. the green arrow), one hidden to hidden - brown arrow layer, and one hidden to output. So each colored arrow has a single matrix. Then in the forward pass, we take our first input x[0] and put it through input to hidden (the green arrow), create our first set of activations which we call h. Assuming that there is a second word, because sometimes we might be at the end of a batch where there isn't a second word. Assume there is a second word then we would add to h the result of x[1] put through the green arrow (that's i_h). Then we would say, okay our new h is the result of those two added together, put through our hidden to hidden (orange arrow), and then ReLU then batch norm. Then for the second word, do exactly the same thing. Then finally blue arrow - put it through h_o.

So that's how we convert our diagram to code. Nothing new here at all. We can chuck that in a learner, and we can train it - 46%.



In [0]:
learn = Learner(data,Model0(),loss_func=loss4,metrics=acc4)

In [58]:
learn.fit_one_cycle(6,1e-4)

epoch,train_loss,valid_loss,acc4,time
0,3.692921,3.704324,0.017233,00:01
1,3.063164,3.19332,0.436351,00:01
2,2.440366,2.677097,0.464614,00:01
3,2.110219,2.399459,0.464614,00:01
4,1.987081,2.302276,0.464384,00:01
5,1.962168,2.288489,0.464614,00:01


##Same thing with a loop [1:50:48](https://youtu.be/nWpdkZE2_cc?t=6648)

Let's take this code and recognize it's pretty awful. There's a lot of duplicate code, and as coders, when we see duplicate code, what do we do? We refactor. So we should refactor this into a loop.



In [0]:
class Model1(nn.Module):
    def __init__(self):
        super().__init__()
        self.i_h = nn.Embedding(nv,nh) #green arrow
        self.h_h = nn.Linear(nh,nh) #brown arrow
        self.h_o = nn.Linear(nh,nv) #blue arrow
        self.bn = nn.BatchNorm1d(nh)
    def forward(self,x):
        h = torch.zeros(x.shape[0],nh).to(device=x.device)
        for i in range(x.shape[1]):
            h = h + self.i_h(x[:,i])
            h = self.bn(F.relu(self.h_h(h)))
        return self.h_o(h)

Here we are. We've refactored it into a loop. So now we're going for each xi in x, and doing it in the loop. Guess what? That's an RNN. An RNN is just a refactoring. It's not anything new. This is now an RNN. And let's refactor our diagram:

![alt text](https://github.com/hiromis/notes/raw/master/lesson7/50.png?raw=true)

This is the same diagram, but I've just replaced it with my loop. It does the same thing, so here it is. It's got exactly the same `__init__`, literally exactly the same, just popped a loop here. Before I start, I just have to make sure that I've got a bunch of zeros to add to. And of course, I get exactly the same result when I train it.

In [0]:
learn = Learner(data,Model1(),loss_func=loss4,metrics=acc4)

In [61]:
learn.fit_one_cycle(6,1e-4)

epoch,train_loss,valid_loss,acc4,time
0,3.570172,3.416073,0.1983,00:01
1,2.947914,2.813697,0.452436,00:01
2,2.370738,2.346581,0.462086,00:01
3,2.065823,2.141318,0.467831,00:01
4,1.948797,2.076144,0.46921,00:01
5,1.924532,2.067348,0.46921,00:01


##Multi fully connected model

One nice thing about the loop though, is now this will work even if I'm not predicting the fourth word from the previous three, but the ninth word from the previous eight. It'll work for any arbitrarily length long sequence which is nice.

So let's up the bptt to 20 since we can now. And let's now say, okay, instead of just predicting the nth word from the previous n-1, let's try to predict the second word from the first, the third from the second, and the fourth from the third, and so forth. Look at our loss function.

```
def loss4(input,target): return F.cross_entropy(input,target[:,-1])
def acc4(input,target): return accuracy(input,target[:,-1])
```

Previously we were comparing the result of our model to just the last word of the sequence. It is very wasteful, because there's a lot of words in the sequence. So let's compare every word in x to every word and y. To do that, we need to change the diagram so it's not just one triangle at the end of the loop, but the triangle is inside the loop:

![alt text](https://github.com/hiromis/notes/raw/master/lesson7/52.png?raw=true)

In other words, after every loop, predict, loop, predict, loop, predict.

In [0]:
data = src.databunch(bs=bs,bptt=20)

In [63]:
x,y = data.one_batch()
x.shape,y.shape

(torch.Size([64, 20]), torch.Size([64, 20]))

In [0]:
class Model2(nn.Module):
    def __init__(self):
        super().__init__()
        self.i_h = nn.Embedding(nv,nh)
        self.h_h = nn.Linear(nh,nh)
        self.h_o = nn.Linear(nh,nv)
        self.bn = nn.BatchNorm1d(nh)
    def forward(self,x):
        h = torch.zeros(x.shape[0],nh).to(device=x.device)
        res = []
        for i in range(x.shape[1]):
            h = h + self.i_h(x[:,i])
            h = F.relu(self.h_h(h))
            res.append(self.h_o(self.bn(h)))
        return torch.stack(res,dim=1)

Here's this code. It's the same as the previous code, but now I've created an array, and every time I go through the loop, I append h_o(h) to the array. Now, for n inputs, I create n outputs. So I'm predicting after every word.

In [0]:
learn = Learner(data,Model2(),metrics=accuracy)

In [66]:
learn.fit_one_cycle(10,1e-4,pct_start=0.1)

epoch,train_loss,valid_loss,accuracy,time
0,3.686221,3.663586,0.00696,00:00
1,3.591855,3.552743,0.065057,00:00
2,3.476271,3.450134,0.23608,00:00
3,3.361939,3.358895,0.323935,00:00
4,3.257956,3.280302,0.349787,00:00
5,3.169981,3.220957,0.368466,00:00
6,3.101793,3.181773,0.386861,00:00
7,3.054198,3.160476,0.396733,00:00
8,3.024985,3.152318,0.398722,00:00
9,3.009749,3.151104,0.399077,00:00


Previously I had 46%, now I have 40%. Why is it worse? It's worse because now when I'm trying to predict the second word, I only have one word of state to use. When I'm looking at the third word, I only have two words of state to use. So it's a much harder problem for it to solve. The key problem is here:

![alt text](https://github.com/hiromis/notes/blob/master/lesson7/53.png?raw=true)

I go **h = torch.zeros**. I reset my state to zero every time I start another BPTT sequence. Let's not do that. Let's keep h. And we can, because remember, each batch connects to the previous batch. It's not shuffled like happens in image classification. So let's take this exact model and replicate it again, but let's move the creation of **h** into the constructor.

In [0]:
class Model3(nn.Module):
    def __init__(self):
        super().__init__()
        self.i_h = nn.Embedding(nv,nh)
        self.h_h = nn.Linear(nh,nh)
        self.h_o = nn.Linear(nh,nv)
        self.bn = nn.BatchNorm1d(nh)
        self.h = torch.zeros(bs,nh).cuda()
        
    def forward(self,x):
        res = []
        h = self.h
        for i in range(x.shape[1]):
            h = h + self.i_h(x[:,i])
            h = F.relu(self.h_h(h))
            res.append(self.bn(h))
        self.h = h.detach()
        res = torch.stack(res,dim=1)
        res = self.h_o(res)
        return res
            
                            

There it is. So it's now self.h. So this is now exactly the same code, but at the end, let's put the new h back into self.h. It's now doing the same thing, but it's not throwing away that state.

In [0]:
learn = Learner(data,Model3(),metrics=accuracy)

In [72]:
learn.fit_one_cycle(20,3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,3.629223,3.596083,0.06108,00:00
1,3.28198,2.983214,0.408878,00:00
2,2.59961,2.046736,0.467258,00:00
3,2.012351,1.942376,0.330256,00:00
4,1.706188,1.93568,0.343608,00:00
5,1.514285,1.820297,0.466477,00:00
6,1.328042,1.748945,0.481108,00:00
7,1.14567,1.772853,0.50696,00:00
8,0.977849,1.788621,0.525497,00:00
9,0.832399,1.685223,0.56179,00:00


Therefore, now we actually get above the original. We get all the way up to 58% accuracy. So this is what a real RNN looks like. You always want to keep that state. But just keep remembering, there's nothing different about an RNN, and it's a totally normal fully connected neural net. It's just that you've got a loop you refactored.

##nn.RNN

![alt text](https://github.com/hiromis/notes/raw/master/lesson7/54.png?raw=true)

What you could do though is at the end of your every loop, you could not just spit out an output, but you could spit it out into another RNN. So you have an RNN going into an RNN. That's nice because we now got more layers of computation, you would expect that to work better.

To get there, let's do some more refactoring. Let's take this code (Model3) and replace it with the equivalent built in PyTorch code which is you just say that:

In [0]:
class Model4(nn.Module):
    def __init__(self):
        super().__init__()
        self.i_h = nn.Embedding(nv,nh)
        self.rnn = nn.RNN(nh,nh,batch_first=True)
        self.h_o = nn.Linear(nh,nv)
        self.bn = BatchNorm1dFlat(nh)
        self.h = torch.zeros(1,bs,nh).cuda()
        
    def forward(self,x):
        res,h = self.rnn(self.i_h(x),self.h)
        self.h = h.detach()
        return self.h_o(self.bn(res))

So nn.RNN basically says do the loop for me. We've still got the same embedding, we've still got the same output, still got the same batch norm, we still got the same initialization of h, but we just got rid of the loop. One of the nice things about RNN is that you can now say how many layers you want. This is the same accuracy of course:

In [0]:
learn = Learner(data,Model4(),metrics=accuracy)

In [84]:
learn.fit_one_cycle(20,3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,3.479915,3.365526,0.251349,00:00
1,3.06282,2.591632,0.451989,00:00
2,2.393446,1.946349,0.467117,00:00
3,1.901078,1.950912,0.318892,00:00
4,1.62741,1.843968,0.392401,00:00
5,1.393521,1.716843,0.402699,00:00
6,1.170689,1.528136,0.467401,00:00
7,0.996274,1.602331,0.492259,00:00
8,0.86348,1.554383,0.521378,00:00
9,0.753672,1.563642,0.524432,00:00


##So here, I'm going to do it with two layers: 2-layer GRU

In [0]:
class Model5(nn.Module):
    def __init__(self):
        super().__init__()
        self.i_h = nn.Embedding(nv,nh)
        self.rnn = nn.GRU(nh,nh,2,batch_first=True)
        self.h_o = nn.Linear(nh,nv)
        self.bn = BatchNorm1dFlat(nh)
        self.h = torch.zeros(2,bs,nh).cuda()
        
    def forward(self,x):
        res,h = self.rnn(self.i_h(x),self.h)
        self.h = h.detach()
        return self.h_o(self.bn(res))

But here's the thing. When you think about this:

![alt text](https://github.com/hiromis/notes/raw/master/lesson7/54.png?raw=true)

Think about it without the loop. It looks like this:

![alt text](https://github.com/hiromis/notes/blob/master/lesson7/55.png?raw=true)

It keeps on going, and we've got a BPTT of 20, so there's 20 layers of this. And we know from that visualizing the loss landscapes paper, that deep networks have awful bumpy loss surfaces. So when you start creating long timescales and multiple layers, these things get impossible to train. There's a few tricks you can do. One thing is you can add skip connections, of course. But what people normally do is, instead of just adding these together(green and orange arrows), they actually use a little mini neural net to decide how much of the green arrow to keep and how much of the orange arrow to keep. When you do that, you get something that's either called GRU or LSTM depending on the details of that little neural net. And we'll learn about the details of those neural nets in part 2. They really don't matter though, frankly.

So we can now say let's create a GRU instead. It's just like what we had before, but it'll handle longer sequences in deeper networks. Let's use two layers.



In [0]:
learn = Learner(data,Model5(),metrics=accuracy)

In [87]:
learn.fit_one_cycle(10,1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,2.886179,2.395688,0.440341,00:00
1,1.783277,1.495697,0.605256,00:00
2,0.921582,1.002496,0.802912,00:00
3,0.445754,1.143247,0.822514,00:00
4,0.221266,1.003409,0.825923,00:00
5,0.115465,1.051694,0.833594,00:00
6,0.063862,1.110775,0.828622,00:00
7,0.03738,1.056298,0.828693,00:00
8,0.02338,1.141393,0.828338,00:00
9,0.016222,1.136422,0.825923,00:00


And we're up to 82%. That's RNNs and the main reason I wanted to show it to you was to remove the last remaining piece of magic, and this is one of the least magical things we have in deep learning. It's just a refactored fully connected network. So don't let RNNs ever put you off. With this approach where you basically have a sequence of n inputs and a sequence of n outputs we've been using for language modeling, you can use that for other tasks.

For example, the sequence of outputs could be for every word there could be something saying is there something that is sensitive and I want to anonymize or not. So it says private data or not. Or it could be a part of speech tag for that word, or it could be something saying how should that word be formatted, or whatever. These are called **sequence labeling tasks** and so you can use this same approach for pretty much any sequence labeling task. Or you can do what I did in the earlier lesson which is once you finish building your language model, you can throw away the h_o bit, and instead pop there a standard classification head, and then you can now do NLP classification which as you saw earlier will give you a state of the art results even on long documents. So this is a super valuable technique, and not remotely magical.

What now? [1:58:59]()