<a href="https://colab.research.google.com/github/hduongck/AI-ML-Learning/blob/master/Fastai%20NLP%20course/8_Predicting_English_word_version_of_numbers_using_an_RNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook was part of [Lesson 7](https://course.fast.ai/videos/?lesson=7) of the Practical Deep Learning for Coders course.

# Predicting English word version of numbers using an RNN

[Video 10](https://youtu.be/MDX_x6rKXAs?t=5353)

We were using RNNs as part of our language model in the previous lesson. Today, we will dive into more details of what RNNs are and how they work. We will do this using the problem of trying to predict the English word version of numbers.

Let's predict what should come next in this sequence:

`eight thousand one , eight thousand two , eight thousand three , eight thousand four , eight thousand five , eight thousand six , eight thousand seven , eight thousand eight , eight thousand nine , eight thousand ten , eight thousand eleven , eight thousand twelve...`

Jeremy created this **synthetic dataset** to have a better way to check if things are working, to debug, and to understand what was going on. When experimenting with new ideas, it can be nice to have a smaller dataset to do so, to quickly get a sense of whether your ideas are promising (for other examples, see [Imagenette and Imagewoof](https://github.com/fastai/imagenette). This English word numbers will serve as a good dataset for learning about RNNs. Our task today will be to predict which word comes next when counting.

    - Synthetic dataset technique: during the developing of FastAi lib, Jeremy write something 12 times wrong and it's very hard to know something wrong, particular with machine learning, it could be wrong in subtle way. So for something, which I clearly know what the next answer ought to be, it was just easier for me to see how it was wrong. It's just much easier to develope an algorithm using something that can train in 5 or 10s rather than 5 to 10h. Imagenette and Imagewoof is a slight different variant which it's not a synthetic dataset but it's but smaller dataset in computer vision that people are struggling for a long time. They either trains things on the imagenet dataset which used to take days to train or they're trained on something called Cifar which was so small that was actually turned out to be useless. There were 32x32 pixels and it turns out things that work out and 32x32 pixels actually don't work well on normal size images. So I created something with full-sized images but less of them and I kind of tried to create one version that would be easy to classify and one version that would be hard to classify. So I guess in general, if I try to come up with different sampling versions of the problem that you're trying to solve, that is like one of the really important things is a machine learning practitioner. So you can quickly iterate and we identify your mistakes even you don't make as many mistakes as I do.
    

**In deep learning, there are 2 types of numbers**

**Parameters** are numbers that are learned. **Activations** are numbers that are calculated (by affine functions & element-wise non-linearities).

When you learn about any new concept in deep learning, ask yourself: is this a parameter or an activation?

Note to self: Point out the hidden state, going from the version without a for-loop to the for loop. This is the step where people get confused.

## Data

this does not see in the real world but it's very informative for us to understand RNN

In [0]:
from fastai.text import *

In [0]:
bs = 64


In [3]:
path = untar_data(URLs.HUMAN_NUMBERS)
path.ls()

[PosixPath('/root/.fastai/data/human_numbers/train.txt'),
 PosixPath('/root/.fastai/data/human_numbers/valid.txt')]

In [0]:
def readnums(d): return [', '.join(o.strip() for o  in open(path/d).readlines())]

train.txt gives us a sequence of numbers written out as English words:

In [5]:
train_txt = readnums('train.txt'); train_txt[0][:80]

'one, two, three, four, five, six, seven, eight, nine, ten, eleven, twelve, thirt'

In [6]:
valid_txt = readnums('valid.txt'); valid_txt[0][-80:]

' nine thousand nine hundred ninety eight, nine thousand nine hundred ninety nine'

In [0]:
train = TextList(train_txt,path=path)
valid = TextList(valid_txt,path=path)

src = ItemLists(path=path,train=train,valid=valid).label_for_lm()
data = src.databunch(bs=bs)

In [8]:
train[0].text[:80]

'xxbos one , two , three , four , five , six , seven , eight , nine , ten , eleve'

In [9]:
len(data.valid_ds[0][0].data)

13017

**bptt** stands for back-propagation through time. This tells us how many steps of history we are considering.

what we want the network to predict what the next word , 71st

In [10]:
data.bptt,len(data.valid_dl)

(70, 3)

We have 3 batches in our validation set:

13017 tokens, with about ~70 tokens in about a line of text, and 64 lines of text per batch.

In [11]:
13017/70/bs

2.905580357142857

We will store each batch in a separate variable, so we can walk through this to understand better what the RNN does at each step:

In [0]:
it = iter(data.valid_dl)
x1,y1 = next(it)
x2,y2 = next(it)
x3,y3 = next(it)
it.close()

In [13]:
x1

tensor([[ 2, 19, 11,  ..., 36,  9, 19],
        [ 9, 19, 11,  ..., 24, 20,  9],
        [11, 27, 18,  ...,  9, 19, 11],
        ...,
        [20, 11, 20,  ..., 11, 20, 10],
        [20, 11, 20,  ..., 24,  9, 20],
        [20, 10, 26,  ..., 20, 11, 20]], device='cuda:0')

numel() is a PyTorch method to return the number of elements in a tensor:

In [14]:
x1.numel()+x2.numel()+x3.numel()

13440

In [15]:
x1.shape,y1.shape

(torch.Size([64, 70]), torch.Size([64, 70]))

In [16]:
x2.shape,y2.shape

(torch.Size([64, 70]), torch.Size([64, 70]))

In [17]:
x3.shape,y3.shape

(torch.Size([64, 70]), torch.Size([64, 70]))

In [0]:
v = data.valid_ds.vocab

In [19]:
v.itos

['xxunk',
 'xxpad',
 'xxbos',
 'xxeos',
 'xxfld',
 'xxmaj',
 'xxup',
 'xxrep',
 'xxwrep',
 ',',
 'hundred',
 'thousand',
 'one',
 'two',
 'three',
 'four',
 'five',
 'six',
 'seven',
 'eight',
 'nine',
 'twenty',
 'thirty',
 'forty',
 'fifty',
 'sixty',
 'seventy',
 'eighty',
 'ninety',
 'ten',
 'eleven',
 'twelve',
 'thirteen',
 'fourteen',
 'fifteen',
 'sixteen',
 'seventeen',
 'eighteen',
 'nineteen',
 'xxfake']

[1:40](https://youtu.be/MDX_x6rKXAs?t=5859)

Tokenization is always implementation-dependent. It's going to depend on particular library how things are tokenized

In [20]:
x1[:,0]

tensor([ 2,  9, 11, 12, 13, 11, 10,  9, 10, 14, 19, 25, 19, 15, 16, 11, 19,  9,
        10,  9, 19, 25, 19, 11, 19, 11, 10,  9, 19, 20, 11, 26, 20, 23, 20, 20,
        24, 20, 11, 14, 11, 11,  9, 14,  9, 20, 10, 20, 35, 17, 11, 10,  9, 17,
         9, 20, 10, 20, 11, 20, 11, 20, 20, 20], device='cuda:0')

In [21]:
y1[:,0]

tensor([19, 19, 27, 10,  9, 12, 32, 19, 26, 10, 11, 15, 11, 10,  9, 15, 11, 19,
        26, 19, 11, 18, 11, 18,  9, 18, 21, 19, 10, 10, 20,  9, 11, 16, 11, 11,
        13, 11, 13,  9, 13, 14, 20, 10, 20, 11, 24, 11,  9,  9, 16, 17, 20, 10,
        20, 11, 24, 11, 19,  9, 19, 11, 11, 10], device='cuda:0')

In [22]:
v.itos[9],v.itos[11],v.itos[12],v.itos[13],v.itos[10]

(',', 'thousand', 'one', 'two', 'hundred')

In [23]:
v.textify(x1[0])

'xxbos eight thousand one , eight thousand two , eight thousand three , eight thousand four , eight thousand five , eight thousand six , eight thousand seven , eight thousand eight , eight thousand nine , eight thousand ten , eight thousand eleven , eight thousand twelve , eight thousand thirteen , eight thousand fourteen , eight thousand fifteen , eight thousand sixteen , eight thousand seventeen , eight'

In [24]:
v.textify(x2[0])

'thousand eighteen , eight thousand nineteen , eight thousand twenty , eight thousand twenty one , eight thousand twenty two , eight thousand twenty three , eight thousand twenty four , eight thousand twenty five , eight thousand twenty six , eight thousand twenty seven , eight thousand twenty eight , eight thousand twenty nine , eight thousand thirty , eight thousand thirty one , eight thousand thirty two ,'

In [25]:
v.textify(x3[0])

'eight thousand thirty three , eight thousand thirty four , eight thousand thirty five , eight thousand thirty six , eight thousand thirty seven , eight thousand thirty eight , eight thousand thirty nine , eight thousand forty , eight thousand forty one , eight thousand forty two , eight thousand forty three , eight thousand forty four , eight thousand forty five , eight thousand forty six , eight'

note that the kind of way these batches line up is that it's going from one batch to the next , the first line of your first batch is gonna be follow by the first line of second batch.

data.show_batch() is another way to get another view of what the batch looks like 

In [26]:
data.show_batch(ds_type=DatasetType.Valid)

idx,text
0,"thousand forty seven , eight thousand forty eight , eight thousand forty nine , eight thousand fifty , eight thousand fifty one , eight thousand fifty two , eight thousand fifty three , eight thousand fifty four , eight thousand fifty five , eight thousand fifty six , eight thousand fifty seven , eight thousand fifty eight , eight thousand fifty nine , eight thousand sixty , eight thousand sixty"
1,"eight , eight thousand eighty nine , eight thousand ninety , eight thousand ninety one , eight thousand ninety two , eight thousand ninety three , eight thousand ninety four , eight thousand ninety five , eight thousand ninety six , eight thousand ninety seven , eight thousand ninety eight , eight thousand ninety nine , eight thousand one hundred , eight thousand one hundred one , eight thousand one"
2,"thousand one hundred twenty four , eight thousand one hundred twenty five , eight thousand one hundred twenty six , eight thousand one hundred twenty seven , eight thousand one hundred twenty eight , eight thousand one hundred twenty nine , eight thousand one hundred thirty , eight thousand one hundred thirty one , eight thousand one hundred thirty two , eight thousand one hundred thirty three , eight thousand"
3,"three , eight thousand one hundred fifty four , eight thousand one hundred fifty five , eight thousand one hundred fifty six , eight thousand one hundred fifty seven , eight thousand one hundred fifty eight , eight thousand one hundred fifty nine , eight thousand one hundred sixty , eight thousand one hundred sixty one , eight thousand one hundred sixty two , eight thousand one hundred sixty three"
4,"thousand one hundred eighty three , eight thousand one hundred eighty four , eight thousand one hundred eighty five , eight thousand one hundred eighty six , eight thousand one hundred eighty seven , eight thousand one hundred eighty eight , eight thousand one hundred eighty nine , eight thousand one hundred ninety , eight thousand one hundred ninety one , eight thousand one hundred ninety two , eight thousand"


## Single fully connected model

We will iteratively consider a few different models, building up to a more traditional RNN.

First we create a model to predict a next word after three words

In [0]:
data = src.databunch(bs=bs,bptt=3)

In [28]:
x,y = data.one_batch()
x.shape,y.shape

(torch.Size([64, 3]), torch.Size([64, 3]))

In [29]:
nv = len(v.itos);nv

40

In [0]:
nh =64


In [0]:
def loss4(input,target): return F.cross_entropy(input,target[:,-1])
def acc4(input,target): return accuracy(input,target[:,-1])


In [32]:
x[:,0]

tensor([13, 13, 10,  9, 18,  9, 11, 11, 13, 19, 16, 23, 24,  9, 12,  9, 13, 14,
        15, 11, 10, 22, 15,  9, 10, 14, 11, 16, 10, 28, 11,  9, 20,  9, 15, 15,
        11, 18, 10, 28, 23, 24,  9, 16, 10, 16, 19, 20, 12, 10, 22, 16, 17, 17,
        17, 11, 24, 10,  9, 15, 16,  9, 18, 11])

**Layer names:**

i_h: input to hidden

h_h: hidden to hidden

h_o: hidden to output

bn: batchnorm

![alt text](https://github.com/hduongck/AI-ML-Learning/blob/master/Pic/RNN.PNG?raw=true)

![alt text](https://github.com/hduongck/AI-ML-Learning/blob/master/Pic/RNN1.PNG?raw=true)

![alt text](https://github.com/hduongck/AI-ML-Learning/blob/master/Pic/RNN2.PNG?raw=true)

In [0]:
class Model0(nn.Module):
    def __init__(self):
        super().__init__()
        self.i_h = nn.Embedding(nv,nh)
        self.h_h = nn.Linear(nh,nh)
        self.h_o = nn.Linear(nh,nv)
        self.bn = nn.BatchNorm1d(nh)
        
    def forward(self,x):
        h = self.bn(F.relu(self.i_h(x[:,0])))
        if x.shape[1]>1:
            h = h + self.i_h(x[:,1])
            h = self.bn(F.relu(self.h_h(h)))
        if x.shape[1]>2:
            h = h + self.i_h(x[:,2])
            h = self.bn(F.relu(self.h_h(h)))
            
        return self.h_o(h)

In [0]:
learn = Learner(data,Model0(),loss_func=loss4,metrics=acc4)

In [35]:
learn.fit_one_cycle(6,1e-4)

epoch,train_loss,valid_loss,acc4,time
0,3.644249,3.632674,0.061581,00:01
1,3.04418,3.223973,0.365809,00:01
2,2.411806,2.737695,0.459789,00:01
3,2.088001,2.458073,0.463695,00:01
4,1.969697,2.35956,0.465303,00:01
5,1.945775,2.345535,0.465993,00:01


## Same thing with a loop

Let's refactor this to use a for-loop. This does the same thing as before:

![alt text](https://github.com/hduongck/AI-ML-Learning/blob/master/Pic/RNN3.PNG?raw=true)

In [0]:
class Model1(nn.Module):
    def __init__(self):
        super().__init__()
        self.i_h = nn.Embedding(nv,nh) #green arrow
        self.h_h = nn.Linear(nh,nh) #brown arrow
        self.h_o = nn.Linear(nh,nv) # blue line
        self.bn = nn.BatchNorm1d(nh)
        
    def forward(self,x):
        h = torch.zeros(x.shape[0],nh).to(device=x.device)
        for i in range(x.shape[1]):
            h = h + self.i_h(x[:,i])
            h = self.bn(F.relu(self.h_h(h)))
        
        return self.h_o(h)

This is the difference between unrolled (what we had before) and rolled (what we have now) RNN diagrams:

In [0]:
learn = Learner(data,Model1(),loss_func=loss4,metrics=acc4)

In [39]:
learn.fit_one_cycle(6,1e-4)

epoch,train_loss,valid_loss,acc4,time
0,1.937885,2.1588,0.350414,00:01
1,1.764818,2.045693,0.339844,00:01
2,1.63629,1.997185,0.331572,00:01
3,1.577181,1.995895,0.324219,00:01
4,1.554903,2.012139,0.323759,00:01
5,1.549862,2.018103,0.32261,00:01


Our accuracy is about the same, since we are doing the same thing as before.

## Multi fully connected model

Before, we were just predicting the last word in a line of text. Given 70 tokens, what is token 71? That approach was throwing away a lot of data. Why not predict token 2 from token 1, then predict token 3, then predict token 4, and so on? We will modify our model to do this.

the loop version is a lot slower because nn.RNN write that loop in CUDA C so it runs on the GPU while the python version has to say to the GPU run one step of the loop rather than other loop and each of those takes a lot of time . So that is one of annoying things about working with RNN is that they are not really fast enough to write them in pytorch loops like that. You kind of have to work with the existing machinery . So previous loop version is not something you would work in practice.

In [0]:
data = src.databunch(bs=bs,bptt=20)

In [41]:
x,y = data.one_batch()
x.shape,y.shape

(torch.Size([64, 20]), torch.Size([64, 20]))

In [0]:
class Model2(nn.Module):
    def __init__(self):
        super().__init__()
        self.i_h = nn.Embedding(nv,nh)
        self.h_h = nn.Linear(nh,nh)
        self.h_o = nn.Linear(nh,nv)
        self.bn = nn.BatchNorm1d(nh)
        
    def forward(self,x):
        h = torch.zeros(x.shape[0],nh).to(device=x.device)
        res = []
        for i in range(x.shape[1]):
            h = h + self.i_h(x[:,i])
            h = F.relu(self.h_h(h))
            res.append(self.h_o(self.bn(h)))
        return torch.stack(res,dim=1)

In [0]:
learn = Learner(data,Model2(),metrics=accuracy)

In [44]:
learn.fit_one_cycle(10,1e-4,pct_start=0.1)

epoch,train_loss,valid_loss,accuracy,time
0,3.698139,3.687875,0.03956,00:01
1,3.602434,3.583583,0.150639,00:00
2,3.478865,3.468043,0.238849,00:01
3,3.346398,3.354091,0.273651,00:01
4,3.221698,3.258121,0.299077,00:01
5,3.116115,3.188085,0.312287,00:01
6,3.034905,3.142476,0.318608,00:01
7,2.978496,3.117679,0.331108,00:01
8,2.943944,3.108143,0.335085,00:01
9,2.925942,3.106726,0.335653,00:01


Note that our accuracy is worse now, because we are doing a harder task. When we predict word k (k<70), we have less history to help us then when we were only predicting word 71.

### Maintain state

To address this issue, let's keep the hidden state from the previous line of text

`res = self.h_o(res)`

, so we are not starting over again on each new line of text.

In [0]:
class Model3(nn.Module):
    def __init__(self):
        super().__init__()
        self.i_h = nn.Embedding(nv,nh)
        self.h_h = nn.Linear(nh,nh)
        self.h_o = nn.Linear(nh,nv)
        self.bn = nn.BatchNorm1d(nh)
        self.h = torch.zeros(bs,nh).cuda()
    def forward(self,x):
        
        res = []
        h = self.h
        for i in range(x.shape[1]):
            h = h + self.i_h(x[:,i])
            h = F.relu(self.h_h(h))
            res.append(self.bn(h))
        self.h = h.detach()
        res = torch.stack(res,dim=1)
        res = self.h_o(res)
        return res

In [0]:
learn = Learner(data,Model3(),metrics=accuracy)

In [52]:
learn.fit_one_cycle(20,3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,3.621734,3.508891,0.114134,00:00
1,3.283303,3.012769,0.375994,00:00
2,2.637859,2.105673,0.465128,00:00
3,2.038377,1.935312,0.338849,00:00
4,1.696197,1.819489,0.430398,00:00
5,1.473343,1.779536,0.471094,00:00
6,1.288067,1.931935,0.555398,00:00
7,1.090811,1.769333,0.497443,00:00
8,0.912858,1.534575,0.570739,00:00
9,0.768918,1.568856,0.578125,00:00


Now we are getting greater accuracy than before!

## nn.RNN

Let's refactor the above to use PyTorch's RNN. This is what you would use in practice, but now you know the inside details!

In [0]:
class Model4(nn.Module):
    def __init__(self):
        super().__init__()
        self.i_h = nn.Embedding(nv,nh)
        self.rnn = nn.RNN(nh,nh,batch_first=True)
        self.h_o = nn.Linear(nh,nv)
        self.bn = BatchNorm1dFlat(nh)
        self.h = torch.zeros(1,bs,nh).cuda()
        
    def forward(self,x):
        res,h = self.rnn(self.i_h(x),self.h)
        self.h = h.detach()
        return self.h_o(self.bn(res))

In [0]:
learn = Learner(data,Model4(),metrics=accuracy)

In [55]:
learn.fit_one_cycle(20,3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,3.586282,3.447632,0.143608,00:00
1,3.116198,2.594665,0.441406,00:00
2,2.404595,1.953668,0.409162,00:00
3,1.903025,2.05882,0.316832,00:00
4,1.63015,1.78913,0.460014,00:00
5,1.419893,1.735064,0.518395,00:00
6,1.225648,1.564154,0.51527,00:00
7,1.02775,1.508618,0.526278,00:00
8,0.847054,1.431779,0.543253,00:00
9,0.71427,1.392497,0.563139,00:00


## 2-layer GRU():

When you have long time scales and deeper networks, these become impossible to train. One way to address this is to add mini-NN to decide how much of the green arrow and how much of the orange arrow to keep. These mini-NNs can be GRUs or LSTMs. We will cover more details of this in a later lesson.

In [0]:
class Model5(nn.Module):
    def __init__(self):
        super().__init__()
        self.i_h = nn.Embedding(nv,nh)
        self.rnn = nn.GRU(nh, nh, 2, batch_first=True)
        self.h_o = nn.Linear(nh,nv)
        self.bn = BatchNorm1dFlat(nh)
        self.h = torch.zeros(2, bs, nh).cuda()
        
    def forward(self, x):
        res,h = self.rnn(self.i_h(x), self.h)
        self.h = h.detach()
        return self.h_o(self.bn(res))

In [0]:
learn = Learner(data, Model5(), metrics=accuracy)

In [58]:
learn.fit_one_cycle(10,1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,2.883967,2.312016,0.456676,00:00
1,1.789046,1.644801,0.559801,00:00
2,0.922462,1.224658,0.778977,00:00
3,0.451269,0.856259,0.839418,00:00
4,0.224592,0.950845,0.83679,00:00
5,0.117308,1.000393,0.835369,00:00
6,0.06527,1.019392,0.834517,00:00
7,0.039372,1.061049,0.836861,00:00
8,0.025762,1.044991,0.835653,00:00
9,0.018671,1.064459,0.83402,00:00


**Connection to ULMFi**t

In the previous lesson, we were essentially swapping out **self.h_o** with a classifier in order to do classification on text.


**RNNs are just a refactored, fully-connected neural network.**

You can use the same approach for any sequence labeling task (part of speech, classifying whether material is sensitive,..)