# Recurrent neural networks

For di previous module, we don dey use beta semantic representations of text, and one simple linear classifier wey dey on top di embeddings. Wetin dis architecture dey do na to capture di overall meaning of words for one sentence, but e no dey consider di **order** of di words, because di aggregation operation wey dey on top di embeddings don remove dis information from di original text. Since dis models no fit model word ordering, dem no fit solve more complex or ambiguous tasks like text generation or question answering.

To fit capture di meaning of text sequence, we go need use another neural network architecture wey dem dey call **recurrent neural network**, or RNN. For RNN, we go dey pass our sentence through di network one symbol at a time, and di network go produce one **state**, wey we go then pass back to di network with di next symbol.

<img alt="RNN" src="../../../../../translated_images/rnn.27f5c29c53d727b546ad3961637a267f0fe9ec5ab01f2a26a853c92fcefbb574.pcm.png" width="60%"/>

If we get input sequence of tokens $X_0,\dots,X_n$, RNN go create one sequence of neural network blocks, and e go train dis sequence end-to-end using back propagation. Each network block dey take one pair $(X_i,S_i)$ as input, and e go produce $S_{i+1}$ as result. Final state $S_n$ or output $X_n$ go enter one linear classifier to produce di result. All di network blocks dey share di same weights, and dem dey train end-to-end using one back propagation pass.

Because state vectors $S_0,\dots,S_n$ dey pass through di network, e fit learn di sequential dependencies between words. For example, if di word *not* show somewhere for di sequence, e fit learn how to negate some elements inside di state vector, wey go result in negation.

> Since di weights of all RNN blocks for di picture dey shared, di same picture fit represent as one block (for di right) with one recurrent feedback loop, wey dey pass di output state of di network back to di input.

Make we see how recurrent neural networks fit help us classify our news dataset.


In [1]:
import torch
import torchtext
from torchnlp import *
train_dataset, test_dataset, classes, vocab = load_dataset()
vocab_size = len(vocab)

Loading dataset...
Building vocab...


## Simple RNN classifier

For simple RNN, each recurrent unit na just simple linear network wey go take join input vector and state vector together, come produce new state vector. PyTorch dey represent dis unit wit `RNNCell` class, and network of dis kain cells - as `RNN` layer.

To define RNN classifier, we go first use embedding layer to reduce di size of di input vocabulary, den we go put RNN layer on top:


In [2]:
class RNNClassifier(torch.nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_class):
        super().__init__()
        self.hidden_dim = hidden_dim
        self.embedding = torch.nn.Embedding(vocab_size, embed_dim)
        self.rnn = torch.nn.RNN(embed_dim,hidden_dim,batch_first=True)
        self.fc = torch.nn.Linear(hidden_dim, num_class)

    def forward(self, x):
        batch_size = x.size(0)
        x = self.embedding(x)
        x,h = self.rnn(x)
        return self.fc(x.mean(dim=1))

> **Note:** We dey use untrained embedding layer here make e simple, but if we wan get beta result, we fit use pre-trained embedding layer wey dey use Word2Vec or GloVe embeddings, as dem explain for di previous unit. To sabi am well, you fit try change dis code make e work with pre-trained embeddings.

For our own case, we go use padded data loader, so each batch go get some padded sequences wey get di same length. Di RNN layer go collect di sequence of embedding tensors, and e go produce two outputs: 
* $x$ na di sequence of RNN cell outputs for each step
* $h$ na di final hidden state for di last element of di sequence

After dat, we go apply fully-connected linear classifier to get di number of class.

> **Note:** RNNs no easy to train, because once di RNN cells don unroll for di sequence length, di number of layers wey dey involved for back propagation go plenty well well. So, we need to choose small learning rate, and train di network for bigger dataset to get beta result. E fit take plenty time, so e better make we use GPU.


In [3]:
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=16, collate_fn=padify, shuffle=True)
net = RNNClassifier(vocab_size,64,32,len(classes)).to(device)
train_epoch(net,train_loader, lr=0.001)

3200: acc=0.3090625
6400: acc=0.38921875
9600: acc=0.4590625
12800: acc=0.511953125
16000: acc=0.5506875
19200: acc=0.57921875
22400: acc=0.6070089285714285
25600: acc=0.6304296875
28800: acc=0.6484027777777778
32000: acc=0.66509375
35200: acc=0.6790056818181818
38400: acc=0.6929166666666666
41600: acc=0.7035817307692308
44800: acc=0.7137276785714286
48000: acc=0.72225
51200: acc=0.73001953125
54400: acc=0.7372794117647059
57600: acc=0.7436631944444444
60800: acc=0.7503947368421052
64000: acc=0.75634375
67200: acc=0.7615773809523809
70400: acc=0.7662642045454545
73600: acc=0.7708423913043478
76800: acc=0.7751822916666666
80000: acc=0.7790625
83200: acc=0.7825
86400: acc=0.7858564814814815
89600: acc=0.7890513392857142
92800: acc=0.7920474137931034
96000: acc=0.7952708333333334
99200: acc=0.7982258064516129
102400: acc=0.80099609375
105600: acc=0.8037594696969697
108800: acc=0.8060569852941176


## Long Short Term Memory (LSTM)

One big wahala wey classical RNNs get na di **vanishing gradients** problem. Because RNNs dey train end-to-end for one back-propagation pass, e dey hard for am to carry error go di first layers of di network, and so di network no fit sabi di relationship wey dey between far tokens. One way wey dem dey take avoid dis wahala na to use **explicit state management** wit di help of wetin dem dey call **gates**. Two di most popular architectures wey dey dis kind na **Long Short Term Memory** (LSTM) and **Gated Relay Unit** (GRU).

![Image wey dey show example of long short term memory cell](../../../../../lessons/5-NLP/16-RNN/images/long-short-term-memory-cell.svg)

LSTM Network dey arrange like RNN, but e get two states wey dey pass from layer to layer: di real state $c$, and hidden vector $h$. For each unit, hidden vector $h_i$ go join body wit input $x_i$, and dem go control wetin go happen to di state $c$ through **gates**. Each gate na neural network wey get sigmoid activation (output dey between $[0,1]$), wey you fit think of as bitwise mask when e multiply di state vector. Di gates wey dey (from left to right for di picture above) na:
* **forget gate** dey use hidden vector take decide which parts of di vector $c$ we go forget, and which ones we go allow pass.
* **input gate** dey collect some information from di input and hidden vector, then e go put am inside di state.
* **output gate** dey change di state through some linear layer wey get $\tanh$ activation, then e go use hidden vector $h_i$ select some parts of di state to produce new state $c_{i+1}$.

Di parts of di state $c$ fit be like flags wey you fit switch on and off. For example, if we see name *Alice* for di sequence, we fit assume say e dey talk about female character, and we go raise di flag for di state say we get female noun for di sentence. If later we see phrase *and Tom*, we go raise di flag say we get plural noun. So, by di way we dey manipulate di state, we fit dey track di grammatical properties of di sentence parts.

> **Note**: One better resource wey fit help you understand di inside of LSTM na dis fine article [Understanding LSTM Networks](https://colah.github.io/posts/2015-08-Understanding-LSTMs/) by Christopher Olah.

Even though di inside structure of LSTM cell fit look complex, PyTorch dey hide di implementation inside `LSTMCell` class, and e dey provide `LSTM` object to represent di whole LSTM layer. So, di way we go take implement LSTM classifier go dey almost di same as di simple RNN wey we don see before:


In [4]:
class LSTMClassifier(torch.nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_class):
        super().__init__()
        self.hidden_dim = hidden_dim
        self.embedding = torch.nn.Embedding(vocab_size, embed_dim)
        self.embedding.weight.data = torch.randn_like(self.embedding.weight.data)-0.5
        self.rnn = torch.nn.LSTM(embed_dim,hidden_dim,batch_first=True)
        self.fc = torch.nn.Linear(hidden_dim, num_class)

    def forward(self, x):
        batch_size = x.size(0)
        x = self.embedding(x)
        x,(h,c) = self.rnn(x)
        return self.fc(h[-1])

Make we train our network now. Note say to train LSTM dey slow well, and you fit no see plenty increase for accuracy for di beginning of di training. Plus, you go need try adjust di `lr` learning rate parameter to find di learning rate wey go make di training speed dey okay, and still no go waste memory.


In [5]:
net = LSTMClassifier(vocab_size,64,32,len(classes)).to(device)
train_epoch(net,train_loader, lr=0.001)

3200: acc=0.259375
6400: acc=0.25859375
9600: acc=0.26177083333333334
12800: acc=0.2784375
16000: acc=0.313
19200: acc=0.3528645833333333
22400: acc=0.3965625
25600: acc=0.4385546875
28800: acc=0.4752777777777778
32000: acc=0.505375
35200: acc=0.5326704545454546
38400: acc=0.5557552083333334
41600: acc=0.5760817307692307
44800: acc=0.5954910714285714
48000: acc=0.6118333333333333
51200: acc=0.62681640625
54400: acc=0.6404779411764706
57600: acc=0.6520138888888889
60800: acc=0.662828947368421
64000: acc=0.673546875
67200: acc=0.6831547619047619
70400: acc=0.6917897727272727
73600: acc=0.6997146739130434
76800: acc=0.707109375
80000: acc=0.714075
83200: acc=0.7209134615384616
86400: acc=0.727037037037037
89600: acc=0.7326674107142858
92800: acc=0.7379633620689655
96000: acc=0.7433645833333333
99200: acc=0.7479032258064516
102400: acc=0.752119140625
105600: acc=0.7562405303030303
108800: acc=0.76015625
112000: acc=0.7641339285714286
115200: acc=0.7677777777777778
118400: acc=0.77112331081

(0.03487814127604167, 0.7728)

## Packed sequences

For di example wey we show, we gatz pad all di sequences for di minibatch wit zero vectors. Dis one dey waste memory small, but for RNNs, di main wahala na say extra RNN cells go dey created for di padded input items. Dis cells go dey train, but dem no go carry any important input information. E go better if we fit train RNN only for di real sequence size.

To do dis one, PyTorch don bring one special format wey go store padded sequence. Imagine say we get input padded minibatch wey be like dis:
```
[[1,2,3,4,5],
 [6,7,8,0,0],
 [9,0,0,0,0]]
```
Here 0 na di padded values, and di real length vector for di input sequences na `[5,3,1]`.

To train RNN well wit padded sequence, we go wan start di training for di first group of RNN cells wit big minibatch (`[1,6,9]`), but later we go stop di processing for di third sequence, and continue di training wit smaller minibatches (`[2,7]`, `[3,8]`), and so on. So, packed sequence go dey represented as one vector - for dis case `[1,6,9,2,7,3,8,4,5]`, and length vector (`[5,3,1]`), wey we fit use to reconstruct di original padded minibatch.

To create packed sequence, we fit use `torch.nn.utils.rnn.pack_padded_sequence` function. All di recurrent layers, like RNN, LSTM and GRU, dey support packed sequences as input, and dem go produce packed output, wey we fit decode wit `torch.nn.utils.rnn.pad_packed_sequence`.

To fit create packed sequence, we gatz pass di length vector to di network, and because of dis, we go need different function to prepare minibatches:


In [6]:
def pad_length(b):
    # build vectorized sequence
    v = [encode(x[1]) for x in b]
    # compute max length of a sequence in this minibatch and length sequence itself
    len_seq = list(map(len,v))
    l = max(len_seq)
    return ( # tuple of three tensors - labels, padded features, length sequence
        torch.LongTensor([t[0]-1 for t in b]),
        torch.stack([torch.nn.functional.pad(torch.tensor(t),(0,l-len(t)),mode='constant',value=0) for t in v]),
        torch.tensor(len_seq)
    )

train_loader_len = torch.utils.data.DataLoader(train_dataset, batch_size=16, collate_fn=pad_length, shuffle=True)

Di network go dey almost like `LSTMClassifier` wey dey up, but di `forward` pass go collect both padded minibatch and di vector of sequence lengths. After we don compute di embedding, we go compute packed sequence, pass am go di LSTM layer, and then unpack di result back.

> **Note**: We no dey actually use di unpacked result `x`, because na di output from di hidden layers we dey use for di next computations. So, we fit remove di unpacking completely from dis code. Di reason why we put am here na so you fit modify di code easily if you need use di network output for other computations.


In [7]:
class LSTMPackClassifier(torch.nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_class):
        super().__init__()
        self.hidden_dim = hidden_dim
        self.embedding = torch.nn.Embedding(vocab_size, embed_dim)
        self.embedding.weight.data = torch.randn_like(self.embedding.weight.data)-0.5
        self.rnn = torch.nn.LSTM(embed_dim,hidden_dim,batch_first=True)
        self.fc = torch.nn.Linear(hidden_dim, num_class)

    def forward(self, x, lengths):
        batch_size = x.size(0)
        x = self.embedding(x)
        pad_x = torch.nn.utils.rnn.pack_padded_sequence(x,lengths,batch_first=True,enforce_sorted=False)
        pad_x,(h,c) = self.rnn(pad_x)
        x, _ = torch.nn.utils.rnn.pad_packed_sequence(pad_x,batch_first=True)
        return self.fc(h[-1])

Make we start di training:


In [8]:
net = LSTMPackClassifier(vocab_size,64,32,len(classes)).to(device)
train_epoch_emb(net,train_loader_len, lr=0.001,use_pack_sequence=True)


3200: acc=0.285625
6400: acc=0.33359375
9600: acc=0.3876041666666667
12800: acc=0.44078125
16000: acc=0.4825
19200: acc=0.5235416666666667
22400: acc=0.5559821428571429
25600: acc=0.58609375
28800: acc=0.6116666666666667
32000: acc=0.63340625
35200: acc=0.6525284090909091
38400: acc=0.668515625
41600: acc=0.6822596153846154
44800: acc=0.6948214285714286
48000: acc=0.7052708333333333
51200: acc=0.71521484375
54400: acc=0.7239889705882353
57600: acc=0.7315277777777778
60800: acc=0.7388486842105263
64000: acc=0.74571875
67200: acc=0.7518303571428572
70400: acc=0.7576988636363636
73600: acc=0.7628940217391305
76800: acc=0.7681510416666667
80000: acc=0.7728125
83200: acc=0.7772235576923077
86400: acc=0.7815393518518519
89600: acc=0.7857700892857142
92800: acc=0.7895043103448276
96000: acc=0.7930520833333333
99200: acc=0.7959072580645161
102400: acc=0.798994140625
105600: acc=0.802064393939394
108800: acc=0.8051378676470589
112000: acc=0.8077857142857143
115200: acc=0.8104600694444445
118400

(0.029785829671223958, 0.8138166666666666)

> **Note:** You fit don notice di parameter `use_pack_sequence` wey we dey pass to di training function. Right now, di `pack_padded_sequence` function need di length sequence tensor to dey for CPU device, so di training function need to make sure say e no carry di length sequence data go GPU when e dey train. You fit check di implementation of `train_emb` function for di [`torchnlp.py`](../../../../../lessons/5-NLP/16-RNN/torchnlp.py) file.


## Bidirectional and multilayer RNNs

For di examples wey we don show, all di recurrent networks dey work for one direction, from di start of di sequence go di end. E dey look normal like dat, because e resemble how we dey read and hear speech. But, for many cases wey we fit get random access to di input sequence, e go make sense to run di recurrent computation for both directions. Dis kain networks na wetin dem dey call **bidirectional** RNNs, and you fit create am by passing `bidirectional=True` parameter to RNN/LSTM/GRU constructor.

If you dey use bidirectional network, you go need two hidden state vectors, one for each direction. PyTorch dey encode di two vectors as one vector wey big pass di normal size by two times, and e dey very convenient because normally you go pass di hidden state wey you get to fully-connected linear layer, and you go just need to adjust di size when you dey create di layer.

Recurrent network, whether e dey go one direction or e dey bidirectional, dey capture some patterns inside sequence, and e fit store dem for state vector or pass dem go output. Just like convolutional networks, we fit build another recurrent layer on top di first one to capture higher level patterns, wey di low-level patterns wey di first layer extract go help build. Dis one na wetin dem dey call **multi-layer RNN**, wey get two or more recurrent networks, and di output of di previous layer go dey pass go di next layer as input.

![Image showing a Multilayer long-short-term-memory- RNN](../../../../../translated_images/multi-layer-lstm.dd975e29bb2a59fe58b429db833932d734c81f211cad2783797a9608984acb8c.pcm.jpg)

*Di picture dey from [dis fine post](https://towardsdatascience.com/from-a-lstm-cell-to-a-multilayer-lstm-network-with-pytorch-2899eb5696f3) by Fernando LÃ³pez*

PyTorch dey make am easy to construct dis kain networks, because you just need to pass `num_layers` parameter to RNN/LSTM/GRU constructor to build plenty layers of recurrence automatically. Dis one go also mean say di size of di hidden/state vector go increase well well, and you go need to adjust di way you dey handle di output of di recurrent layers.


## RNNs for oda work dem

For dis unit, we don see say RNNs fit dey use for sequence classification, but true true, dem fit handle plenty oda work dem, like text generation, machine translation, and plenty more. We go look dose work dem for di next unit.


---

<!-- CO-OP TRANSLATOR DISCLAIMER START -->
**Disclaimer**:  
Dis dokyument don use AI transle-shun service [Co-op Translator](https://github.com/Azure/co-op-translator) do di transle-shun. Even as we dey try make am correct, abeg make you sabi say AI transle-shun fit get mistake or no dey accurate well. Di original dokyument wey dey for im native language na di one wey you go take as di correct source. For important mata, e good make you use professional human transle-shun. We no go fit take blame for any misunderstanding or wrong interpretation wey fit happen because you use dis transle-shun.
<!-- CO-OP TRANSLATOR DISCLAIMER END -->
