# Recurrent neural networks

For di module wey we do before, we talk about beta semantic representation of text. Di architecture wey we dey use dey capture di overall meaning of words for sentence, but e no dey consider di **order** of di words, because di aggregation operation wey dey follow di embeddings dey remove dis information from di original text. Since dis models no fit represent word ordering, dem no fit solve more complex or ambiguous tasks like text generation or question answering.

To fit capture di meaning of text sequence, we go use one neural network architecture wey dem dey call **recurrent neural network**, or RNN. When we dey use RNN, we go pass our sentence through di network one token at a time, and di network go produce some **state**, wey we go pass to di network again with di next token.

![Image wey dey show example of recurrent neural network generation.](../../../../../translated_images/rnn.27f5c29c53d727b546ad3961637a267f0fe9ec5ab01f2a26a853c92fcefbb574.pcm.png)

If we get di input sequence of tokens $X_0,\dots,X_n$, di RNN go create one sequence of neural network blocks, and e go train dis sequence end-to-end using backpropagation. Each network block dey take one pair $(X_i,S_i)$ as input, and e dey produce $S_{i+1}$ as result. Di final state $S_n$ or output $Y_n$ go enter linear classifier to produce di result. All di network blocks dey share di same weights, and dem dey train am end-to-end using one back propagation pass.

> Di figure wey dey up dey show recurrent neural network for di unrolled form (for di left), and di more compact recurrent representation (for di right). E dey important to understand say all RNN Cells dey get di same **shareable weights**.

Because state vectors $S_0,\dots,S_n$ dey pass through di network, di RNN fit learn sequential dependencies between words. For example, if di word *not* show somewhere for di sequence, e fit learn how to negate some elements inside di state vector.

Inside, each RNN cell dey get two weight matrices: $W_H$ and $W_I$, and bias $b$. For each RNN step, if we get input $X_i$ and input state $S_i$, di output state go be $S_{i+1} = f(W_H\times S_i + W_I\times X_i+b)$, where $f$ na activation function (many times $\tanh$).

> For problems like text generation (wey we go talk about for di next unit) or machine translation, we go also want make we get some output value for each RNN step. For dis case, another matrix $W_O$ go dey, and di output go be $Y_i=f(W_O\times S_i+b_O)$.

Make we see how recurrent neural networks fit help us classify our news dataset.

> For di sandbox environment, we need to run di cell wey dey below to make sure say di required library don install, and data don prefetch. If you dey run am locally, you fit skip di cell wey dey below.


In [1]:
import sys
!{sys.executable} -m pip install --quiet tensorflow_datasets==4.4.0
!cd ~ && wget -q -O - https://mslearntensorflowlp.blob.core.windows.net/data/tfds-ag-news.tgz | tar xz

In [2]:
import tensorflow as tf
from tensorflow import keras
import tensorflow_datasets as tfds
import numpy as np

# We are going to be training pretty large models. In order not to face errors, we need
# to set tensorflow option to grow GPU memory allocation when required
physical_devices = tf.config.list_physical_devices('GPU') 
if len(physical_devices)>0:
    tf.config.experimental.set_memory_growth(physical_devices[0], True)

ds_train, ds_test = tfds.load('ag_news_subset').values()

Wen you dey train big models, GPU memory fit turn wahala. We fit also need try different minibatch sizes, so dat di data go fit enter our GPU memory, but di training go still dey fast enough. If you dey run dis code for your own GPU machine, you fit try adjust di minibatch size to make training quick.

> **Note**: Some kain versions of NVidia drivers no dey release memory after you don train di model finish. We dey run plenty examples for dis notebook, and e fit make memory finish for some setups, especially if you dey do your own experiments join for di same notebook. If you see some kind funny errors wen you wan start to train di model, e go make sense to restart di notebook kernel.


In [3]:
batch_size = 16
embed_size = 64

## Simple RNN classifier

For simple RNN, each recurrent unit na simple linear network wey dey take input vector and state vector, then e go produce new state vector. For Keras, we fit represent am wit `SimpleRNN` layer.

Even though we fit pass one-hot encoded tokens go the RNN layer directly, e no good idea because their dimensionality dey too high. So, we go use embedding layer to reduce the dimensionality of word vectors, then we go add RNN layer, and finally `Dense` classifier.

> **Note**: For cases wey dimensionality no too high, like when we dey use character-level tokenization, e fit make sense to pass one-hot encoded tokens directly go the RNN cell.


In [4]:
vocab_size = 20000

vectorizer = keras.layers.experimental.preprocessing.TextVectorization(
    max_tokens=vocab_size,
    input_shape=(1,))

model = keras.models.Sequential([
    vectorizer,
    keras.layers.Embedding(vocab_size, embed_size),
    keras.layers.SimpleRNN(16),
    keras.layers.Dense(4,activation='softmax')
])

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
text_vectorization (TextVect (None, None)              0         
_________________________________________________________________
embedding (Embedding)        (None, None, 64)          1280000   
_________________________________________________________________
simple_rnn (SimpleRNN)       (None, 16)                1296      
_________________________________________________________________
dense (Dense)                (None, 4)                 68        
Total params: 1,281,364
Trainable params: 1,281,364
Non-trainable params: 0
_________________________________________________________________


> **Note:** We dey use untrained embedding layer here for simplicity, but if we wan get better result, we fit use pretrained embedding layer wey dey use Word2Vec, as dem describe for di previous unit. E go make sense if you try adapt dis code to work with pretrained embeddings.

Make we train our RNN now. RNNs no too easy to train, because once di RNN cells don unroll along di sequence length, di number of layers wey go dey involved for backpropagation go plenty. So, we need to choose smaller learning rate, and train di network on bigger dataset to fit get better result. Dis one fit take plenty time, so e better make we use GPU.

To make di process quick, we go only train di RNN model on news titles, we no go include di description. You fit try train am with description to see whether di model go fit train.


In [5]:
def extract_title(x):
    return x['title']

def tupelize_title(x):
    return (extract_title(x),x['label'])

print('Training vectorizer')
vectorizer.adapt(ds_train.take(2000).map(extract_title))

Training vectorizer


In [6]:
model.compile(loss='sparse_categorical_crossentropy',metrics=['acc'], optimizer='adam')
model.fit(ds_train.map(tupelize_title).batch(batch_size),validation_data=ds_test.map(tupelize_title).batch(batch_size))



<tensorflow.python.keras.callbacks.History at 0x7f3e0030d350>

> **Note** say accuracy fit dey low for here, because we dey train only on news titles.


## Look di variable sequences again

Make sure say you sabi say di `TextVectorization` layer go dey add pad tokens to sequences wey get different length inside one minibatch. But di wahala be say, di pad tokens go still join for di training, and e fit make di model no quick learn well.

We get some ways wey we fit use reduce di padding. One way na to arrange di dataset by di sequence length and group all di sequences wey get di same size together. You fit do dis one with di `tf.data.experimental.bucket_by_sequence_length` function (check [documentation](https://www.tensorflow.org/api_docs/python/tf/data/experimental/bucket_by_sequence_length)).

Another way na to use **masking**. For Keras, some layers dey support extra input wey go show which tokens we suppose use for training. To add masking for our model, we fit either add one `Masking` layer ([docs](https://keras.io/api/layers/core_layers/masking/)), or we fit set di `mask_zero=True` parameter for our `Embedding` layer.

> **Note**: Dis training go take like 5 minutes to finish one epoch for di whole dataset. If you no get patience, you fit stop di training anytime. Another thing wey you fit do na to reduce di amount of data wey you dey use for training, by adding `.take(...)` after `ds_train` and `ds_test` datasets.


In [7]:
def extract_text(x):
    return x['title']+' '+x['description']

def tupelize(x):
    return (extract_text(x),x['label'])

model = keras.models.Sequential([
    vectorizer,
    keras.layers.Embedding(vocab_size,embed_size,mask_zero=True),
    keras.layers.SimpleRNN(16),
    keras.layers.Dense(4,activation='softmax')
])

model.compile(loss='sparse_categorical_crossentropy',metrics=['acc'], optimizer='adam')
model.fit(ds_train.map(tupelize).batch(batch_size),validation_data=ds_test.map(tupelize).batch(batch_size))



<tensorflow.python.keras.callbacks.History at 0x7f3dec118850>

Now wey we dey use masking, we fit train di model wit di whole dataset of titles and descriptions.

> **Note**: You don notice say we don dey use vectorizer wey dem train on di news titles, and no be di whole body of di article? E fit make some of di tokens no show, so e go better make we re-train di vectorizer. But, e fit no really change plenty tin, so we go just stick to di pre-trained vectorizer wey we don already get to make tins simple.


## LSTM: Long short-term memory

One big wahala wey RNNs get na **vanishing gradients**. RNNs fit long wella, and e fit hard for dem to carry gradients go back reach di first layer of di network wen backpropagation dey happen. Wen dis kind tin happen, di network no fit sabi di connection wey dey between far tokens. One way wey we fit take dodge dis wahala na to use **explicit state management** wit **gates**. Di two common architectures wey dey use gates na **long short-term memory** (LSTM) and **gated relay unit** (GRU). We go talk about LSTMs for here.

![Image wey dey show example of long short term memory cell](../../../../../lessons/5-NLP/16-RNN/images/long-short-term-memory-cell.svg)

LSTM network dey arrange like RNN, but e get two states wey dey pass from layer to layer: di real state $c$, and di hidden vector $h$. For each unit, di hidden vector $h_{t-1}$ go join wit input $x_t$, and dem go control wetin go happen to di state $c_t$ and output $h_{t}$ through **gates**. Each gate get sigmoid activation (output dey between $[0,1]$), wey we fit think say e be like bitwise mask wen e multiply di state vector. LSTMs get di following gates (from left to right for di picture wey dey up):
* **forget gate** wey dey decide which part of di vector $c_{t-1}$ we go forget, and which one we go allow pass.
* **input gate** wey dey decide how much info from di input vector and di hidden vector wey dey before go enter di state vector.
* **output gate** wey dey carry di new state vector and decide which part of am go dey use to produce di new hidden vector $h_t$.

Di parts of di state $c$ fit be like flags wey we fit switch on and off. For example, wen we see di name *Alice* for di sequence, we go guess say e dey talk about woman, and we go raise di flag for di state wey dey show say we get female noun for di sentence. Wen we later see di words *and Tom*, we go raise di flag wey dey show say we get plural noun. So, as we dey manipulate di state, we fit dey track di grammar properties of di sentence.

> **Note**: Dis na one better resource wey fit help you understand di inside of LSTMs: [Understanding LSTM Networks](https://colah.github.io/posts/2015-08-Understanding-LSTMs/) by Christopher Olah.

Even though di inside structure of LSTM cell fit look hard, Keras dey hide di implementation inside di `LSTM` layer, so di only tin we need do for di example wey dey up na to change di recurrent layer:


In [8]:
model = keras.models.Sequential([
    vectorizer,
    keras.layers.Embedding(vocab_size, embed_size),
    keras.layers.LSTM(8),
    keras.layers.Dense(4,activation='softmax')
])

model.compile(loss='sparse_categorical_crossentropy',metrics=['acc'], optimizer='adam')
model.fit(ds_train.map(tupelize).batch(8),validation_data=ds_test.map(tupelize).batch(8))



<tensorflow.python.keras.callbacks.History at 0x7f3d6af5c350>

> **Note** say to train LSTMs dey slow well, and you fit no see plenty increase for accuracy for di beginning of di training. You go need continue di training for some time to get better accuracy.


## Bidirectional and multilayer RNNs

For di examples wey we don show so far, di recurrent networks dey work from di beginning of di sequence go reach di end. E dey feel normal to us because na di same way we dey read or listen to talk. But for situations wey need random access to di input sequence, e go make sense to run di recurrent computation for both directions. RNNs wey fit do computation for both directions na wetin dem dey call **bidirectional** RNNs, and you fit create am by wrapping di recurrent layer with one special `Bidirectional` layer.

> **Note**: Di `Bidirectional` layer dey make two copies of di layer wey dey inside am, and e dey set di `go_backwards` property for one of di copies to `True`, so e go fit run for di opposite direction along di sequence.

Recurrent networks, whether na unidirectional or bidirectional, dey capture patterns inside sequence, and dem dey store am for state vectors or return am as output. Just like convolutional networks, we fit build another recurrent layer after di first one to capture higher level patterns, wey dem build from di lower level patterns wey di first layer extract. Dis one na wetin dem dey call **multi-layer RNN**, wey get two or more recurrent networks, where di output of di previous layer dey pass go di next layer as input.

![Image showing a Multilayer long-short-term-memory- RNN](../../../../../translated_images/multi-layer-lstm.dd975e29bb2a59fe58b429db833932d734c81f211cad2783797a9608984acb8c.pcm.jpg)

*Picture from [dis wonderful post](https://towardsdatascience.com/from-a-lstm-cell-to-a-multilayer-lstm-network-with-pytorch-2899eb5696f3) by Fernando LÃ³pez.*

Keras dey make am easy to construct dis kind networks, because you just need to add more recurrent layers to di model. For all di layers except di last one, we need to set `return_sequences=True` parameter, because we need di layer to return all di intermediate states, and no be just di final state of di recurrent computation.

Make we build one two-layer bidirectional LSTM for our classification problem.

> **Note** dis code go still take plenty time to finish, but e dey give us di highest accuracy we don see so far. So e fit worth am to wait and see di result.


In [9]:
model = keras.models.Sequential([
    vectorizer,
    keras.layers.Embedding(vocab_size, 128, mask_zero=True),
    keras.layers.Bidirectional(keras.layers.LSTM(64,return_sequences=True)),
    keras.layers.Bidirectional(keras.layers.LSTM(64)),    
    keras.layers.Dense(4,activation='softmax')
])

model.compile(loss='sparse_categorical_crossentropy',metrics=['acc'], optimizer='adam')
model.fit(ds_train.map(tupelize).batch(batch_size),
          validation_data=ds_test.map(tupelize).batch(batch_size))



## RNNs for oda tasks

So far, we don dey use RNNs to classify text wey dey for sequence. But dem fit do plenty oda work, like text generation and machine translation &mdash; we go look those tasks for di next unit.


---

<!-- CO-OP TRANSLATOR DISCLAIMER START -->
**Disclaimer**:  
Dis dokyument don use AI translet service [Co-op Translator](https://github.com/Azure/co-op-translator) do di translet. Even as we dey try make am correct, abeg make you sabi say AI translet fit get mistake or no dey accurate well. Di original dokyument wey dey for im native language na di one wey you go take as di correct source. For important informate, e good make you use professional human translet. We no go fit take blame for any misunderstanding or wrong interpretation wey fit happen because you use dis translet.
<!-- CO-OP TRANSLATOR DISCLAIMER END -->
