# Code recurrent neural networks

This demo will walk you through how to build recurrent neural networks to solve problems with text data (these methods may also be used for any sequential data like time series, sound etc...)

## What will you learn in this course? 🧐🧐

This course will focus on the technical approach to building recurrent neural networks and details on how to code the three new layers we have studied!
Here is the outline:

* Recurrent layers
  * SimpleRNN
  * GRU
  * LSTM
* Build a recurrent neural network

## Recurrent layers

In this section we will focus strictly on studying the code around the three new layers we just learned about: simpleRNN, GRU, and LSTM.

### SimpleRNN

The most simple recurrent layer corresponds to `tf.keras.layers.SimpleRNN`, you may find the documentation here [simpleRNN](https://www.tensorflow.org/api_docs/python/tf/keras/layers/SimpleRNN).

In [4]:
import tensorflow as tf

from tensorflow.keras.layers import SimpleRNN

srnn = SimpleRNN(units=16, return_sequences=False, return_state=False)

# units indicates the number of neurons in this layer
# return_sequences indicates whether the layer should output the full sequence
#   of outputs computed while processing the input sequence or just the last
#   output 
# return_state indicates whether to return the hidden state in a separate object

In [3]:
# Let's create an example input for this layer and see how it works
batch_size = 1
seq_len = 10
channels = 4
input = tf.random.normal(shape=(batch_size, seq_len, channels))

input

<tf.Tensor: shape=(1, 10, 4), dtype=float32, numpy=
array([[[-1.1635678 ,  0.16855568, -0.75133115, -0.87421376],
        [-0.79280365,  0.18734804,  1.9284809 , -0.6769638 ],
        [-0.3092225 , -0.5739422 ,  0.12904677,  0.71961206],
        [-1.0049052 ,  0.7402705 , -0.5423828 , -0.67694044],
        [-0.3986196 ,  0.9453884 ,  0.44661582, -0.4180978 ],
        [ 0.5253731 ,  1.0866215 , -1.3690103 , -0.40669233],
        [-0.6830413 , -0.19420567, -0.44058374, -0.05411001],
        [-1.7652224 ,  0.3612688 ,  2.461906  ,  2.5993505 ],
        [-1.3611202 ,  0.0565971 ,  1.7649431 ,  1.009644  ],
        [ 0.30490953,  0.48030594, -0.42446727, -0.7473773 ]]],
      dtype=float32)>

In [5]:
# now let's apply the simpleRNN layer and see what comes out
srnn(input)

# the ouput is a batch of one observation with 16 representation channels which
# corresponds to the number of units in the layer

<tf.Tensor: shape=(1, 16), dtype=float32, numpy=
array([[-0.69963205, -0.8901289 , -0.65305644,  0.89033777, -0.5493674 ,
        -0.30030796, -0.87140274, -0.17787921, -0.710449  , -0.48489323,
         0.82255244,  0.10862522, -0.8405781 , -0.29141235, -0.58941853,
        -0.23059027]], dtype=float32)>

In [7]:
# let's change things up by returning the whole output sequence
srnn = SimpleRNN(units=16, return_sequences=True, return_state=False)

srnn(input)
# now the layer preserves the sequential structure of the input, instead of
# returning a 2D tensor, now outputs a 3D tensor of shape (batch_size, seq_len, units)

<tf.Tensor: shape=(1, 10, 16), dtype=float32, numpy=
array([[[ 0.27807614,  0.24998415, -0.206076  , -0.18585703,
         -0.13200063, -0.30927563,  0.30215526, -0.6047987 ,
          0.06832609,  0.09680297, -0.48995888,  0.74964297,
          0.12882937,  0.5754363 ,  0.26132122,  0.19606563],
        [ 0.4108236 , -0.4610422 , -0.06890181, -0.94440496,
          0.3286013 , -0.9087894 ,  0.90982765,  0.62182766,
          0.71168864,  0.2125421 ,  0.5943476 ,  0.11756188,
          0.33666435,  0.15585865, -0.67677826,  0.18991011],
        [ 0.7265542 , -0.32163596, -0.6203475 ,  0.14377876,
          0.5476527 , -0.23619634,  0.14602633, -0.62572193,
         -0.01003208,  0.30781385,  0.94770104,  0.52899766,
         -0.15797052,  0.06652664,  0.02946644,  0.21035738],
        [ 0.51004964,  0.6138037 ,  0.02415972, -0.29176757,
         -0.8195665 , -0.20461814,  0.752831  , -0.7716506 ,
         -0.23259985,  0.19450921, -0.19646755,  0.6881266 ,
         -0.45885983,  0.5566

In [8]:
# what happens if we return the state as well?

srnn = SimpleRNN(units=16, return_sequences=False, return_state=True)

srnn(input)
# now the layer returns two objects, the output and the hidden state, well in
# simpleRNN they carry the same values as you can see

[<tf.Tensor: shape=(1, 16), dtype=float32, numpy=
 array([[-0.1098024 , -0.60982233, -0.32496956,  0.15012479, -0.85631555,
          0.10025096, -0.80801576, -0.7833567 ,  0.57048374,  0.366127  ,
         -0.06957667,  0.11576582,  0.27912173,  0.30287528, -0.6361747 ,
          0.21078372]], dtype=float32)>,
 <tf.Tensor: shape=(1, 16), dtype=float32, numpy=
 array([[-0.1098024 , -0.60982233, -0.32496956,  0.15012479, -0.85631555,
          0.10025096, -0.80801576, -0.7833567 ,  0.57048374,  0.366127  ,
         -0.06957667,  0.11576582,  0.27912173,  0.30287528, -0.6361747 ,
          0.21078372]], dtype=float32)>]

### GRU

Now let's see how we can code GRU layers, you can read the documentation here: [GRU](https://www.tensorflow.org/api_docs/python/tf/keras/layers/GRU).

In [10]:
from tensorflow.keras.layers import GRU

gru = GRU(units=16, return_sequences=False, return_state=False)

gru(input)

# it works mainly in the same way as the SimpleRNN layer

<tf.Tensor: shape=(1, 16), dtype=float32, numpy=
array([[-1.0808587e-02,  1.6081883e-02,  2.5624454e-01, -1.7327070e-04,
        -3.6342442e-04, -1.3345486e-01, -1.9287676e-02,  1.6523668e-01,
         1.5990761e-01, -1.6242927e-01,  9.7698882e-02,  7.2466955e-04,
         1.1103213e-02, -2.1072997e-01, -1.5388536e-01,  9.7298890e-02]],
      dtype=float32)>

In [11]:
gru = GRU(units=16, return_sequences=True, return_state=False)

gru(input)

# you can still use return_sequences in order to preserve the sequential
# nature of the data

<tf.Tensor: shape=(1, 10, 16), dtype=float32, numpy=
array([[[-0.06934597,  0.0994537 ,  0.19374685,  0.06849788,
         -0.24599184,  0.20088878, -0.03349971,  0.04861541,
         -0.05434639,  0.07263847, -0.20479624, -0.00352959,
          0.1506898 , -0.30446586,  0.01592659, -0.21474351],
        [-0.23376793, -0.21181841,  0.49358296,  0.00247522,
          0.09905542, -0.07676595, -0.11009975,  0.03201354,
         -0.24974006,  0.31153888, -0.13860452,  0.18306912,
         -0.01656908, -0.07648543, -0.00100983, -0.03535505],
        [-0.14887926, -0.01346983,  0.16340686,  0.04806547,
          0.03504476, -0.02840108,  0.10916819, -0.06649324,
         -0.11411075,  0.00463158, -0.05082737,  0.10810807,
          0.0317376 , -0.11074291,  0.14496876,  0.00664135],
        [-0.20463498, -0.03600414,  0.30112484,  0.18771484,
         -0.13481319, -0.00141429,  0.01309087, -0.0438042 ,
         -0.16769245,  0.15123951, -0.11137815, -0.01092391,
          0.14892133, -0.2592

In [15]:
gru = GRU(units=16, return_sequences=True, return_state=True)

gru(input)

# the state is always equal to the values returned after processing the whole
# sequence

[<tf.Tensor: shape=(1, 10, 16), dtype=float32, numpy=
 array([[[ 1.68479919e-01,  8.23669434e-02,  3.23924780e-01,
          -8.42784345e-02, -9.56254266e-03, -1.59966320e-01,
           2.76096463e-01, -1.02438785e-01,  2.03701615e-01,
          -1.92683958e-03,  1.70403346e-01,  1.54588059e-01,
          -8.10326338e-02, -5.28476685e-02,  1.08160235e-01,
          -2.27760404e-01],
         [-1.32329106e-01, -1.27311483e-01,  2.83936799e-01,
           1.67768389e-01, -3.10528845e-01, -3.13403666e-01,
           1.92587525e-02, -2.44162321e-01,  3.72009426e-01,
           9.21765193e-02,  5.12781441e-01,  9.05069336e-02,
           8.67996067e-02, -1.11345544e-01,  2.73031592e-02,
          -2.54611522e-02],
         [-6.42747879e-02, -1.70516461e-01,  1.37872756e-01,
           9.04626772e-02, -2.21704453e-01,  3.52517068e-02,
          -1.12043321e-01, -2.15143144e-01,  1.13908403e-01,
           2.45724559e-01,  2.97635645e-01, -1.03899032e-01,
          -9.39854234e-03, -4.962749

### LSTM

Last but not least let's see how to code an LSTM neuron, check the documentation: 
[LSTM](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM).

In [17]:
from tensorflow.keras.layers import LSTM

lstm = LSTM(units=16, return_sequences=False, return_state=False)

lstm(input)

# it works mainly like GRU

<tf.Tensor: shape=(1, 16), dtype=float32, numpy=
array([[ 0.07914715,  0.07655661, -0.02766241, -0.12622504,  0.02012474,
        -0.28260395, -0.07853369,  0.1230915 , -0.15065426,  0.01768514,
         0.07268874, -0.12879959,  0.08355555, -0.01858526, -0.13285434,
        -0.05467749]], dtype=float32)>

In [18]:
lstm = LSTM(units=16, return_sequences=True, return_state=True)

lstm(input)

# When using return_state, the layer returns 
# 1 the output (sequence or not depending on return_sequences)
# 2 the hidden state which is equal to the final output
# 3 the cell state

[<tf.Tensor: shape=(1, 10, 16), dtype=float32, numpy=
 array([[[-7.14064986e-02, -4.10236940e-02,  3.96738835e-02,
           2.14840518e-03,  2.57249698e-02, -3.26019898e-02,
           5.92622161e-02,  5.29569872e-02, -7.35800155e-03,
           1.77562386e-01, -1.37414224e-02, -1.07611351e-01,
           1.18049234e-01,  2.45713983e-02,  9.78207067e-02,
           1.91633254e-02],
         [ 2.09015124e-02, -1.21876560e-01,  8.26791860e-03,
          -1.52107090e-01,  1.10287145e-02,  1.33250102e-01,
           5.73449582e-02,  9.98590067e-02, -3.48599367e-02,
           1.83004320e-01,  5.15443534e-02,  3.12008243e-02,
           4.25682738e-02,  2.20248178e-01,  1.50843382e-01,
          -1.47773579e-01],
         [ 3.81631702e-02, -4.77095842e-02,  8.64316244e-03,
          -3.09723355e-02, -7.94051439e-02,  4.97619249e-02,
           6.55366480e-02,  1.13652036e-01, -6.12777434e-02,
           7.59801492e-02,  7.06352741e-02,  1.14705414e-01,
          -2.13253293e-02,  1.701525

Now that you know how to code the three different recurrent layers, let's look to build a recurrent neural network on text data.

## Build a recurrent network

Let's show you an example on some toy dataset

In [24]:
import io
import os
import re
import shutil
import string
import tensorflow as tf

from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Embedding, GlobalAveragePooling1D
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

We'll use the same dataset we used for the embedding and word2vec demos which is the movie critique dataset.

In [25]:
url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"

dataset = tf.keras.utils.get_file("aclImdb_v1.tar.gz", url,
                                  untar=True, cache_dir='.',
                                  cache_subdir='')

# after dowloading the data we remove the unlabeled examples stored in the 
# unsup folder
remove_dir = os.path.join("/content/aclImdb/train", 'unsup')
shutil.rmtree(remove_dir)

Now let's proceed to load the data into a batch generator

In [35]:
batch_size = 128
seed = 123 # seed is mandatory here to prevent overlap between train and validation
train_ds = tf.keras.preprocessing.text_dataset_from_directory(
    'aclImdb/train', # path to the folder containing the text files
    batch_size=batch_size, # the size of a batch of data
    validation_split=0.2, # The proportion of data in the validation set
    subset='training', # Forms the train set
    seed=seed) # similar to random_state
val_ds = tf.keras.preprocessing.text_dataset_from_directory(
    'aclImdb/train', 
    batch_size=batch_size, 
    validation_split=0.2,
    subset='validation', # forms the validation set
    seed=seed)  

Found 25000 files belonging to 2 classes.
Using 20000 files for training.
Found 25000 files belonging to 2 classes.
Using 5000 files for validation.


Let's take a look at a batch of data

In [47]:
for text_batch, label_batch in train_ds.take(1):
  for i in range(5):
    print(label_batch[i].numpy(), text_batch.numpy()[i])

tf.Tensor(0, shape=(), dtype=int32) tf.Tensor(b'I thought it was a New-York located movie: wrong! It\'s a little British countryside setting.<br /><br />I thought it was a comedy: wrong! It\'s a drama.... Well, up to the last third, because after the story becomes totally "abracadabrantesque", the symbolic word for a French presidential mandate. It means, close to nonsense even it the motives would like to bring a sincere feeling.<br /><br />What Do I have left? Maybe, a good duo of actress: Yes, I know, they are 3 friends, but the redhead policewoman is a bit invisible for me. The tall doctoress surprises by her punch, and McDowell delivers a fine acting as usual, all in delicate, soft and almost mute attitude. This gentleness puzzles me, because as other fine artists or directors, the same pattern is repeating over and over. In her case, it\'s like, whatever the movie, it\'s always the same character defined by her feelings, her values, who lives infinite different stories. I still d

There's some preprocessing to be done, it's possible to do it with spacy by loading all the texts in memory and removing stop words and lemmatize tokens, but the more memory friendly way to do this is to create a preprocessing layer.

In [37]:
# Create a custom standardization function to strip HTML break tags '<br />'.
def custom_standardization(input_data):
  # transform all characters to lowercase
  lowercase = tf.strings.lower(input_data)
  # remove all <br and /> strings
  stripped_html = tf.strings.regex_replace(lowercase, '<br />', ' ')
  # replace punctuation with empty string
  # [%s] % re.escape(string.punctuation) is a formatting syntax borrowed to see
  # [] creates a group, and the %s gets replaced by the content of 
  # re.escape(string.punctuation) (the escaped punctuation characters)
  return tf.strings.regex_replace(stripped_html,
                                  '[%s]' % re.escape(string.punctuation), '')


# Vocabulary size and number of words in a sequence.
vocab_size = 10000
sequence_length = 100

# Use the text vectorization layer to normalize, split, and map strings to
# integers. Note that the layer uses the custom standardization defined above.
# Set maximum_sequence length as all samples are not of the same length.
vectorize_layer = TextVectorization(
    standardize=custom_standardization, # string tensor input -> string tensor output
    max_tokens=vocab_size, # int, keep only the vocab_size most common tokens
    output_mode='int', # sets the type of encoding
    output_sequence_length=sequence_length) # truncates or pads sequences to a
    # certain length

# Make a text-only dataset (no labels) and call adapt to build the vocabulary.
text_ds = train_ds.map(lambda x, y: x) # this is building a text only tf dataset
vectorize_layer.adapt(text_ds) # lists the vocab and the most common words

Now let's define a model including some recurrent neurons. Note that if you wish to stack recurrent layers you have to preserve the sequential nature of the data with `return_sequence=True`, the last recurrent may use `return_sequence=False` this will flatten the data so you can use dense layers afterwards.

In [51]:
embedding_dim=32 # the dimensionality of the representation space

model = Sequential([
  vectorize_layer, # This layers encodes the string as sequences of int
  Embedding(vocab_size, embedding_dim, name="embedding"), # the embedding layer
  # the input dim needs to be equal to the size of the vocabulary + 1 (because of
  # the zero padding)
  SimpleRNN(units=64, return_sequences=True), # maintains the sequential nature
  SimpleRNN(units=32, return_sequences=False), # returns the last output
  Dense(16, activation='relu'), # a dense layer
  Dense(1, activation="sigmoid") # the prediction layer
])

We need to compile the model so it can train on the data

In [52]:
model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(),
              metrics=['accuracy'])

We can now train the model

In [53]:
model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f5a14718f10>

In [54]:
model.summary()

Model: "sequential_6"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
text_vectorization_1 (TextVe (None, 100)               0         
_________________________________________________________________
embedding (Embedding)        (None, 100, 32)           320000    
_________________________________________________________________
simple_rnn_12 (SimpleRNN)    (None, 100, 64)           6208      
_________________________________________________________________
simple_rnn_13 (SimpleRNN)    (None, 32)                3104      
_________________________________________________________________
dense_12 (Dense)             (None, 16)                528       
_________________________________________________________________
dense_13 (Dense)             (None, 1)                 17        
Total params: 329,857
Trainable params: 329,857
Non-trainable params: 0
________________________________________________

There is a lot of overfitting! Let's try with the other two types of layers

In [41]:
embedding_dim=32 # the dimensionality of the representation space

model = Sequential([
  vectorize_layer, # This layers encodes the string as sequences of int
  Embedding(vocab_size, embedding_dim, name="embedding"), # the embedding layer
  # the input dim needs to be equal to the size of the vocabulary + 1 (because of
  # the zero padding)
  GRU(units=64, return_sequences=True), # maintains the sequential nature
  GRU(units=32, return_sequences=False), # returns the last output
  Dense(16, activation='relu'), # a dense layer
  Dense(1, activation="sigmoid") # the prediction layer
])

In [42]:
model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(),
              metrics=['accuracy'])

In [43]:
model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f5a148f7710>

Seems like using GRU instead of simpleRNN is helping the model a lot with the overfitting problem. Now let's compare this with LSTM.

In [44]:
embedding_dim=32 # the dimensionality of the representation space

model = Sequential([
  vectorize_layer, # This layers encodes the string as sequences of int
  Embedding(vocab_size, embedding_dim, name="embedding"), # the embedding layer
  # the input dim needs to be equal to the size of the vocabulary + 1 (because of
  # the zero padding)
  LSTM(units=64, return_sequences=True), # maintains the sequential nature
  LSTM(units=32, return_sequences=False), # returns the last output
  Dense(16, activation='relu'), # a dense layer
  Dense(1, activation="sigmoid") # the prediction layer
])

In [45]:
model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(),
              metrics=['accuracy'])

In [46]:
model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f5a11c0d110>

It looks like the results we obtain for GRU and LSTM are quite comparable, they are both able to solve the overfitting problem which is probably due to the fact that the input data consists in long sequences.

## Conclusion
We conclude here that GRU and LSTM layers seem way better for supervised learning tasks than the simple RNN. If you are looking for other best practices for building recurrent neural network, this [blog post](https://danijar.com/tips-for-training-recurrent-neural-networks/) contains lots of great ideas for improving your results and getting better understanding overall of these types of models.