# Natural Language Processing with RNNs and Attention

In [1]:
# FIXME: meke autocompletion working again
%config Completer.use_jedi = False

import os

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf

physical_devices = tf.config.list_physical_devices('GPU')

if not physical_devices:
    print("No GPU was detected.")
else:
    # https://stackoverflow.com/a/60699372
    tf.config.experimental.set_memory_growth(physical_devices[0], True)
    
from tensorflow import keras

No GPU was detected.


## Char-RNN
Let's build a RNN processing sequences of text and predicting single character.

### Loading the Data and Preparing the Dataset
Following example uses famous Shakespear's texts.

In [2]:
# Set RNG state
np.random.seed(42)
tf.random.set_seed(42)

# Download the dataset
filepath = keras.utils.get_file(
    "shakespeare.txt",
    "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
)

# Load raw dataset
with open(filepath) as f:
    shakespeare_text = f.read()
    
# Show a pice of the text
print(shakespeare_text[:148])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?



In [3]:
# Setup a character-based text tokenizer
tokenizer = keras.preprocessing.text.Tokenizer(char_level=True)
tokenizer.fit_on_texts(shakespeare_text)

In [4]:
# Convert a text to a sequence of character IDs
tokenizer.texts_to_sequences(["First"])

[[20, 6, 9, 8, 3]]

In [5]:
# Convert a sequence of character IDs back to text
tokenizer.sequences_to_texts([[20, 6, 9, 8, 3]])

['f i r s t']

In [6]:
# Set RNG state
np.random.seed(42)
tf.random.set_seed(42)

# number of distinct characters
max_id = len(tokenizer.word_index)

# total number of characters
dataset_size = tokenizer.document_count

# Encode the whole dataset
#  - TF tokenizer assigns the first character it encounters with ID=1, we shift it back to start from 0
[encoded] = np.array(tokenizer.texts_to_sequences([shakespeare_text])) - 1

# Build a training TF Dataset from the first 90% of the text
train_size = dataset_size * 90 // 100
dataset = tf.data.Dataset.from_tensor_slices(encoded[:train_size])

# Preprocessing parameters
# - length of a training instance (sequence of text)
# - size of a training micro-batch
n_steps = 100
batch_size = 32

# target = input shifted 1 character ahead
window_length = n_steps + 1

# Create training instances (sequences of text) by sliding a window over the text
#  - each time we shift it by single character (`shift=1`)
#  - `drop_remainder=True` means that we don't want to include final shortened windows with length < window length 
dataset = dataset.repeat().window(window_length, shift=1, drop_remainder=True)

# Because `window()` creates a nested Dataset (containing sub-datasets), we want to flatten and convert it to single dataset of tensors
#  - the trick here is that we batch the windows to the same length they already have
dataset = dataset.flat_map(lambda window: window.batch(window_length))

# Now we can safely shuffle the dataset and not to break the text
#  - note: shuffling ensures some degree of i.i.d. which is necessary for SGD to work well
#  - we also create training micro-batches
dataset = dataset.shuffle(10000).batch(batch_size)

# Split the instances to (inputs, target) where the target is the next character
dataset = dataset.map(lambda windows: (windows[:, :-1], windows[:, 1:]))

# As the last step we must either encode or embed categorical features (characters)
#  - here we use 1-hot encoding since there's fairly few distinct characters
dataset = dataset.map(lambda X_batch, Y_batch: (tf.one_hot(X_batch, depth=max_id), Y_batch))

# Finally we prefetch the data for better training performance
dataset = dataset.prefetch(1)

# Show shapes of 1st batch tensors
for X_batch, Y_batch in dataset.take(1):
    print(X_batch.shape, Y_batch.shape)

(32, 100, 39) (32, 100)


### Creating and Training the Model

In [7]:
# Build a simple Char-RNN model:
# - there are two GRU recurrent layers with 128 units, both of which use a 20% dropout (`recurrent_dropout`)
# - there's also a 20% input dropout (`dropout` parameter of the 1st layer)
# - the output layer is a time-distributed dense layer with 39 units and softmax activation to predict each character's class probability
model = keras.models.Sequential([
    keras.layers.GRU(128, return_sequences=True, input_shape=[None, max_id], dropout=0.2, recurrent_dropout=0.2),
    keras.layers.GRU(128, return_sequences=True, dropout=0.2, recurrent_dropout=0.2),
    keras.layers.TimeDistributed(keras.layers.Dense(max_id, activation="softmax"))
])
model.compile(loss="sparse_categorical_crossentropy", optimizer="adam")

# Train and validate the model for 10 epochs
# - Note: This would take forever to train on my PC, so let's use just few batches
history = model.fit(dataset.take(40), epochs=10)
# history = model.fit(dataset, steps_per_epoch=train_size // batch_size, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


### Using the Model to Generate Text

In [8]:
def preprocess(texts):
    """Preprocess given text to conform to Char-RNN's input"""
    X = np.array(tokenizer.texts_to_sequences(texts)) - 1
    return tf.one_hot(X, max_id)

# Make a new prediction using the model
X_new = preprocess(["How are yo"])
Y_pred = np.argmax(model.predict(X_new), axis=-1)

# Show the prediction as text: 1st sentence, last char
tokenizer.sequences_to_texts(Y_pred + 1)[0][-1]

'u'

Next, let's generate not only single letter but whole new text. One approach is to repeatedly call the above. However, this often leads to repeating the same letter over and over again. Better approach is to select next letter randomly based on the learned class probabilities.

In [9]:
def next_char(text, temperature=1):
    """
    Generate new characters based on given text.
     1. we pre-process and predict as before but return all character probablilities
     2. then we compute the log of probabilities and scale it by the `temperature` parameter (the higher, the more in favour of higher prob. letters)
     3. finally we select single character randomly given these log-probs. and convert the character ID back to text 
    """
    X_new = preprocess([text])
    y_proba = model.predict(X_new)[0, -1:, :]
    rescaled_logits = tf.math.log(y_proba) / temperature
    char_id = tf.random.categorical(rescaled_logits, num_samples=1) + 1
    return tokenizer.sequences_to_texts(char_id.numpy())[0]


def complete_text(text, n_chars=50, temperature=1):
    """Extend given text with `n_chars` new letters"""
    for _ in range(n_chars):
        text += next_char(text, temperature)
    return text


# Reset RNG state
tf.random.set_seed(42)

# Complete some text using different temperatures
#  - Note: this example dosn't present the model very well since it's not been trained on the full dataset
print(complete_text("t", temperature=0.2))

te and the beall the reake the belly the belly and 


In [10]:
print(complete_text("t", temperature=1))

tucio. ar you up. greccoun:
the beabudos the gile: 


In [11]:
print(complete_text("t", temperature=2))

ty no c't;
meracqniogtt chai! aekgld arkichbrben; g


## Stateful RNN
The premise of a *Stateful RNN* is simple: So far we've thrown all neurons' hidden states away after applying BPTT on a training batch. In other words, hidden states were re-initialized for each partial update and so the model had hard time to learn long term patterns. The idea of a *Stateful RNN* is to keep the hidden state from previous batch and not to initialize it over again.

This has, however, a consequence for the pre-processing logic. If we assume the state is transferred over from previous batches, these batches of training instances cannot overlap - they must consecutively extend each one. In our text generating example, this means we can't use overlapping windows and shuffling anymore.

In [12]:
# Reset RNG state
tf.random.set_seed(42)

# (a) Updated pre-processing logic for Stateful Char-RNN
# - In this version we apply single window at a time

dataset = tf.data.Dataset.from_tensor_slices(encoded[:train_size])

# Contrary to before, we shift windows by full `n_steps` to create non-overlapping inputs
dataset = dataset.window(window_length, shift=n_steps, drop_remainder=True)
dataset = dataset.flat_map(lambda window: window.batch(window_length))

# We skip shuffling altogether so that we don't break the preserved state and batch by 1
#  - batching by 1 means that we apply just single window at a time and, again, preserve the state
dataset = dataset.repeat().batch(1)

# The rest of the logic is analogous
dataset = dataset.map(lambda windows: (windows[:, :-1], windows[:, 1:]))
dataset = dataset.map(lambda X_batch, Y_batch: (tf.one_hot(X_batch, depth=max_id), Y_batch))
dataset = dataset.prefetch(1)

# (b) Updated pre-processing logic for Stateful Char-RNN
# - In this more complicated version we apply a micro-batch of windows as before
batch_size = 32

@tf.function
def make_windowed_ds(encoded_part):
    """Creates a flat windowed TF Dataset of non-overlapping windows"""
    dataset = tf.data.Dataset.from_tensor_slices(encoded_part)
    dataset = dataset.window(window_length, shift=n_steps, drop_remainder=True)
    return dataset.flat_map(lambda window: window.batch(window_length))

# Contrary to before, we make a windowed Dataset in two steps:
#  1. We split the dateset into equal length batches and make windowed Dataset from each batch
#  2. Then we put put all these batches back together and stack the windows so that 
#     the n-th inputs sequence of a batch starts where the n-th sequence of the previous one ended
datasets = map(make_windowed_ds, np.array_split(encoded[:train_size], batch_size))
dataset = tf.data.Dataset.zip(tuple(datasets)).map(lambda *windows: tf.stack(windows))

# Final steps are the same:
#  - Split each window to (inputs, target)
#  - 1-hot encode the categorical input features
#  - Prefetch the data for better performance
dataset = dataset.repeat().map(lambda windows: (windows[:, :-1], windows[:, 1:]))
dataset = dataset.map(lambda X_batch, Y_batch: (tf.one_hot(X_batch, depth=max_id), Y_batch))
dataset = dataset.prefetch(1)

# Build a Stateful RNN model
# The architecture is basically the same as before, notice two distinctions:
#  - `stateful=True` on the recurrent layers to preserve hidden state
#  - `batch_input_shape` set for the initial recurrent layer to let the model know the shape (batch size) for the hidden state
model = keras.models.Sequential([
    keras.layers.GRU(
        128,
        return_sequences=True,
        stateful=True,
        dropout=0.2,
        recurrent_dropout=0.2,
        batch_input_shape=[batch_size, None, max_id],
    ),
    keras.layers.GRU(
        128, 
        return_sequences=True,
        stateful=True,
        dropout=0.2,
        recurrent_dropout=0.2,
    ),
    keras.layers.TimeDistributed(keras.layers.Dense(max_id, activation="softmax")),
])
model.compile(loss="sparse_categorical_crossentropy", optimizer="adam")

# Train and validate the model
#  - we use custom callback to reset model's state at the start of each epoch (instead of each batch)
#  - we train the model for 50 epochs, also notice the updated `steps_per_epoch`

class ResetStatesCallback(keras.callbacks.Callback):
    """Callback that resets model's state each epoch"""
    
    def on_epoch_begin(self, epoch, logs):
        self.model.reset_states()


history = model.fit(
    dataset, 
    steps_per_epoch=train_size // batch_size // n_steps,
    epochs=50,
    callbacks=[ResetStatesCallback()],
)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


To use the model with different batch sizes, we need to create a stateless copy. We can get rid of dropout since it is only used during training.

In [13]:
# Set RNG state
tf.random.set_seed(42)

# Create a steteless Char-RNN model
# - This model is based on our steteful Char-RNN but used only for making predictions
# - Notice: We don't need dropout since it's used only during training
stateless_model = keras.models.Sequential([
    keras.layers.GRU(128, return_sequences=True, input_shape=[None, max_id]),
    keras.layers.GRU(128, return_sequences=True),
    keras.layers.TimeDistributed(keras.layers.Dense(max_id, activation="softmax")),
])

# Build the stateless model
#  - Firstly, we can loosen the fixed batch size restriction
#  - Secondly, we copy learned weights from the stateful model (this works fine since dropout layers have no trainable params)
stateless_model.build(tf.TensorShape([None, None, max_id]))
stateless_model.set_weights(model.get_weights())

# Replace our main model by this one
#  - because `complete_text()` implicitly works with `model`
model = stateless_model

# Try to complete some text
print(complete_text("t"))

the court,
when they shortime she down.
peserve, ab


## Sentiment Analysis
Let's take a step further from the character-level RNNs to word-level sentiment analysis. Typical dataset from this taks is the IMDb reviews dataset, so let's play.

In [15]:
# Reset RNG state
tf.random.set_seed(42)

# Load the IMDb reviews dataset
(X_train, y_test), (X_valid, y_test) = keras.datasets.imdb.load_data()

# Show a training instance
#  - The dataset is already preprocessed, each instance is a sequence integers which represent an ID of a word
X_train[0][:10]

[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65]

In [16]:
# In order to reconstruct a word we can load the word to ID index
word_index = keras.datasets.imdb.get_word_index()

# And then create an inverse mapping
# - Note: We shift the ID by 3 to reserve first three IDs for special markers
id_to_word = {id_ + 3: word for word, id_ in word_index.items()}

# These special markers are for the:
#  - padding symbol
#  - start of sequence
#  - unknown word
for id_, token in enumerate(("<pad>", "<sos>", "<unk>")):
    id_to_word[id_] = token
    
# Show a sample of decoded words
" ".join(id_to_word[id_] for id_ in X_train[0][:10])

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json


'<sos> this film was just brilliant casting location scenery story'

Now, let's create the same pre-processing logic and trainable dataset using TensorFlow's Datasets API.

In [17]:
import tensorflow_datasets as tfds

# Load the IMDb reviews TF Dataset
#  - Note: Using TF-only functions allows us to reuse the same pre-processing logic in every environment
datasets, info = tfds.load("imdb_reviews", as_supervised=True, with_info=True)

# List the dataset content
datasets.keys()

dict_keys(['train', 'test', 'unsupervised'])

In [18]:
# Save and show training and test set sizes
train_size = info.splits["train"].num_examples
test_size = info.splits["test"].num_examples

train_size, test_size

(25000, 25000)

In [19]:
# Peek the training dataset
for X_batch, y_batch in datasets["train"].batch(2).take(1):
    for review, label in zip(X_batch.numpy(), y_batch.numpy()):
        print("Review:", review.decode("utf-8")[:200], "...")
        print("Label:", label, "= Positive" if label else "= Negative")
        print()

Review: This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting  ...
Label: 0 = Negative

Review: I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However  ...
Label: 0 = Negative



In [20]:
def preprocess(X_batch, y_batch):
    """
    Pre-process an input batch:
     1. Crops each instance to first 300 characters (speeds up training and sentiment can usually be deduced by the first few sentences)
     2. Replaces '<br />' symbols by a space character
     3. Replaces each non-letter and quote character by a space
     4. Splits instances by space creating a ragged tensor
     5. Returns a dense tensor (and original label) made by padding the splits with '<pad>'
    """
    X_batch = tf.strings.substr(X_batch, 0, 300)
    X_batch = tf.strings.regex_replace(X_batch, rb"<br\s*/?>", b" ")
    X_batch = tf.strings.regex_replace(X_batch, b"[^a-zA-Z']", b" ")
    X_batch = tf.strings.split(X_batch)
    return X_batch.to_tensor(default_value=b"<pad>"), y_batch

# Try the preprocessing logic on the first training batch
preprocess(X_batch, y_batch)

(<tf.Tensor: shape=(2, 53), dtype=string, numpy=
 array([[b'This', b'was', b'an', b'absolutely', b'terrible', b'movie',
         b"Don't", b'be', b'lured', b'in', b'by', b'Christopher',
         b'Walken', b'or', b'Michael', b'Ironside', b'Both', b'are',
         b'great', b'actors', b'but', b'this', b'must', b'simply', b'be',
         b'their', b'worst', b'role', b'in', b'history', b'Even',
         b'their', b'great', b'acting', b'could', b'not', b'redeem',
         b'this', b"movie's", b'ridiculous', b'storyline', b'This',
         b'movie', b'is', b'an', b'early', b'nineties', b'US',
         b'propaganda', b'pi', b'<pad>', b'<pad>', b'<pad>'],
        [b'I', b'have', b'been', b'known', b'to', b'fall', b'asleep',
         b'during', b'films', b'but', b'this', b'is', b'usually', b'due',
         b'to', b'a', b'combination', b'of', b'things', b'including',
         b'really', b'tired', b'being', b'warm', b'and', b'comfortable',
         b'on', b'the', b'sette', b'and', b'having', b'j

In [27]:
from collections import Counter

batch_size = 32

# Do a word-count over the whole pre-processed training dataset (in one pass)
vocabulary = Counter(
    word.numpy()
    for X_batch, _ in datasets["train"].batch(batch_size).map(preprocess)
    for review in X_batch
    for word in review
)

# Show first 3 most common words in the training corpus
vocabulary.most_common()[:3]

[(b'<pad>', 214309), (b'the', 61137), (b'a', 38564)]

In [28]:
len(vocabulary)

53893

In [29]:
# Drop the least important words and keep just 10k most frequent ones
vocab_size = 10_000
truncated_vocabulary = [word for word, _ in vocabulary.most_common(vocab_size)]

# Make a word index from the truncated vocabulary
word_to_id = {word: index for index, word in enumerate(truncated_vocabulary)}

# Test the word index on an example sentence
for word in b"This movie was faaaaaantastic".split():
    print(word_to_id.get(word) or vocab_size)

22
12
11
10000


In [30]:
# Build a static vocabulary table with 1k OOV buckets
num_oov_buckets = 1000

# Initialize the vocabulary from our truncated vocabulary and word index
words = tf.constant(truncated_vocabulary)
word_ids = tf.range(len(truncated_vocabulary), dtype=tf.int64)
vocab_init = tf.lookup.KeyValueTensorInitializer(words, word_ids)

# Build the lookup table
table = tf.lookup.StaticVocabularyTable(vocab_init, num_oov_buckets)

# Test the lookup table on the example sentence we used before
table.lookup(tf.constant([b"This movie was faaaaaantastic".split()]))

<tf.Tensor: shape=(1, 4), dtype=int64, numpy=array([[   22,    12,    11, 10053]])>

In [31]:
def encode_words(X_batch, y_batch):
    """Encode each word in an input batch using the static vocabulary table"""
    return table.lookup(X_batch), y_batch

# Preprocess and encode the whole training set
train_set = (
    datasets["train"]
    .repeat()
    .batch(batch_size)
    .map(preprocess)
    .map(encode_words)
    .prefetch(1)
)

# Display the 1st training batch
for X_batch, y_batch in train_set.take(1):
    print(X_batch)
    print(y_batch)

tf.Tensor(
[[  22   11   28 ...    0    0    0]
 [   6   21   70 ...    0    0    0]
 [4099 6881    1 ...    0    0    0]
 ...
 [  22   12  118 ...  331 1047    0]
 [1757 4101  451 ...    0    0    0]
 [3365 4392    6 ...    0    0    0]], shape=(32, 60), dtype=int64)
tf.Tensor([0 0 0 1 1 1 0 0 0 0 0 1 1 0 1 0 1 1 1 0 1 1 1 1 1 0 0 0 1 0 0 0], shape=(32,), dtype=int64)


In [32]:
# The embedding dimention hyperparameter
embed_size = 128

# Build a classification RNN with initial word embedding layer
#  - This layer's matrix has shape [ID count = vocabulary size + OOV buckets, embedding dimension]
#  - So the model's inputs are 2D tensors of shape [batch size, time steps], the embedding output is 3D tensor [batch size, time steps, embedding size]
#  - `mask_zero=True` means that we ignore ID=0 - the most frequent word which in our case is `<pad>` (so the model doesn't have to learn to ignore it)
#  - note: It would clearner to ensure that the padding word really has ID 0 than to count on the fact that it's the most frequent one.
model = keras.models.Sequential([
    keras.layers.Embedding(vocab_size + num_oov_buckets, embed_size, mask_zero=True, input_shape=[None]),
    keras.layers.GRU(128, return_sequences=True),
    keras.layers.GRU(128),
    keras.layers.Dense(1, activation="sigmoid"),
])
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

# Train and validate the model for 5 epochs
history = model.fit(train_set, steps_per_epoch=train_size // batch_size, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


### Manual Masking

In [33]:
K = keras.backend

# Define an input layer
inputs = keras.layers.Input(shape=[None])

# Create a mask that ignores inputs equal to 0
mask = keras.layers.Lambda(lambda inputs: K.not_equal(inputs, 0))(inputs)

# Build the same model structure as before but with explicit masking of layer inputs
#  - Note: In the previous example the output dense layer didn't receive the implicit mask because the time dimension was not the same, 
#          so the explicit masking is necessary if we want to propagate this information all the way to the loss function.
#  - Note 2: The downside is that LSTMs and GRUs won't use optimized impl. for GPUs and so the training might be slower.
z = keras.layers.Embedding(vocab_size + num_oov_buckets, embed_size)(inputs)
z = keras.layers.GRU(128, return_sequences=True)(z, mask=mask)
z = keras.layers.GRU(128)(z, mask=mask)

# Define model's outputs
outputs = keras.layers.Dense(1, activation="sigmoid")(z)

# Compose and compile the model
model = keras.models.Model(inputs=[inputs], outputs=[outputs])
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

# Train and validate the model for 5 epochs
history = model.fit(train_set, steps_per_epoch=train_size // batch_size, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


### Reusing Pretrained Embeddings

In [34]:
import tensorflow_hub as hub

# Reset RNG state
tf.random.set_seed(42)

# Build a model with pre-trained layers:
#  - Main portion of this model reuses Google's model that pre-processes and embeds words from an input text to 50 dimensional vectors
#  - Then we just add two dense layers for our classification task of sentiment analysis
#  - Note: By default TF Hub downloads models to /tmp, one can override this by setting `TFHUB_CACHE_DIR` env. variable
#  - Note 2: TF Hub layers are also by default non-trainable - if we want to tweak their weights we must unfreeze them
model = keras.Sequential([
    hub.KerasLayer(
        "https://tfhub.dev/google/tf2-preview/nnlm-en-dim50/1",
        dtype=tf.string,
        input_shape=[],
        output_shape=[50],
    ),
    keras.layers.Dense(128, activation="relu"),
    keras.layers.Dense(1, activation="sigmoid")
])
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

# Then we can just load the IMDb reviews dataset
datasets, info = tfds.load("imdb_reviews", as_supervised=True, with_info=True)

# Take the training set and just batch it (and prefetch)
#  - Note: The rest of the preprocessing logic is handled by the TF Hub portion of the model
train_size = info.splits["train"].num_examples
train_set = datasets["train"].repeat().batch(batch_size).prefetch(1)

# Finally we just train and validate the model on our IMDb dataset
history = model.fit(train_set, steps_per_epoch=train_size // batch_size, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


## Encoder-Decoder Network for Neural Machine Translation

As the name suggests, in the *Encoder-Decoder* architecture we split a *sequence-to-sequence* RNN into two parts:
1. Encoder - takes as inputs reversed sequences of words (or rather embeddings thereof; reversed so that the decoder reveives the first word first)
1. Decoder - this part has actually two inputs, first the hidden states of the encoder and socond is either previous target word (during training; embedded) or the actual token that was output in the previous step (during inference; embedded)

Additional notes to the architecture:
* The outputs of the decoder are scores for each word in the vocabulary which are turned to probabilities using time-distributed *softmax*. Because we can easily get to very high-dimensional outputs, typically a *sampled softmax* is used for training and regular *softmax* for inference
* In this task we cannot simply truncate input sequences to common length as before because we want to get complete translations. Also pedding to some large common lenght does not work. Instead, we can bucket the sentenced into sets of close-enough lenght and pad these to match the longes one in each set.
* Finally, we should ignore part of the output after an `<EOS>` token - both from the output and loss function

In [35]:
import tensorflow_addons as tfa

# Set the RNG state
tf.random.set_seed(42)

# Sutup vocabulary and embedding size hyperparameters
vocab_size = 100
embed_size = 10

# Define Encoder and Decoder inputs
encoder_inputs = keras.layers.Input(shape=[None], dtype=np.int32)
decoder_inputs = keras.layers.Input(shape=[None], dtype=np.int32)
sequence_lengths = keras.layers.Input(shape=[], dtype=np.int32)

# Create embedding layers for the Encoder and Decoder parts
embeddings = keras.layers.Embedding(vocab_size, embed_size)
encoder_embeddings = embeddings(encoder_inputs)
decoder_embeddings = embeddings(decoder_inputs)

# Encoder is a 512 unit LSTM layer
#  - we can ignore encoder ouputs but we return both the short-term and long-term states with `return_state=True`
#  - the complete hidden state of the encoder is a pair of the short and long-term states
encoder = keras.layers.LSTM(512, return_state=True)
_, state_h, state_c = encoder(encoder_embeddings)

# Decoder is based on the `BasicDecoder` from TF Addons
#  - Decoder cell is a 512 unit LSTM cell
#  - Sampler is a component tells the Decoder what it should pretend the last step's output was:
#    - in this case `TrainingSampler` takses the embedding of previous target token
#    - other option is `ScheduledEmbedingTrainingSampler` which randomly chooses between target and actual outputs
#  - Model's output is a dense layer with one unit per word in the vocabulary
decoder = tfa.seq2seq.basic_decoder.BasicDecoder(
    cell=keras.layers.LSTMCell(512),
    sampler=tfa.seq2seq.sampler.TrainingSampler(),
    output_layer=keras.layers.Dense(vocab_size),
)

# Construct the Decoder
#  - Initial state is the complete encoder state
#  - We can ignore final decoder state and sequence lengths but we do care about the final outputs
final_outputs, _, _ = decoder(
    decoder_embeddings,
    initial_state=[state_h, state_c],
    sequence_length=sequence_lengths,
)

# Final class (word) probabilities are retrieved as the (sampled) softmax of the final outputs (decoder)
Y_proba = tf.nn.softmax(final_outputs.rnn_output)

# Build an Encoder-Decoder model
#  - Note: Because the task is basically a classification task, we can use `sparse_categorical_crossentropy` as the loss function
model = keras.models.Model(
    inputs=[encoder_inputs, decoder_inputs, sequence_lengths],
    outputs=[Y_proba],
)
model.compile(loss="sparse_categorical_crossentropy", optimizer="adam")

# Build a random sequence dataset
X = np.random.randint(100, size=10*1000).reshape(1000, 10)
Y = np.random.randint(100, size=15*1000).reshape(1000, 15)
X_decoder = np.c_[np.zeros((1000, 1)), Y[:, :-1]]
seq_lengths = np.full([1000], 15)

# Train and validate the model on the random dataset
history = model.fit([X, X_decoder, seq_lengths], Y, epochs=2)

Epoch 1/2
Epoch 2/2


## Bidirectional RNNs
For forecasting future values in a time series we want to have a *causal* model - a model in which future values are predicted solely on the basis of past values. On the other hand in NLP tasks (such as Neural Machine Translation) it can be beneficial to embed a word based on both the past and future contexts.

A *Bidirectional* layer is a layer in which is composed of two layers working on the same input. One layer reads the input from the original direction (left to right) and the other one is a clone except it read from the reverse direction (right to left). The final output is some sort of a combination of both outputs - typically a concatenation.

In [36]:
# Build an example RNN with a bidirectional GRU layer
#  - `Bidirectional` wrapper creates a clone in the reverse direction of a layer passed as an argument and concatenates outputs
#  -  Note: Adding a bidirectional wrapper implicitly doubles the number of units of the prototype
model = keras.models.Sequential([
    keras.layers.GRU(10, return_sequences=True, input_shape=[None, 10]),
    keras.layers.Bidirectional(keras.layers.GRU(10, return_sequences=True))
])

# Show model's topology
model.summary()

Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
gru_10 (GRU)                 (None, None, 10)          660       
_________________________________________________________________
bidirectional (Bidirectional (None, None, 20)          1320      
Total params: 1,980
Trainable params: 1,980
Non-trainable params: 0
_________________________________________________________________
