# Deep Learning
### Week 6: Recurrent neural networks

## Contents

[1. Introduction](#introduction)

[2. Recurrent neural networks (\*)](#rnns)

[3. Long Short Term Memory (LSTM) (\*)](#lstm)

[4. Preprocessing and Embedding layers (\*)](#preprocessingembedding)

[References](#references)

<a class="anchor" id="introduction"></a>
## Introduction

In the last week of the module we studied a very important neural network network architecture that is the convolutional neural network. You learned the operations that are carried out by convolutional layers and pooling layers, as well as the hyperparameter choices within those layers and the effect they can have on the layer outputs. We also covered transposed convolutions, which can be thought of as a reverse analog to the regular convolutional layers.

One of the main motivations for developing convolutional neural networks was to design a model that captures important structural properties that we know are contained in the data. CNNs have an equivariance property that means they're adapted well for image data, because we know that in images, we want to be able to detect the same features in different regions of the input.

In this week of the course, we will look at another very important and widespread model architecture, which is the recurrent neural network (RNN). This is another network type where we deliberately build structure into the network itself in order to capture certain aspects of the data. In the case of recurrent neural networks, these are intended for sequence data.

We will examine the different possible types of sequence modelling tasks, and see how RNNs are very flexible models that can be used in many different configurations. You'll learn the basic RNN computation, and more sophisticated architectures such as stacked RNNs, bidirectional layers, and the long short term memory architecture (LSTM).

You will also see how to implement all of these models and layer types using the Keras RNN API, as well as learning about some of the preprocessing and embedding layers available in Keras.

<a class="anchor" id="rnns"></a>
## Recurrent neural networks

A particular challenge with sequential data and modelling tasks is that the sequence lengths can vary from one dataset example to the next. This makes the use of a fixed input size architecture such as the MLP unsuitable. In addition, there can be many different types of sequential modelling tasks that we might want to consider, each of which could have different architectural requirements, as illustrated in the following diagram. 

<center><img src="figures/schematic_rnn_architectures.png" alt="Schematic RNN architectures" style="width: 1000px;"/></center>
<center>Different architectures for recurrent neural networks</center>
<br>

Typical sequence modelling tasks could include:

* Text sentiment analysis (many-to-one)
* Image captioning (one-to-many)
* Language translation (many-to-many)
* Part-of-speech tagging (many-to-many)

Recurrent neural networks ([Rumelhart et al 1986b](#Rumelhart86b)) are designed to handle this variability of data lengths and diversity of problem tasks.

#### Basic RNN computation

Let $\{\mathbf{x}_t\}_{t=1}^T$ be an example sequence input, with each $\mathbf{x}_t\in\mathbb{R}^D$. Suppose that we are in the many-to-many setting, and there is a corresponding sequence of labels $\{{y}_t\}_{t=1}^T$, with $y_t\in Y$, where $Y$ could be $\{0, 1\}$ for a binary classification task for example.

The basic RNN computation is given as follows:

$$
\begin{align}
\left.
\begin{array}{rcl}
\mathbf{h}^{(1)}_t &=& \sigma\left( \mathbf{W}_{hh}^{(1)}\mathbf{h}^{(1)}_{t-1} + \mathbf{W}_{xh}^{(1)}\mathbf{x}_t + \mathbf{b}^{(1)}_h \right),\\
\hat{\mathbf{y}}_t &=& \sigma_{out}\left( \mathbf{W}_{hy}\mathbf{h}_t^{(1)} + \mathbf{b}_{y} \right),
\end{array}
\right\}\tag{1}
\end{align}
$$

for $t=1,\ldots,T$, where $\mathbf{h}^{(1)}\in\mathbb{R}^{n_1}$,
$\mathbf{W}^{(1)}_{hh}\in\mathbb{R}^{n_1\times n_1}$, $\mathbf{W}^{(1)}_{xh}\in\mathbb{R}^{n_1\times D}$, $\mathbf{b}^{(1)}_h\in\mathbb{R}^{n_{1}}$, $\mathbf{W}_{hy}\in\mathbb{R}^{n_y\times n_1}$, $\mathbf{b}_y\in\mathbb{R}^{n_{y}}$,  $\sigma$ and $\sigma_{out}$ are activation functions, $n_1$ is the number of units in the hidden layer, and $n_y$ is the dimension of the output space $Y$.

<center><img src="figures/rnn_computation.png" alt="Basic RNN computation" style="width: 400px;"/></center>
<center>Basic computation for recurrent neural networks</center>
<br>

Recurrent neural networks make use of weight sharing, similar to convolutional neural networks, but this time the weights are shared across time. This allows the RNN to be 'unrolled' for as many time steps as there are in the data input $\mathbf{x}$.

The RNN also has a **persistent state**, in the form of the hidden layer $\mathbf{h}^{(1)}$. This hidden state can carry information over an arbitrary number of time steps, and so predictions at a given time step $t$ can depend on events that occurred at any point in the past, at least in principle. As with MLPs, the hidden state stores **distributed representations** of information, which allows them to store a lot of information, in contrast to hidden Markov models.

Note that the computation $(1)$ requires an **initial hidden state** $\mathbf{h}^{(1)}_0$ to be defined. In practice, this is often just set to the zero vector, although it can also be learned as additional parameters.

In Keras, the RNN is available as the layer `SimpleRNN` in the `keras.layers` module (see [the docs](https://keras.io/api/layers/recurrent_layers/simple_rnn/)). It can be included in the list of layers passed to the `Sequential` constructor, or using the functional API.

In [None]:
import keras
from keras import ops

In [None]:
# Demonstrate the SimpleRNN layer

from keras.models import Sequential
from keras.layers import Input, SimpleRNN

rnn_model = Sequential([
    Input(shape=(10, 2)),
    SimpleRNN(32, activation='tanh')  # 'tanh' is the default activation
])

The Tensor shape expected by a recurrent neural network layer is of the form `(batch_size, sequence_length, num_features)`. In the above, the `input_shape` specifies that the sequence length is 10 and there are 2 features.

In [None]:
# Print the model summary

rnn_model.summary()

By default, the RNN only returns the final hidden state output.

In [None]:
# Call the RNN on a dummy input

inputs = keras.random.normal((1, 10, 2))
rnn_model(inputs)

The default initial hidden state is zeros, but it can be explicitly set in the layer's `call` method:

In [None]:
# Set the initial hidden state of a SimpleRNN layer

rnn_layer = SimpleRNN(3)
dummy_inputs = keras.random.normal((16, 5, 2))
layer_output = rnn_layer(dummy_inputs, initial_state=ops.ones((16, 3)))
layer_output.shape

#### Stacked RNNs
RNNs can also be made more powerful by stacking recurrent layers on top of each other:

$$
\begin{align}
\left.
\begin{array}{rcl}
\mathbf{h}^{(k)}_t &=& \sigma\left( \mathbf{W}_{hh}^{(k)}\mathbf{h}^{(k)}_{t-1} + \mathbf{W}_{xh}^{(k)}\mathbf{h}^{(k-1)}_{t} + \mathbf{b}^{(k)}_h \right),\quad k=1,\ldots, L,\\
\hat{\mathbf{y}}_t &=& \sigma_{out}\left( \mathbf{W}_{hy}\mathbf{h}^{(L)} + \mathbf{b}_{y} \right),
\end{array}
\right\}\tag{2}
\end{align}
$$

where $\mathbf{h}^{(k)}\in\mathbb{R}^{n_k}$,
$\mathbf{W}^{(k)}_{hh}\in\mathbb{R}^{n_k\times n_k}$, $\mathbf{W}^{(k)}_{xh}\in\mathbb{R}^{n_k\times n_{k-1}}$, $\mathbf{b}^{(k)}_h\in\mathbb{R}^{n_{k}}$, $\mathbf{W}_{hy}\in\mathbb{R}^{n_y\times n_L}$, $\mathbf{b}_y\in\mathbb{R}^{n_{y}}$, and we have set $n_{L+1}=n_y$, $n_0=D$, and $\mathbf{h}^{(0)} = \mathbf{x}_t$.

<center><img src="figures/stacked_rnn_computation.png" alt="Stacked RNN" style="width: 400px;"/></center>
<center>Stacked recurrent neural network</center>
<br>

To create a stacked RNN in Keras, we need to obtain the full sequence of hidden states in the lower layer. This can be done using the `return_sequences` keyword argument in the layer constructor.

In [None]:
# Create a SimpleRNN layer that returns sequences

rnn_layer_1 = SimpleRNN(16, return_sequences=True)

In [None]:
# Create the second SimpleRNN layer, this only returns the final state

rnn_layer_2 = SimpleRNN(8)

In [None]:
# Build the stacked RNN model using the functional API

from keras.models import Model
from keras.layers import Input

inputs = Input(shape=(32, 5))
h = rnn_layer_1(inputs)
outputs = rnn_layer_2(h)
stacked_rnn_model = Model(inputs=inputs, outputs=outputs)

In [None]:
# Print the model summary

stacked_rnn_model.summary()

Note the output shapes in the above summary. The first RNN layer returns a sequence (length 32) of hidden states (of size 16), and the second RNN layer only returns the final hidden state.

#### Bidirectional RNNs
Standard recurrent neural networks are uni-directional; that is, they only take past context into account. In some applications (where the full input sequence is available to make predictions) it is possible and desirable for the network to take both past and future context into account. 

For example, consider a part-of-speech (POS) tagging problem, where the task is to label each word in a sentence according to its particular part of speech, e.g. noun, adjective, verb etc.

<center><img src="figures/pos_tagging.png" alt="POS tagging" style="width: 350px;"/></center>
<center>Part-of-speech (POS) tagging example</center>
<br>

In some cases the correct label can be ambiguous given only the past context, for example the word `light` in the sentence `"There's a light ..."` could be a noun or a verb depending on how the sentence continues (e.g. `"There's a light on upstairs"` or `"There's a light breeze"`).

Bidirectional RNNs ([Schuster & Paliwal 1997](#schuster97)) are designed to look at both future and past context. They consist of two RNNs running forward and backwards in time, whose states are combined in sum way (e.g. adding or concatenating) to produce the final hidden state of the layer. 

<center><img src="figures/bidirectional_rnn.png" alt="Bidirectional RNN" style="width: 550px;"/></center>
<center>Bidirectional recurrent neural network</center>
<br>

Bidirectional recurrent neural networks (BRNNs) are implemented in Keras using the `Bidirectional` wrapper (see [the docs](https://keras.io/api/layers/recurrent_layers/bidirectional/)):

In [None]:
# Build a bidirectional recurrent neural network

from keras.layers import Bidirectional

brnn_model = Sequential([
    Input(shape=(64, 7)),
    Bidirectional(SimpleRNN(16, return_sequences=True), merge_mode='concat')
])

The `Bidirectional` wrapper constructs two RNNs running in different time directions. The `merge_mode='concat'` setting is the default for the `Bidirectional` constructor, and means that the bidirectional layer concatenates the hidden states from the forward and backward RNNs. This means that the number of units per time step in the output of the layer is $2\times 16 = 32$:

In [None]:
# Print the model summary

brnn_model.summary()

The `Bidirectional` wrapper can also operate on RNN layers with `return_sequences=False`, in which case it combines the final hidden states of the forward and backward RNNs.

#### Training RNNs
RNNs are trained in the same way as multilayer perceptrons and convolutional neural networks. A loss function $L(\mathbf{y}_1, \ldots, \mathbf{y}_T, \hat{\mathbf{y}}_1,\ldots, \hat{\mathbf{y}}_T)$ is defined according to the problem task and learning principle, and the network is trained using the backpropagation algorithm and a selected network optimiser. In the many-to-one case (e.g. sentiment analysis), the loss function may be defined as $L(\mathbf{y}_T, \hat{\mathbf{y}}_T)$. 

Recall the equation describing the backpropagation of errors in the MLP case:

$$
\mathbf{\delta}^{(k)} = \mathbf{\sigma}'(\mathbf{a}^{(k)})(\mathbf{W}^{(k)})^T \mathbf{\delta}^{(k+1)},\qquad k=1,\ldots,L \tag{3}
$$

where $k$ indexes the hidden layers. In the case of recurrent neural networks, the errors primarily backpropagate along the time direction, and we obtain the following propagation of errors in the hidden states:

$$
\mathbf{\delta}^{(k)}_{t-1} = \mathbf{\sigma}'(\mathbf{a}^{(k)}_{t-1})(\mathbf{W}^{(k)}_{hh})^T \mathbf{\delta}^{(k)}_{t}, \quad t=T,\ldots,1 \tag{4}
$$

<center><img src="figures/backpropagation-through-time.png" alt="Backpropagation through time (BPTT)" style="width: 550px;"/></center>
<center>When training RNNs, the errors backpropagate along the time axis</center>
<br>

For this reason, the backpropagation algorithm for RNNs is referred to as **backpropagation through time** (BPTT).

Recurrent neural networks can also be trained as generative models for unlabelled sequence data, by re-wiring the network to send the output back as the input to the next step:

<center><img src="figures/schematic_unsupervised_example.png" alt="Generative RNN model" style="width: 450px;"/></center>
<center>Generative RNN model, with the outputs fed back at inputs at the next time step</center>
<br>

This is an example of **self-supervised learning**, which is where we use an unlabelled dataset to frame a supervised learning problem. This can be used to train  language models, or generative music models for example. In practical we treat this case the same as a supervised learning problem, where the outputs are the same as the inputs but shifted by one time step. This particular technique is also sometimes referred to as **teacher forcing**.

<a class="anchor" id="lstm"></a>
## Long Short Term Memory (LSTM)

As mentioned previously, recurrent neural networks can in principle use information from events that occurred many time steps earlier to make predictions at the current time step. However, in practice RNNs struggle to make use of long-term dependencies in the data. 

Recall the equation describing the backpropagation of errors in an MLP:

$$
\mathbf{\delta}^{(k)} = \mathbf{\sigma}'(\mathbf{a}^{(k)})(\mathbf{W}^{(k)})^T \mathbf{\delta}^{(k+1)},\qquad k=1,\ldots,L 
$$

where $k$ indexes the hidden layers, and the corresponding equation for the backpropagation through time (BPTT) algorithm:

$$
\mathbf{\delta}^{(k)}_{t-1} = \mathbf{\sigma}'(\mathbf{a}^{(k)}_{t-1})(\mathbf{W}^{(k)}_{hh})^T \mathbf{\delta}^{(k)}_{t}, \qquad t=1,\ldots,T
$$

where $k$ now indexes the stacked recurrent layers and $t$ indexes the time steps. The above equations indicates a fundamental problem of training neural networks: the **vanishing gradients problem**. Gradients can explode or vanish with a large number of layers, or a large number of time steps. This problem was pointed out by [Hochreiter](#Hochreiter91), and is particularly bad in the case of RNNs, where the length of sequences can be long (e.g. 100 time steps).

The Long Short Term Memory (LSTM) network was introduced by [Hochreiter and Schmidhuber](#Hochreiter97) (and later updated by [Gers](#Gers99)) to mitigate the effect of vanishing gradients and allow the recurrent neural network to remember things for a long time.

The LSTM has inputs $\mathbf{x}_t\in\mathbb{R}^{n_{k-1}}$ and $\mathbf{h}_{t-1}\in\mathbb{R}^{n_{k}}$ just as regular RNNs. However, it also includes an internal **cell state** $\mathbf{c}_t\in\mathbb{R}^{n_{k}}$ that allows the unit to store and retain information (we drop the superscript $(k)$ in this section to ease notation). 

The LSTM cell works with a gating mechanism, consisting of logistic and linear units with multiplicative interactions. Information is allowed into the cell state when the 'write' gate is on, it can choose to erase information in the cell state when the 'forget' gate is on, and can read information from the cell state when the 'read' gate is on.

The following schematic diagram outlines the gating system of the LSTM unit.

<center><img src="figures/lstm.png" alt="LSTM" style="width: 600px;"/></center>
<center>The Long Short Term Memory (LSTM) gating system</center>
<br>

First of all, note that there is no neural network layer that operates directly on the cell state. This means that information is more freely able to travel across time steps in the cell state. The role of the hidden state is to manage the information flow in and out of the cell state, according to the signals provided in the inputs $\mathbf{h}_{t-1}$ and $\mathbf{x}_t$. 

The first of these operations is the _forget gate_.

#### The forget gate
The forget gate determines what information should be erased from the cell state.

<center><img src="figures/lstm-forget-gate.png" alt="LSTM forget gate" style="width: 450px;"/></center>
<center>The Long Short Term Memory (LSTM) forget gate</center>
<br>

The information is controlled by signals in the inputs $\mathbf{h}_{t-1}$ and $\mathbf{x}_t$ according to the following equation:

$$
\mathbf{f}_t = \sigma \left( \mathbf{W}_{f}\cdot [\mathbf{x}_t, \mathbf{h}_{t-1}] + \mathbf{b}_{f}\right),
$$

where $[\mathbf{x}_t, \mathbf{h}_{t-1}]\in\mathbb{R}^{n_k + n_{k-1}}$ is the concatenation of $\mathbf{x}_t$ and $\mathbf{h}_{t-1}$, $\mathbf{W}_{f}\in\mathbb{R}^{n_k \times (n_k + n_{k-1})}$, $\mathbf{b}_{f}\in\mathbb{R}^{n_k}$ and $\sigma$ is the sigmoid activation function. Note that entries of $\mathbf{f}_t$ will be close to one for large positive pre-activation values, and close to zero for large negative pre-activation values. The cell state is then updated

$$
\mathbf{c}_t \leftarrow \mathbf{f}_t \odot \mathbf{c}_{t-1}
$$

where $\odot$ is the Hadamard (element-wise) product, so that selected entries of the cell state $\mathbf{c}_{t-1}$ are erased, while others are retained.

#### The input and content gates
The input gate determines when information should be written into the cell state. The content gate contains the information to be written.

<center><img src="figures/lstm-input-content-gate.png" alt="LSTM input-content gates" style="width: 450px;"/></center>
<center>The Long Short Term Memory (LSTM) input and content gates</center>
<br>

The input and content gates are a combination of sigmoid and tanh activation gates:

$$
\begin{align}
\mathbf{i}_t &= \sigma \left( \mathbf{W}_{i}\cdot [\mathbf{x}_t, \mathbf{h}_{t-1}] + \mathbf{b}_{i}\right)\\
\mathbf{\tilde{c}}_t &= \tanh\left( \mathbf{W}_{c}\cdot [\mathbf{x}_t, \mathbf{h}_{t-1}] + \mathbf{b}_c\right),
\end{align}
$$

where $\mathbf{W}_{i}, \mathbf{W}_c\in\mathbb{R}^{n_k \times (n_k + n_{k-1})}$ and $\mathbf{b}_i, \mathbf{b}_c\in\mathbb{R}^{n_k}$. In a similar way to the forget gate, the input gate $\mathbf{i}_t$ is used to 'zero out' selected entries in the content signal $\mathbf{\tilde{c}}_t$. The content entries that are allowed through the gate are then added into the cell state:

$$
\mathbf{c}_t \leftarrow \mathbf{c}_{t} + \mathbf{i}_t \odot \mathbf{\tilde{c}}_t
$$

#### The output gate
Finally, the output gate decides which cell state values should be output in the hidden state.

<center><img src="figures/lstm-output-gate.png" alt="LSTM output gate" style="width: 450px;"/></center>
<center>The Long Short Term Memory (LSTM) output gate</center>
<br>

The output gate is another sigmoid gate that releases information from the cell state after passing through a tanh activation:

$$
\begin{align}
\mathbf{o}_t &= \sigma\left(\mathbf{W}_o \cdot [\mathbf{x}_t, \mathbf{h}_{t-1}] + \mathbf{b}_o\right)\\
\mathbf{h}_t &= \mathbf{o}_t \odot \tanh(\mathbf{c}_t)
\end{align}
$$

The LSTM network has been immensely successful in sequence modelling tasks, including handwriting recognition ([Graves et al 2009](#Graves09)), speech recognition ([Graves et al 2013](#Graves13)), machine translation ([Wu et al 2016](#Wu16)) and reinforcement learning for video games ([Vinyals et al 2019](#Vinyals19)).

Another type of gated recurrent cell that should be mentioned is the Gated Recurrent Unit (GRU), proposed in [Cho et al 2014](#Cho14), which simplifies the architecture by combining the forget and input gates into a single 'update' gate, and also merges the cell state and hidden state. We will not go into detail of this cell architecture, for more details refer to the paper.

In Keras, the LSTM is implemented as another layer in the `keras.layers` module:

In [None]:
import keras

In [None]:
# Build an LSTM model

from keras.models import Sequential
from keras.layers import Input, LSTM

lstm = Sequential([
    Input(shape=(None, 12)),
    LSTM(16, return_sequences=True),
    LSTM(16),
])

In [None]:
# Print the model summary

lstm.summary()

RNN cells also have the optional keyword argument `return_state`, which defaults to `False`. When `True`, the layer returns the final internal state, in addition to its output. In the case of LSTM, this internal state is the hidden state $\mathbf{h}_t$ and cell state $\mathbf{c}_t$. So the `LSTM` layer would return `(outputs, hidden_state, cell_state)` when `return_state=True`.

In [None]:
# Build an LSTM model that returns its final internal state

from keras.models import Model

inputs = Input(shape=(8, 4))
outputs = LSTM(6, return_state=True, return_sequences=True)(inputs)
lstm2 = Model(inputs=inputs, outputs=outputs)

In [None]:
# Print the model summary

lstm2.summary()

In [None]:
# View the model outputs

lstm2.outputs

In [None]:
# Test the model on a dummy input

lstm2(keras.random.normal((1, 8, 4)))

The `LSTM` can be also be called using the `initial_state` argument; in this case, a list of `[hidden_state, cell_state]` should be passed to this argument.

The GRU is also available as the `GRU` layer in `keras.layers`, and has a similar API.

<a class="anchor" id="preprocessingembedding"></a>
## Preprocessing and Embedding layers

In this final section of the week we will look at layers that are particularly useful when working with text data. [Preprocessing layers](https://keras.io/api/layers/preprocessing_layers/) can be used to convert text data into a numerical representation that can be used by neural networks. Embedding layers take data that has been tokenized into integer sequences, and act as a look-up table to map each integer token to its own embedding vector in $\mathbb{R}^D$.

In [None]:
import os
os.environ['PYTORCH_ENABLE_MPS_FALLBACK'] = '1'

In [None]:
import keras
from keras import ops
import torch

For this tutorial we will use the [Twitter airline sentiment dataset](https://www.kaggle.com/crowdflower/twitter-airline-sentiment) from Kaggle, which consists of 14,640 tweets labelled as having positive, negative or neutral sentiment.

#### Loading and preparing the data

In [None]:
# Load the data

import pandas as pd
from pathlib import Path

df = pd.read_csv(Path('./data/tweets.csv'))
print(df.shape)
df.head()

In [None]:
# Extract the relevant columns

df = df[['text', 'airline_sentiment', 'airline_sentiment_confidence']]

In [None]:
# View a sample tweet and its label

df.sample(1).values

In [None]:
# Split the data into training, validation and test sets

from sklearn.model_selection import train_test_split

train_df, val_df = train_test_split(df, test_size=0.4)
val_df, test_df = train_test_split(val_df, test_size=0.5)

When working with text data, it is useful to know that in TensorFlow, Tensors of string type are allowed but in PyTorch they are not. Therefore, in order to remain backend independent, we will avoid creating string Tensors. 

We will choose to work with PyTorch DataLoaders, as the dataset preparation is slightly more involved and will make a useful example for how to make custom Datasets in PyTorch.

_NB: If using TensorFlow Datasets, we could save the above DataFrames to CSV files and use the [`CsvDataset`](https://www.tensorflow.org/api_docs/python/tf/data/experimental/CsvDataset) class to load the CSVs directly into Dataset objects._

#### Preprocessing layers

We will need to convert the string data into a numeric representation for the models to process it. We will do this using [preprocessing layers](https://keras.io/guides/preprocessing_layers/). 

Our custom Datasets will need to tokenise the input text as well as convert the output text labels to a numeric representation. We will use the [`TextVectorization`](https://keras.io/api/layers/preprocessing_layers/text/text_vectorization/) and [`StringLookup`](https://keras.io/api/layers/preprocessing_layers/categorical/string_lookup/) Keras layers to help with these respective tasks.

In [None]:
# Create a TextVectorization layer

from keras.layers import TextVectorization

textvectorization = TextVectorization(max_tokens=1000)

In [None]:
# Configure the layer to the dataset

textvectorization.adapt(train_df['text'])

In [None]:
# Test the TextVectorization layer

input_text = train_df['text'].sample(1)
print(input_text.values)
textvectorization(input_text)

`TextVectorization` layers have a `get_vocabulary()` method which can be used to obtain the (ordered) list of words in the vocabulary. Note that the token index 0 is used for zero padding, and the token index 1 is the OOV (`[UNK]`) token.

In [None]:
# Get the word vocabulary

vocabulary = textvectorization.get_vocabulary()
inx2word = {i: word for i, word in enumerate(vocabulary)}

In [None]:
# Create a StringLookup layer

from keras.layers import StringLookup

output_labels = ['positive', 'negative', 'neutral']
stringlookup = StringLookup(vocabulary=output_labels, num_oov_indices=0)

In [None]:
# Test the StringLookup layer

sample_labels = train_df['airline_sentiment'].sample(12)
print(sample_labels.values)
stringlookup(sample_labels)

As an extra preprocessing step, we will also filter out examples where the confidence score is too low.

In [None]:
# Plot a histogram of confidence scores

import matplotlib.pyplot as plt

plt.hist(df['airline_sentiment_confidence'], bins=50)
plt.title("Sentiment confidence scores")
plt.xlabel("Score")
plt.ylabel("Count")
plt.show()

We will choose 0.5 as a cutoff threshold for the confidence score.

Now we are ready to create our custom Dataset. Custom Datasets in PyTorch should subclass from the `torch.utils.data.Dataset` class, and should implement the  `__init__`, `__len__`, and `__getitem__` methods. See [this tutorial](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html#creating-a-custom-dataset-for-your-files) for more information and examples.

In [None]:
# Custom Dataset to read and preprocess the CSV data

class TweetDataset(torch.utils.data.Dataset):
    
    def __init__(self, df, textvectorization_layer, stringlookup_layer):
        # NB: base class doesn't define an __init__, so no need to call super().__init__()
        
        # Filter out low confidence labels
        self.df = df[df['airline_sentiment_confidence'] > 0.5]
        
        self.textvectorization = textvectorization_layer
        self.stringlookup = stringlookup_layer

    def __len__(self):
        return len(self.df)

    def __getitem__(self, index):
        elem = self.df.iloc[index]
        tokenized_text = self.textvectorization(elem["text"])
        label = self.stringlookup(elem["airline_sentiment"])
        return tokenized_text, label

In [None]:
# Create the custom Dataset objects

train_dataset = TweetDataset(train_df, textvectorization, stringlookup)
val_dataset = TweetDataset(val_df, textvectorization, stringlookup)
test_dataset = TweetDataset(test_df, textvectorization, stringlookup)

In [None]:
# Test the training Dataset

import numpy as np

print(len(train_dataset))
sample_inx = np.random.choice(len(train_dataset))
train_dataset[sample_inx]

The Datasets are now ready to be loaded into DataLoaders as normal. However, these Datasets have the property that the input sentence tokens can have different lengths, making batching difficult. We will solve this problem by adding zero tokens to each example in a batch up to the length of the longest token sequence in the batch (this is what the zero token is for). We could have chosen to do add zero padding in the custom Dataset class above, but we will instead use the following function to pass to the `collate_fn` argument in the DataLoader initializer.

In [None]:
# Define batching function

def padded_batch(batch):
    inputs, outputs = zip(*batch)
    
    # The pad_sequence fn expects torch Tensors. The following conversion is only necessary for TF backend
    inputs = [torch.tensor(ops.convert_to_numpy(t)) for t in inputs]
    outputs = [torch.tensor(ops.convert_to_numpy(l)) for l in outputs]
    
    inputs = torch.nn.utils.rnn.pad_sequence(inputs, batch_first=True, padding_value=0)
    outputs = torch.tensor(outputs)
    return inputs, outputs

_NB: If working with TensorFlow Datasets, using [`.padded_batch`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#padded_batch) instead of `.batch` pads the examples as above._

In [None]:
# Create the DataLoaders

train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=16, shuffle=True, collate_fn=padded_batch)
val_dataloader = torch.utils.data.DataLoader(val_dataset, batch_size=16, shuffle=False, collate_fn=padded_batch)
test_dataloader = torch.utils.data.DataLoader(test_dataset, batch_size=16, shuffle=False, collate_fn=padded_batch)

In [None]:
# Test the training DataLoader

t, l = next(iter(train_dataloader))
print(t)
print(l)

#### Embedding layer
We are now able to process the text data into numerical form. However, the input integer tokens should be further processed to transform them into a representation that is more useful for the network. This is where the `Embedding` layer can be used - it creates a lookup table of vectors in $\mathbb{R}^D$ such that each integer token in the vocabulary has its own $D$-dimensional embedding vector.

In [None]:
# Create an Embedding layer

from keras.layers import Embedding

embedding = Embedding(1000, 2, mask_zero=True)

In [None]:
# View the output of the Embedding layer

t, l = next(iter(train_dataloader))
print(embedding(t).shape)
embedding(t)._keras_mask

In [None]:
# Build the classifier model

from keras.models import Sequential
from keras.layers import Input, Bidirectional, LSTM, Dense

lstm_classifier = Sequential([
    Input(shape=([None])),
    embedding,
    Bidirectional(LSTM(8)),
    Dense(3, activation='softmax')
])
lstm_classifier.summary()

In [None]:
# Compile and train the model

lstm_classifier.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
history = lstm_classifier.fit(train_dataloader, validation_data=val_dataloader, epochs=5)

In [None]:
# Evaluate the model on the test Dataset

lstm_classifier.evaluate(test_dataloader)

In [None]:
# Plot the learning curves

fig = plt.figure(figsize=(15, 4))

plt.subplot(1, 2, 1)
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='validation')
plt.legend()
plt.title("Loss vs epoch")

plt.subplot(1, 2, 2)
plt.plot(history.history['accuracy'], label='train')
plt.plot(history.history['val_accuracy'], label='validation')
plt.legend()
plt.title("Accuracy vs epoch")

plt.show()

In [None]:
# View some model predictions on the test set

text, label = next(iter(test_dataloader))

ground_truth = np.array(output_labels)[label]
predicted_label_ints = np.argmax(ops.convert_to_numpy(lstm_classifier(text)), axis=1)
predicted_labels = np.array(output_labels)[predicted_label_ints]

for t, l, g in zip(text, predicted_labels, ground_truth):
    print(' '.join([inx2word[i] for i in ops.convert_to_numpy(t)]))
    print("True label: {}\nPredicted label: {}\n".format(g, l))

_Exercise 1._ Train a new LSTM classifier model using `tf.data.Dataset` objects, loading the training/validation/test splits with `CsvDataset`, and carrying out all preprocessing using the `map` and `filter` methods.

_Exercise 2._ Test the trained model as above, but pull the examples directly from the test DataFrame.

<a class="anchor" id="references"></a>
## References

<a class="anchor" id="Cho14"></a>
* Cho, K., van Merriënboer, B., Gülçehre, Ç., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014), "Learning phrase representations using rnn encoder–decoder for statistical machine translation", in *Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)*, 1724–1734.
<a class="anchor" id="Gers99"></a>
* Gers, F.A. (1999), "Learning to forget: Continual prediction with LSTM", *9th International Conference on Artificial Neural Networks: ICANN '99*, 850–855.
<a class="anchor" id="Graves09"></a>
* Graves, A., Liwicki, M., Fernández, S., Bertolami, R., Bunke, H., & Schmidhuber, J. (2009), "A Novel Connectionist System for Unconstrained Handwriting Recognition", *IEEE Transactions on Pattern Analysis and Machine Intelligence*, **31** (5), 855–868.
<a class="anchor" id="Graves13"></a>
* Graves, A., Mohamed, A.-R., Hinton, G. (2013), "Speech Recognition with Deep Recurrent Neural Networks", arXiv preprint, abs/1303.5778.
<a class="anchor" id="Hochreiter91"></a>
* Hochreiter, S. (1991), "Untersuchungen zu dynamischen neuronalen Netzen", Diploma thesis, Institut für Informatik, Lehrstuhl Prof. Brauer, Technische Universität München.
<a class="anchor" id="Hochreiter97"></a>
* Hochreiter, S. and Schmidhuber, J. (1997), "Long short-term memory", *Neural Computation*, **9** (8), 1735–1780.
<a class="anchor" id="Rumelhart86b"></a>
* Rumelhart, D. E., Hinton, G., and Williams, R. (1986b), "Learning representations by back-propagating errors", Nature, **323**, 533-536.
<a class="anchor" id="Schuster97"></a>
* Schuster, M. & Paliwal, K. K. (1997), "Bidirectional Recurrent Neural Networks", *IEEE Transactions on Signal Processing*, **45** (11), 2673-2681.
<a class="anchor" id="Vinyals19"></a>
* Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M., Dudzik, A., Chung, J., Choi, D. H., Powell, R., Ewalds, T., Georgiev, P., Oh, J., Horgan, D., Kroiss, M., Danihelka, I., Huang, A., Sifre, L., Cai, T., Agapiou, J. P., Jaderberg, M., Vezhnevets, A. S., Leblond, R., Pohlen, T., Dalibard, V., Budden, D., Sulsky, Y., Molloy, J., Paine, T. L., Gulcehre, C., Wang, Z., Pfaff, T., Wu, Y., Ring, R., Yogatama, D., Wünsch, D., McKinney, K., Smith, O., Schaul, T., Lillicrap, T., Kavukcuoglu, K., Hassabis, D., Apps, C., & Silver, D.(2019) "Grandmaster level in StarCraft II using multi-agent reinforcement learning", *Nature*, **575** (7782), 350-354.
<a class="anchor" id="Wu16"></a>
* Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M, Macherey, W., Krikun, M., Cao, Y., & Gao, Q. (2016), "Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation", arXiv preprint, abs/1609.08144.