# Deep Learning
## Formative assessment
### Week 7: Transformers

#### Instructions

In this notebook, you will write code to implement and train a Transformer classifier model.

Some code cells are provided you in the notebook. You should avoid editing provided code, and make sure to execute the cells in order to avoid unexpected errors. Some cells begin with the line: 

`#### GRADED CELL ####`

These cells require you to write your own code to complete them.

#### Let's get started!

We'll start by running some imports, and loading the dataset.

In [None]:
#### PACKAGE IMPORTS ####

# Run this cell to import all required packages. 

import keras
from keras import ops
from keras.models import Sequential
from keras.layers import (Input, Layer, TextVectorization, Dense, MultiHeadAttention, 
                          LayerNormalization, Embedding, Dropout, GlobalAveragePooling1D)
import numpy as np
import matplotlib.pyplot as plt

<center><img src="figures/IMDb.png" title="IMDb" style="width: 550px;"/></center>
  
#### The IMDb dataset

In this assignment, you will use the [IMDb dataset](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews). This is a sentiment analysis dataset of movie reviews with binary labels. It contains 25,000 training examples and a further 25,000 for testing. 

* Maas, A.L.,  Daly, R.E.,  Pham, P.T.,  Huang, D.,  Ng, A.Y. & Potts, C. (2011), "Learning Word Vectors for Sentiment Analysis", _Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies_, 142-150.

Your goal is to build and train an encoder-only Transformer classifier model to predict the sentiment labels from the review text.

#### Load and prepare the data
For this assignment, you will load the IMDb dataset from the TensorFlow Datasets library:

In [None]:
# Run this cell to load the data and print the element_spec

import tensorflow_datasets as tfds

train_data = tfds.load("imdb_reviews", split="train", read_config=tfds.ReadConfig(try_autocache=False))
test_data = tfds.load("imdb_reviews", split="test", read_config=tfds.ReadConfig(try_autocache=False))

train_data.element_spec

In [None]:
# View some samples

for example in train_data.shuffle(100).take(2):
    print(ops.convert_to_numpy(example['text']).item().decode('utf-8'))
    print(f"Label: {ops.convert_to_numpy(example['label'])}")
    print()

#### Tokenizing the input sentences

We will need to convert the text into integer tokens to be able to process them in the Transformer. To do this we will use a `TextVectorization` layer and adapt it to the training data. You should now complete the following function to create and prepare this layer as follows:

* The function takes a `dataset` (a `tf.data.Dataset` object) as an argument, which has the same spec as `train_data` or `test_data` above. It also takes a `max_tokens` argument
* The `TextVectorization` should be configured to use a maximum of `max_tokens` tokens (including the masking and OOV tokens)
* It should standardize the input text by lower-casing the text and removing punctuation
* It should split the text on whitespace
* You should use the `adapt` method to compute the vocabulary using `dataset`

In [None]:
#### GRADED CELL ####

# Complete the following function.
# Make sure not to change the function name or arguments.

def configure_textvectorization(dataset, max_tokens):
    """
    This function should create a TextVectorization layer and configure it as above.
    The function should then return the TextVectorization layer.
    """
    
    

In [None]:
# Use your function to create and configure the TextVectorization layer

MAX_TOKENS = 20000
text_vectorization = configure_textvectorization(train_data, max_tokens=MAX_TOKENS)

In [None]:
# Test your TextVectorization layer

for example in train_data.shuffle(100).take(2):
    print(text_vectorization(example['text']))

#### Preprocess the datasets

You should now complete the following function `preprocess_dataset` which you will use to preprocess the `train_data` and `test_data` Dataset objects according to the following spec:

* The function takes `dataset`, `text_vectorization_layer`, `max_seq_len`, `batch_size` and `shuffle_buffer_size` as arguments
    * `dataset` is a `tf.data.Dataset` object with the same spec as `train_data` or `test_data` above
* The `text_vectorization_layer` should be used to convert the text into integer tokens
* The maximum length of the token sequences should be `max_seq_len`. Any token sequences longer than this should be truncated
* The Datasets should return a tuple of `(tokens, label)` Tensors
* The Datasets should be shuffled with buffer size `shuffle_buffer_size`, and then batched with `batch_size`. Note that the sequences will not be the same length, so the batches should be padded with zero masking tokens where necessary (see [the docs](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#padded_batch))
* The function should then return the preprocessed `dataset` Dataset object

In [None]:
#### GRADED CELL ####

# Complete the following function.
# Make sure not to change the function name or arguments.

def preprocess_dataset(dataset, text_vectorization_layer, max_seq_len, batch_size, shuffle_buffer_size):
    """
    This function should preprocess the Dataset object as above.
    The function should then return the preprocessed Dataset.
    """
    
    

In [None]:
# Run your function to preprocess the Datasets

MAX_SEQ_LEN = 200
train_data = preprocess_dataset(train_data, text_vectorization, MAX_SEQ_LEN, 
                                batch_size=32, shuffle_buffer_size=500)
test_data = preprocess_dataset(test_data, text_vectorization, MAX_SEQ_LEN, 
                               batch_size=32, shuffle_buffer_size=500)

In [None]:
# Print the element_spec

train_data.element_spec

In [None]:
# Inspect a data minibatch

for example in train_data.take(1):
    print(example)

Note that when we pass this integer tokens Tensor through our Transformer, we will need to be careful to not use the zero padding tokens. The mechanism to handle this is masking (see [this guide](https://www.tensorflow.org/guide/keras/masking_and_padding)), and our custom layers will need to make use of this masking mechanism.

#### Transformer architecture

We will use an encoder-only Transformer classifier architecture for the task of sentiment prediction. This will consist of a single encoder block, followed by a classifier head.

<center><img src="figures/encoder-only_transformer.png" alt="Transformer" style="width: 250px;"/></center>

#### Positional encodings and Embedding layer

You will now implement the input embedding and positional encoding stage of the Transformer. Your model will use the deterministic positional encoding scheme $\mathbf{P}\in\mathbb{R}^{n\times d_{model}}$ as in the original Transformer:

$$
\begin{align}
P_{ti} &= \left\{
\begin{array}{c}
\sin(\omega_k t)\quad\text{for }i=2k+1,\quad(\text{for some }k\in\mathbb{N}_0)\\
\cos(\omega_k t)\quad\text{for }i=2k+2\quad(\text{for some }k\in\mathbb{N}_0)
\end{array}
\right.\\
\omega_k &= \frac{1}{10000^{2k/d_{model}}},
\end{align}
$$

where $t=1,\ldots,n$ and $i=1,\ldots,d_{model}$.

You should now complete the following `positional_encodings` function to compute the positional encoding $\mathbf{P}\in\mathbb{R}^{n\times d_{model}}$.

* The function takes `seq_len` (the number of time steps) and `d_model` the embedding dimension as integer arguments
* The function should compute a 2D Tensor of shape `(seq_len, d_model)` of positional encodings according to the above equations (be careful with python's zero-indexing, the python indices are off-by-one from the mathematical indices above)
* The function should then return the Tensor of positional encodings with type `float32`

In [None]:
#### GRADED CELL ####

# Complete the following function.
# Make sure not to change the function name or arguments.

def positional_encodings(seq_len, d_model):
    """
    This function should compute the positional encodings as above.
    The function should then return the Tensor of positional encodings.
    """
    
    

In [None]:
# Run your function to get the positional encodings

D_MODEL = 32
pos_encodings = positional_encodings(MAX_SEQ_LEN, D_MODEL)

The positional encodings should be added to the token embeddings in the first stage of the Transformer.

You should now complete the `__init__` and `call` methods for the following custom layer `InputEmbeddings`, which builds and returns a model that converts the integer token sequence into a sequence of embeddings, and then adds positional encodings.

* The initialiser takes the following required arguments:
    * `d_model`: the dimension of the embedding vectors
    * `pos_encodings`: the Tensor of positional encodings of shape `(max_seq_len, d_model)`, as computed by the `positional_encodings` function
    * `max_tokens`: The maximum number of integer tokens used in the input (including the masking and OOV tokens)
* The custom layer should create an `Embedding` layer with the correct input and output dimensions, that is set to mask incoming zero tokens. This layer should be set as the class attribute `embedding`
* The `call` method takes a Tensor of integer tokens of shape `(batch_size, n)` as input, where `n` is the maximum sequence length in the batch
* The `call` method should use the `Embedding` lookup layer to convert the inputs to embedding vectors, and return the sum of the embedding vectors and positional encodings

_NB: The custom layer also implements the `compute_mask` method, which has been completed for you. This is a method should be implemented whenever a layer should produce a mask (see [this guide](https://www.tensorflow.org/guide/keras/masking_and_padding) for more information). Note that this method references `self.embedding`. In this instance we want to pass on the same mask that is produced by the `Embedding` layer, so the `compute_mask` method simply calls the existing method from the `Embedding` layer._

In [None]:
#### GRADED CELL ####

# Complete the following class.
# Make sure not to change the class or methods name or arguments.

class InputEmbeddings(Layer):
    """
    This custom layer should take a batch of integer tokens as input, 
    converts them into embeddings and adds the positional encodings.
    """
    
    def __init__(self, d_model, pos_encodings, max_tokens, name='input_embeddings', **kwargs):
        super().__init__(name=name, **kwargs)
        
        
        
    def compute_mask(self, inputs, mask=None):
        return self.embedding.compute_mask(inputs)
        
    def call(self, inputs):
        """
        inputs is an integer Tensor of shape (batch_size, n), where n is 
        the maximum sequence length in this batch (n <= max_seq_len)
        """       
        
        

In [None]:
# Create an instance of your custom layer

input_embeddings = InputEmbeddings(D_MODEL, pos_encodings, MAX_TOKENS)

In [None]:
# Test your custom layer on an input

for tokens, _ in train_data.take(1):
    h = input_embeddings(tokens)

In [None]:
# Check that our custom layer is producing a mask

h._keras_mask

#### Encoder block

You will now implement the encoder block of the Transformer model. This block consists of a multi-head attention block with a residual connection followed by layer normalisation, and then a pointwise feedforward network with residual connection again followed by layer normalisation.

The multi-head attention block will need to account for the masking corresponding to the zero padding tokens in the input. The way the `MultiHeadAttention` layer handles this is through the `attention_mask` argument when it is called. 

The incoming mask is a boolean Tensor with shape `(batch_size, seq_len)`, where `seq_len` is the length of the sequence of embedding vectors being input to the `MultiHeadAttention` layer. The multi-head attention is performing self-attention, so the shape of the mask required by the `attention_mask` will be `(batch_size, seq_len, seq_len)`.

Before implementing the encoder block, you should complete the following function `get_attention_mask`, which takes a single argument `mask`, which is a boolean Tensor of shape `(batch_size, seq_len)`, or `None`. If `mask` is `None`, then this function should return `None`. Otherwise, the function should return a boolean Tensor of shape `(batch_size, seq_len, seq_len)` which will be used by the `MultiHeadAttention` layer.

For a single example in the batch, suppose the vector mask $\mathbf{m}\in\mathbb{R}^n$, where $n$ is the sequence length. We would like to convert this vector mask to a matrix mask $\mathbf{M}\in\mathbb{R}^{n\times n}$, where the $i,j$-th element $M_{ij}$ is given by the element $\min (i,j)$ of the vector $\mathbf{m}$.

For example, if the incoming boolean mask was as follows:

```
mask = [[True, True, False],
        [True, False, False]]
```

where the batch size is 2 and the sequence length is 3, then the mask returned by the function `get_attention_mask` should look like:

```
returned_mask = [[[True, True, False],
                  [True, True, False],
                  [False, False, False]],
                 [True, False, False],
                 [False, False, False],
                 [False, False, False]]]
```

In [None]:
#### GRADED CELL ####

# Complete the following function.
# Make sure not to change the function name or arguments.

def get_attention_mask(mask=None):
    """
    This function should compute the attention mask as described above.
    The function should then return the boolean Tensor.
    """
    
    

In [None]:
# Test your mask function

input_mask = ops.array([[True, True, False], [True, False, False]])
get_attention_mask(input_mask)

You should complete the following `EncoderBlock` custom layer, that implements the encoder block. 

* The initialiser takes the following required arguments:
    * `num_heads`: the number of attention heads to use in the multi-head attention
    * `key_dim`: the key (and query and value) dimension to use in the multi-head attention
    * `d_model`: the embedding dimension of the Transformer
    * `ff_dim`: the width of the hidden layer in the feedforward network in the encoder block
* The initialiser sets `self.support_masking = True` (this has been done for you), so that the incoming boolean mask will also be included in the output of the `EncoderBlock` layer
* The `EncoderBlock` layer should create `MultiHeadAttention`, `LayerNormalization` and `Dense` layers in the initializer as required 
    * The feedforward network should have one hidden layer of size `ff_dim` with a ReLU activation
    * The output layer of the feedforward network should have size `d_model` and no activation
* The operation of the custom layer is as follows:
    * The input to the layer is a Tensor of shape `(batch_size, seq_len, d_model)`
    * The call method also takes a `mask` argument, which will be a boolean mask of shape `(batch_size, seq_len)`. The call method should use your `get_attention_mask` function to compute the attention mask
    * Pass the input through the `MultiHeadAttention` layer (performing self-attention), passing the attention mask in the `attention_mask`. Add the resulting Tensor output to the input (residual connection) and pass through a `LayerNormalization` layer
    * Pass the resulting Tensor $h$ through the feedforward network, add this to $h$ (residual connection) and pass through a `LayerNormalization` layer
    * The custom layer should then return the resulting Tensor

In [None]:
#### GRADED CELL ####

# Complete the following class.
# Make sure not to change the class or methods name or arguments.

class EncoderBlock(Layer):
    """
    This custom layer should take a Tensor of shape (batch_size, seq_len, d_model) as input.
    It should carry out the operations as described above and return the resulting Tensor.
    """
    
    def __init__(self, num_heads, key_dim, d_model, ff_dim, name='encoder_block', **kwargs):
        super().__init__(name=name, **kwargs)
        self.supports_masking = True  # This will pass on any incoming mask
        
        
        
    def call(self, inputs, mask=None):
        """
        inputs is a Tensor of shape (batch_size, seq_len, d_model)
        """    
        
        

In [None]:
# Create an EncoderBlock instance

encoder_block = EncoderBlock(num_heads=2, key_dim=16, d_model=32, ff_dim=32)

In [None]:
# Test your layer on a dummy input

inputs = keras.random.normal((16, 200, 32))
h = encoder_block(inputs)

In [None]:
# Test your layer on the output from the input_embeddings layer

for tokens, _ in train_data.take(1):
    h = input_embeddings(tokens)
    h = encoder_block(h)

In [None]:
# Check that the mask has been propagated

h._keras_mask

#### Classifier head

The final stage of our Transformer model is the classifier head. This stage consists of the following layers:

* A `GlobalAveragePooling1D` layer, that takes an incoming Tensor of shape `(batch_size, seq_len, d_model)` and reduces out the time axis to produce a Tensor of shape `(batch_size, d_model)`
* A dropout layer
* A dense layer with ReLU activation
* A dropout layer
* A dense layer with a single neuron output and sigmoid activation function

The final dense layer outputs the probability of a positive sentiment label.

You should now complete the following `get_classifier_head` function, which takes the arguments `d_model`, `dropout_rate` and `units`. The function should build and return a Sequential Model object, according to the above specification, where `dropout_rate` is used in both `Dropout` layers and `units` is used to define the width of the intermediate `Dense` layer. The `d_model` input should be used to set the `input_shape` in the first layer.

Note that the [`GlobalAveragePooling1D`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/GlobalAveragePooling1D) layer will automatically use the incoming mask when used in the `Sequential` model, see [the guide](https://www.tensorflow.org/guide/keras/masking_and_padding) here.

In [None]:
#### GRADED CELL ####

# Complete the following function.
# Make sure not to change the function name or arguments.

def get_classifier_head(d_model, dropout_rate, units):
    """
    This function should compute classifier head model as described above.
    The function should then return the Model object.
    """
    
    

In [None]:
# Create an instance of the classifier head

classifier_head = get_classifier_head(D_MODEL, dropout_rate=0.1, units=20)

In [None]:
# Print the model summary

classifier_head.summary()

In [None]:
# Test your classifier head model on a dummy input

inputs = keras.random.normal((8, 200, D_MODEL))
classifier_head(inputs)

#### Build the Transformer classifier

We now have all the components to build the complete Transformer classifier. You should now complete the following function `get_transformer_classifier` to build and compile the model.

The function takes the arguments `input_embeddings_layer`, `encoder_block_layer` and `classifier_head_layer`. It should use these layers to build a `Sequential` model, and compile it with a suitable loss function and optimizer, and an accuracy metric.

In [None]:
#### GRADED CELL ####

# Complete the following function.
# Make sure not to change the function name or arguments.

def get_transformer_classifier(input_embeddings_layer, encoder_block_layer, classifier_head_layer):
    """
    This function should compute classifier head model as described above.
    The function should then return the Model object.
    """
    
    

In [None]:
# Get the compiled Transformer classifier model

input_embeddings = InputEmbeddings(D_MODEL, pos_encodings, MAX_TOKENS)
encoder_block = EncoderBlock(num_heads=2, key_dim=16, d_model=32, ff_dim=32)
classifier_head = get_classifier_head(D_MODEL, dropout_rate=0.1, units=20)
transformer = get_transformer_classifier(input_embeddings, encoder_block, classifier_head)

In [None]:
# Test the Transformer classifier model

for tokens, _ in train_data.take(1):
    outputs = transformer(tokens)

#### Train the model

The following call to `model.fit` uses the `steps_per_epoch` and `validation_steps` keyword arguments, which specify the number of iterations that define an epoch and validation phase respectively. When using these options, you should `repeat()` the `tf.data.Dataset` to create an infinitely repeating Dataset.

In [None]:
history = transformer.fit(train_data.repeat(), validation_data=test_data.repeat(), epochs=10,
                          steps_per_epoch=200, validation_steps=200)

#### Test on unlabelled data

The IMDB dataset also contains a split without labels. The following cell loads this dataset split and applies a shuffle.

In [None]:
unsupervised_data = tfds.load("imdb_reviews", split="unsupervised", 
                              read_config=tfds.ReadConfig(try_autocache=False))
unsupervised_data = unsupervised_data.shuffle(1000)

Now let's take a look at some model predictions.

In [None]:
for example in unsupervised_data.take(1):
    print(ops.convert_to_numpy(example['text']).item().decode("utf-8"))
    tokens = text_vectorization(example['text'])
    tokens = tokens[None, :MAX_SEQ_LEN]  # Add dummy batch dimension and truncate
    prob = ops.convert_to_numpy(transformer(tokens)).squeeze()
    print(f"\nTransformer probability of positive label: {prob:.4f}")

Congratulations on completing this week's assignment! You have now implemented and trained an encoder-only Transformer classifier model for the task of sentiment prediction.