<a href="https://colab.research.google.com/github/aisha-partha/AIMLOps-Assignments/blob/main/M5_AST_04_Transformer_Encoder_C.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Certification Programme in AI and MLOps
## A Program by IISc and TalentSprint
### Assignment 4: Transformer Encoders, Self Attention

## Learning Objectives

At the end of the experiment, you will be able to:

* understand the big picture of transformers
* understand the concept of self attention
* explore transformer encoder and positional embedding
* train an encoder-only transformer model for text classification

### The Big Picture of Transformer

<center>
<img src= "https://cdn.iisc.talentsprint.com/AIandMLOps/Images/M5%20AST%205%20Big%20Picture.png" width=800px/>
</center>

Above is the entire architecture of transformer. A TextVectorization layer, Embedding layer, an Encoder and a Decoder.

Transformer architecture follows an encoder-decoder structure. The encoder, on the left-hand side, is tasked with mapping an input sequence to a sequence of continuous representations; the decoder, on the right-hand side, receives the output of the encoder together with the decoder output at the previous time step to generate an output sequence.

The Transformer architecture was originally designed for translation. In the encoder, the attention layers can use all the words in a sentence (since, as we just saw, the translation of a given word can be dependent on what is after as well as before it in the sentence). The decoder, however, works sequentially and can only pay attention to the words in the sentence that it has already translated (so, only the words before the word currently being generated). For example, when we have predicted the first three words of the translated target, we give them to the decoder which then uses all the inputs of the encoder to try to predict the fourth word.

To speed things up during training (when the model has access to target sentences), the decoder is fed the whole target, but it is not allowed to use future words (if it had access to the word at position 2 when trying to predict the word at position 2, the problem would not be very hard!). For instance, when trying to predict the fourth word, the attention layer will only have access to the words in positions 1 to 3.

**Encoder-only Transformer:**

<br>

<center>
<img src= "https://cdn.iisc.talentsprint.com/AIandMLOps/Images/M5_AST_05_Image1_Transformer.png" width=900px/>
</center>

In this assignment decoder will not form the topic of discussion, the main focus will be on the Transformer Encoder.
This has been discussed in detail in the later sections of this notebook.

## Dataset Description

The **IMDb Movie Reviews dataset** is a binary sentiment analysis dataset consisting of 50,000 reviews from the Internet Movie Database (IMDb) labeled as *positive* or *negative*. The dataset contains an even number of positive and negative reviews.

This dataset is processed and used in the later sections of this notebook.

### Setup Steps:

In [1]:
#@title Please enter your registration id to start: { run: "auto", display-mode: "form" }
Id = "2304896" #@param {type:"string"}

In [2]:
#@title Please enter your password (your registered phone number) to continue: { run: "auto", display-mode: "form" }
password = "9916583736" #@param {type:"string"}

In [3]:
#@title Run this cell to complete the setup for this Notebook
from IPython import get_ipython

ipython = get_ipython()

notebook= "M5_AST_04_Transformer_Encoder_C" #name of the notebook

def setup():
#  ipython.magic("sx pip3 install torch")

    ipython.magic("sx curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz")
    ipython.magic("sx tar -xvzf aclImdb_v1.tar.gz")
    from IPython.display import HTML, display
    display(HTML('<script src="https://dashboard.talentsprint.com/aiml/record_ip.html?traineeId={0}&recordId={1}"></script>'.format(getId(),submission_id)))
    print("Setup completed successfully")
    return

def submit_notebook():
    ipython.magic("notebook -e "+ notebook + ".ipynb")

    import requests, json, base64, datetime

    url = "https://dashboard.talentsprint.com/xp/app/save_notebook_attempts"
    if not submission_id:
      data = {"id" : getId(), "notebook" : notebook, "mobile" : getPassword()}
      r = requests.post(url, data = data)
      r = json.loads(r.text)

      if r["status"] == "Success":
          return r["record_id"]
      elif "err" in r:
        print(r["err"])
        return None
      else:
        print ("Something is wrong, the notebook will not be submitted for grading")
        return None

    elif getAnswer() and getComplexity() and getAdditional() and getConcepts() and getComments() and getMentorSupport():
      f = open(notebook + ".ipynb", "rb")
      file_hash = base64.b64encode(f.read())

      data = {"complexity" : Complexity, "additional" :Additional,
              "concepts" : Concepts, "record_id" : submission_id,
              "answer" : Answer, "id" : Id, "file_hash" : file_hash,
              "notebook" : notebook,
              "feedback_experiments_input" : Comments,
              "feedback_mentor_support": Mentor_support}
      r = requests.post(url, data = data)
      r = json.loads(r.text)
      if "err" in r:
        print(r["err"])
        return None
      else:
        print("Your submission is successful.")
        print("Ref Id:", submission_id)
        print("Date of submission: ", r["date"])
        print("Time of submission: ", r["time"])
        print("View your submissions: https://aimlops-iisc.talentsprint.com/notebook_submissions")
        #print("For any queries/discrepancies, please connect with mentors through the chat icon in LMS dashboard.")
        return submission_id
    else: submission_id


def getAdditional():
  try:
    if not Additional:
      raise NameError
    else:
      return Additional
  except NameError:
    print ("Please answer Additional Question")
    return None

def getComplexity():
  try:
    if not Complexity:
      raise NameError
    else:
      return Complexity
  except NameError:
    print ("Please answer Complexity Question")
    return None

def getConcepts():
  try:
    if not Concepts:
      raise NameError
    else:
      return Concepts
  except NameError:
    print ("Please answer Concepts Question")
    return None


# def getWalkthrough():
#   try:
#     if not Walkthrough:
#       raise NameError
#     else:
#       return Walkthrough
#   except NameError:
#     print ("Please answer Walkthrough Question")
#     return None

def getComments():
  try:
    if not Comments:
      raise NameError
    else:
      return Comments
  except NameError:
    print ("Please answer Comments Question")
    return None


def getMentorSupport():
  try:
    if not Mentor_support:
      raise NameError
    else:
      return Mentor_support
  except NameError:
    print ("Please answer Mentor support Question")
    return None

def getAnswer():
  try:
    if not Answer:
      raise NameError
    else:
      return Answer
  except NameError:
    print ("Please answer Question")
    return None


def getId():
  try:
    return Id if Id else None
  except NameError:
    return None

def getPassword():
  try:
    return password if password else None
  except NameError:
    return None

submission_id = None
### Setup
if getPassword() and getId():
  submission_id = submit_notebook()
  if submission_id:
    setup()
else:
  print ("Please complete Id and Password cells before running setup")



Setup completed successfully


### Importing required packages

In [4]:
import numpy as np
import matplotlib.pyplot as plt
import os, pathlib, shutil, random

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers import TextVectorization, Embedding, Dense
from tensorflow.keras.utils import text_dataset_from_directory

#  **Part A** : Text Pre-processing Before Transformer Block

## Processing the dataset using TextVectorization layer of keras

### Data Preparation

A pre-processed version of the IMDB dataset provided by Keras was used in the previous assignments.

Originally IMDB dataset contains the *train* and the *test* folders.
Here, the original dataset will be used and pre-processing related to it will be explored.

In [5]:
# List subdirectories
!cd aclImdb && ls -d */

test/  train/


In [6]:
# Remove unnecessary folder
!rm -r aclImdb/train/unsup

In [7]:
# Check a sample review
!cat aclImdb/train/pos/4077_10.txt

I first saw this back in the early 90s on UK TV, i did like it then but i missed the chance to tape it, many years passed but the film always stuck with me and i lost hope of seeing it TV again, the main thing that stuck with me was the end, the hole castle part really touched me, its easy to watch, has a great story, great music, the list goes on and on, its OK me saying how good it is but everyone will take there own best bits away with them once they have seen it, yes the animation is top notch and beautiful to watch, it does show its age in a very few parts but that has now become part of it beauty, i am so glad it has came out on DVD as it is one of my top 10 films of all time. Buy it or rent it just see it, best viewing is at night alone with drink and food in reach so you don't have to stop the film.<br /><br />Enjoy

### Create a validation directory and move 20% of the train data to it

In [8]:
# move 20% of the training data to the validation folder
base_dir = pathlib.Path("aclImdb")
val_dir = base_dir / "val"
train_dir = base_dir / "train"
for category in ("neg", "pos"):
    os.makedirs(val_dir / category)
    files = os.listdir(train_dir / category)
    # random.Random(1337).shuffle(files) # We should shuffle. Only commenting for demonstration
    num_val_samples = int(0.2 * len(files))
    val_files = files[-num_val_samples:]
    for fname in val_files:
        shutil.move(train_dir / category / fname,
                    val_dir / category / fname)

### Create batches of data using `text_dataset_from_directory`

In [9]:
# Create dataset using utility
batch_size = 32

# Q: Name other such utilities seen earlier ?
train_ds = text_dataset_from_directory("aclImdb/train", batch_size=batch_size)

val_ds = text_dataset_from_directory("aclImdb/val", batch_size=batch_size)

test_ds = text_dataset_from_directory("aclImdb/test", batch_size=batch_size)

# Extracting only the review text(not labels); to be used later to adapt the TextVec layer
text_only_train_ds = train_ds.map(lambda x, y: x)             # lambda x, y: x  --> replace x,y with x. That is remove labels, just keep text data.


Found 20000 files belonging to 2 classes.
Found 5000 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.


In [14]:
list(train_ds.take(1))

[(<tf.Tensor: shape=(32,), dtype=string, numpy=
  array([b"A LAUREL & HARDY Comedy Short. The Boys arrive to sweep the chimneys at the home of Professor Noodle, a mad scientist who's just perfected his rejuvenation serum. Stan & Ollie proceed with their DIRTY WORK, spreading destruction inside the house and on the roof. Then the Professor wants to try out his new potion...<br /><br />A very funny little film. The ending is a bit abrupt, but much of the slapstick leading up to it is terrific. Especially good is Stan & Ollie's contest of wills at opposite ends of the chimney. That's Lucien Littlefield as the Professor.",
         b'An unusual take on time travel: instead of traveling to Earth\'s past, the main trio get stuck in the past history of another planet. They beam down to this planet, whose sun is scheduled to go nova in 3 or 4 hours (that\'s cutting it close!). In some kind of futuristic library, they meet Mr. Atoz (A to Z, get it? ha-ha) and his duplicates. It turns out, inste

There are 20000, 5000, and 25000 records in train, validation, and test directories with two class as positive and negative.

In [10]:
# Check shapes

for inputs, targets in train_ds:

    print("inputs.shape:", inputs.shape)
    print("inputs.dtype:", inputs.dtype)

    print("targets.shape:", targets.shape)
    print("targets.dtype:", targets.dtype)

    print("inputs[2]:", inputs[2])
    print("targets[2]:", targets[2])

    break

inputs.shape: (32,)
inputs.dtype: <dtype: 'string'>
targets.shape: (32,)
targets.dtype: <dtype: 'int32'>
inputs[2]: tf.Tensor(b"Johny To makes here one of his best style exercises, making a strong film with a good Yakuza's story. The election of the new Yakuza's boss is the beginning of a war inside the organization.<br /><br />In my opinion the violence is wise used in the context, making a very strong gangs film. I specially love the way he tells the history, moving around all the roles inside the Yakuza's family, and making that we see the violence, like the only way they have to solve their problems...<br /><br />Talking about, the technical aspects, the film is a good example of paused, rythmic and planified way of shooting a film. One of the Hong Kong Films of the year. Is like Infernal affairs, but without the easy action-violence scenes, and the confused storyline. Strongly recommended to all Asian films lovers.<br /><br />(sorry for my English, better do in Spanish lol)", shap

### Create TextVectorization layer and adapt to dataset

In [11]:
# Vectorizing the data
#Only valid in INT mode. If set, the output will have its time dimension padded or truncated to exactly
#output_sequence_length values, resulting in a tensor of shape (batch_size, output_sequence_length)

max_length = 600 # Output length of the sequence output padded if length of out is less
max_tokens = 20000 #when using adapt, what is the vocab size to be built
text_vectorization = layers.TextVectorization(
    max_tokens=max_tokens,    # Q: What is the vocabular size?
    output_mode="int",        # Q: What will be the type of output for a token (say), 'amazing' ?
    output_sequence_length=max_length,      # Q: What is the maximum length of review? Is it a fair assumption?
    )

text_vectorization.adapt(text_only_train_ds)


In [12]:
text_vectorization.get_vocabulary()[:10]

['', '[UNK]', 'the', 'a', 'and', 'of', 'to', 'is', 'in', 'it']

In [13]:
len(text_vectorization.get_vocabulary())

20000

In [14]:
text_vectorization('The film was good and it had a lot of action scenes')

<tf.Tensor: shape=(600,), dtype=int64, numpy=
array([  2,  20,  14,  50,   4,   9,  67,   3, 169,   5, 214, 137,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,  

 # **Part B** : Building Encoder Transformer

Figure below shows the **Transformer Model Architecture** as per the paper ["**Attention Is All You Need**"](https://arxiv.org/pdf/1706.03762v6.pdf) .Here, we are going to implement **Encoder** and try to understand how it function. We are writing **'as per the paper'** to  mention this paper throughout this notebook. To completely understand the Encoder Transformer, it is imperative to understand Self Attention which is used inside Multi-head Attention. The data after passing the TextVectorization layer and Embedding layer will pass the Self Attention layer.
<br><br>
<center>
<img src= "https://cdn.extras.talentsprint.com/aiml/Experiment_related_data/Images/Transformer.png" width=750px/>

</center>

## Self Attention
The attention mechanism being depicted in the picture below can be understood as the attention scores highlighting the most important features of the cat so that it can be identified.


<center>
<img src= "https://cdn.iisc.talentsprint.com/AIandMLOps/Images/Attention%20scores%20pic.png" width=700px/>
</center>



Most of the popular language models does not have just the term 'BERT' in their name but an important techique called 'self-attention'. Transformer-based architectures, which are primarily used in modelling language understanding tasks, eschew recurrence in neural networks and instead trust entirely on self-attention mechanisms to draw global dependencies between inputs and outputs.

<center>
<img src= "https://cdn.iisc.talentsprint.com/AIandMLOps/Images/M5%20AST5%20Self%20Attention%20Scores.png" width=900px/>
</center>




outputs = sum(inputs * pairwise_scores(inputs, inputs))


According to the self attention scores which are depicted in the picture, the word 'train pays' more attention to station rather than other words in consideration such as 'on' or the.

The attention mechanism allows output to focus attention on input while producing output while the self-attention model allows inputs to interact with each other (i.e calculate attention of all other inputs wrt one input).

 Inside each attention head is a **Scaled Dot Product Self-Attention** operation, the operation returns a Attention vector as given by equation below:

$$ Self Attention = softmax(\frac{x^{T}_i x_j}{\sqrt{d_k}})x_j $$

The term  **$x^{T}_i x_j$** is dot product of input vector with itself. The  'pivot_vector' and the 'vector' forms the 'xi' and 'xj' of the above Self Attention function.

### Demonstrating self_attention with dummy data

#### Define a custom self attention function

In [15]:
# Custom self attention function
def self_attention(input_sequence):
    output = np.zeros(shape=input_sequence.shape)
    for i, pivot_vector in enumerate(input_sequence): # iterate over each token in ip seq
        scores = np.zeros(shape=(len(input_sequence), ))

        for j, vector in enumerate(input_sequence):
            scores[j] = np.dot(pivot_vector, vector.T)    # Pairwise scores

        scores /= np.sqrt(input_sequence.shape[1]) # scale #[1] is the embedding dim
        scores = tf.nn.softmax(scores)              # softmax
        new_pivot_representation = np.zeros(shape=pivot_vector.shape)
        for j, vector in enumerate(input_sequence):
            new_pivot_representation += vector*scores[j] # weigthed sum
        output[i] = new_pivot_representation
    return output

# Optional HW: Add to the code to print the attention_score matrix

#### Use a dummy data and find its vectors

The data first passes through the TextVectorization layer then through the Embedding layer and later the Self Attention scores are calculated

In [16]:
# First, vectorize raw text using _________
dummy_vocab = ["movie was very nice", "film was good"]
text_vec = layers.TextVectorization(max_tokens=5, output_sequence_length=3)
text_vec.adapt(dummy_vocab)
print(text_vec("movie"))
#Q: why the zeros ? why shape 3?
#Q: output type?

tf.Tensor([1 0 0], shape=(3,), dtype=int64)


In [17]:
# Then obtain embeddings from text indices using the Embedding layer
int_text = text_vec(["movie was good"])
print(int_text)
embedding = layers.Embedding(input_dim=5, output_dim=4)(int_text)  # Why 5 ? input_dim: Integer. Size of the vocabulary, i.e. maximum integer index + 1.
print(embedding)

tf.Tensor([[1 2 1]], shape=(1, 3), dtype=int64)
tf.Tensor(
[[[ 0.01905343 -0.00803129  0.02016748 -0.03094834]
  [ 0.03039265  0.04239618  0.01108522 -0.00826528]
  [ 0.01905343 -0.00803129  0.02016748 -0.03094834]]], shape=(1, 3, 4), dtype=float32)


#### Output from the attention function

In [18]:
# Compute output of attention module
attention_outputs = self_attention(embedding[0].numpy())
print(attention_outputs.shape)
print(attention_outputs)

(3, 4)
[[ 0.02283182  0.00877185  0.01714115 -0.02339003]
 [ 0.02283594  0.00879017  0.01713785 -0.02338179]
 [ 0.02283182  0.00877185  0.01714115 -0.02339003]]


## Attention Eqn. with Queries, Keys and Values

We computed the Self Attention based on the inputs of vectors themselves. This means that for fixed inputs, these attention weights would always be fixed. In other words, there are no learnable parameters. Need to introduce some learnable parmeters which will make the self attention mechanism more flexible and tunable for various tasks. To fullfil this purpose, three weight matices are introduced and multiplied with input $x_i$ seperately and three new terms **Queries(Q), Keys(K) and Values(V)** comes into picture as given by equations below. Vectorized implemenation  & Shape tracking are also shown along with equations.

**Vectorized implemenation  & Shape tracking**

$ d_{model} $ = Embedding vector for each word ( 512 as per the paper).

$ X   \Rightarrow (T \times d_{model}) $


$ Q = X W^{Q}   \Rightarrow (T \times d_{model}) \times (d_{model} \times d_k  )  \Rightarrow   (T \times d_{k}) $


$ K = X W^{K}   \Rightarrow (T \times d_{model}) \times (d_{model} \times d_k  )  \Rightarrow   (T \times d_{k}) $


$ V = X W^{V}   \Rightarrow (T \times d_{model}) \times (d_{model} \times d_v  )  \Rightarrow   (T \times d_{v}) $

Dot product of Queries and Keys:

$ Q K^{T}   \Rightarrow (T \times d_{k}) \times (d_{k} \times T  )  \Rightarrow   (T \times T) $

T query vectors and T key vectors (Input Sequence), so need TxT attention weights. Make Sense! Taking SoftMax doesn't change the shape.

 **Shapes as per the paper**

$
\begin{array}{|c|c|} \hline
Object   &  Shape & values  \\ \hline
q_i, k_i  &  d_k  &  (64,) \\
v_i   &   d_v   &   (64,)  \\
x_i   &   d_{model}   & (512,)  \\
W^{Q}, W^{K}  &   d_{model} \times d_k   &   (512, 64)  \\
W^{V}   &   d_{model} \times d_v   &  (512,64)  \\ \hline
\end{array}
$

**Batch consideration**

In code, a batch of N samples are processed at a time. Everyting would be  **N times**, like: $ N \times T \times d_k $ instead of just $ T \times d_k$.

 **Fianl Scaled Dot Product Attention** equation inside each attention head with **Queries(Q)**, **Keys(Q)**, and **Values(V)**, which returns a Attention vector.

<center>
<img src= "https://cdn.extras.talentsprint.com/aiml/Experiment_related_data/Images/Scaled_dot_product_Attention.png" width=250px/>

</center>


$$Attention(Q, K, V) = softmax(\frac{QK^T)}{\sqrt{d_k}})V$$

## Multihead Attention

If the output of the final Encoder in the stack is passed to the Value and Key parameters in the Encoder-Decoder Attention.

The Encoder-Decoder Attention is therefore getting a representation of both the target sequence (from the Decoder Self-Attention) and a representation of the input sequence (from the Encoder stack). It, therefore, produces a representation with the attention scores for each target sequence word that captures the influence of the attention scores from the input sequence as well.

As this passes through all the Decoders in the stack, each Self-Attention and each Encoder-Decoder Attention also add their own attention scores into each word’s representation.

In the Transformer, the Attention module repeats its computations multiple times in parallel. Each of these is called an Attention Head. The Attention module splits its Query, Key, and Value parameters N-ways and passes each split independently through a separate Head. All of these similar Attention calculations are then combined together to produce a final Attention score. This is called Multi-head attention and gives the Transformer greater power to encode multiple relationships and nuances for each word.

In the paper, the diagram for a scaled dot product attention does not use any weights at all. Instead, the weights are included only in the multi head attention block, shown in figure below :
<br><br>
<center>
<img src= "https://cdn.extras.talentsprint.com/aiml/Experiment_related_data/Images/Multi_head_attention_with_weights.png" width=900px/>
</center>

### Understanding the shape
<br>
<center>
<img src= "https://cdn.extras.talentsprint.com/aiml/Experiment_related_data/Images/Multi_head_attention_shape_tracking.png" width=750px>
</center>
<br>

**Final Projection :** $ Output = concat(A_1, A_2, ..., A_h)  W^{o} $

**Shape of :**  $ concat (A_1, A_2, ..., A_h) \Rightarrow  (T \times hd_v) $

**Shape of:**  $  W^{o} \Rightarrow (hd_v \times d_{model}) $

**Shape of final:**  $ Ouput = concat (A_1, A_2, ..., A_h) W^{o} \Rightarrow  (T \times hd_v) \times (hd_v \times d_{model})  \Rightarrow  (T \times d_{model}) \Leftarrow $ **Back to the initial input shape.**

Batch size is not displayed here.

## Transformer Encoder Block

The Transformer Encoder consists of a stack of
 identical layers (Encoder Block) as shown in figure below, where each layer further consists of two main sub-layers:

* The first sub-layer comprises a multi-head attention mechanism that receives the queries, keys, and values as inputs.
* A second sub-layer comprises a fully-connected feed-forward network.

Following each of these two sub-layers is layer normalization, into which the sub-layer input (through a residual/skip connection) and output are fed.

Regularization is also intrpduced into the model by applying a dropout to the output of each sub-layer (before the layer normalization step), as well as to the positional encodings before these are fed into the encoder.



<br>
<center>
<img src="https://cdn.extras.talentsprint.com/aiml/Experiment_related_data//Images/Encoder_tfr_block_unfolded.png"  width=600 px />$⇒$
<img src="https://cdn.extras.talentsprint.com/aiml/Experiment_related_data/Images/Encoder_tfr_block.png" width=180 px/>

</center>

The transformer encoder architecture typically consists of multiple layers, each of which includes a self-attention mechanism and a feed-forward neural network. The self-attention mechanism allows the model to weigh the importance of different input sequence parts by calculating the embeddings' dot product. This mechanism is also known as multi-head attention.

The feed-forward network allows the model to extract higher-level features from the input. This network usually comprises two linear layers with a ReLU activation function in between. The feed-forward network allows the model to extract deeper meaning from the input data and more compactly and usefully represent the input.In the paper, an ANN with one hidden layer and a ReLu activation in the middle  with no activation function at output layer has been implemented.

The transformer encoder is a crucial part of the transformer encoder-decoder architecture, which is widely used for natural language processing tasks.

## Encoder Transformer
Stacking transormer blocks gives a Transfomer! Shown in figure below:

<br>
<center>
<img src= "https://cdn.extras.talentsprint.com/aiml/Experiment_related_data/Images/Encoder_transfomer.png" width=1000px/>
</center>

#### Define TransformerEncoder class

In [19]:
class TransformerEncoder(layers.Layer):
    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
        super().__init__(**kwargs)
        self.embed_dim = embed_dim    # Dimension of embedding. 4 in the dummy example
        self.dense_dim = dense_dim    # No. of neurons in dense layer
        self.num_heads = num_heads    # No. of heads for MultiHead Attention layer
        self.attention = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)    # MultiHead Attention layer
        self.dense_proj = keras.Sequential([layers.Dense(dense_dim, activation="relu"),
                                            layers.Dense(embed_dim),]    # encoders are stacked on top of the other.
                                           )                             # So output dimension is also embed_dim
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()

    # Call function based on figure above
    def call(self, inputs, mask=None):
        if mask is not None:
            mask = mask[:, tf.newaxis, :]
        attention_output = self.attention(query=inputs,             # Query: inputs,
                                          value=inputs,             # Value: inputs,
                                          key=inputs,               # Keys: Same as Values by default
                                          attention_mask=mask
                                          )                         # Q: Can you see how this is self attention? A: all args are the same

        proj_input = self.layernorm_1(inputs + attention_output) # LayerNormalization; + Recall cat picture
        proj_output = self.dense_proj(proj_input)
        return self.layernorm_2(proj_input + proj_output)  # LayerNormalization + Residual connection

    def get_config(self):
        config = super().get_config()
        config.update({
            "embed_dim": self.embed_dim,
            "num_heads": self.num_heads,
            "dense_dim": self.dense_dim,
        })
        return config

#### Model definition

In [20]:
# Build the Transformer encoder
vocab_size = 20000
embed_dim = 256
num_heads = 2
dense_dim = 32

inputs = keras.Input(shape=(1,), dtype=tf.string)
x = text_vectorization(inputs)                                         # TextVectorization layer
x = layers.Embedding(vocab_size, embed_dim)(x)                         # Embedding layer
x = TransformerEncoder(embed_dim, dense_dim, num_heads)(x)             # Transformer Encoder block
x = layers.GlobalMaxPooling1D()(x)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)                     # Dense layer for classification

model = keras.Model(inputs, outputs)

model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])

model.summary()

#### Train and evaluate the performance of the model *(Switch to GPU runtime if needed)*

In [21]:
# Fit the model on train set
callbacks = [keras.callbacks.ModelCheckpoint("transformer_encoder.keras", save_best_only=True)]

# Change target shape from (None,) to (None, 1)
train_dataset = train_ds.map(lambda x, y: (x, tf.reshape(y, (-1,1))))
val_dataset = val_ds.map(lambda x, y: (x, tf.reshape(y, (-1,1))))

model.fit(train_dataset,
          validation_data = val_dataset,
          epochs = 10,
          callbacks = callbacks)

Epoch 1/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m53s[0m 69ms/step - accuracy: 0.5381 - loss: 0.8638 - val_accuracy: 0.8182 - val_loss: 0.4057
Epoch 2/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m74s[0m 68ms/step - accuracy: 0.8142 - loss: 0.4175 - val_accuracy: 0.8502 - val_loss: 0.3490
Epoch 3/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m43s[0m 69ms/step - accuracy: 0.8429 - loss: 0.3557 - val_accuracy: 0.8644 - val_loss: 0.3263
Epoch 4/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m44s[0m 70ms/step - accuracy: 0.8639 - loss: 0.3236 - val_accuracy: 0.8728 - val_loss: 0.3083
Epoch 5/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m82s[0m 70ms/step - accuracy: 0.8717 - loss: 0.2987 - val_accuracy: 0.8686 - val_loss: 0.3137
Epoch 6/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m44s[0m 71ms/step - accuracy: 0.8854 - loss: 0.2723 - val_accuracy: 0.8740 - val_loss: 0.2997
Epoch 7/10
[1m6

<keras.src.callbacks.history.History at 0x7aac63f695a0>

In [22]:
model = keras.models.load_model(
    "transformer_encoder.keras",
    custom_objects={"TransformerEncoder": TransformerEncoder}
    )

print(f"Test acc: {model.evaluate(test_ds)[1]:.3f}")



[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m19s[0m 24ms/step - accuracy: 0.8706 - loss: 0.3047
Test acc: 0.868


#### Prediction

In [23]:
txt = "It was a great movie"
myTensor = tf.convert_to_tensor(txt, dtype=tf.string)
print(myTensor)

tf.Tensor(b'It was a great movie', shape=(), dtype=string)


In [24]:
pred = model(tf.reshape(myTensor, (-1,1)))
pred

<tf.Tensor: shape=(1, 1), dtype=float32, numpy=array([[0.930238]], dtype=float32)>

In [25]:
label_index = int(pred.numpy()[0,0] + 0.5)
label_index

1

In [26]:
mapping = {0: 'Negative', 1: 'Positive'}
label = mapping[label_index]
label

'Positive'

In [27]:
def make_prediction(input_text, tf_model):
    myTensor = tf.convert_to_tensor(input_text, dtype=tf.string)
    pred = tf_model(tf.reshape(myTensor, (-1,1)))
    label_index = int(pred.numpy()[0,0] + 0.5)
    mapping = {0: 'Negative', 1: 'Positive'}
    label = mapping[label_index]
    return label

In [28]:
make_prediction("It was a great movie", model)

'Positive'

In [29]:
make_prediction("It was a bad movie", model)

'Negative'

## Positional Embedding

**Positional Embedding = Word Embedding + Positional Encoding**

**Positional Encoding**

Passing embeddings directly into the transformer block results in missing of information about the order of tokens. As attention is permutation invariant i.e. order of token does not matter to attention.
Although transformers are a sequence model, it appears that this important detail has somehow been lost. Positional encoding is for rescue.

Positional encoding add positional information to the existing embeddings.

**A unique set of numbers added at each position of the existing embeddings**, such that this new set of numbers can uniquely identify which postion they are located at.


1. Positional Encoding by SubClassing the Keras Embedding Layer (Trainable)
2. Positional Encoding scheme as per the paper (Non-Trainable)

 In this scheme the encoding is created by using a set of sins and cosines at different frequencies. The  paper uses the following formula for calculating the positional encoding. [Positional Encoding Vizualization.](https://erdem.pl/2021/05/understanding-positional-encoding-in-transformers)

$$\Large{PE_{(pos, 2i)} = \sin(pos / 10000^{2i / d_{model}})} $$
$$\Large{PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i / d_{model}})} $$

We are going to implement Positional Encoding by SubClassing the Keras Embedding Layer (Trainable). Thus Instead of using the Embedding layer from keras define a PositionalEmbedding class and create a new model using it as the embedding layer.

In [30]:
# Using positional encoding to re-inject order information

class PositionalEmbedding(layers.Layer):
    def __init__(self, sequence_length, input_dim, output_dim, **kwargs):                  # input_dim = (token) vocabulary size,  output_dim = embedding size
        super().__init__(**kwargs)
        self.token_embeddings = layers.Embedding(input_dim=input_dim, output_dim=output_dim)            # Q: what is input_dim and output_dim? A: vocab size, embedding dim
        self.position_embeddings = layers.Embedding(input_dim=sequence_length, output_dim=output_dim)   # Q: Why input_dim = seq_length?  A: there are seq_len; no. of possible positions
                                                                                                        # Q: What is the vocab for this Embedding layer? A: seq_length
        self.sequence_length = sequence_length
        self.input_dim = input_dim
        self.output_dim = output_dim


    def call(self, inputs):   # inputs will be a batch of sequences (batch, seq_len)
        length = tf.shape(inputs)[-1]     # lenght will just be sequence length
        positions = tf.range(start=0, limit=length, delta=1) # indices for input to positional embedding
        embedded_tokens = tf.reshape(self.token_embeddings(inputs), (-1, length, self.output_dim))
        embedded_positions = tf.reshape(self.position_embeddings(positions), (-1, length, self.output_dim))
        return layers.Add()([embedded_tokens, embedded_positions])     # ADD the embeddings

    def compute_mask(self, inputs, mask=None):     # makes this layer a mask-generating layer
        if mask is None:
            return None
        return tf.math.not_equal(inputs, 0)        # mask will get propagated to the next layer.

    # When using custom layers, this enables the layer to be reinstantiated from its config dict,
    # which is useful during model saving and loading.
    def get_config(self):
        config = super().get_config()
        config.update({
            "output_dim": self.output_dim,
            "sequence_length": self.sequence_length,
            "input_dim": self.input_dim,
        })
        return config

In [31]:
# What does tf.math.not_equal() do?

a = tf.constant([1,0,2,0,3]) # a is a tensor
print(a)
print(tf.math.not_equal(a, 0))   # which elements of 'a' are not equal to 0

tf.Tensor([1 0 2 0 3], shape=(5,), dtype=int32)
tf.Tensor([ True False  True False  True], shape=(5,), dtype=bool)


#### Model Definition

In [32]:
#  Combining the Transformer encoder with positional embedding

vocab_size = 20000
sequence_length = 600
embed_dim = 256
num_heads = 2
dense_dim = 32

inputs = keras.Input(shape=(1,), dtype=tf.string)
x = text_vectorization(inputs)                                             # Text Vectorization layer
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(x)         # Embedding layer  +  Positional Encoding
x = TransformerEncoder(embed_dim, dense_dim, num_heads)(x)                 # Transformer Encoder block
x = layers.GlobalMaxPooling1D()(x)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)                         # Dense layer for classification

model = keras.Model(inputs, outputs)

model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])

model.summary()

#### Train and evaluate the model *(Switch to GPU runtime if needed)*

In [33]:
# Fit the model on train set
callbacks = [keras.callbacks.ModelCheckpoint("full_transformer_encoder.keras", save_best_only=True)]

# Change target shape from (None,) to (None, 1)
train_dataset = train_ds.map(lambda x, y: (x, tf.reshape(y, (-1,1))))
val_dataset = val_ds.map(lambda x, y: (x, tf.reshape(y, (-1,1))))

model.fit(train_dataset,
          validation_data = val_dataset,
          epochs = 10,
          callbacks = callbacks)

Epoch 1/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m49s[0m 72ms/step - accuracy: 0.5863 - loss: 0.8340 - val_accuracy: 0.8144 - val_loss: 0.4171
Epoch 2/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m45s[0m 73ms/step - accuracy: 0.8014 - loss: 0.4350 - val_accuracy: 0.8214 - val_loss: 0.3899
Epoch 3/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m47s[0m 75ms/step - accuracy: 0.8354 - loss: 0.3786 - val_accuracy: 0.8346 - val_loss: 0.3728
Epoch 4/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m45s[0m 72ms/step - accuracy: 0.8487 - loss: 0.3430 - val_accuracy: 0.8482 - val_loss: 0.3512
Epoch 5/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m45s[0m 72ms/step - accuracy: 0.8657 - loss: 0.3148 - val_accuracy: 0.8394 - val_loss: 0.3629
Epoch 6/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m45s[0m 73ms/step - accuracy: 0.8890 - loss: 0.2749 - val_accuracy: 0.8510 - val_loss: 0.3474
Epoch 7/10
[1m6

<keras.src.callbacks.history.History at 0x7aac12156350>

In [34]:
model = keras.models.load_model(
    "full_transformer_encoder.keras",
    custom_objects={"TransformerEncoder": TransformerEncoder,
                    "PositionalEmbedding": PositionalEmbedding})

print(f"Test acc: {model.evaluate(test_ds)[1]:.3f}")



[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 25ms/step - accuracy: 0.8519 - loss: 0.3425
Test acc: 0.850


#### Prediction

In [35]:
make_prediction("It was a great movie", model)

'Positive'

In [36]:
make_prediction("It was a bad movie", model)

'Negative'

## References

If you are very interested or plan to work closely on Transformers, then following are good resource that explains in simplified manner.

1. [Attention Is All You Need](https://arxiv.org/pdf/1706.03762v6.pdf)

2. [Understanding Positional  Encoding](https://erdem.pl/2021/05/understanding-positional-encoding-in-transformers)

2. [Implement Multi-Head Attention](https://machinelearningmastery.com/how-to-implement-multi-head-attention-from-scratch-in-tensorflow-and-keras)

3. [Implementing the Transformer Encoder](https://machinelearningmastery.com/implementing-the-transformer-encoder-from-scratch-in-tensorflow-and-keras/)

4. [Illustrated-transformer](https://jalammar.github.io/illustrated-transformer/)





### Please answer the questions below to complete the experiment:




In [37]:
#@title Which layer in the Transformer model is responsible for capturing local dependencies within an input sequence? { run: "auto", form-width: "500px", display-mode: "form" }
Answer = "Self-Attention Layer" #@param ["", "Self-Attention Layer", "Feedforward Neural Network Layer", "Positional Encoding Layer", "Layer Normalization"]

In [38]:
#@title How was the experiment? { run: "auto", form-width: "500px", display-mode: "form" }
Complexity = "Good and Challenging for me" #@param ["","Too Simple, I am wasting time", "Good, But Not Challenging for me", "Good and Challenging for me", "Was Tough, but I did it", "Too Difficult for me"]


In [39]:
#@title If it was too easy, what more would you have liked to be added? If it was very difficult, what would you have liked to have been removed? { run: "auto", display-mode: "form" }
Additional = "na" #@param {type:"string"}


In [40]:
#@title Can you identify the concepts from the lecture which this experiment covered? { run: "auto", vertical-output: true, display-mode: "form" }
Concepts = "Yes" #@param ["","Yes", "No"]


In [41]:
#@title  Text and image description/explanation and code comments within the experiment: { run: "auto", vertical-output: true, display-mode: "form" }
Comments = "Very Useful" #@param ["","Very Useful", "Somewhat Useful", "Not Useful", "Didn't use"]


In [42]:
#@title Mentor Support: { run: "auto", vertical-output: true, display-mode: "form" }
Mentor_support = "Very Useful" #@param ["","Very Useful", "Somewhat Useful", "Not Useful", "Didn't use"]


In [43]:
#@title Run this cell to submit your notebook for grading { vertical-output: true }
try:
  if submission_id:
      return_id = submit_notebook()
      if return_id : submission_id = return_id
  else:
      print("Please complete the setup first.")
except NameError:
  print ("Please complete the setup first.")

Your submission is successful.
Ref Id: 6349
Date of submission:  26 Aug 2024
Time of submission:  23:46:01
View your submissions: https://aimlops-iisc.talentsprint.com/notebook_submissions
