<a href="https://colab.research.google.com/github/jpatra85/ColabTF_EDU/blob/master/Copy_of_M5_AST_03_Transformer_Encoder_C.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Certification Programme in AI and MLOps
## A Program by IISc and TalentSprint
### Assignment 3: Transformer Encoders, Self Attention

## Learning Objectives

At the end of the experiment, you will be able to:

* understand the big picture of transformers
* understand and work with the Textvectorization layer
* understand and work with the embedding layer
* understand the consept of self attention
* explore transformer encoder and positional embedding

### The Big Picture

<center>
<img src= "https://cdn.iisc.talentsprint.com/AIandMLOps/Images/M5%20AST%205%20Big%20Picture.png" width=800px/>
</center>

Above is the entire architecture of transformer. A TextVectorization layer, Embedding layer, an Encoder and a Decoder.

Transformer architecture follows an encoder-decoder structure. The encoder, on the left-hand side, is tasked with mapping an input sequence to a sequence of continuous representations; the decoder, on the right-hand side, receives the output of the encoder together with the decoder output at the previous time step to generate an output sequence.

The Transformer architecture was originally designed for translation. In the encoder, the attention layers can use all the words in a sentence (since, as we just saw, the translation of a given word can be dependent on what is after as well as before it in the sentence). The decoder, however, works sequentially and can only pay attention to the words in the sentence that it has already translated (so, only the words before the word currently being generated). For example, when we have predicted the first three words of the translated target, we give them to the decoder which then uses all the inputs of the encoder to try to predict the fourth word.

To speed things up during training (when the model has access to target sentences), the decoder is fed the whole target, but it is not allowed to use future words (if it had access to the word at position 2 when trying to predict the word at position 2, the problem would not be very hard!). For instance, when trying to predict the fourth word, the attention layer will only have access to the words in positions 1 to 3.

<center>
<img src= "https://cdn.iisc.talentsprint.com/AIandMLOps/Images/M5_AST_05_Image1_Transformer.png" width=900px/>
</center>

In this assignment decoder will not form the topic of discussion, the main focus will be on the Transformer Encoder.
This has been discussed in detail in the later sections of this notebook.

## Dataset Description

The IMDb Movie Reviews dataset is a binary sentiment analysis dataset consisting of 50,000 reviews from the Internet Movie Database (IMDb) labeled as positive or negative. The dataset contains an even number of positive and negative reviews.

This dataset is processed and used in the later sections of this notebook.

### Setup Steps:

In [None]:
#@title Please enter your registration id to start: { run: "auto", display-mode: "form" }
Id = "" #@param {type:"string"}

In [None]:
#@title Please enter your password (your registered phone number) to continue: { run: "auto", display-mode: "form" }
password = "" #@param {type:"string"}

In [None]:
#@title Run this cell to complete the setup for this Notebook
from IPython import get_ipython

ipython = get_ipython()

notebook= "M5_AST_03_Transformer_Encoder_C" #name of the notebook

def setup():
#  ipython.magic("sx pip3 install torch")

    ipython.magic("sx curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz")
    ipython.magic("sx tar -xvzf aclImdb_v1.tar.gz")
    from IPython.display import HTML, display
    display(HTML('<script src="https://dashboard.talentsprint.com/aiml/record_ip.html?traineeId={0}&recordId={1}"></script>'.format(getId(),submission_id)))
    print("Setup completed successfully")
    return

def submit_notebook():
    ipython.magic("notebook -e "+ notebook + ".ipynb")

    import requests, json, base64, datetime

    url = "https://dashboard.talentsprint.com/xp/app/save_notebook_attempts"
    if not submission_id:
      data = {"id" : getId(), "notebook" : notebook, "mobile" : getPassword()}
      r = requests.post(url, data = data)
      r = json.loads(r.text)

      if r["status"] == "Success":
          return r["record_id"]
      elif "err" in r:
        print(r["err"])
        return None
      else:
        print ("Something is wrong, the notebook will not be submitted for grading")
        return None

    elif getAnswer() and getComplexity() and getAdditional() and getConcepts() and getComments() and getMentorSupport():
      f = open(notebook + ".ipynb", "rb")
      file_hash = base64.b64encode(f.read())

      data = {"complexity" : Complexity, "additional" :Additional,
              "concepts" : Concepts, "record_id" : submission_id,
              "answer" : Answer, "id" : Id, "file_hash" : file_hash,
              "notebook" : notebook,
              "feedback_experiments_input" : Comments,
              "feedback_mentor_support": Mentor_support}
      r = requests.post(url, data = data)
      r = json.loads(r.text)
      if "err" in r:
        print(r["err"])
        return None
      else:
        print("Your submission is successful.")
        print("Ref Id:", submission_id)
        print("Date of submission: ", r["date"])
        print("Time of submission: ", r["time"])
        print("View your submissions: https://aimlops-iisc.talentsprint.com/notebook_submissions")
        #print("For any queries/discrepancies, please connect with mentors through the chat icon in LMS dashboard.")
        return submission_id
    else: submission_id


def getAdditional():
  try:
    if not Additional:
      raise NameError
    else:
      return Additional
  except NameError:
    print ("Please answer Additional Question")
    return None

def getComplexity():
  try:
    if not Complexity:
      raise NameError
    else:
      return Complexity
  except NameError:
    print ("Please answer Complexity Question")
    return None

def getConcepts():
  try:
    if not Concepts:
      raise NameError
    else:
      return Concepts
  except NameError:
    print ("Please answer Concepts Question")
    return None


# def getWalkthrough():
#   try:
#     if not Walkthrough:
#       raise NameError
#     else:
#       return Walkthrough
#   except NameError:
#     print ("Please answer Walkthrough Question")
#     return None

def getComments():
  try:
    if not Comments:
      raise NameError
    else:
      return Comments
  except NameError:
    print ("Please answer Comments Question")
    return None


def getMentorSupport():
  try:
    if not Mentor_support:
      raise NameError
    else:
      return Mentor_support
  except NameError:
    print ("Please answer Mentor support Question")
    return None

def getAnswer():
  try:
    if not Answer:
      raise NameError
    else:
      return Answer
  except NameError:
    print ("Please answer Question")
    return None


def getId():
  try:
    return Id if Id else None
  except NameError:
    return None

def getPassword():
  try:
    return password if password else None
  except NameError:
    return None

submission_id = None
### Setup
if getPassword() and getId():
  submission_id = submit_notebook()
  if submission_id:
    setup()
else:
  print ("Please complete Id and Password cells before running setup")



### Importing Required Packages

In [None]:
import numpy as np
import os, pathlib, shutil, random
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers import TextVectorization
from tensorflow.keras.utils import text_dataset_from_directory

#  **Part A** : Text Pre-processing and Embedding Before Transformer Block

### TextVectorization

Preparing the text data:
  * Text standardization
  * Text splitting
  * Vocabulary indexing
  



A flowchart depicting the procedure or sequence of steps followed by a TextVectorization layer. 'Standardization' is taking care of basic preprocessing of text data such as removing the punctuation and converting the text to lower case. 'Tokenization' is giving the list of words from the sentence. Later, these words are represented with indices and with the help of embedding to get the vector encoding of indices.

<center>
<img src= "https://cdn.iisc.talentsprint.com/AIandMLOps/Images/M5_AST_05_Transformer_Encoder_Text_data_prep.png" width=650px/>
</center>


All these steps are performed in a TextVectorization Layer.


*   Keras provides a TextVectorization layer which can be dropped directly into
      - a tf.data pipeline **or**
      - a Keras model

*  MOREOVER, TextVectorization also handles both approaches of representing groups of words:
      - Words as a set or Bag-of-words
      - Words as a sequence







### Define a dummy dataset and a test sentence


In [None]:
dataset = [
    "I write, erase, rewrite",
    "Erase again, and then",
    "A poppy blooms.",
]

test_sentence = "I write, rewrite, and still rewrite again"
# dataset_t = ["I write, rewrite, and still rewrite again"]
#Q: Is the word 'still' in the dataset (vocabulary)? Is it there in the test_sentence?
#Q: How many words in test_sentence?

### Function depicting TextVectorization layer

A function is defined here to demonstrate the working of a TextVectorization layer. This function also compares two text data by making use of encodings and decodings.

This function will be called multiple times in the sub-sequent code cells to demonstrate TextVectorization with different parameters such as a combinations of monograms, bigrams and different modes of output.

In [None]:
# To see the workings of TextVectorization
def demonstrate_TxVec(text_vectorization, dataset, test_sen, mode=None):
  # arguments:
  text_vectorization.adapt(dataset) # Computes a vocabulary of string terms from tokens in a dataset
  vocabulary = text_vectorization.get_vocabulary()
  print(f"vocabulary = {vocabulary}")
  print(f"len(vocabulary) = {len(vocabulary)}")

  # To see how the the text_vec layer transforms/vectorizes the raw text
  encoded_sentence = text_vectorization(test_sen)
  print(f"encoded sentence = {encoded_sentence}")
  print(f"len(encoded sentence) = {len(encoded_sentence)}")
  # print(f"encoded dataset_t = {text_vectorization(dataset_t)}")

  # decode back for comparison with test_sentence
  if mode=="int":
    inverse_vocab = dict(enumerate(vocabulary)) # making a dictionary to decode embeddings
    print(f"inverse_vocab = {inverse_vocab}")
    decoded_sentence = " ".join(inverse_vocab[int(i)] for i in encoded_sentence)
    print(f"decoded sentence = {decoded_sentence}")

  if mode=="multi_hot":
    for token in vocabulary:
      print(f"{token}:\t {text_vectorization(token)}")

  if mode=="count":
    for token in vocabulary:
      print(f"{token}:\t {text_vectorization(token)}")

  if mode=="tf":
    for token in vocabulary:
      print(f"{token}:\t {text_vectorization(token)}")

  print(f"test_sentence = {test_sen}")



### Instantiating a TextVectorization object with output mode as integer

In [None]:
#Q: What 3 things does a TV layer do?
# instantiating a TextVectorization layer/object
text_vectorization = TextVectorization(
    output_mode="int",  # int is defualt. We will see different kinds of modes
    # we can use custom fucntions for standardizing and splitting the text - see Chollet
    # standardize=custom_standardization_fn,
    # split=custom_split_fn,
)

demonstrate_TxVec(text_vectorization, dataset, test_sentence, mode="int")

# Q: What is a vocabulary?
# Q: No. of tokens in vocabulary?
# Q: Length of encoded_sentence (output of TV layer)?
# Q: Type of elements in encoded_sentence (embedding)?
# Q: Is decoded sentence the same as the test_sentence? Why?

# dataset = [
#     "I write, erase, rewrite",
#     "Erase again, and then",
#     "A poppy blooms.",
# ]

vocabulary = ['', '[UNK]', 'erase', 'write', 'then', 'rewrite', 'poppy', 'i', 'blooms', 'and', 'again', 'a']
len(vocabulary) = 12
encoded sentence = [ 7  3  5  9  1  5 10]
len(encoded sentence) = 7
inverse_vocab = {0: '', 1: '[UNK]', 2: 'erase', 3: 'write', 4: 'then', 5: 'rewrite', 6: 'poppy', 7: 'i', 8: 'blooms', 9: 'and', 10: 'again', 11: 'a'}
decoded sentence = i write rewrite and [UNK] rewrite again
test_sentence = I write, rewrite, and still rewrite again


### Instantiating a TextVectorization object with output mode as integer and considering bigrams

In [None]:
# Bigrams with integer encoding
text_vectorization = TextVectorization(
    ngrams = 2,
    output_mode="int",
    # output_sequence_length=20
)
demonstrate_TxVec(text_vectorization, dataset, test_sentence, mode="int")
# Q: Can we have 'rewrite erase' in vocab?
# Q: Why the extra 'i write'?
# Q: Why the extra 'UNK' ?

# dataset = [
#     "I write, erase, rewrite",
#     "Erase again, and then",
#     "A poppy blooms.",
# ]

vocabulary = ['', '[UNK]', 'erase', 'write erase', 'write', 'then', 'rewrite', 'poppy blooms', 'poppy', 'i write', 'i', 'erase rewrite', 'erase again', 'blooms', 'and then', 'and', 'again and', 'again', 'a poppy', 'a']
len(vocabulary) = 20
encoded sentence = [10  4  6 15  1  6 17  9  1  1  1  1  1]
len(encoded sentence) = 13
inverse_vocab = {0: '', 1: '[UNK]', 2: 'erase', 3: 'write erase', 4: 'write', 5: 'then', 6: 'rewrite', 7: 'poppy blooms', 8: 'poppy', 9: 'i write', 10: 'i', 11: 'erase rewrite', 12: 'erase again', 13: 'blooms', 14: 'and then', 15: 'and', 16: 'again and', 17: 'again', 18: 'a poppy', 19: 'a'}
decoded sentence = i write rewrite and [UNK] rewrite again i write [UNK] [UNK] [UNK] [UNK] [UNK]
test_sentence = I write, rewrite, and still rewrite again


### Instantiating a TextVectorization object with output mode as integer, maximum tokens as 20 and output mode as `'multi_hot'` encodings

In [None]:
#  Unigrams with binary encoding
text_vectorization = TextVectorization(
    ngrams = 1, # Default value
    max_tokens = 20, # let's change this value to 8 and see
    output_mode="multi_hot",
)
demonstrate_TxVec(text_vectorization, dataset, test_sentence, mode="multi_hot")

# Recall we saw this in tutorial 2



vocabulary = ['[UNK]', 'erase', 'write', 'then', 'rewrite', 'poppy', 'i', 'blooms', 'and', 'again', 'a']
len(vocabulary) = 11
encoded sentence = [1. 0. 1. 0. 1. 0. 1. 0. 1. 1. 0.]
len(encoded sentence) = 11
[UNK]:	 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
erase:	 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
write:	 [0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
then:	 [0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
rewrite:	 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
poppy:	 [0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
i:	 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
blooms:	 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
and:	 [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
again:	 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
a:	 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
test_sentence = I write, rewrite, and still rewrite again


### Instantiating a TextVectorization object with output mode as `'multi_hot'` and considering bigrams

In [None]:
text_vectorization = TextVectorization(
    ngrams = 2, # bag of 2 or less words. See cell o/p
    output_mode="multi_hot",
)
# Bi-grams typically perform better than unigrams- Word order matters!
demonstrate_TxVec(text_vectorization, dataset, test_sentence, mode="multi_hot")

vocabulary = ['[UNK]', 'erase', 'write erase', 'write', 'then', 'rewrite', 'poppy blooms', 'poppy', 'i write', 'i', 'erase rewrite', 'erase again', 'blooms', 'and then', 'and', 'again and', 'again', 'a poppy', 'a']
len(vocabulary) = 19
encoded sentence = [1. 0. 0. 1. 0. 1. 0. 0. 1. 1. 0. 0. 0. 0. 1. 0. 1. 0. 0.]
len(encoded sentence) = 19
[UNK]:	 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
erase:	 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
write erase:	 [0. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
write:	 [0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
then:	 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
rewrite:	 [0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
poppy blooms:	 [0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
poppy:	 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
i write:	 [0. 0. 0. 1. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
i:	 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.

### Instantiating a TextVectorization object with output mode as `'count'` and considering monograms

In [None]:
# Bigrams with token counts
text_vectorization = TextVectorization(
    ngrams = 1,
    # max_tokens = 8,
    output_mode="count",
)
demonstrate_TxVec(text_vectorization, dataset, test_sentence, mode="count")
#Q: which token comes twice in the test_sentence?




vocabulary = ['[UNK]', 'erase', 'write', 'then', 'rewrite', 'poppy', 'i', 'blooms', 'and', 'again', 'a']
len(vocabulary) = 11
encoded sentence = [1. 0. 1. 0. 2. 0. 1. 0. 1. 1. 0.]
len(encoded sentence) = 11
[UNK]:	 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
erase:	 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
write:	 [0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
then:	 [0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
rewrite:	 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
poppy:	 [0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
i:	 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
blooms:	 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
and:	 [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
again:	 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
a:	 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
test_sentence = I write, rewrite, and still rewrite again


### Instantiating a TextVectorization object with output mode as `'tf_idf'` and considering monograms

In [None]:
# Bigrams with TF-IDF weighted outputs
text_vectorization = TextVectorization(
    ngrams = 1,
    output_mode="tf_idf",
)
demonstrate_TxVec(text_vectorization, dataset, test_sentence, mode="tf")

# Often leads to 1% increase in perfermonance



vocabulary = ['[UNK]', 'erase', 'write', 'then', 'rewrite', 'poppy', 'i', 'blooms', 'and', 'again', 'a']
len(vocabulary) = 11
encoded sentence = [0.8939764  0.         0.91629076 0.         1.8325815  0.
 0.91629076 0.         0.91629076 0.91629076 0.        ]
len(encoded sentence) = 11
[UNK]:	 [0.8939764 0.        0.        0.        0.        0.        0.
 0.        0.        0.        0.       ]
erase:	 [0.        0.6931472 0.        0.        0.        0.        0.
 0.        0.        0.        0.       ]
write:	 [0.         0.         0.91629076 0.         0.         0.
 0.         0.         0.         0.         0.        ]
then:	 [0.         0.         0.         0.91629076 0.         0.
 0.         0.         0.         0.         0.        ]
rewrite:	 [0.         0.         0.         0.         0.91629076 0.
 0.         0.         0.         0.         0.        ]
poppy:	 [0.         0.         0.         0.         0.         0.91629076
 0.         0.         0.         0.

## TextVectorization in Keras Model

This technique is useful for production: For a stand-alone model.

Load a saved model and add a 'text_vectorization' layer to it

### Define a model architecture in a function

The processed inputs from will be later fed to this model.

In [None]:
# Dense network which may be used repetitively
def get_model(max_tokens=20000, hidden_dim=16):
  inputs = keras.Input(shape=(max_tokens,))
  x = layers.Dense(hidden_dim, activation="relu")(inputs)
  x = layers.Dropout(0.5)(x)
  outputs = layers.Dense(1, activation="sigmoid")(x)
  model = keras.Model(inputs, outputs)
  model.compile(optimizer="rmsprop",
                    loss="binary_crossentropy",
                    metrics=["accuracy"])
  return model

In [None]:
try:
  model=get_model()
  inputs = keras.Input(shape=(1,), dtype="string")
  processed_inputs = text_vectorization(inputs)
  outputs = model(processed_inputs) #some trained model
  inference_model = keras.Model(inputs, outputs)
  inference_model.summary()
except:
  pass

A discussion related to the method depicted above will be demonstrated in the next assignment.




### Data Preparation Example

A pre-processed version of the IMDB dataset provided by Keras was used in the previous assignments.

Originally IMDB dataset contains the *train* and the *test* folders.
Here, the original dataset will be used and pre-processing related to it will be explored.

In [None]:
# List subdirectories
!cd aclImdb && ls -d */

test/  train/


In [None]:
# Remove unnecessary folder
!rm -r aclImdb/train/unsup

In [None]:
# Visualise a sample
!cat aclImdb/train/pos/4077_10.txt

I first saw this back in the early 90s on UK TV, i did like it then but i missed the chance to tape it, many years passed but the film always stuck with me and i lost hope of seeing it TV again, the main thing that stuck with me was the end, the hole castle part really touched me, its easy to watch, has a great story, great music, the list goes on and on, its OK me saying how good it is but everyone will take there own best bits away with them once they have seen it, yes the animation is top notch and beautiful to watch, it does show its age in a very few parts but that has now become part of it beauty, i am so glad it has came out on DVD as it is one of my top 10 films of all time. Buy it or rent it just see it, best viewing is at night alone with drink and food in reach so you don't have to stop the film.<br /><br />Enjoy

### Create a validation directory and move 20% of the train data to it

In [None]:
# move 20% of the training data to the validation folder
base_dir = pathlib.Path("aclImdb")
val_dir = base_dir / "val"
train_dir = base_dir / "train"
for category in ("neg", "pos"):
    os.makedirs(val_dir / category)
    files = os.listdir(train_dir / category)
    # random.Random(1337).shuffle(files) # We should shuffle. Only commenting for demonstration
    num_val_samples = int(0.2 * len(files))
    val_files = files[-num_val_samples:]
    for fname in val_files:
        shutil.move(train_dir / category / fname,
                    val_dir / category / fname)

### Create batches of data using `text_dataset_from_directory`

In [None]:
# Create dataset using utility
batch_size = 32

# Q: Name other such utilities seen earlier ?
train_ds = text_dataset_from_directory(
    "aclImdb/train", batch_size=batch_size)

val_ds = text_dataset_from_directory(
    "aclImdb/val", batch_size=batch_size)

test_ds = text_dataset_from_directory(
    "aclImdb/test", batch_size=batch_size)

text_only_train_ds = train_ds.map(lambda x, y: x) #replace x,y with x. That is remove labels, just keep text data.


Found 20000 files belonging to 2 classes.
Found 5000 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.


There are 20000, 5000, and 25000 records in train, validation, and test directories with two class as positive and negative.

In [None]:
# Check shapes
for inputs, targets in train_ds:
    print("inputs.shape:", inputs.shape)
    print("inputs.dtype:", inputs.dtype)

    print("targets.shape:", targets.shape)
    print("targets.dtype:", targets.dtype)

    print("inputs[2]:", inputs[2])
    print("targets[2]:", targets[2])
    break

inputs.shape: (32,)
inputs.dtype: <dtype: 'string'>
targets.shape: (32,)
targets.dtype: <dtype: 'int32'>
inputs[2]: tf.Tensor(b'Problem with these type of movies is that literally dozens of them are being made each year. Luckily for use only a handful are given a theatrical release, while the others are being pushed straight to video or TV, such as this movie.<br /><br />The foremost problem of this movie is really its originality. It\'s one of those movies which uses the "Die Hard" formula of a tough but troubled guy being at the wrong place at the wrong time. In this case it\'s a character played by Casper Van Dien, who works for a security agency that thoroughly test safety procedures for companies and individuals. In this case he\'s being send to a cruise ship, which of course gets hijacked. You can see this movie as a sort of mix of "Die Hard" and "Air Force One" and the movie doesn\'t even try to conceal that those two movies were probably its biggest source of \'inspiration\'. S

### Processing the data using TextVectorization layer of keras

In [None]:
# Vectorizing the data
max_length = 600
max_tokens = 20000
text_vectorization = layers.TextVectorization(
    max_tokens=max_tokens,    # Q: What is the vocabular size?
    output_mode="int",        # Q: What will be the type of output for a token (say), 'amazing' ?
    output_sequence_length=max_length,      # Q: What is the maximum length of review? Is it a fair assumption?
    )

text_vectorization.adapt(text_only_train_ds)

int_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
int_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
int_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)



<center>
<img src= "https://cdn.iisc.talentsprint.com/AIandMLOps/Images/M5_AST_05_Transformer_Encoder_Text_data_prep.png" width=650px/>
</center>


### Visualize and compare the raw and processed data

In [None]:
# Let's visualize the raw text and the vectorized (to int) text
for text, label in train_ds:
  print(text[0])
  print(label[0])
  break

for int_of_text, label in int_train_ds:
  print(int_of_text[0])
  print(label[0])
  break

# Q: How can you verify whether the index of movie is 18?


tf.Tensor(b"This romantic adventure must have seemed shockingly subversive in its day. A wealthy upper class English woman schemes, plots and manipulates everyone around her for her own satisfaction. She uses her privileged position to embark on secret activities of a decidedly anti-social kind. There's a clever sex-role reversal as her activities prove her more daring and dashing than most of the male characters. But naturally there's a tall, dark and handsome stranger to keep up the love interest, and this wicked lady is not backward in coming forward when she meets the right man.<br /><br />The wishy-washy weakness and gullibility of every other character make the plot unconvincing in the extreme, but those who thirst for Romance will overlook that.", shape=(), dtype=string)
tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(
[   11    18     7   271   310    16     2  1074   898     3   732   113
   583    19    48    79     7    34   222  2732  6508     5 10059   370
     3  3003   193

Vector representation of the word movie

In [None]:
text_vectorization("movie")
# Q: What is the shape of the TV output?
# Q: Why so many 0s?


<tf.Tensor: shape=(600,), dtype=int64, numpy=
array([18,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,

Vector representation of "great movie" and "a fine story"

In [None]:
text_vectorization(["great movie","a fine story"])
#Q: shape?

<tf.Tensor: shape=(2, 600), dtype=int64, numpy=
array([[ 87,  18,   0, ...,   0,   0,   0],
       [  4, 473,  64, ...,   0,   0,   0]])>

### Embeddings


Word embeddings are vector representations of words that achieve exactly this: they map human language into a structured geometric space.

* dense (floats)
* low-dimensional (1024 dims for large vocabs)

There are two ways to obtain word embeddings:

* Learn word embeddings jointly with the main task you care about (such as document classification or sentiment prediction). In this setup, you start with random word vectors and then learn word vectors, in the same way you learn the weights of a neural network. **Move away from manual feature engineering.**
* Load into your model word embeddings that were precomputed using a different machine learning task than the one you’re trying to solve. These are called pretrained word embeddings.

**Q: Do two ways remind you of something we studied in CNNs ?**

In this assignment the main agenda is to explore the Learning of word embeddings.




### Embedding Layer


The procedure if as follows:

*   Like a dictionary that **maps integer indices** (which stand for specific words) **to dense vectors**

*   Input: a rank-2 tensor of integers, of shape (batch_size, sequence_length)
*   Output: 3D floating-point tensor of shape (batch_size, sequence_length, embedding_dimensionality)
*   WORD INDEX ⭢ EMBEDDING LAYER ⭢ CORRESPONDING WORD VEC

*   Initial weights are random
*   Learns specialized structure upon training



<center>
<img src= "https://cdn.iisc.talentsprint.com/AIandMLOps/Images/M5%20AST5%20Embedding%20Layer.png" width=750px/>
</center>


### Define an LSTM architecture with an Embedding layer, and a Bidriectional layer

In [None]:
max_tokens = 20000
inputs = keras.Input(shape=(None,), dtype="int64")
# The Embedding layer
embedded = layers.Embedding(input_dim=max_tokens, output_dim=256)(inputs)  # the largest integer (i.e. word index) in the input
                                                                             # should be no larger than 19999 (vocabulary size).
# Q: What is the input to the Embedding layer?
# Q: What is the dimension of the output embeddings
# Q: In embedding layer shape, what are None and None ?

x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])
model.summary()
#Q: Weights in the embedding layer?
#Hint: Dict; 1 input word => embedding of size ___ .

Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_3 (InputLayer)        [(None, None)]            0         
                                                                 
 embedding (Embedding)       (None, None, 256)         5120000   
                                                                 
 bidirectional (Bidirection  (None, 64)                73984     
 al)                                                             
                                                                 
 dropout_1 (Dropout)         (None, 64)                0         
                                                                 
 dense_2 (Dense)             (None, 1)                 65        
                                                                 
Total params: 5194049 (19.81 MB)
Trainable params: 5194049 (19.81 MB)
Non-trainable params: 0 (0.00 Byte)
___________________

### Train the model and make a prediction

In [None]:
# Fit the model - This code cell can be commented for brevity
callbacks = [
    keras.callbacks.ModelCheckpoint("one_hot_bidir_lstm.keras",
                                    save_best_only=True)
]
model.fit(int_train_ds, validation_data=int_val_ds, epochs=10,callbacks=callbacks)

model = keras.models.load_model("one_hot_bidir_lstm.keras")
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test acc: 0.862


 # **Part B** : Building Encoder Transformer

Figure below shows the **Transformer Model Architecture** as per the paper ["**Attention Is All You Need**"](https://arxiv.org/pdf/1706.03762v6.pdf) .Here, we are going to implement **Encoder** and try to understand how it function. We are writing **'as per the paper'** to  mention this paper throughout this notebook. To completely understand the Encoder Transformer, it is imperative to understand Self Attention which is used inside Multi-head Attention. The data after passing the TextVectorization layer and Embedding layer will pass the Self Attention layer.
<br><br>
<center>
<img src= "https://cdn.extras.talentsprint.com/aiml/Experiment_related_data/Images/Transformer.png" width=750px/>

</center>

## Self Attention
The attention mechanism being depicted in the picture below can be understood as the attention scores highlighting the most important features of the cat so that it can be identified.


<center>
<img src= "https://cdn.iisc.talentsprint.com/AIandMLOps/Images/Attention%20scores%20pic.png" width=700px/>
</center>



Most of the popular language models does not have just the term 'BERT' in their name but an important techique called 'self-attention'. Transformer-based architectures, which are primarily used in modelling language understanding tasks, eschew recurrence in neural networks and instead trust entirely on self-attention mechanisms to draw global dependencies between inputs and outputs.

<center>
<img src= "https://cdn.iisc.talentsprint.com/AIandMLOps/Images/M5%20AST5%20Self%20Attention%20Scores.png" width=900px/>
</center>




outputs = sum(inputs * pairwise_scores(inputs, inputs))


According to the self attention scores which are depicted in the picture, the word 'train pays' more attention to station rather than other words in consideration such as 'on' or the.

The attention mechanism allows output to focus attention on input while producing output while the self-attention model allows inputs to interact with each other (i.e calculate attention of all other inputs wrt one input).

 Inside each attention head is a **Scaled Dot Product Self-Attention** operation, the operation returns a Attention vector as given by equation below:

$$ Self Attention = softmax(\frac{x^{T}_i x_j}{\sqrt{d_k}})x_j $$

The term  **$x^{T}_i x_j$** is dot product of input vector with itself. The  'pivot_vector' and the 'vector' forms the 'xi' and 'xj' of the above Self Attention function.

### Demonstrating self_attention with dummy data

#### Define a custom self attention function

In [None]:
# Custom self attention function
def self_attention(input_sequence):
  output = np.zeros(shape=input_sequence.shape)
  for i, pivot_vector in enumerate(input_sequence): # iterate over each token in ip seq
    scores = np.zeros(shape=(len(input_sequence), ))

    for j, vector in enumerate(input_sequence):
      scores[j] = np.dot(pivot_vector, vector.T)    # Pairwise scores

    scores /= np.sqrt(input_sequence.shape[1]) # scale #[1] is the embedding dim
    scores = tf.nn.softmax(scores)              # softmax
    new_pivot_representation = np.zeros(shape=pivot_vector.shape)
    for j, vector in enumerate(input_sequence):
      new_pivot_representation += vector*scores[j] # weigthed sum
    output[i] = new_pivot_representation
  return output

# Optional HW: Add to the code to print the attention_score matrix

#### Use a dummy data and find its vectors

The data first passes through the TextVectorization layer then through the Embedding layer and later the Self Attention scores are calculated

In [None]:
# First, vectorize raw text using _________
dummy_vocab = ["movie was very nice", "film was good"]
text_vec = layers.TextVectorization(max_tokens=5, output_sequence_length=3)
text_vec.adapt(dummy_vocab)
print(text_vec("movie"))
#Q: why the zeros ? why shape 3?
#Q: output type?

tf.Tensor([1 0 0], shape=(3,), dtype=int64)


In [None]:
# Then obtain embeddings from text indices using the Embedding layer
int_text = text_vec(["movie was good"])
print(int_text)
embedding = layers.Embedding(input_dim=5, output_dim=4)(int_text)  # Why 5 ?
print(embedding)

tf.Tensor([[1 2 1]], shape=(1, 3), dtype=int64)
tf.Tensor(
[[[ 0.01873472  0.04067269  0.03172047 -0.03395281]
  [-0.04117931  0.04232437  0.04808872  0.01169556]
  [ 0.01873472  0.04067269  0.03172047 -0.03395281]]], shape=(1, 3, 4), dtype=float32)


#### Output from the attention function

In [None]:
# Compute output of attention module
attention_outputs = self_attention(embedding[0].numpy())
print(attention_outputs.shape)
print(attention_outputs)
# These are z1, z2, z3 in the picture below

(3, 4)
[[-0.00122274  0.04122287  0.03717276 -0.01874726]
 [-0.00126232  0.04122396  0.03718357 -0.01871711]
 [-0.00122274  0.04122287  0.03717276 -0.01874726]]


## Attention Eqn. with Queries, Keys and Values

We computed the Self Attention based on the inputs of vectors themselves. This means that for fixed inputs, these attention weights would always be fixed. In other words, there are no learnable parameters. Need to introduce some learnable parmeters which will make the self attention mechanism more flexible and tunable for various tasks. To fullfil this purpose, three weight matices are introduced and multiplied with input $x_i$ seperately and three new terms **Queries(Q), Keys(K) and Values(V)** comes into picture as given by equations below. Vectorized implemenation  & Shape tracking are also shown along with equations.

**Vectorized implemenation  & Shape tracking**

$ d_{model} $ = Embedding vector for each word ( 512 as per the paper).

$ X   \Rightarrow (T \times d_{model}) $


$ Q = X W^{Q}   \Rightarrow (T \times d_{model}) \times (d_{model} \times d_k  )  \Rightarrow   (T \times d_{k}) $


$ K = X W^{K}   \Rightarrow (T \times d_{model}) \times (d_{model} \times d_k  )  \Rightarrow   (T \times d_{k}) $


$ V = X W^{V}   \Rightarrow (T \times d_{model}) \times (d_{model} \times d_v  )  \Rightarrow   (T \times d_{v}) $

Dot product of Queries and Keys:

$ Q K^{T}   \Rightarrow (T \times d_{k}) \times (d_{k} \times T  )  \Rightarrow   (T \times T) $

T query vectors and T key vectors (Input Sequence), so need TxT attention weights. Make Sense! Taking SoftMax doesn't change the shape.

 **Shapes as per the paper**

$
\begin{array}{|c|c|} \hline
Object   &  Shape & values  \\ \hline
q_i, k_i  &  d_k  &  (64,) \\
v_i   &   d_v   &   (64,)  \\
x_i   &   d_{model}   & (512,)  \\
W^{Q}, W^{K}  &   d_{model} \times d_k   &   (512, 64)  \\
W^{V}   &   d_{model} \times d_v   &  (512,64)  \\ \hline
\end{array}
$

**Batch consideration**

In code, a batch of N samples are processed at a time. Everyting would be  **N times**, like: $ N \times T \times d_k $ instead of just $ T \times d_k$.

 **Fianl Scaled Dot Product Attention** equation inside each attention head with **Queries(Q)**, **Keys(Q)**, and **Values(V)**, which returns a Attention vector.

<center>
<img src= "https://cdn.extras.talentsprint.com/aiml/Experiment_related_data/Images/Scaled_dot_product_Attention.png" width=250px/>

</center>


$$Attention(Q, K, V) = softmax(\frac{QK^T)}{\sqrt{d_k}})V$$

## Multihead Attention

If the output of the final Encoder in the stack is passed to the Value and Key parameters in the Encoder-Decoder Attention.

The Encoder-Decoder Attention is therefore getting a representation of both the target sequence (from the Decoder Self-Attention) and a representation of the input sequence (from the Encoder stack). It, therefore, produces a representation with the attention scores for each target sequence word that captures the influence of the attention scores from the input sequence as well.

As this passes through all the Decoders in the stack, each Self-Attention and each Encoder-Decoder Attention also add their own attention scores into each word’s representation.

In the Transformer, the Attention module repeats its computations multiple times in parallel. Each of these is called an Attention Head. The Attention module splits its Query, Key, and Value parameters N-ways and passes each split independently through a separate Head. All of these similar Attention calculations are then combined together to produce a final Attention score. This is called Multi-head attention and gives the Transformer greater power to encode multiple relationships and nuances for each word.

In the paper, the diagram for a scaled dot product attention does not use any weights at all. Instead, the weights are included only in the multi head attention block, shown in figure below :
<br><br>
<center>
<img src= "https://cdn.extras.talentsprint.com/aiml/Experiment_related_data/Images/Multi_head_attention_with_weights.png" width=900px/>
</center>

### Understanding the shape
<br>
<center>
<img src= "https://cdn.extras.talentsprint.com/aiml/Experiment_related_data/Images/Multi_head_attention_shape_tracking.png" width=750px>
</center>
<br>

**Final Projection :** $ Output = concat(A_1, A_2, ..., A_h)  W^{o} $

**Shape of :**  $ concat (A_1, A_2, ..., A_h) \Rightarrow  (T \times hd_v) $

**Shape of:**  $  W^{o} \Rightarrow (hd_v \times d_{model}) $

**Shape of final:**  $ Ouput = concat (A_1, A_2, ..., A_h) W^{o} \Rightarrow  (T \times hd_v) \times (hd_v \times d_{model})  \Rightarrow  (T \times d_{model}) \Leftarrow $ **Back to the initial input shape.**

Batch size is not displayed here.

## Transformer Encoder Block

The Transformer Encoder consists of a stack of
 identical layers (Encoder Block) as shown in figure below, where each layer further consists of two main sub-layers:

* The first sub-layer comprises a multi-head attention mechanism that receives the queries, keys, and values as inputs.
* A second sub-layer comprises a fully-connected feed-forward network.

Following each of these two sub-layers is layer normalization, into which the sub-layer input (through a residual/skip connection) and output are fed.

Regularization is also intrpduced into the model by applying a dropout to the output of each sub-layer (before the layer normalization step), as well as to the positional encodings before these are fed into the encoder.



<br>
<center>
<img src="https://cdn.extras.talentsprint.com/aiml/Experiment_related_data//Images/Encoder_tfr_block_unfolded.png"  width=600 px />$⇒$
<img src="https://cdn.extras.talentsprint.com/aiml/Experiment_related_data/Images/Encoder_tfr_block.png" width=180 px/>

</center>

The transformer encoder architecture typically consists of multiple layers, each of which includes a self-attention mechanism and a feed-forward neural network. The self-attention mechanism allows the model to weigh the importance of different input sequence parts by calculating the embeddings' dot product. This mechanism is also known as multi-head attention.

The feed-forward network allows the model to extract higher-level features from the input. This network usually comprises two linear layers with a ReLU activation function in between. The feed-forward network allows the model to extract deeper meaning from the input data and more compactly and usefully represent the input.In the paper, an ANN with one hidden layer and a ReLu activation in the middle  with no activation function at output layer has been implemented.

The transformer encoder is a crucial part of the transformer encoder-decoder architecture, which is widely used for natural language processing tasks.

## Encoder Transformer
Stacking transormer blocks gives a Transfomer! Shown in figure below:

<br>
<center>
<img src= "https://cdn.extras.talentsprint.com/aiml/Experiment_related_data/Images/Encoder_transfomer.png" width=1000px/>
</center>

#### Define TransformerEncoder class

In [None]:
class TransformerEncoder(layers.Layer):
    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
        super().__init__(**kwargs)
        self.embed_dim = embed_dim    # Dimension of embedding. 4 in the dummy example
        self.dense_dim = dense_dim    # No. of neurons in dense layer
        self.num_heads = num_heads    # No. of heads for MultiHead Attention layer
        self.attention = layers.MultiHeadAttention(# MultiHead Attention layer -
            num_heads=num_heads, key_dim=embed_dim)   # see coloured pic above
        self.dense_proj = keras.Sequential(
            [layers.Dense(dense_dim, activation="relu"),
             layers.Dense(embed_dim),]    # encoders are stacked on top of the other.
        )                                 # So output dimension is also embed_dim
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()

    # Call function based on figure above
    def call(self, inputs, mask=None):
        if mask is not None:
            mask = mask[:, tf.newaxis, :]   # Will discuss in next tutorial
        attention_output = self.attention(
            inputs, inputs, attention_mask=mask)  # Query: inputs, Value: inputs, Keys: Same as Values by default
                                                  # Q: Can you see how this is self attention?
        proj_input = self.layernorm_1(inputs + attention_output) # LayerNormalization; + Recall cat picture
        proj_output = self.dense_proj(proj_input)
        return self.layernorm_2(proj_input + proj_output)  # LayerNormalization + Residual connection

    def get_config(self):
        config = super().get_config()
        config.update({
            "embed_dim": self.embed_dim,
            "num_heads": self.num_heads,
            "dense_dim": self.dense_dim,
        })
        return config

#### Model definition

In [None]:
# Build the Transformer encoder
vocab_size = 20000
embed_dim = 256
num_heads = 2
dense_dim = 32

inputs = keras.Input(shape=(None,), dtype="int64")
x = layers.Embedding(vocab_size, embed_dim)(inputs)
x = TransformerEncoder(embed_dim, dense_dim, num_heads)(x)
x = layers.GlobalMaxPooling1D()(x)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])
model.summary()

Model: "model_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_4 (InputLayer)        [(None, None)]            0         
                                                                 
 embedding_2 (Embedding)     (None, None, 256)         5120000   
                                                                 
 transformer_encoder (Trans  (None, None, 256)         543776    
 formerEncoder)                                                  
                                                                 
 global_max_pooling1d (Glob  (None, 256)               0         
 alMaxPooling1D)                                                 
                                                                 
 dropout_2 (Dropout)         (None, 256)               0         
                                                                 
 dense_5 (Dense)             (None, 1)                 257 

#### Train and evaluate the performance of the model

In [None]:
callbacks = [
    keras.callbacks.ModelCheckpoint("transformer_encoder.keras",
                                    save_best_only=True)
]
model.fit(int_train_ds, validation_data=int_val_ds, epochs=20, callbacks=callbacks)
model = keras.models.load_model(
    "transformer_encoder.keras",
    custom_objects={"TransformerEncoder": TransformerEncoder})
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Test acc: 0.873


## Positional Embedding

**Positional Embedding = Word Embedding + Positional Encoding**

**Positional Encoding**

Passing embeddings directly into the transformer block results in missing of information about the order of tokens. As attention is permutation invariant i.e. order of token does not matter to attention.
Although transformers are a sequence model, it appears that this important detail has somehow been lost. Positional encoding is for rescue.

Positional encoding add positional information to the existing embeddings.

**A unique set of numbers added at each position of the existing embeddings**, such that this new set of numbers can uniquely identify which postion they are located at.


1. Positional Encoding by SubClassing the Keras Embedding Layer (Trainable)
2. Positional Encoding scheme as per the paper (Non-Trainable)

 In this scheme the encoding is created by using a set of sins and cosines at different frequencies. The  paper uses the following formula for calculating the positional encoding. [Positional Encoding Vizualization.](https://erdem.pl/2021/05/understanding-positional-encoding-in-transformers)

$$\Large{PE_{(pos, 2i)} = \sin(pos / 10000^{2i / d_{model}})} $$
$$\Large{PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i / d_{model}})} $$

We are going to implement Positional Encoding by SubClassing the Keras Embedding Layer (Trainable). Thus Instead of using the Embedding layer from keras define a PositionalEmbedding class and create a new model using it as the embedding layer.

In [None]:
# Using positional encoding to re-inject order information

class PositionalEmbedding(layers.Layer):
    def __init__(self, sequence_length, input_dim, output_dim, **kwargs):
        super().__init__(**kwargs)
        self.token_embeddings = layers.Embedding(
            input_dim=input_dim, output_dim=output_dim)
        self.position_embeddings = layers.Embedding(
            input_dim=sequence_length, output_dim=output_dim)
        self.sequence_length = sequence_length
        self.input_dim = input_dim
        self.output_dim = output_dim

    def call(self, inputs):
        length = tf.shape(inputs)[-1]
        positions = tf.range(start=0, limit=length, delta=1)
        embedded_tokens = self.token_embeddings(inputs)
        embedded_positions = self.position_embeddings(positions)
        return embedded_tokens + embedded_positions

    def compute_mask(self, inputs, mask=None):
        return tf.math.not_equal(inputs, 0)

    def get_config(self):
        config = super().get_config()
        config.update({
            "output_dim": self.output_dim,
            "sequence_length": self.sequence_length,
            "input_dim": self.input_dim,
        })
        return config

#### Model Definition

In [None]:
#  Combining the Transformer encoder with positional embedding

vocab_size = 20000
sequence_length = 600
embed_dim = 256
num_heads = 2
dense_dim = 32

inputs = keras.Input(shape=(None,), dtype="int64")
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(inputs)
x = TransformerEncoder(embed_dim, dense_dim, num_heads)(x)
x = layers.GlobalMaxPooling1D()(x)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])
model.summary()

Model: "model_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_5 (InputLayer)        [(None, None)]            0         
                                                                 
 positional_embedding (Posi  (None, None, 256)         5273600   
 tionalEmbedding)                                                
                                                                 
 transformer_encoder_1 (Tra  (None, None, 256)         543776    
 nsformerEncoder)                                                
                                                                 
 global_max_pooling1d_1 (Gl  (None, 256)               0         
 obalMaxPooling1D)                                               
                                                                 
 dropout_3 (Dropout)         (None, 256)               0         
                                                           

#### Train and evaluate the model

In [None]:
callbacks = [
    keras.callbacks.ModelCheckpoint("full_transformer_encoder.keras",
                                    save_best_only=True)
]
model.fit(int_train_ds, validation_data=int_val_ds, epochs=20, callbacks=callbacks)
model = keras.models.load_model(
    "full_transformer_encoder.keras",
    custom_objects={"TransformerEncoder": TransformerEncoder,
                    "PositionalEmbedding": PositionalEmbedding})
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Test acc: 0.880


## References

If you are very interested or plan to work closely on Transformers, then following are good resource that explains in simplified manner.

1. [Attention Is All You Need](https://arxiv.org/pdf/1706.03762v6.pdf)

2. [Understanding Positional  Encoding](https://erdem.pl/2021/05/understanding-positional-encoding-in-transformers)

2. [Implement Multi-Head Attention](https://machinelearningmastery.com/how-to-implement-multi-head-attention-from-scratch-in-tensorflow-and-keras)

3. [Implementing the Transformer Encoder](https://machinelearningmastery.com/implementing-the-transformer-encoder-from-scratch-in-tensorflow-and-keras/)

4. [Illustrated-transformer](https://jalammar.github.io/illustrated-transformer/)





### Please answer the questions below to complete the experiment:




In [None]:
#@title Which layer in the Transformer model is responsible for capturing local dependencies within an input sequence? { run: "auto", form-width: "500px", display-mode: "form" }
Answer = "" #@param ["", "Self-Attention Layer", "Feedforward Neural Network Layer", "Positional Encoding Layer", "Layer Normalization"]

In [None]:
#@title How was the experiment? { run: "auto", form-width: "500px", display-mode: "form" }
Complexity = "" #@param ["","Too Simple, I am wasting time", "Good, But Not Challenging for me", "Good and Challenging for me", "Was Tough, but I did it", "Too Difficult for me"]


In [None]:
#@title If it was too easy, what more would you have liked to be added? If it was very difficult, what would you have liked to have been removed? { run: "auto", display-mode: "form" }
Additional = "" #@param {type:"string"}


In [None]:
#@title Can you identify the concepts from the lecture which this experiment covered? { run: "auto", vertical-output: true, display-mode: "form" }
Concepts = "" #@param ["","Yes", "No"]


In [None]:
#@title  Text and image description/explanation and code comments within the experiment: { run: "auto", vertical-output: true, display-mode: "form" }
Comments = "" #@param ["","Very Useful", "Somewhat Useful", "Not Useful", "Didn't use"]


In [None]:
#@title Mentor Support: { run: "auto", vertical-output: true, display-mode: "form" }
Mentor_support = "" #@param ["","Very Useful", "Somewhat Useful", "Not Useful", "Didn't use"]


In [None]:
#@title Run this cell to submit your notebook for grading { vertical-output: true }
try:
  if submission_id:
      return_id = submit_notebook()
      if return_id : submission_id = return_id
  else:
      print("Please complete the setup first.")
except NameError:
  print ("Please complete the setup first.")