# Transformer Chatbot Using Python/TensorFlow/Keras

<table>
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/gsathisk/Chatbot/blob/main/TensorFlow_Transformer_Chatbot.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" /> Run in Google Colab</a>
  </td>
</table>

## Transformer Chatbot

Chatbots are computer programs that create responses as close as humans when an interaction is carried out using a text or voice-based input. Chatbots use Natural Language Processing (NLP) techniques to understand most human languages. Chatbots are useful applications in industries like entertainment, information retrieval, education, e-commerce, business, and much more. The primary advantages of chatbots in businesses were the ability to deal with multiple users at the same time while significantly reducing customer service costs. With this in mind, this chatbot project will use Transformer Models for creating the chatbot. The scope of the chatbot will remain as an open domain. The core idea behind the Transformer model is _self-attention_, the ability to attend to different positions of the input sequence to compute a representation of that sequence. Transformer creates stacks of self-attention layers and is explained below in the sections _Scaled dot product attention_ and _Multi-head attention_.


The implementation of Transformer using TensorFlow 2.0 and python involved the following:
- Importing the required library modules for python and tensorflow
- Setting up the Google colab TPU environment
- Defining the hyperparameters for the input, batch, and transformer model
    - Input:
        - Input sentence length to be used for training the model
        - Number of samples to be used for training the model
    - Batch:
        - Batch size (number of samples per batch)
        - Buffer size for the batch (helps prefetching)
    - Model:
        - Number of layers [Nx] for Encoder and Decoder
        - Number of expected features in the encoder/decoder inputs
        - Number of heads for Multi-head attention
        - Number of hidden units
        - Probability at which outputs of the layer are dropped out
        - Number of Training steps for the entire dataset (EPOCHS)
- Importing and unzipping the "Cornell movie corpus" dataset
- Creating the question & answer set using the "movie_conversations.txt" and "movie_lines.txt" files.
    - Involves preprocessing of sentences
    - Involves marrying conversations based on conversation ids
- Attention process:
    - Creating Scaled dot-product attention (Attention calculation)
    - Creating Multi-head attention (Calls --> Scaled dot-product attention)
- Transformer model creation:
    - Create definition for padded masking of inputs
    - Create definition for look-ahead masking of outputs
    - Create definition for positional encoding
    - Create definition for encoder layer (Single layer)
    - Create definition for encoder (Iterate encoder layer based on [Nx] hyperparameter)
    - Create definition for decoder layer (Single layer)
    - Create definition for decoder (Iterate decoder layer based on [Nx] hyperparameter)
    - Create definition for transformer model
        - Call encoder
        - Call decoder
        - Call linear layer
- Compile the model:
    - Create definition for loss function
    - Create definition for custom learning rate
    - Create definition for optimizer (Adam)
    - Create definition for accuracy (Used for metrics report)
    - Initialize the model
    - Complile the model
    - Display the summary of the model
- Training the model:
    - Train (fit) the model using input dataset for defined number of EPOCHS (hyperparameter)
- Create definitions for evaluating and predicting the model
- Testing the model

## Initial Setup

### Import Libraries

The following python libraries are imported to help with the development of this transformer model.

In [1]:
# Import library modules required for this program
from __future__ import absolute_import, division, print_function, unicode_literals
import sys
import os
import re
import numpy as np
from time import time
import matplotlib.pyplot as plt

#%pip install tensorflow==2.3.1
import tensorflow as tf
tf.random.set_seed(1234)
AUTO = tf.data.experimental.AUTOTUNE

#%pip install tensorflow-datasets==4.1.0
import tensorflow_datasets as tfds

print("Tensorflow imported. Version: {}".format(tf.__version__))


Tensorflow imported. Version: 2.7.0


### GPU / TPU initialization (Required for Google Colab)

- Tensor Processing Units (TPUs) are Google’s custom-developed application-specific integrated circuits (ASICs) used to accelerate machine learning workloads. Learn more about TPU here --> __[What is TPU?](https://cloud.google.com/tpu/docs/tpus)__
- Use Google Colab to run this code to take advantage of TPU/GPU hardware acceleration. To enable it in Colab, under "Runtime --> Change Runtime Type --> Select `TPU` or `GPU` as hardware accelerator.


In [2]:
# Define TPU cluster config
try:
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    print('Running on TPU {}'.format(tpu.cluster_spec().as_dict()['worker']))
except ValueError:
    tpu = None

if tpu:
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.experimental.TPUStrategy(tpu)
else:
    strategy = tf.distribute.get_strategy()

print("TPU Replication strategy (Num_replicas): {}".format(strategy.num_replicas_in_sync))

TPU Replication strategy (Num_replicas): 1


## Hyperparameters

- To keep this example small and relatively fast, the values for *num_layers, d_model, and units* have been reduced. 
    - See the article [Attention Is All You Need](https://arxiv.org/abs/1706.03762) for all the other versions of the transformer.

In [4]:
# Filters for input dataset
MAX_LENGTH = 40 # Maximum sentence length to keep. Conversations that are bigger gets dropped.
MAX_SAMPLES = 50000 # Maximum number of samples to preprocess & fed to training.

# Parameters for the Dataset. Question and Answer sets (Numpy arrays) are split into chunks of batches.
BATCH_SIZE = 64 * strategy.num_replicas_in_sync # number of training examples used by one GPU in one training step
BUFFER_SIZE = 20000 # number of elements that will be buffered when prefetching

# Parameters for Transformer Model
NUM_LAYERS = 2 # Number of layers [Nx] for Encoder and Decoder
D_MODEL = 256 # the number of expected features in the encoder/decoder inputs
NUM_HEADS = 8 # Number of heads for Multi-head attention
UNITS = 512 # Number of hidden units
DROPOUT = 0.1 # The probability at which outputs of the layer are dropped out
EPOCHS = 40 # Number of Training steps for the entire dataset

print('Hyperparameter definition completed !!!')

Hyperparameter definition completed !!!


## Prepare Dataset

- The conversations in movies and TV shows provided by [Cornell Movie-Dialogs Corpus](https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html) is used as input dataset.
- Dataset contains more than 220 thousands conversational exchanges between more than 10k pairs of movie characters, as the dataset.
- `movie_conversations.txt` contains list of the conversation IDs and `movie_lines.text` contains the text assoicated with each conversation ID. 
    - For further  information regarding the dataset, please check the README file in the zip file.

In [5]:
path_to_zip = tf.keras.utils.get_file('cornell_movie_dialogs.zip',
    origin= 'http://www.cs.cornell.edu/~cristian/data/cornell_movie_dialogs_corpus.zip',
    extract=True)

path_to_dataset = os.path.join(
    os.path.dirname(path_to_zip), "cornell movie-dialogs corpus")

path_to_movie_lines = os.path.join(path_to_dataset, 'movie_lines.txt')
path_to_movie_conversations = os.path.join(path_to_dataset, 'movie_conversations.txt')

print('Dataset unzipped. Path to movie lines file and conversation file are set')

Dataset unzipped. Path to movie lines file and conversation file are set


### Data preprocesing:

#### 1) Subset the training data:
* To keep this example simple and fast, training samples were limited to `MAX_SAMPLES=50000` and the maximum length of the sentence to  `MAX_LENGTH=40`.

#### 2) Preprocessing 

##### step #1:
* The following preprocessing steps occur in the process:
    * Spaces are added after punctuations (Eg: "he is a boy." => "he is a boy .")
    * Contractions are removed (Eg: "i'm", "i am")
    * Replace everything with space except (a-z, A-Z, ".", "?", "!", ",")

In [6]:
# Preprocessing to work with spaces and contractions
def preprocess_sentence(sentence):
  sentence = sentence.lower().strip()
  # creating a space between a word and the punctuation following it
  # eg: "he is a boy." => "he is a boy ."
  sentence = re.sub(r"([?.!,])", r" \1 ", sentence)
  sentence = re.sub(r'[" "]+', " ", sentence)
  # removing contractions
  sentence = re.sub(r"i'm", "i am", sentence)
  sentence = re.sub(r"he's", "he is", sentence)
  sentence = re.sub(r"she's", "she is", sentence)
  sentence = re.sub(r"it's", "it is", sentence)
  sentence = re.sub(r"that's", "that is", sentence)
  sentence = re.sub(r"what's", "that is", sentence)
  sentence = re.sub(r"where's", "where is", sentence)
  sentence = re.sub(r"how's", "how is", sentence)
  sentence = re.sub(r"\'ll", " will", sentence)
  sentence = re.sub(r"\'ve", " have", sentence)
  sentence = re.sub(r"\'re", " are", sentence)
  sentence = re.sub(r"\'d", " would", sentence)
  sentence = re.sub(r"\'re", " are", sentence)
  sentence = re.sub(r"won't", "will not", sentence)
  sentence = re.sub(r"can't", "cannot", sentence)
  sentence = re.sub(r"n't", " not", sentence)
  sentence = re.sub(r"n'", "ng", sentence)
  sentence = re.sub(r"'bout", "about", sentence)
  # replacing everything with space except (a-z, A-Z, ".", "?", "!", ",")
  sentence = re.sub(r"[^a-zA-Z?.!,]+", " ", sentence)
  sentence = sentence.strip()
  return sentence

##### step #2:
* Conversation Id's (from `movie_conversations.txt`) and actual conversations (from `movie_lines.txt`) are married together and converted into a dictionary. An example conversation is given below. At the end, the entire conversation file is converted to an Question and Answer List

    ##### Example Conversation ID:<br>
    - u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L194', 'L195', 'L196', 'L197']

    ##### Example Conversation Text:<br>
    - L194 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again.
    - L195 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ Well, I thought we'd start with pronunciation, if that's okay with you.<br>
    - L196 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Not the hacking and gagging and spitting part.  Please.<br>
    - L197 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ Okay... then how 'bout we try out some French cuisine.  Saturday?  Night?<br>

    ##### Processed inputs (Questions):
    - 'can we make this quick ? roxanne korrine and andrew barrett are having an incredibly horrendous public break up on the quad . again .',
    - 'well , i thought we would start with pronunciation , if that is okay with you .',
    - 'not the hacking and gagging and spitting part . please .'
    
    ##### Processed outputs (Answers):
    - 'well , i thought we would start with pronunciation , if that is okay with you .', <br>
    - 'not the hacking and gagging and spitting part . please .', <br>
    - 'okay . . . then how about we try out some french cuisine . saturday ? night ?'
    <br><br>

In [8]:
# Map actual conversations from "movie_lines.txt" using Id's from "movie_conversations.txt"
def load_conversations():
  # dictionary of line id to text
  id2line = {}
  with open(path_to_movie_lines, errors='ignore') as file:
    lines = file.readlines()
  for line in lines:
    parts = line.replace('\n', '').split(' +++$+++ ')
    id2line[parts[0]] = parts[4]

  inputs, outputs = [], []
  with open(path_to_movie_conversations, 'r') as file:
    lines = file.readlines()

  for line in lines:
    parts = line.replace('\n', '').split(' +++$+++ ')
    # get conversation in a list of line ID
    conversation = [line[1:-1] for line in parts[3][1:-1].split(', ')]
    for i in range(len(conversation) - 1):
      inputs.append(preprocess_sentence(id2line[conversation[i]]))
      outputs.append(preprocess_sentence(id2line[conversation[i + 1]]))
      # Maximum number of samples to load for processing
      if len(inputs) >= MAX_SAMPLES:
        return inputs, outputs
  return inputs, outputs


print('Max samples to load set at ==> ', MAX_SAMPLES)
questions, answers = load_conversations()
print('Conversations are mapped and preprocessing completed...')

print('\nSample question ==> {}'.format(questions[0]))
print('Sample answer   ==> {}'.format(answers[0]))
print('\nQuestion and Answer list created !!!')

Max samples to load set at ==>  50000
Conversations are mapped and preprocessing completed...

Sample question ==> can we make this quick ? roxanne korrine and andrew barrett are having an incredibly horrendous public break up on the quad . again .
Sample answer   ==> well , i thought we would start with pronunciation , if that is okay with you .

Question and Answer list created !!!


##### step #3:

* Build tokenizer (Each unique word in the question and answer list is assigned a number `token`).
    #### Examples:
    <i>Note: In this example, 8277 is id assigned to mark the beginning of a sentence `START_TOKEN` and 8278 for `END_TOKEN` to indicate the end of each sentence.</i>

    - Sentence ==> 'can we make this quick ? roxanne korrine and andrew barrett are having an incredibly horrendous public break up on the quad . again .'
        - Tokenized sentence ==> 8277, 58, 22, 110, 33, 3055, 23, 956, 8141, 2941, 8053, 2373, 3587, 2260, 14, 7494, 937, 6596, 8053, 17, 431, 82, 5533, 7898, 220, 3081, 3926, 130, 1451, 741, 74, 39, 6, 2109, 8121, 2, 231, 1, 8278
    - Sentence ==> 'well , i thought we would start with pronunciation , if that is okay with you .'
        - Tokenized sentence ==> 8277, 69, 3, 4, 175, 22, 42, 353, 38, 1079, 1587, 4185, 820, 3, 50, 15, 9, 1109, 38, 31, 1, 8278
    - Sentence ==> 'not the hacking and gagging and spitting part . please .'
        - Tokenized sentence ==> 8277, 11, 6, 4388, 88, 14, 3635, 3087, 14, 2803, 1029, 1954, 2, 257, 1, 8278
    - Sentence ==> 'okay . . . then how about we try out some french cuisine . saturday ? night ?'
        - Tokenized sentence ==> 8277, 135, 28, 108, 54, 53, 22, 302, 64, 91, 1833, 986, 4209, 633, 2, 4709, 23, 230, 7, 8278

In [9]:
# Create tokens for question and answer list (Entire corpus)
tokenizer = tfds.deprecated.text.SubwordTextEncoder.build_from_corpus(
    questions + answers, target_vocab_size=2**13)
print('Token creation for corpus completed ...')

# Define Start/End token to indicate the beginning and end of a sentence
START_TOKEN, END_TOKEN = [tokenizer.vocab_size], [tokenizer.vocab_size + 1]
# Vocabulary size includes Start & End token
VOCAB_SIZE = tokenizer.vocab_size + 2

print('Tokenized sample question:\n {} \n {}'.format(questions[0], tokenizer.encode(questions[0])))
print('Tokenized sample answer:\n {} \n {}'.format(answers[0], tokenizer.encode(answers[0])))
print('\nStart token: {}. End Token: {}. Vocabulary size: {}.'.format(START_TOKEN, END_TOKEN, VOCAB_SIZE))

Token creation for corpus completed ...
Tokenized sample question:
 can we make this quick ? roxanne korrine and andrew barrett are having an incredibly horrendous public break up on the quad . again . 
 [58, 22, 110, 33, 3055, 23, 956, 8141, 2941, 8053, 2373, 3587, 2260, 14, 7494, 937, 6596, 8053, 17, 431, 82, 5533, 7898, 220, 3081, 3926, 130, 1451, 741, 74, 39, 6, 2109, 8121, 2, 231, 1]
Tokenized sample answer:
 well , i thought we would start with pronunciation , if that is okay with you . 
 [69, 3, 4, 175, 22, 42, 353, 38, 1079, 1587, 4185, 820, 3, 50, 15, 9, 1109, 38, 31, 1]

Start token: [8277]. End Token: [8278]. Vocabulary size: 8279.


##### step #4:

* Filter out sentence that has more than `MAX_LENGTH` tokens.
* Pad tokenized sentences with "0" at the end, upto `MAX_LENGTH` (40 in this case).
    #### Example:
    
    - Question ==> i really , really , really wanna go , but i cannot . not unless my sister goes .
    - Tokenized and Padded Question ==> [8277, 4, 271, 3, 271, 3, 141, 385, 173, 3, 40, 4, 611, 2, 11, 864, 30, 2021, 3086, 1, 8278, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In [10]:
# Tokenize, filter and pad sentences
def tokenize_and_filter(inputs, outputs):
  tokenized_inputs, tokenized_outputs = [], []
  
  for (sentence1, sentence2) in zip(inputs, outputs):
    # Tokenize sentence
    sentence1 = START_TOKEN + tokenizer.encode(sentence1) + END_TOKEN
    sentence2 = START_TOKEN + tokenizer.encode(sentence2) + END_TOKEN
    # Filter: Drop sentences, if its length exceeds defined max length.
    if len(sentence1) <= MAX_LENGTH and len(sentence2) <= MAX_LENGTH:
      tokenized_inputs.append(sentence1)
      tokenized_outputs.append(sentence2)
  
  # Pad tokenized sentences
  tokenized_inputs = tf.keras.preprocessing.sequence.pad_sequences(
      tokenized_inputs, maxlen=MAX_LENGTH, padding='post')
  tokenized_outputs = tf.keras.preprocessing.sequence.pad_sequences(
      tokenized_outputs, maxlen=MAX_LENGTH, padding='post')
  
  return tokenized_inputs, tokenized_outputs


print('Maximum sentence length defined ==> ', MAX_LENGTH)
questions, answers = tokenize_and_filter(questions, answers)
print('Tokenization of sentences, Filtering of bigger sentences, Padding of sentences, Completed !!!')

print('\nTotal number of questions: {}'.format(len(questions)))
for i in range(0,5):
    print('\nTokenized/Filtered/Padded sample question: \n{}'.format(questions[i]))
    print('Tokenized/Filtered/Padded sample answer: \n{}'.format(answers[i]))

Maximum sentence length defined ==>  40
Tokenization, Filtering of bigger sentences, Padding of sentences, Completed !!!

Total number of questions: 44131

Tokenized/Filtered/Padded sample question: 
[8277   58   22  110   33 3055   23  956 8141 2941 8053 2373 3587 2260
   14 7494  937 6596 8053   17  431   82 5533 7898  220 3081 3926  130
 1451  741   74   39    6 2109 8121    2  231    1 8278    0]
Tokenized/Filtered/Padded sample answer: 
[8277   69    3    4  175   22   42  353   38 1079 1587 4185  820    3
   50   15    9 1109   38   31    1 8278    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0]

Tokenized/Filtered/Padded sample question: 
[8277   69    3    4  175   22   42  353   38 1079 1587 4185  820    3
   50   15    9 1109   38   31    1 8278    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0]
Tokenized/Filtered/Padded sample answer: 
[8277   11    6 4388   88   14 3635 3087   14 2803 1029 195

### Create `Dataset`

- [tf.data.Dataset API](https://www.tensorflow.org/api_docs/python/tf/data) is used to contruct the input pipline in order to utilize features like caching and prefetching, inorder to speed up the training process.
- The transformer is an auto-regressive model: it makes predictions one part at a time, and uses its output so far to decide what to do next.
- During training, this example uses teacher-forcing. Teacher forcing is passing the true output to the next time step regardless of what the model predicts at the current time step.
- As the transformer predicts each word, self-attention allows it to look at the previous words in the input sequence, to better predict the next word.
- To prevent the model from peeking at the expected output (in decoder self attention), the model uses a look-ahead mask.
- Target dataset is divided into:
    - `inputs` --> Input to encoder
    - `decoder_inputs` --> Input to decoder (padded)
    - `cropped_targets` --> Outputs used for calculating loss and accuracy

#### Example:
input[18] = array([8277, 4, 271, 3, 271, 3, 141, 385, 173, 3, 40, 4, 611, 2, ...], dtype=int32)
<br>dec_inputs[17] = array([8277, 4, 271, 3, 271, 3, 141, 385, 173, 3, 40, 4, 611, 2, ...], dtype=int32)

In [11]:
# decoder inputs use the previous target as input
# remove START_TOKEN from targets
dataset = tf.data.Dataset.from_tensor_slices((
    {
        'inputs': questions,
        'dec_inputs': answers[:, :-1]
    },
    {
        'outputs': answers[:, 1:]
    },
))

print('\nDataset Details ==> ', dataset)
for idx, element in enumerate(dataset.as_numpy_iterator()):
    print(element)
    if idx == 10:
        break

# De
dataset = dataset.cache()
dataset = dataset.shuffle(BUFFER_SIZE)
dataset = dataset.batch(BATCH_SIZE)
dataset = dataset.prefetch(tf.data.experimental.AUTOTUNE)


Dataset Details ==>  <TensorSliceDataset shapes: ({inputs: (40,), dec_inputs: (39,)}, {outputs: (39,)}), types: ({inputs: tf.int32, dec_inputs: tf.int32}, {outputs: tf.int32})>
({'inputs': array([8277,   58,   22,  110,   33, 3055,   23,  956, 8141, 2941, 8053,
       2373, 3587, 2260,   14, 7494,  937, 6596, 8053,   17,  431,   82,
       5533, 7898,  220, 3081, 3926,  130, 1451,  741,   74,   39,    6,
       2109, 8121,    2,  231,    1, 8278,    0], dtype=int32), 'dec_inputs': array([8277,   69,    3,    4,  175,   22,   42,  353,   38, 1079, 1587,
       4185,  820,    3,   50,   15,    9, 1109,   38,   31,    1, 8278,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0], dtype=int32)}, {'outputs': array([  69,    3,    4,  175,   22,   42,  353,   38, 1079, 1587, 4185,
        820,    3,   50,   15,    9, 1109,   38,   31,    1, 8278,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
    

## Attention



### Scaled dot product Attention

<img src="https://www.tensorflow.org/images/tutorials/transformer/scaled_attention.png" width="500" alt="Scaled Dot-product attention">

The scaled dot-product attention function used by the transformer takes three inputs: Q (query), K (key), V (value). The equation used to calculate the attention weights is:

$$\Large{Attention(Q, K, V) = softmax_k(\frac{QK^T}{\sqrt{d_k}}) V} $$

- As the softmax normalization is done on the `key`, its values decide the amount of importance given to the `query`.
- The output represents the multiplication of the attention weights and the `value` vector. This ensures that the words we want to focus on are kept as is, and the irrelevant words are flushed out.
- The dot-product attention is scaled by a factor of square root of the depth. This is done because, for large values of depth, the dot product grows large in magnitude pushing the softmax function where it has small gradients, resulting in a very hard softmax. 
    - For example, consider that `query` and `key` have a mean of 0 and variance of 1. Their matrix multiplication will have a mean of 0 and variance of `dk`. Hence, *square root of `dk`* is used for scaling (and not any other number) because the matmul of `query` and `key` should have a mean of 0 and variance of 1, so that we get a gentler softmax.
- The mask is multiplied with *-1e9 (close to negative infinity).* This is done because, the mask is summed with the scaled matrix multiplication of `query` and `key`, and is applied immediately before a softmax. The goal is to zero out these cells, and large negative inputs to softmax are near zero in the output.

In [12]:
# Definition for Scaled-dot-product-attention
def scaled_dot_product_attention(query, key, value, mask):
  """Calculate the attention weights. """
  matmul_qk = tf.matmul(query, key, transpose_b=True)

  # scale matmul_qk
  depth = tf.cast(tf.shape(key)[-1], tf.float32)
  logits = matmul_qk / tf.math.sqrt(depth)

  # add the mask to zero out padding tokens
  if mask is not None:
    logits += (mask * -1e9)

  # softmax is normalized on the last axis (seq_len_k)
  attention_weights = tf.nn.softmax(logits, axis=-1)

  output = tf.matmul(attention_weights, value)

  return output

### Multi-head attention (Calls --> Scaled Dot-Product attention)

<img src="https://www.tensorflow.org/images/tutorials/transformer/multi_head_attention.png" width="500" alt="multi-head attention">

- Multi-head attention consists of four parts:
    * Linear layers and split of heads.
    * Scaled dot-product attention.
    * Concatenation of heads.
    * Final linear layer.
- Each multi-head attention block gets three inputs; Q (query), K (key), V (value). These are put through linear (Dense) layers and split up into multiple heads. 
- The `scaled_dot_product_attention` defined above is applied to each head (broadcasted for efficiency). An appropriate mask must be used in the attention step.  The attention output for each head is then concatenated (using `tf.transpose`, and `tf.reshape`) and put through a final `Dense` layer.
- Instead of one single attention head, `query`, `key`, and `value` are split into multiple heads because it allows the model to jointly attend to information at different positions from different representational spaces. After the split, each head has a reduced dimensionality, so the total computation cost is the same as a single head attention with full dimensionality.

In [13]:
# Multi-Head Attention class definition
class MultiHeadAttention(tf.keras.layers.Layer):

  def __init__(self, d_model, num_heads, name="multi_head_attention"):
    super(MultiHeadAttention, self).__init__(name=name)
    self.num_heads = num_heads
    self.d_model = d_model

    assert d_model % self.num_heads == 0

    self.depth = d_model // self.num_heads

    self.query_dense = tf.keras.layers.Dense(units=d_model)
    self.key_dense = tf.keras.layers.Dense(units=d_model)
    self.value_dense = tf.keras.layers.Dense(units=d_model)

    self.dense = tf.keras.layers.Dense(units=d_model)
  
  def get_config(self):
        config = super(MultiHeadAttention,self).get_config()
        config.update({
            'num_heads':self.num_heads,
            'd_model':self.d_model,
        })
        return config

  def split_heads(self, inputs, batch_size):
    inputs = tf.keras.layers.Lambda(lambda inputs:tf.reshape(
        inputs, shape=(batch_size, -1, self.num_heads, self.depth)))(inputs)
    return tf.keras.layers.Lambda(lambda inputs: tf.transpose(inputs, perm=[0, 2, 1, 3]))(inputs)

  # Call each layer in multi-head attention in order
  def call(self, inputs):
    query, key, value, mask = inputs['query'], inputs['key'], inputs[
        'value'], inputs['mask']
    batch_size = tf.shape(query)[0]

    # linear layers
    query = self.query_dense(query)
    key = self.key_dense(key)
    value = self.value_dense(value)

    # split heads
    query = self.split_heads(query, batch_size)
    key = self.split_heads(key, batch_size)
    value = self.split_heads(value, batch_size)

    # scaled dot-product attention
    scaled_attention = scaled_dot_product_attention(query, key, value, mask)
    scaled_attention = tf.keras.layers.Lambda(lambda scaled_attention: tf.transpose(
        scaled_attention, perm=[0, 2, 1, 3]))(scaled_attention)

    # concatenation of heads
    concat_attention = tf.keras.layers.Lambda(lambda scaled_attention: tf.reshape(scaled_attention,
                                  (batch_size, -1, self.d_model)))(scaled_attention)

    # final linear layer
    outputs = self.dense(concat_attention)

    return outputs    

## Transformer

### Transformer Model


This transformer model is modeled after the model proposed by Vaswani et al. (2017).

<img src="https://www.tensorflow.org/images/tutorials/transformer/transformer.png" width="500" alt="Transformer Model">

### Step1: Create masking definitions:

`create_padding_mask` and `create_look_ahead` are helper functions for creating masks, to mask out padded tokens. 
These helper functions are used as `tf.keras.layers.Lambda` layers.
- Mask all the pad tokens (value `0`) in the batch to ensure the model does not treat padding as input.

In [15]:
# Defintion to mask the padded zeros. Output is encoded to "1" for padded "0"s
def create_padding_mask(x):
  mask = tf.cast(tf.math.equal(x, 0), tf.float32)
  # (batch_size, 1, 1, sequence length)
  return mask[:, tf.newaxis, tf.newaxis, :]

print(create_padding_mask(tf.constant([[1, 2, 0, 3, 0], [0, 0, 0, 4, 5]])))

tf.Tensor(
[[[[0. 0. 1. 0. 1.]]]


 [[[1. 1. 1. 0. 0.]]]], shape=(2, 1, 1, 5), dtype=float32)


- Look-ahead mask to mask the future tokens in a sequence. 
    - Pad tokens are also masked out. i.e. To predict the third word, only the first and second word will be used

In [16]:
# Definition to mask the future tokens to facilitate decoder learning
def create_look_ahead_mask(x):
  seq_len = tf.shape(x)[1]
  look_ahead_mask = 1 - tf.linalg.band_part(tf.ones((seq_len, seq_len)), -1, 0)
  padding_mask = create_padding_mask(x)
  return tf.maximum(look_ahead_mask, padding_mask)

print(create_look_ahead_mask(tf.constant([[1, 2, 0, 4, 5]])))

tf.Tensor(
[[[[0. 1. 1. 1. 1.]
   [0. 0. 1. 1. 1.]
   [0. 0. 1. 1. 1.]
   [0. 0. 1. 0. 1.]
   [0. 0. 1. 0. 0.]]]], shape=(1, 1, 5, 5), dtype=float32)


### Step2: Positional encoding

Since this model doesn't contain any recurrence or convolution, positional encoding is added to give the model some information about the relative position of the words in the sentence. 

The positional encoding vector is added to the embedding vector. Embeddings represent a token in a d-dimensional space where tokens with similar meaning will be closer to each other. But the embeddings do not encode the relative position of words in a sentence. So after adding the positional encoding, words will be closer to each other based on the *similarity of their meaning and their position in the sentence*, in the d-dimensional space. See the notebook on [positional encoding](https://github.com/tensorflow/examples/blob/master/community/en/position_encoding.ipynb) to learn more about it. 

The formula for calculating the positional encoding is as follows:

$$\Large{PE_{(pos, 2i)} = sin(pos / 10000^{2i / d_{model}})} $$
$$\Large{PE_{(pos, 2i+1)} = cos(pos / 10000^{2i / d_{model}})} $$

In [17]:
# Class definition for positional Encoding
class PositionalEncoding(tf.keras.layers.Layer):

  def __init__(self, position, d_model):
    super(PositionalEncoding, self).__init__()
    self.pos_encoding = self.positional_encoding(position, d_model)
  
  def get_config(self):

        config = super(PositionalEncoding, self).get_config()
        config.update({
            'position': self.position,
            'd_model': self.d_model,
            
        })
        return config

  def get_angles(self, position, i, d_model):
    angles = 1 / tf.pow(10000, (2 * (i // 2)) / tf.cast(d_model, tf.float32))
    return position * angles

  def positional_encoding(self, position, d_model):
    angle_rads = self.get_angles(
        position=tf.range(position, dtype=tf.float32)[:, tf.newaxis],
        i=tf.range(d_model, dtype=tf.float32)[tf.newaxis, :],
        d_model=d_model)
    # apply sin to even index in the array
    sines = tf.math.sin(angle_rads[:, 0::2])
    # apply cos to odd index in the array
    cosines = tf.math.cos(angle_rads[:, 1::2])

    pos_encoding = tf.concat([sines, cosines], axis=-1)
    pos_encoding = pos_encoding[tf.newaxis, ...]
    return tf.cast(pos_encoding, tf.float32)

  def call(self, inputs):
    return inputs + self.pos_encoding[:, :tf.shape(inputs)[1], :]

In [33]:
# sample_pos_encoding = PositionalEncoding(50, 512)

# plt.pcolormesh(sample_pos_encoding.pos_encoding.numpy()[0], cmap='RdBu')
# plt.xlabel('Depth')
# plt.xlim((0, 512))
# plt.ylabel('Position')
# plt.colorbar()
# plt.show()

### Step3: Create Encoder Layer (Single Layer)

- Each encoder layer consists of sublayers:
    1. Multi-head attention (with padding mask) 
    2. Two dense layers followed by dropout
<br><br>

- Sublayer:
    - Each of these sublayers has a residual connection around it followed by a layer normalization. 
        - Residual connections help in avoiding the vanishing gradient problem in deep networks.
    - The output of each sublayer is `LayerNorm(x + Sublayer(x))`. 
        - The normalization is done on the `d_model` (last) axis.

In [32]:
# Definition for Encoder layer (Single layer - N1)
def encoder_layer(units, d_model, num_heads, dropout, name="encoder_layer"):
  
  inputs = tf.keras.Input(shape=(None, d_model), name="inputs")
  # Create padding mask
  padding_mask = tf.keras.Input(shape=(1, 1, None), name="padding_mask")

  # Call for Multi-head attention
  attention = MultiHeadAttention(
      d_model, num_heads, name="attention")({
          'query': inputs,
          'key': inputs,
          'value': inputs,
          'mask': padding_mask
      })
  attention = tf.keras.layers.Dropout(rate=dropout)(attention)
  add_attention = tf.keras.layers.add([inputs,attention])
  attention = tf.keras.layers.LayerNormalization(epsilon=1e-6)(add_attention)

  # Two dense layers followed by dropout
  outputs = tf.keras.layers.Dense(units=units, activation='relu')(attention)
  outputs = tf.keras.layers.Dense(units=d_model)(outputs)
  outputs = tf.keras.layers.Dropout(rate=dropout)(outputs)
  add_attention = tf.keras.layers.add([attention,outputs])
  outputs = tf.keras.layers.LayerNormalization(epsilon=1e-6)(add_attention)

  return tf.keras.Model(
      inputs=[inputs, padding_mask], outputs=outputs, name=name)

In [34]:
# sample_encoder_layer = encoder_layer(
    # units=512,
    # d_model=128,
    # num_heads=4,
    # dropout=0.3,
    # name="sample_encoder_layer")

# tf.keras.utils.plot_model(sample_encoder_layer, to_file='encoder_layer.png', show_shapes=True)

### Step4: Create Encoder (Consists of Nx Layers)

The Encoder consists of:
1.   Input Embedding
2.   Positional Encoding
3.   `num_layers` encoder layers (Nx)

Explanation:
- The input is put through an embedding which is summed with the positional encoding. 
- The output of this summation is the input to the encoder layers. 
- The output of the encoder is the input to the decoder.

In [35]:
# Definition for Encoder (Consists of Nx Layers)
def encoder(vocab_size,
            num_layers,
            units,
            d_model,
            num_heads,
            dropout,
            name="encoder"):
  inputs = tf.keras.Input(shape=(None,), name="inputs")
  # Create padding mask
  padding_mask = tf.keras.Input(shape=(1, 1, None), name="padding_mask")

  # Create word embeddings
  embeddings = tf.keras.layers.Embedding(vocab_size, d_model)(inputs)
  embeddings *= tf.keras.layers.Lambda(lambda d_model: tf.math.sqrt(tf.cast(d_model, tf.float32)))(d_model)
  embeddings = PositionalEncoding(vocab_size,d_model)(embeddings)

  outputs = tf.keras.layers.Dropout(rate=dropout)(embeddings)

  # Loop through for Nx Layers
  for i in range(num_layers):
    outputs = encoder_layer(
        units=units,
        d_model=d_model,
        num_heads=num_heads,
        dropout=dropout,
        name="encoder_layer_{}".format(i),
    )([outputs, padding_mask])

  return tf.keras.Model(
      inputs=[inputs, padding_mask], outputs=outputs, name=name)

In [36]:
# sample_encoder = encoder(
#     vocab_size=8192,
#     num_layers=2,
#     units=512,
#     d_model=128,
#     num_heads=4,
#     dropout=0.3,
#     name="sample_encoder")

# tf.keras.utils.plot_model(
#    sample_encoder, to_file='encoder.png', show_shapes=True)

### Step5: Create decoder layer (Single Layer)

Each decoder layer consists of sublayers:

1.   Masked multi-head attention (with look ahead mask and padding mask)
2.   Multi-head attention (with padding mask). 
    - `value` and `key` receive the *encoder output* as inputs. 
    - `query` receives the *output from the masked multi-head attention sublayer.*
3.   Two dense layers followed by dropout

Description of Masked Multi-Head Attention:
- Each of these sublayers has a residual connection around it followed by a layer normalization. 
- The output of each sublayer is `LayerNorm(x + Sublayer(x))`. The normalization is done on the `d_model` (last) axis.

Description of Multi-Head Attention:
- As `query` receives the output from decoder's first attention block, and `key` receives the encoder output, the attention weights represent the importance given to the decoder's input based on the encoder's output. In other words, the decoder predicts the next word by looking at the encoder output and self-attending to its own output. See the demonstration above in the scaled dot product attention section.

In [37]:
# Definition for decoder layer (Single layer)
def decoder_layer(units, d_model, num_heads, dropout, name="decoder_layer"):
  inputs = tf.keras.Input(shape=(None, d_model), name="inputs")
  enc_outputs = tf.keras.Input(shape=(None, d_model), name="encoder_outputs")
  
  # Look ahead masking and padded masking
  look_ahead_mask = tf.keras.Input(
      shape=(1, None, None), name="look_ahead_mask")
  padding_mask = tf.keras.Input(shape=(1, 1, None), name='padding_mask')

  # Masked Multi-head attention
  attention1 = MultiHeadAttention(
      d_model, num_heads, name="attention_1")(inputs={
          'query': inputs,
          'key': inputs,
          'value': inputs,
          'mask': look_ahead_mask
      })
  add_attention = tf.keras.layers.add([attention1,inputs])    
  attention1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)(add_attention)

  # Multi-Head attention
  attention2 = MultiHeadAttention(
      d_model, num_heads, name="attention_2")(inputs={
          'query': attention1,
          'key': enc_outputs,
          'value': enc_outputs,
          'mask': padding_mask
      })
  attention2 = tf.keras.layers.Dropout(rate=dropout)(attention2)
  add_attention = tf.keras.layers.add([attention2,attention1])
  attention2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)(add_attention)

  # Two dense layer and dropout
  outputs = tf.keras.layers.Dense(units=units, activation='relu')(attention2)
  outputs = tf.keras.layers.Dense(units=d_model)(outputs)
  outputs = tf.keras.layers.Dropout(rate=dropout)(outputs)
  add_attention = tf.keras.layers.add([outputs,attention2])
  outputs = tf.keras.layers.LayerNormalization(epsilon=1e-6)(add_attention)

  return tf.keras.Model(
      inputs=[inputs, enc_outputs, look_ahead_mask, padding_mask],
      outputs=outputs,
      name=name)

In [38]:
# sample_decoder_layer = decoder_layer(
#     units=512,
#     d_model=128,
#     num_heads=4,
#     dropout=0.3,
#     name="sample_decoder_layer")

# tf.keras.utils.plot_model(
#     sample_decoder_layer, to_file='decoder_layer.png', show_shapes=True)

### Step6: Create decoder (Consists of Nx layers)

The Decoder consists of:
1. Output Embedding
2. Positional Encoding
3. N decoder layers

Explanation:
- The target is put through an embedding which is summed with the positional encoding. 
- The output of this summation is the input to the decoder layers. 
- The output of the decoder is the input to the final linear layer.

In [39]:
# Definition of Decoder (Consists of Nx layers)
def decoder(vocab_size,
            num_layers,
            units,
            d_model,
            num_heads,
            dropout,
            name='decoder'):
  inputs = tf.keras.Input(shape=(None,), name='inputs')
  enc_outputs = tf.keras.Input(shape=(None, d_model), name='encoder_outputs')
  look_ahead_mask = tf.keras.Input(
      shape=(1, None, None), name='look_ahead_mask')
  padding_mask = tf.keras.Input(shape=(1, 1, None), name='padding_mask')
  
  # Word embeddings for input to decoder layer
  embeddings = tf.keras.layers.Embedding(vocab_size, d_model)(inputs)
  embeddings *= tf.keras.layers.Lambda(lambda d_model: tf.math.sqrt(tf.cast(d_model, tf.float32)))(d_model)
  embeddings = PositionalEncoding(vocab_size, d_model)(embeddings)

  outputs = tf.keras.layers.Dropout(rate=dropout)(embeddings)

  # Loop through Nx Layers
  for i in range(num_layers):
    outputs = decoder_layer(
        units=units,
        d_model=d_model,
        num_heads=num_heads,
        dropout=dropout,
        name='decoder_layer_{}'.format(i),
    )(inputs=[outputs, enc_outputs, look_ahead_mask, padding_mask])

  return tf.keras.Model(
      inputs=[inputs, enc_outputs, look_ahead_mask, padding_mask],
      outputs=outputs,
      name=name)

In [40]:
# sample_decoder = decoder(
#     vocab_size=8192,
#     num_layers=2,
#     units=512,
#     d_model=128,
#     num_heads=4,
#     dropout=0.3,
#     name="sample_decoder")

# tf.keras.utils.plot_model(
#     sample_decoder, to_file='decoder.png', show_shapes=True)

### Step7: Create transformer

Transformer consists of the encoder, decoder and a final linear layer. The output of the decoder is the input to the linear layer and its output is returned.

In [41]:
# definition for transformer
def transformer(vocab_size,
                num_layers,
                units,
                d_model,
                num_heads,
                dropout,
                name="transformer"):
  
  # Inputs to Encoder and Decoder
  inputs = tf.keras.Input(shape=(None,), name="inputs")
  dec_inputs = tf.keras.Input(shape=(None,), name="dec_inputs")

  # Padding mask for encoder
  enc_padding_mask = tf.keras.layers.Lambda(
      create_padding_mask, output_shape=(1, 1, None),
      name='enc_padding_mask')(inputs)
  
  # mask the future tokens for decoder inputs at the 1st attention block
  look_ahead_mask = tf.keras.layers.Lambda(
      create_look_ahead_mask,
      output_shape=(1, None, None),
      name='look_ahead_mask')(dec_inputs)
  
  # mask the encoder outputs for the 2nd attention block
  dec_padding_mask = tf.keras.layers.Lambda(
      create_padding_mask, output_shape=(1, 1, None),
      name='dec_padding_mask')(inputs)

  # Call Encoder with Encoder input (inputs)
  # Output of encoder = enc_outputs
  enc_outputs = encoder(
      vocab_size=vocab_size,
      num_layers=num_layers,
      units=units,
      d_model=d_model,
      num_heads=num_heads,
      dropout=dropout,
  )(inputs=[inputs, enc_padding_mask])

  # Call Decoder with Decoder input (dec_inputs)
  # Also, Encoder output (enc_outputs), look-ahead-mask & padding-mask
  dec_outputs = decoder(
      vocab_size=vocab_size,
      num_layers=num_layers,
      units=units,
      d_model=d_model,
      num_heads=num_heads,
      dropout=dropout,
  )(inputs=[dec_inputs, enc_outputs, look_ahead_mask, dec_padding_mask])

  # Final linear layer
  outputs = tf.keras.layers.Dense(units=vocab_size, name="outputs")(dec_outputs)

  return tf.keras.Model(inputs=[inputs, dec_inputs], outputs=outputs, name=name)

In [42]:
# sample_transformer = transformer(
#     vocab_size=8192,
#     num_layers=4,
#     units=512,
#     d_model=128,
#     num_heads=4,
#     dropout=0.3,
#     name="sample_transformer")

# tf.keras.utils.plot_model(
#     sample_transformer, to_file='transformer.png', show_shapes=True)

## Train model

### Step1: Create loss function (Input to model)

Since the target sequences are padded, it is important to apply a padding mask when calculating the loss.

In [43]:
# Definition for loss function
def loss_function(y_true, y_pred):
  y_true = tf.reshape(y_true, shape=(-1, MAX_LENGTH - 1))
  
  loss = tf.keras.losses.SparseCategoricalCrossentropy(
      from_logits=True, reduction='none')(y_true, y_pred)

  mask = tf.cast(tf.not_equal(y_true, 0), tf.float32)
  loss = tf.multiply(loss, mask)

  return tf.reduce_mean(loss)

### Step2: Custom learning rate (Input to model)

Use the Adam optimizer with a custom learning rate scheduler according to the formula in the [paper](https://arxiv.org/abs/1706.03762).

$$\Large{lrate = d_{model}^{-0.5} * min(step{\_}num^{-0.5}, step{\_}num * warmup{\_}steps^{-1.5})}$$

In [44]:
# Class definition for custom learning rate
class CustomSchedule(tf.keras.optimizers.schedules.LearningRateSchedule):

  def __init__(self, d_model, warmup_steps=4000):
    super(CustomSchedule, self).__init__()
    
    self.d_model = tf.constant(d_model,dtype=tf.float32)
    self.warmup_steps = warmup_steps
    
  def get_config(self):
        return {"d_model": self.d_model,"warmup_steps":self.warmup_steps}
    
  def __call__(self, step):
    arg1 = tf.math.rsqrt(step)
    arg2 = step * (self.warmup_steps**-1.5)

    return tf.math.multiply(tf.math.rsqrt(self.d_model), tf.math.minimum(arg1, arg2))

In [50]:
# sample_learning_rate = CustomSchedule(d_model=128)

# plt.plot(sample_learning_rate(tf.range(200000, dtype=tf.float32)))
# plt.ylabel("Learning Rate")
# plt.xlabel("Train Step")

### Step3: Initialize and compile model (Model creation)

Initialize and compile model with our predefined custom learning rate and Adam optimizer under the strategy scope.

In [46]:
# Code to compile the transformer model.

# clear backend
tf.keras.backend.clear_session()

# Information on learning rate
learning_rate = CustomSchedule(D_MODEL)

# Definition for optimizer (Uses learning rate as input)
optimizer = tf.keras.optimizers.Adam(
    learning_rate, beta_1=0.9, beta_2=0.98, epsilon=1e-9)

# Definition for model accuracy - Used for metrics
def accuracy(y_true, y_pred):
  # ensure labels have shape (batch_size, MAX_LENGTH - 1)
  y_true = tf.reshape(y_true, shape=(-1, MAX_LENGTH - 1))
  return tf.keras.metrics.sparse_categorical_accuracy(y_true, y_pred)

# initialize and compile model within strategy scope
with strategy.scope():
  model = transformer(
      vocab_size=VOCAB_SIZE,
      num_layers=NUM_LAYERS,
      units=UNITS,
      d_model=D_MODEL,
      num_heads=NUM_HEADS,
      dropout=DROPOUT)

  # Compile the model
  model.compile(optimizer=optimizer, loss=loss_function, metrics=[accuracy])

# Display model summary
model.summary()

Model: "transformer"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 inputs (InputLayer)            [(None, None)]       0           []                               
                                                                                                  
 dec_inputs (InputLayer)        [(None, None)]       0           []                               
                                                                                                  
 enc_padding_mask (Lambda)      (None, 1, 1, None)   0           ['inputs[0][0]']                 
                                                                                                  
 encoder (Functional)           (None, None, 256)    3173632     ['inputs[0][0]',                 
                                                                  'enc_padding_mask[0][0

### Step4: Fit model (Training)

Train our transformer by simply calling `model.fit()`

In [None]:
model.fit(dataset, epochs=EPOCHS)
# model.fit(dataset, epochs=5)

## Evaluate and predict

The following steps are used for evaluation:

* Apply the same preprocessing method we used to create our dataset for the input sentence.
* Tokenize the input sentence and add `START_TOKEN` and `END_TOKEN`. 
* Calculate the padding masks and the look ahead masks.
* The decoder then outputs the predictions by looking at the encoder output and its own output.
* Select the last word and calculate the argmax of that.
* Concatentate the predicted word to the decoder input and pass it to the decoder.
* In this approach, the decoder predicts the next word based on the previous words it predicted.

Note: The model used here has less capacity and trained on a subset of the full dataset, hence its performance can be further improved.

### Step1: Definition for Evaluation & Prediction

In [None]:
# Definition for evaluation of the model

# Same steps need to be performed for the input sentence as in training
# Otherwise, the prediction will not work
def evaluate(sentence):
  sentence = preprocess_sentence(sentence)

  # Tokenize the input sentence
  sentence = tf.expand_dims(
      START_TOKEN + tokenizer.encode(sentence) + END_TOKEN, axis=0)

  # Initialize the first output with start token
  output = tf.expand_dims(START_TOKEN, 0)

  # Loop through predicted words until the End-token is reached
  for i in range(MAX_LENGTH):
    predictions = model(inputs=[sentence, output], training=False)

    # select the last word from the seq_len dimension
    predictions = predictions[:, -1:, :]
    predicted_id = tf.cast(tf.argmax(predictions, axis=-1), tf.int32)

    # return the result if the predicted_id is equal to the end token
    if tf.equal(predicted_id, END_TOKEN[0]):
      break

    # concatenated the predicted_id to the output which is given to the decoder
    # as its input.
    output = tf.concat([output, predicted_id], axis=-1)

  return tf.squeeze(output, axis=0)


# Definition for prediction (Calls evaluate())
def predict(sentence):
  prediction = evaluate(sentence)

  # Decode the predicted output to human read sentence
  predicted_sentence = tokenizer.decode(
      [i for i in prediction if i < tokenizer.vocab_size])

  print('Input: {}'.format(sentence))
  print('Output: {}'.format(predicted_sentence))

  return predicted_sentence

### Step2: Testing the model (Calls Evaluate/Predict)

Testing the model. 
- Hit "_Enter_" to predict an output for an input sentence
- Hit "_Q_" anytime to quit the program

In [None]:
# Sample testing
output = predict('Where have you been?')

# Test using user input
while True:
    input_option = input('Hit "Enter" to continue; "Q" to quit;  ==> ')
    if input_option == "Q":
        break
    else:
        predict(input('Enter your question ==> '))

Hit "Enter" to continue; "Q" to quit;  ==> 
Enter your question ==> Where have you been?
Input: Where have you been?
Output: i do not know , i just came up with the military . . .
Hit "Enter" to continue; "Q" to quit;  ==> 
Enter your question ==> it is frustrating
Input: it is frustrating
Output: and your head ?
Hit "Enter" to continue; "Q" to quit;  ==> i am hungry
Enter your question ==> i am hungry
Input: i am hungry
Output: not so late . you are so smart to see if you are trying to get me a job , i want to get involved with your own people . jeez , what do you want you need ?
Hit "Enter" to continue; "Q" to quit;  ==> Q


In [None]:
# feed the model with its previous output
sentence = 'I am not crazy, my mother had me tested.'
for _ in range(5):
  sentence = predict(sentence)
  print('')

Input: I am not crazy, my mother had me tested.
Output: you don t have a choice .

Input: you don t have a choice .
Output: i don t have a choice .

Input: i don t have a choice .
Output: what s wrong ?

Input: what s wrong ?
Output: you don t have to go . you don t understand

Input: you don t have to go . you don t understand
Output: i don t know . i m not sure that i do .



## Conclusion

The execution experience has been summarized here in steps.

1. Required libraries were imported successfully. TensorFlow version: 2.7.0 was imported.
2. Google Colab TPU strategy configuration was initialized successfully. 8 TPU cores were secured with one worker node.
3. Hyper-parameters were initialized successfully.
4. Cornell-Movie corpus zipped file was successfully imported from "www.cs.cornell.edu" website using Keras file utility. "Movie_lines.txt" and "Movie_conversations.txt" files have been successfully extracted and files paths were defined.
5. Conversations from the corpus have been successfully created using the two input files, while applying pro-processing steps. A list of 50,000 conversations has been created.
6. The corpus was tokenized successfully. The resulting vocabulary size was found to be 8279.
7. Tokens were applied to the question-and-answer list (conversations). Padding was completed for sentences less than 40 characters length. Total number of questions were found to be 44131.
8. Dataset was created successfully as 3 dictionary objects. 
    - Encoder inputs
    - Decoder inputs
    - Outputs (answers)
9. Definitions for self-attention, multi-head attentions, padding zero tokens, look-ahead masks, positional encoding were created successfully.
10. Definitions for encoder layer, encoder unit (Nx encoder layers), decoder layer, decoder unit (Nx decoder layers), and finally Transformers were created successfully.
11. Transformer model was compiled successfully using Adam optimizer with custom learning rate, loss function definition, and metrics definition.
12. Model training was successfully executed for 40 Epochs. Each epochs took about 5 seconds to train. 
<br>`Note: The TPU configuration should be enabled successfully. Otherwise, each epoch could take several minutes to train.`
13. Definitions for prediction which includes pre-processing, tokenizing, padding, and call-to-model required for prediction phase were created successfully.
14. Prediction phase was found to be interesting and promising. There are times the answers are more relevant and sometimes not. Although, there are times the answers are not relevant to the questions, model is still able to create meaningful sentences.
15. A sample prediction was run at the end by feeding an example question, and subsequently feeding the previous answer as the next question to ask the model. The answers were found to be interesting.

## References

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In _Advances in neural information processing systems_ (pp. 5998-6008). https://arxiv.org/abs/1706.03762

TensorFlow. (2022). _Text generation with an RNN_. TensorFlow. https://tensorflow.org/alpha/tutorials/text/text_generation

TensorFlow. (2022). _Neural machine translation with attention_. TensorFlow. https://www.tensorflow.org/alpha/tutorials/text/nmt_with_attention

TensorFlow. (2022). _Transformer model for language understanding_. https://www.tensorflow.org/alpha/tutorials/text/transformer

Bryan. L. (2022). _Transformer Chatbot_. GitHub. https://github.com/bryanlimy/tf2-transformer-chatbot/blob/master/tf2_tpu_transformer_chatbot.ipynb