# <center> MSBA 6460: Advanced AI for Business Applications </center>
<center> Summer 2022, Mochen Yang </center>

## <center> Sequence to Sequence Model </center>

# Table of Contents
1. [Setup](#setup)
1. [Seq2Seq Model and the Encoder-Decoder Architecture](#encode_decode)
    - [Sequence-to-Sequence (Seq2Seq) Modeling](#encode_decode_s2s)
    - [Basic Ideas of the Encoder-Decoder Architecture](#encode_decode_architecture)
    - [Technical Details of the Encoder-Decoder Architecture](#encode_decode_tech)
    - [Implement Encoder-Decoder Architecture in Keras](#encode_decode_implement)
1. [Attention Mechanism](#attention)
    - [What is it and Why do We Need it?](#attention_motivation)
    - [Technical Details of Attention Mechanism](#attention_tech)
    - [Other Types of Attention](#attention_other)
    - [Implement Attention Mechanism in Keras](#attention_implement)
1. [Transformer](#transformer)
    - [What is the Transformer Architecture?](#transformer_intro)
    - [Key Components of Transformer](#transformer_components)
    - [Other Components of Transformer and its Implementation](#transformer_other)
1. [Application Case: BERT](#bert)
    - [What is BERT?](#bert_intro)
    - [Use BERT](#bert_example)
1. [Application Case: GPT](#gpt)
1. [Additional Resources](#resource)

# Setup <a name="setup"></a>

We will try out different models and architectures that you will learn in this notebook on a machine translation task, using the [Tab-delimited Bilingual Sentence Pairs](http://www.manythings.org/anki/) dataset. This dataset contains multiple pairs of corresponding sentences in two different languages. For all the demos below, I will specifically take the English-Spanish dataset (you can download it from Canvas). You are encouraged to try out other language pairs (though some languages may require special preprocessing steps). A lot of the code in this notebook is adapted from Tensorflow tutorial [Neural machine translation with attention](https://www.tensorflow.org/tutorials/text/nmt_with_attention).

Before building any models, we need to complete a few pre-processing steps, including:
1. Read in the dataset, parse each line, and store the corresponding sentences in English and Spanish into two numpy arrays;
2. For each sentence, tokenize it and represent it as a sequence of integer indices (just like what we did in the text classification case).

In [1]:
# Import necessary packages
import numpy as np
import tensorflow as tf
from tensorflow import keras
import re

In [2]:
# Read in the dataset - note that we need to specify encoding="utf-8" when the language contains non ascii words.
sentences_english = []
sentences_spanish = []
for line in open('../datasets/spa.txt', 'r', encoding = 'utf-8'):
    s_english, s_spanish, other = line.rstrip('\n').split('\t')
    
    # Before we store the English sentence, we need to do a few processing.
    # 1. We don't want to just discard punctuations. To make sure tokenization works properly, let's add a space between each punctuation and the preceding word. E.g., "Hi." -> "Hi ." This is done with regular expression (don't worry about it if you are not familiar with it).
    # 2. For reasons that will become clear later, we need to add two special "words" to indicate the beginning and end of each sentence.
    s_english = re.sub(r"([?.!,])", r" \1 ", s_english)
    s_english = re.sub(r'[" "]+', " ", s_english)
    s_english = s_english.strip()
    s_english = '<start> ' + s_english + ' <end>'
    sentences_english.append(s_english)
    
    # Similarly, do the two steps for Spanish sentences.
    s_spanish = re.sub(r"([?.!,¡¿])", r" \1 ", s_spanish)
    s_spanish = re.sub(r'[" "]+', " ", s_spanish)
    s_spanish = s_spanish.strip()
    s_spanish = '<start> ' + s_spanish + ' <end>'
    sentences_spanish.append(s_spanish)   

sentences_english = np.array(sentences_english)
sentences_spanish = np.array(sentences_spanish)
# print to check
print(sentences_english)
print()
print(sentences_spanish)
print()
print('In total: ' + str(len(sentences_spanish)) + ' pairs of sentences.')

# The original data is quite large, and may result in high memory usage and long training time. Let's take a sample of 15000
idx = np.random.choice(list(range(len(sentences_spanish))), size = 15000, replace = False)
sentences_english = sentences_english[idx]
sentences_spanish = sentences_spanish[idx]

['<start> Go . <end>' '<start> Go . <end>' '<start> Go . <end>' ...
 '<start> A carbon footprint is the amount of carbon dioxide pollution that we produce as a result of our activities . Some people try to reduce their carbon footprint because they are concerned about climate change . <end>'
 '<start> Since there are usually multiple websites on any given topic , I usually just click the back button when I arrive on any webpage that has pop-up advertising . I just go to the next page found by Google and hope for something less irritating . <end>'
 '<start> If you want to sound like a native speaker , you must be willing to practice saying the same sentence over and over in the same way that banjo players practice the same phrase over and over until they can play it correctly and at the desired tempo . <end>']

['<start> Ve . <end>' '<start> Vete . <end>' '<start> Vaya . <end>' ...
 '<start> Una huella de carbono es la cantidad de contaminación de dióxido de carbono que producimos como 

In [12]:
# Text preprocessing - just like what we did for text classification
# we need one vectorization layer for each language

# For English, we want to lowercase, but don't want to strip punctuations.
def lowercase_only(text):
    return tf.strings.lower(text)

vectorize_layer_english = keras.layers.experimental.preprocessing.TextVectorization(
    max_tokens = None,
    standardize = lowercase_only,
    split = 'whitespace',
    ngrams = None,
    output_mode = 'int',
    output_sequence_length = None
)

vectorize_layer_spanish = keras.layers.experimental.preprocessing.TextVectorization(
    max_tokens = None,
    standardize = lowercase_only,
    split = 'whitespace',
    ngrams = None,
    output_mode = 'int',
    output_sequence_length = None
)

In [13]:
# Apply to English and check vocabulary
vectorize_layer_english.adapt(sentences_english)
english_vocabulary = vectorize_layer_english.get_vocabulary()
print(english_vocabulary)
print(len(english_vocabulary))
# also create a variable max_english to store max length of English sentences (will come in handy later)
max_english = vectorize_layer_english(sentences_english).shape[1]
print(max_english)

6088
37


In [14]:
# Apply to Spanish and check vocabulary
vectorize_layer_spanish.adapt(sentences_spanish)
spanish_vocabulary = vectorize_layer_spanish.get_vocabulary()
print(spanish_vocabulary)
print(len(spanish_vocabulary))
# also create a variable max_spanish to store max length of Spanish sentences (will come in handy later)
max_spanish = vectorize_layer_english(sentences_spanish).shape[1]
print(max_spanish)

['', '[UNK]', '<start>', '<end>', '.', 'de', 'que', 'no', 'a', 'tom', 'la', '?', '¿', 'el', 'en', 'es', 'un', 'me', ',', 'se', 'por', 'lo', 'una', 'los', 'qué', 'mi', 'Él', 'con', 'su', 'está', 'te', 'ella', 'le', 'para', 'mary', 'y', 'más', 'las', 'al', 'yo', 'muy', 'eso', 'del', 'tu', 'este', 'esta', 'tiene', 'estoy', 'tengo', 'quiero', 'estaba', 'fue', 'como', 'él', 'si', 'aquí', 'hacer', 'tiempo', 'puedo', 'todo', 'ha', 'casa', 'todos', 'hay', 'esto', 'algo', 'tan', 'mucho', 'puede', 'nada', 'favor', 'son', 'bien', 'ir', 'he', '!', '¡', 'gusta', 'vez', 'nos', 'trabajo', 'era', 'creo', 'quién', 'ser', 'ellos', 'solo', 'dijo', 'cuando', 'sé', 'ya', 'dos', 'estás', 'sus', 'cómo', 'nunca', 'mañana', 'ahora', 'dónde', 'verdad', 'están', 'pero', 'puedes', 'hablar', 'ese', 'dinero', 'tienes', 'hoy', 'había', 'tomás', 'hace', 'soy', 'tú', 'va', 'ver', 'quiere', 'nadie', 'mejor', 'día', 'poco', 'has', 'parece', 'noche', 'sabe', 'siempre', 'voy', 'libro', 'tenía', 'francés', 'quieres', 'ante

# Seq2Seq Model and Encoder-Decoder Architecture <a name="encode_decode"></a>

## Sequence-to-Sequence (Seq2Seq) Modeling <a name="encode_decode_s2s"></a>

The goal is a seq2seq model is to take a sequence (e.g., of texts) as input and predict another sequence (e.g., of texts). The input sequence and the output sequence _may not have the same length_. It is relevant for a lot of NLP tasks including:
- Machine Translation (turn a piece of text in language A into a piece of text in language B);
- Q&A (take a question and produce an answer);
- Speech recognition (predict text based on speech/sound);
- Conversational AI (e.g., chatbots).

<font color="blue"> The **encoder-decoder architecture** is one of the canonical neural network architecture for seq2seq modeling tasks. </font>

## Basic Ideas of the Encoder-Decoder Architecture <a name="encode_decode_architecture"></a>

The encoder-decoder architecture has three components: the **encoder**, the **context**, and the **decoder**. Among them, the encoder and the decoder are typically RNNs and the context is typically a vector. The three componenets have different roles:
- The encoder "reads" the input sequence, then produce a vector output at the end. That vector output is the context. In other words, the encoder "encodes" information in the input sequence into a numeric vector;
- The context vector can be thought of as a "summary" or "representation" of the information in the input sequence;
- The decoder takes the context vector, then "decodes" its information to produce the output sequence.

Take machine translation as an example, this architecture conceptually mimic how humans translate: read a sentence in language A (encoder), understand / digest it (context), then produce the translated sentence in language B (decoder).

<font color="red">Question: why do we need the context vector? Why can't we just directly predict the output sequence, one word at a time, as the encoder reads through the input sequence?</font> (recall the many-to-many RNN architecture that you learned in Yicheng's class). **Hint:** think about the lengths of input and output sequences.

Based on the roles of encoder and decoder, we can easily map out the RNN structures that we need:
- For the encoder, we need an RNN that takes a sequence input and produce a vector output at the end (i.e., many-to-one structure). This is exactly the same structure that we have used for classification purpose, _except that we don't need to get a final categorical prediction - we just need the hidden state of the RNN at the final time step_. <font color="red">Question: why use the hidden states?</font>;
- For the decoder, we need an RNN that takes a single vector as input and produce a sequence as output (i.e., one-to-many structure). More details below.

## Technical Details of the Encoder-Decoder Architecture <a name="encode_decode_tech"></a>

![Encoder-Decoder Architecture Illustration](images/encoder_decoder.gif)
image credit: https://medium.com/analytics-vidhya/machine-translation-using-neural-networks-61ea85b39ad4

Consider a training dataset with $N$ **pairs** of input sequences and output sequences (e.g., in the case of machine translation, these may be $N$ pairs of corresponding sentences in two languages). Let's represent a particular input sequence as $\boldsymbol{X_n} = (x_1, \ldots, x_T)$ and the corresponding output sequence as $\boldsymbol{Y_n} = (y_1, \ldots, y_{T'})$. Importantly, $T$ may not equal $T'$, i.e., the two sequences may have different lengths.

**The encoder RNN**: The encoder RNN behaves just like a regular RNN that you have seen in the case of text classification. Its hidden state changes as it reads through the input sequence:
$$h_t^{(encoder)} = f(h_{t-1}^{(encoder)}, x_t)$$
where function $f()$ represent the activation function, and can be as simple as the hyperbolic tagent (i.e., simple RNN), or the complex LSTM.

**The context vector**: The context vector $\boldsymbol{C}$, which serves as a "summary" of the entire input sequence, is simply the last hidden state of encoder RNN. In other words,
$$\boldsymbol{C} = h_T^{(encoder)}$$

**The decoder RNN**: The decoder RNN takes the context vector and predict the output sequence. In particular, it works as follows:
1. Set $h_0^{(decoder)} = \boldsymbol{C}$ (the context vector becomes the initial hidden state of decoder). Set $y_0$ as an empty string or a special character that indicates the beginning of a sequence;
2. Compute next hidden state as $h_t^{(decoder)} = f(h_{t-1}^{(decoder)},y_{t-1})$. Again, $f()$ is the activation function of your choice. (BTW, using $y_{t-1}$ as part of the input to compute the next hidden state in decoder RNN is called "teacher forcing");
3. Predict $\hat{y_t} = softmax(h_t^{(decoder)})$ as the next word in sequence;
4. Repeat steps 2-3, until some max length of the output sequence or a special character indicating the end of a sequence is reached.

During training, the loss function is constructed by comparing $(\hat{y_1}, \ldots, \hat{y_{T'}})$ with the ground truth $(y_1, \ldots, y_{T'})$, via categorical cross-entropy.

<font color="blue">During out-of-sample prediction, in step 2 of decoder, because $y_{t-1}$ is unobserved, you can simply replace it with $\hat{y_{t-1}}$.</font> This way of putting the model's output back into the model to generate the next output is called _auto-regressive_ (related to, but not to be confused with, the auto-regressive model in time-series forecasting).

## Implement Encoder-Decoder Architecture in Keras <a name="encode_decode_implement"></a>

<font color="red">Disclaimer: </font>Different from what we did for text classification (i.e., mostly calling existing functions), we need to do quite a bit of implementations by ourselves here. As such, there are multiple ways to implement the same thing, and my code below should be treated as _a demonstration_, rather than the most efficient / general-purpose way of implementing the encoder-decoder architecture.

As demonstration, I use GRU cells for both the encoder RNN and the decoder RNN. Of course, one can use LSTM cells as well. See [Character-level recurrent sequence-to-sequence model](https://keras.io/examples/nlp/lstm_seq2seq/) as an example.

In [17]:
# Assemble the Encoder RNN
# It consists of an input layer, an embedding layer, and a GRU layer
# For demonstration, I will set the embedding dimensions and number of GRU units to be 64 (note: the two do not necessarily need to have the same dimension in general)

encoder_inputs = keras.layers.Input(shape = (None,), # shape = (None,) indicates that the length of each input is not known in advance
                                    name = "encoder_input") 
x = keras.layers.Embedding(input_dim = len(vectorize_layer_english.get_vocabulary()),
                           output_dim = 64,
                           mask_zero = True,
                           name = "encoder_embedding")(encoder_inputs)
encoder_h_all, encoder_h_final = keras.layers.GRU(units = 64,
                                                   return_state = True,  # Set return_state = True to get the hidden state at the end, i.e., the context vector
                                                   name = "encoder_gru")(x)

In [18]:
# Assemble the Decoder RNN
# It also consists of an input layer, an embedding layer, and a GRU layer

decoder_inputs = keras.layers.Input(shape = (None,), name = "decoder_input")
x = keras.layers.Embedding(input_dim = len(vectorize_layer_spanish.get_vocabulary()),
                           output_dim = 64,
                           mask_zero = True,
                           name = "decoder_embedding")(decoder_inputs)
decoder_h_all, decoder_h_final = keras.layers.GRU(units = 64,
                                                  return_sequences = True,  # MUST return_sequences = True to allow the RNN to output its hidden state at each step (rathar than only at the end)
                                                  return_state = True,
                                                  name = "decoder_gru")(x, initial_state = encoder_h_final)  # Setting initial_state = encoder_states passes the context vector to decoder RNN
# Then turn decoder RNN hidden state at each step into a prediction
decoder_predictions = keras.layers.Dense(units = len(vectorize_layer_spanish.get_vocabulary()),
                                         activation='softmax',
                                         name = "decoder_dense")(decoder_h_all)

<font color="blue">Coding Note: what's the difference between return_state and return_sequences here?</font>

When specifying RNN cells/layers in Keras, return_state controls whether you want the layer to return the hidden state at the _final step_, and return_sequences controls whether you the hidden state at _every step_. Take GRU as an example:
- `output = GRU(units, return_state = False, return_sequences = False)` means you only want the final hidden state stored in output;
- `output, state = GRU(units, return_state = True, return_sequences = False)` also means you only want the final hidden state. Here, output and state will be two tensors of the same values;
- `output, state = GRU(units, return_state = True, return_sequences = True)` means you want both the final state and state at every step. Here, output will contain hidden state of every step (i.e., it will be a tensor with a "time" dimension), whereas state will only store the final hidden state.

In [19]:
# Put encoder and decoder RNNs into a keras Model object, so it can be trained
# Documentation here: https://www.tensorflow.org/api_docs/python/tf/keras/Model
model = keras.Model(inputs = [encoder_inputs, decoder_inputs],
                    outputs = decoder_predictions)

In [23]:
# We still need to do two things to prepare the actual data for decoder_outputs:
# 1. We need to move it "one step ahead" of the actual output Spanish sentences, becauses of the teacher forcing training strategy: we use encoder states + decoder input at t to predict decoder input at t+1
# 2. Because "decoder_outputs" are generated by a dense layer via softmax activation, we need to one-hot encode the Spanish sentences
temp = vectorize_layer_spanish(sentences_spanish)

# the decoder_output_data tensor should have shape (num_sentences, max_sentence_length, size_of_vocabulary)
decoder_predictions_oneahead = np.zeros((temp.shape[0], temp.shape[1], len(vectorize_layer_spanish.get_vocabulary())), dtype = "float32")
for i in range(temp.shape[0]):
    for j in range(1, temp.shape[1]): # start from 1, because we want to move "one step ahead"
        decoder_predictions_oneahead[i, j-1, temp[i,j]] = 1

In [112]:
# Train the model
model.compile(optimizer="rmsprop",
              loss="categorical_crossentropy",
              metrics=["accuracy"])

# Note: the following training will run for about 1 hour on a regular laptop
model.fit([vectorize_layer_english(sentences_english), vectorize_layer_spanish(sentences_spanish)],
          decoder_predictions_oneahead,
          batch_size = 32,
          epochs = 10,
          validation_split = 0.2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x25d67a73790>

We are not done yet! Using a trained encoder-decoder model to make predictions is more complicated than simply `model.predict()`. We need to build a prediction function that takes a given english sentence, process it, run through the network, and generate Spanish predictions one word at a time. This is called **inference**.

First, we need to construct the encoder and decoder for deployment. This step looks very similar to what we did when we construct them for training, but now we use the trained model object.

In [34]:
# Take a look at all the trained layers in the model object - we will use them to construct encoder/decoder for deployment
model.layers
# if you forgot what each layer represents, you can retrieve its name to remind yourself. For example
#print(model.layers[5].name)

[<tensorflow.python.keras.engine.input_layer.InputLayer at 0x1bdf3d2a640>,
 <tensorflow.python.keras.engine.input_layer.InputLayer at 0x1bdef9d85e0>,
 <tensorflow.python.keras.layers.embeddings.Embedding at 0x1bdedfd39d0>,
 <tensorflow.python.keras.layers.embeddings.Embedding at 0x1bdf26a61c0>,
 <tensorflow.python.keras.layers.recurrent_v2.GRU at 0x1bdf3dc6eb0>,
 <tensorflow.python.keras.layers.recurrent_v2.GRU at 0x1bddba32eb0>,
 <tensorflow.python.keras.layers.core.Dense at 0x1bdf26ab460>]

In [114]:
# Construct encoder
encoder_inputs = model.input[0]
encoder_embedding_layer = model.layers[2]
encoder_embedded_inputs = encoder_embedding_layer(encoder_inputs)
encoder_gru = model.layers[4]
encoder_h_all, encoder_h_final = encoder_gru(encoder_embedded_inputs)
encoder_model = keras.Model(inputs = encoder_inputs,
                            outputs = encoder_h_final)

In [115]:
# Example: use the constructed encoder to encode a sentence. The output will be a 64-dimensional context vector
# See the "encoder hidden state" with your own eyes!
encoder_model.predict(vectorize_layer_english(['Hi']))

array([[ 0.06379634, -0.06867911,  0.0224181 , -0.01005947,  0.01135689,
         0.02713817,  0.06173227,  0.0315777 ,  0.0854506 ,  0.04984908,
         0.03262766, -0.06452022,  0.02338152,  0.00326141,  0.00352042,
         0.04879302, -0.00934004,  0.02815293, -0.04780437, -0.0181174 ,
         0.02910002, -0.06143535, -0.068152  , -0.04027976,  0.02581373,
         0.00888741, -0.01559817,  0.07280658,  0.14587033, -0.06181121,
        -0.07616357, -0.00798858, -0.07944083,  0.056419  , -0.01836442,
         0.05653227,  0.03495847,  0.06762259,  0.00858808, -0.01498305,
        -0.06635191,  0.11128965,  0.1195935 ,  0.06607796, -0.02631752,
        -0.1413462 , -0.04873307,  0.08222912, -0.00267215,  0.08329066,
         0.01642701, -0.05979476, -0.09580483,  0.00491492, -0.03502396,
         0.06356701, -0.01445538, -0.05308008,  0.11230695, -0.1122926 ,
        -0.00118725,  0.04316691, -0.08154774, -0.00367362]],
      dtype=float32)

In [116]:
# Construct decoder
decoder_inputs = model.input[1]
decoder_embedding_layer = model.layers[3]
decoder_embedded_inputs = decoder_embedding_layer(decoder_inputs)
decoder_gru = model.layers[5]
decoder_states_inputs = keras.layers.Input(shape = (64,)) # during actual deployment, this will be the encoder final hidden states
decoder_h_all, decoder_h_final = decoder_gru(decoder_embedded_inputs,
                                             initial_state = decoder_states_inputs) 
decoder_dense = model.layers[6]
decoder_predictions = decoder_dense(decoder_h_all)
# technical note: when the inputs / outputs of the keras.Model() has multiple parts, it must be a list of objects
decoder_model = keras.Model(inputs = [decoder_inputs, decoder_states_inputs],
                            outputs = [decoder_predictions, decoder_h_all])

In [117]:
# Now, let's build the inference function
def translate(input_sentence):
    # first, run it through the encoder model to get context vector
    states_value = encoder_model.predict(vectorize_layer_english(input_sentence))
    
    # next, let's construct a target_sentence and a predict_sentence
    # the target start with the "<start>" word, and serves as the input to decoder during teacher forcing - it is the y_{t-1} element when t = 1
    # the predict_sentence is our ultimate prediction
    target = '<start>'
    predict_sentence = ''
    predict_sentence_len = 0
    stop = False
    while not stop:
        output_softmax_probs, h = decoder_model.predict([vectorize_layer_spanish(target), states_value])
        # get the output word based on "output_softmax_probs" (which is the softmax output probabilities)
        # the weired index [0,-1,:] here is because the output shape is (batch_size, num_sentences, num_words)
        output_word_idx = np.argmax(output_softmax_probs[0,-1,:])
        output_word = spanish_vocabulary[output_word_idx]
        predict_sentence += output_word + ' '
        predict_sentence_len += 1
        
        # check stop conditions
        if output_word == '<end>' or predict_sentence_len > max_spanish:
            stop = True
            
        # update the target to be the current output, for next prediction
        target = output_word
        
        # also update the hidden state of decoder RNN, for next prediction
        states_value = h
    
    return predict_sentence

In [121]:
translate(['<start> Was it you that left the door open last night ? <end>'])

'¿ puedo ir a la casa de la casa ? <end> '

# Attention Mechanism <a name="attention"></a>

## What is it and Why do We Need it? <a name="attention_motivation"></a>

Not long after the success of encoder-decoder architecture in machine translation and other applications, people started to realize that it has some serious limitations. One important limitation that motivated the attention mechanism is the observation that **different parts of the input sequence are not equally important for predicting the output sequence** (see the simple example below). The basic encoder-decoder architecture cannot capture this aspect, because for each input sequence we only get a _single_ context vector that is used to generate the entire output sequence.
![An Example of Why We Need Attention](images/attention.gif)
image credit: https://medium.com/eleks-labs/neural-machine-translation-with-attention-mechanism-step-by-step-guide-989adb12127b

This also naturally gives rise to the basic idea behind the attention mechanism: Instead of a single context vector, we now compute one context vector specifically for generating one word in the output sequence. That context vector should encode the information from input sequence that is most useful for predicting the target word in output sequence. In other words, we "align" the context vector for each target word in the output sequence.

## Technical Details of Attention Mechanism <a name="attention_tech"></a>

![Illustration of Attention Mechanism](images/attention_detail.png)
image credit: Figure 1 in https://arxiv.org/pdf/1409.0473.pdf. <font color="blue">Note:</font> I'm going to use slightly different notations than what's in the above picture, in order to be consistent with other parts of this notebook. Specifically, I will use $h_t^{(encoder)}$ and $h_t^{(decoder)}$ to represent encoder/decoder hidden states, whereas the same things are denoted as $h_t$ and $s_t$ in the picture.

**The encoder RNN**: same as the encoder step in a standard encoder-decoder architecture, _except that we typically use a bi-directional RNN_ (rather than a one-directional RNN). <font color="blue">Intuition for using bi-directional RNN:</font> we want the hidden states of the encoder RNN to contain information of both the preceding and following words in the input sequence, to help better learn the "alignment" with the target word. Formally, the forward and backward pass are:
$$\overrightarrow{h_t^{(encoder)}} = f(\overrightarrow{h_{t-1}^{(encoder)}}, x_t)$$
$$\overleftarrow{h_t^{(encoder)}} = f(\overleftarrow{h_{t+1}^{(encoder)}}, x_t)$$
and we concatenate the two to form the hidden state of encoder RNN at time $t$, i.e.,
$$h_t^{(encoder)} = \big[\overrightarrow{h_t^{(encoder)}}, \overleftarrow{h_t^{(encoder)}} \big]$$

**The context vector**: the context vector for target word $i$ is a **weighted sum** of all encoder hidden states:
$$\boldsymbol{C_i} = \sum_{t=1}^{T} \alpha_{it} h_t^{(encoder)}$$
So where does the weights, $\alpha_{it}$, come from? They are actually trained via a (standard) feed-forward neural network, together with the encoder and decoder. Specifically, the feed-forward neural network has only 1 hidden layer with $tanh$ activation, with input $[h_{i-1}^{(decoder)}, h_t^{(encoder)}]$. The output layer of this network has $T$ neurons (correspond to the $T$ output weights) with softmax activation. For example, for the neuron $t$ in the output layer, denote its value before activation as $e_{it}$ (often called the _"score"_), then via softmax we have:
$$\alpha_{it} = \frac{\exp(e_{it})}{\sum_{k=1}^T \exp(e_{ik})}$$

**The decoder RNN**: at time step $i$ of the decoder RNN, it takes the hidden state from step $i-1$ as well as the context vector $\boldsymbol{C_i}$ as input to compute the hidden state at step $t$ and then produce a prediction at that step. So,
1. Compute next hidden state as $h_i^{(decoder)} = f(h_{i-1}^{(decoder)},y_{i-1},\boldsymbol{C_i})$. The context vector $\boldsymbol{C_i}$ is concatenated with the other inputs and feed into the activation function;
2. Predict $\hat{y_i} = softmax(h_i^{(decoder)})$ as the next word in sequence;
3. Repeat steps 1-2 until termination.

## Other Types of Attention <a name="attention_other"></a>

Besides the attention mechanism discussed here, which is often referred to as Bahdanau Attention or Additive Attention, there are several other variations. They differ in terms of how the scores, $e_{it}$, are computed (i.e., computing the scores via a feed-forward neural net is only one of many reasonable ways). I highly recommend taking a look at [this article](https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html) for details.

## Implement Attention Mechanism in Keras <a name="attention_implement"></a>

Like the encoder-decoder architecture, there is not yet a pre-packaged function that we can call to implement a particular attention mechanism. We have to do a bit of coding ourselves. Note that the [`tf.keras.layers.Attention`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Attention) implements the dot product attention mechanism (i.e., not the same as what we discussed here).

# Transformer <a name="transformer"></a>

## What is the Transformer Architecture? <a name="transformer_intro"></a>

For a short while (about 2 years) after the attention mechanism was proposed in 2015, the encoder-decoder architecture plus the attention mechanism was the state-of-the-art for many language tasks (e.g., machine translation), until the transformer architecture was proposed in 2017.

The transformer architecture follows the same encoder-decoder structure, but seeks to completely throw away the RNNs for encoder/decoder, and only uses (a particular kind of) attention mechanism combined with fully-connected feed-forward neural networks (i.e., non-recurrent). 

<font color="red">But why would you want to throw away the RNNs?</font> One major reason is computational complexity. In a RNN, computations have to be done sequentially (e.g., processing one word after another), which prohibits parallelization. As a result, large-scale tasks with RNNs may become very slow. As you will see, most of the computations in a transformer (especially the self-attention component) can be done in a one-shot or parallel manner.

There are a number of technical components to a transformer architecture (see figure below). I will highlight two important/powerful components (self-attention and positional encoding) and explain the intuition behind them. The goal here is not to understand every single detail of a transformer model (which is still actively evolving as the field progresses), but to get a sense of how it works in general.

![Transformer Architecture](images/transformer.png)
image credit: [Attention is all You Need](https://arxiv.org/pdf/1706.03762.pdf) (Figure 1)

## Key Components of Transformer <a name="transformer_components"></a>

### Component 1: Self-Attention

The attention mechanism that we discussed before can be thought of as a "layer" that sits between an encoder and a decoder, which allows the decoder RNN to "attend to" different parts of the encoder hidden states. However, the transformer uses a different type of attention mechanism, called **self-attention**.

![Self-Attention Visual Illustration](images/self_attention.png)
image credit: [Self-Attention For Generative Models](https://web.stanford.edu/class/cs224n/slides/cs224n-2019-lecture14-transformers.pdf)

You can think of self-attention as a mechanism that applies to an input sequence _itself_ (like the visualization above), in order to generate a representation of the sequence that encodes information about how different words in the sequence are related to each other. In a (non-rigorous) sense, it allows the representation of the input sequence to contain information about "interactions" among different words in the sequence. <font color="red">So, how exactly does self-attention work?</font>

Given an input sequence $\boldsymbol{X} = (x_1,\ldots, x_T)$, where each $x_i$ could be a word embedding vector of dimension $D$ (i.e., we use $D$ numbers to represent each word), and $\boldsymbol{X}$ has dimension $(T,D)$. First we create three vectors, respectively named query, key, and value:
- Query vector: $Q = W_Q \boldsymbol{X}$;
- Key vector: $K = W_K \boldsymbol{X}$;
- Value vector: $V = W_V \boldsymbol{X}$;
Note that, $W_Q, W_K, W_V$ are three matrices, each with dimension $(T,T)$, that will be part of the model's parameters to learn. And the resulting query, key, and value vectors each has dimension $(T,D)$. In other words, the query, key, and value vectors are simply linear projections (or linear combinations) of the original input sequence. <font color="blue">Don't worry too much about why these vectors are called "query", "key", and "value". They are meant to be general terminologies to describe all kinds of attention mechanisms (more on this later).</font>

Then, we compute the attention weights, $\boldsymbol{\alpha}$, which is a matrix of length $(T,T)$. Each column of the matrix contains the weights for each word in the input sequence:
$$\boldsymbol{\alpha} = softmax \left(\frac{QK^{'}}{\sqrt{D}} \right)$$
In other words, the attention weights are computed as the dot-product of query and key - this is called **dot-product attention**. Don't worry about the division by $\sqrt{D}$ (it's for some computational reason). Finally, to get the self-attention representation of the input sequence, we simply multiply attention weights with value - just like the weighted sum that you learned from the additive attention mechanism:
$$Attention(Q, K, V) = \boldsymbol{\alpha} V$$
Note that the output here again has dimension $(T,D)$. If you want, you can write it out like $Attention(Q, K, V) = (e_1, \ldots, e_T)$ where each $e_i$ is a $D$-dimensional embedding vector - <font color="blue">This is the representaion of input sequence after applying self-attention!</font> Because of the attention weighting process discussed above, the $e_i$ embedding vectors now encode relationships among different words in the input sequence. They are sometimes referred to as **contextual embeddings**.

Finally, using the contextual embeddings of each word in the input sequence, you can even create an embedding representation of the entire input sequence, by simply averaging across the $T$ words, i.e., $e_{input} = \frac{1}{T} \sum\limits_{i=1}^T e_i$. This is called **pooling**.

Three more comments about self-attention used in the transformer architecture:
1. As you can see, the entire process of calculating self-attention representation of an input sequence does NOT involve any RNNs or word-by-word recurrence. That's the point of transformer - it wants to get rid of RNNs;
2. The terminologies of query, key, and value in the description of an attention mechanism are actually quite general (i.e., they are not only applicable in the transformer context). For example, in the additive mechanism that we discussed before, the "query" is the decoder hidden state $h_{t-1}^{(decoder)}$, and both the "key" and the "value" are encoder hidden state $h_{t}^{(encoder)}$. In general, all types of attention mechanism can be described as some operations that involve the query, key, and value vectors;
3. In the actual transformer architecture, people typically use something called a **Multi-Head Self-Attention**. The technical details of it are less relevant here, but here's the high-level idea: you first cut the $D$-dimension embedding into $h$ smaller pieces (each with $D/h$ dimensions), and then apply the same self-attention mechanism to each piece. See [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf) Section 3.2.2 for more information if you are interested.

### Component 2: Positional Encoding

Remember that we throw away the encoder and decoder RNNs, and only rely on self-attention to generate representations of the sequences? Without the sequential RNNs, the model now does not know the sequence of words in the input or output. To counter this loss of information, we try to encode the position of a word in a sequence into the embedding, using **Positional Encoding**. The positional encoding for each word at each position is another vector of the same dimension as the word embedding.

In the original paper that proposed transformer, the positional encoding is calculated as follows:
$$PE(pos, 2i) = \sin(\frac{pos}{10000^{2i/D}})$$
$$PE(pos, 2i+1) = \cos(\frac{pos}{10000^{2i/D}})$$
where $pos$ is a particular position in a sequence and $i \in {0, ..., D/2}$ is a running index. <font color="red">Looks like some mysterious magic... What does it mean?</font> Let me explain with a small example.

Suppose you have an input sequence of 5 words, $\boldsymbol{X} = (x_1,\ldots, x_5)$, and each $x_t$ is represented by a $4$-dimensional embedding. Now you want to also encode the positions of each word. For the sake of demonstration, let's say you want to encode the second position, i.e., $pos=2$. You would use the formula above to compute the following:
- Set $i=0$, $PE(2, 0) = \sin(\frac{2}{10000^0})=\sin(2) \approx 0.91$ and $PE(2, 1) = \cos(\frac{2}{10000^0})=\cos(2) \approx -0.42$;
- Set $i=1$, $PE(2, 2) = \sin(\frac{2}{10000^{0.5}}) \approx 0.02$ and $PE(2, 3) = \cos(\frac{2}{10000^{0.5}})=\cos(2) \approx 1.00$. Stop here because you only need 4 dimensions.
Then, the new embedding for the second word in this sequence will be:
$$x_2^{'} = x_2 + [0.91, -0.42, 0.02, 1.00]$$

This works because, after injecting the positional encoding, _the second word in this sequence will have a different embedding than the same word appearing at a different position in a different sequence_. Essentially, this allows the embedding to contain position-specific information that can help learning. Finally, why using the trigonometry functions? It's mostly for mathematical convenience and it works in practice.

## Other Components of Transformer and its Implementation <a name="transformer_other"></a>

In addition to self-attention and positional encoding, the transformer architecture also uses several other technical building blocks, such as layer normalization and residual connection. If you are interested, please see the [Additional Resources](#resource) section for articles you can read to help you understand these, and for a detailed demonstration of how to implement a transformer model.

# Application Case: BERT <a name="bert"></a>

## What is BERT? <a name="bert_intro"></a>

BERT stands for _**B**idirectional **E**ncoder **R**epresentations from **T**ransformers_. It is a **language representation model**, which means it takes raw text and generate a meaningful representation (e.g., embedding) of it. It was developed by Google in 2018. With everything we have discussed so far, you are ready to make sense of all the key components of BERT:

1. **B**idirectional means that, during training, the input sequences and its reverse sequence are both used;
2. **E**ncoder **R**epresentations means that the model is aiming to generate representation of the input sequence, i.e., it acts like an encoder;
3. **T**ransformers means that BERT uses a transformer architecture with self-attention.

## Use BERT <a name="bert_example"></a>

Google has released a number of different BERT models, trained with different hyperparameters. [Here is a directory of all those models](https://www.tensorflow.org/tutorials/text/classify_text_with_bert#choose_a_bert_model_to_fine-tune). You see that each model is identified by three parameters:
- $L$: this is the number of transformer blocks. You can think of it as number of "layers";
- $H$: this is the dimension of embedding. We called this $D$ in our discussion of transformer;
- $A$: this is the number of heads in multi-head self-attention. This means cutting the embedding into $A$ pieces and apply self-attention to each piece.

You can access pre-trained BERT models and potentially fine-tune them for your own ML tasks via [Hugging Face](https://huggingface.co/), an online platform that hosts many commonly used pre-trained models. In the following example, we access a basic BERT model and use it to encode some text. See this [page](https://huggingface.co/bert-base-uncased) for detailed documentation.

In [4]:
# install transformer package from Hugging Face
!pip install transformers

Collecting transformers
  Downloading transformers-4.19.1-py3-none-any.whl (4.2 MB)
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.6.0-py3-none-any.whl (84 kB)
Collecting packaging>=20.0
  Downloading packaging-21.3-py3-none-any.whl (40 kB)
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp38-cp38-win_amd64.whl (155 kB)
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp38-cp38-win_amd64.whl (3.3 MB)
Collecting filelock
  Downloading filelock-3.7.0-py3-none-any.whl (10 kB)
Installing collected packages: pyyaml, packaging, filelock, tokenizers, huggingface-hub, transformers
  Attempting uninstall: packaging
    Found existing installation: packaging 20.8
    Uninstalling packaging-20.8:
      Successfully uninstalled packaging-20.8
Successfully installed filelock-3.7.0 huggingface-hub-0.6.0 packaging-21.3 pyyaml-6.0 tokenizers-0.12.1 transformers-4.19.1


In [2]:
from transformers import BertTokenizer, TFBertModel

# fetch the pre-trained model (it will download a model file ~500M)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained("bert-base-uncased")

HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=231508.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=28.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=570.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=536063208.0), HTML(value='')))




Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['mlm___cls', 'nsp___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


In [12]:
# input text and encode
text = "We are using the BERT model!"
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)

In [13]:
# Look at the tokenized input
# Question: what are tokens 101 and 102?
encoded_input

{'input_ids': <tf.Tensor: shape=(1, 9), dtype=int32, numpy=array([[  101,  2057,  2024,  2478,  1996, 14324,  2944,   999,   102]])>, 'token_type_ids': <tf.Tensor: shape=(1, 9), dtype=int32, numpy=array([[0, 0, 0, 0, 0, 0, 0, 0, 0]])>, 'attention_mask': <tf.Tensor: shape=(1, 9), dtype=int32, numpy=array([[1, 1, 1, 1, 1, 1, 1, 1, 1]])>}

In [16]:
# Look at the encoded input
# Question: what is the dimension of encoding?
# Question: why are there two encoding outputs? What are they?
output

TFBaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=<tf.Tensor: shape=(1, 9, 768), dtype=float32, numpy=
array([[[ 0.10261209,  0.18043919, -0.00554929, ..., -0.166134  ,
          0.26679957,  0.35773745],
        [ 0.263622  , -0.21110201, -0.57594675, ..., -0.20186077,
          1.308478  , -0.14822024],
        [ 0.12224663, -0.15183868, -0.36246365, ..., -0.56034166,
          0.18197185,  0.45692527],
        ...,
        [ 0.487611  ,  0.05848615, -0.26846886, ..., -0.64023006,
         -0.01316616, -0.00961822],
        [-0.16868652, -0.17555293, -0.15778571, ...,  0.54957277,
          0.45626837, -0.39924195],
        [ 0.52467674,  0.37009996, -0.21517405, ...,  0.00148578,
         -0.5219994 , -0.30393368]]], dtype=float32)>, pooler_output=<tf.Tensor: shape=(1, 768), dtype=float32, numpy=
array([[-8.78734827e-01, -3.25698197e-01, -3.28317106e-01,
         6.70523882e-01,  6.76294491e-02, -4.97857258e-02,
         8.80656004e-01,  2.76587784e-01, -1.80702090e-0

# Application Case: GPT <a name="gpt"></a>

Another set of famous applications of transformers is the Generative Pre-Training (GPT) models developed by [OpenAI](openai.com). Here are some resources for you to read:
- GPT-1: [Blog Post](https://openai.com/blog/language-unsupervised/), [Reserach Paper](https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf);
- GPT-2: [Blog Post](https://openai.com/blog/better-language-models/), [Research Paper](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf). GPT-2 is basically GPT-1 trained on much more data with much more parameters (1.5 Billion parameters), but the underlying model structure is largely the same;
- GPT-3: [Blog Post](https://openai.com/blog/openai-api/), [Research Paper](https://arxiv.org/pdf/2005.14165.pdf). Again, GPT-3 largely uses the same architecture as GPT-2, with much more parameters (175 billion parameters).

~~For now, you can only use GPT models via OpenAI's proprietary API (i.e., it is not open-sourced).~~ You can now use GPT models via Hugging Face as well. See [this](https://huggingface.co/gpt2) as an example. 

<font color="blue">Finally, some of my personal opinions: </font> as you can see from the progression of GPT models, a general trend in the development of language models is to _build extremely large models_, i.e., take the state-of-the-art architecture and train it with more and more parameters and on larger and larger datasets. However, looking back on what we have learned so far, you really need _fundamentally new ideas_ (e.g., from bag-of-words to embeddings, from simple RNNs to LSTMs, from RNNs + attention to transformers) to achieve significant (non-incremental) improvement. Therefore, although the transformer architecture is the current state-of-the-art, I don't believe it's the end of the line for NLP. Some other breakthroughs will need to take place and bring the next leap in performance.

# Additional Resources <a name="resource"></a>

- Encoder-Decoder Architecture:
    - [Understanding Encoder-Decoder Sequence to Sequence Model](https://towardsdatascience.com/understanding-encoder-decoder-sequence-to-sequence-model-679e04af4346). This is an accessible introduction to seq2seq modeling. I recommend reading it;
    - [Character-level recurrent sequence-to-sequence model](https://keras.io/examples/nlp/lstm_seq2seq/);
    - Original research papers that proposed the encoder-decoder architecture: [Sequence to Sequence Learning with Neural Networks](https://arxiv.org/abs/1409.3215); [Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation](https://arxiv.org/abs/1406.1078);
- Attention Mechanism:
    - Original research paper that proposed the attention mechanism: [Neural machine translation by jointly learning to align and translate](https://arxiv.org/pdf/1409.0473.pdf?utm_source=ColumnsChannel);
    - Implementation of attention: [Neural machine translation with attention](https://www.tensorflow.org/tutorials/text/nmt_with_attention);
- Transformer:
    - Original research paper that proposed the transformer architecture: [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf);
    - Original paper on self-attention: [Long Short-Term Memory-Networks for Machine Reading](https://arxiv.org/pdf/1601.06733.pdf);
    - Additional articles to learn about self-attention: [Illustrated: Self-Attention](https://towardsdatascience.com/illustrated-self-attention-2d627e33b20a), [Introduction of Self-Attention Layer in Transformer](https://medium.com/lsc-psd/introduction-of-self-attention-layer-in-transformer-fc7bff63f3bc);
    - Additional articles on other components in a transformer: [Layer Normalization](https://arxiv.org/abs/1607.06450), [Normalization Techniques in Deep Neural Networks](https://medium.com/techspace-usict/normalization-techniques-in-deep-neural-networks-9121bf100d8), [Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385);
    - Implementation of Transformer: [Transformer model for language understanding](https://www.tensorflow.org/tutorials/text/transformer);
    - [Transformer for text classification](https://keras.io/examples/nlp/text_classification_with_transformer/)
- BERT:
    - Original research paper that proposed BERT: [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805). In particular, Section 3 talks about BERT model architecture;
    - [Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing](https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html);
    - [Text Classification with BERT](https://www.tensorflow.org/tutorials/text/classify_text_with_bert).