In [1]:
# This is the first installment of transformers, with its basic building block called self attention mechanism.
import numpy as np

In [3]:
# [Source] : 'https://www.youtube.com/watch?v=MVeOwsggkt4&list=PLZ2ps__7DhBZVxMrSkTIcG6zZBDKUXCnM&index=64

Disadvantages of the seq-seq model architecture like RNN and encoder-decoder in machine translation are.<br>
<img src="Images/EncoderDecoder.png" alt="drawing" width="600"/><br>
1. Lack of context retention. Since, the encoder compress the input sentence into a single vector, complete essence of the input seq. may get lost.
2. The RNN based architecture for encoding may suffer from lack of parallelization resulting in computational inefficiency.
3. Lack of contextual representation. During encoding, no special attention given to dominant tokens.

Attention mechanism and contextual learning.
1. Once encoding completes, we can take the encoded internal state vectors, and feed them to the decoder for translation.<br>
<img src="Images/attention1.png" alt="drawing" width="590"/> <img src="Images/attention2.png" alt="drawing" width="600"/><br>
2. During translation, the weight corresponding to "I" and "nan" should be higher compared to other state vectors as shown in the alignment matrix.<br>
3. The alignment score between state $s_t$ and $h_i$ can be given by
$$\alpha_{t,i}=align(s_t, h_i)=\frac{exp(score(s_{t-1}, h_i))}{\sum_{i'=1}exp(score(s_{t-1}, h_{i'}))}$$
4. For a given t, all the $\alpha_{t,i}$'s can be calculated in parallel but this cannot be done for all t's in parallel. This is because, the $alpha_{t,i}$ depends upond the value of $s_{t-1}$<br>
5. The contextual learning paradigm of RNN also result in computational inefficiency.

SELF ATTENTION
1. There is no recurrence relation in the self attention layers still it is aware of the contextual representation of the input sequence.
2. The objective of the self attention layer is to take the current embeddings($h_i$'s) and find the contextual embeddings($s_i$'s).<br>
The contextual embeddings of the word  "movie"($s_5$) is evaluated as attention weighted sum of the remaining words. Our objective is to evaluate these attention weights.
$$
    s_4 = \sum_{j=1}^5 \alpha_{4,j}h_j
$$
3. We can parallelize the calculation of $\alpha_{i,j}$ because of no recurrence relation. We call it self attention because the attention depends only upon the input state vectors.
4. Consider the sentence "The animal didn't cross the street because it was too tired.", here the contextual embedding of the word "it" <br>
should have a higher weight corresponding to animal. If the last word is changed to congested, the contextual embedding should have higher weight<br>
corresponding to road. Threfore, the alpha's should have higher values corresponding to these words.

# Attention is all you need
<img src="Images/transformers.png" alt="drawing" width="300"/><img src="Images/multihead_attention.png" alt="drawing" width="740"/><br>

INPUT EMBEDDING
1. vector representation of token in a sequence. In transformer, there is no concept of static embedding, it learns the tokenization while model training.<br>

POSITIONAL ENCODING [the sinusoidal layer next to the input embedding]<br>
1. The purpose of the positional encoding layer is to make the model aware of the position of the input tokens in the sequence. <br>
2. Since, the recurrence layer of the traditional seq-seq is removed, the contextual understandin is derived from the positon aware tokens<br>
3. [TODO] Deep dive inside the positional encoding.

MULTIHEAD ATTENTION
1. Attention allows the model to focus on different part of the input token for model prediction. Multihead attention allows multiple processes all at once.<br>
2. The 3 linear inputs shown in the above figure are three self attention of the same token. These are called, queries(Q), keys(K) and values(V).<br>
3. The three representation of a single token is learn different aspect of relationship between tokens in the same input sequence.<br>
4. Query represents the token for which the attention weights are calculated. It assigns higher weights to the tokens which are more responsible for the prediction<br> corresponding to the current token.
5. Key represents the other tokens in the sequence. It is responsible for calculating attention weights wrt to the query and determines the importance of other token in the seq.
6. The value represents the content associated with the tokens in a sequence. 

In [8]:
# Code for general attention mechanism in transformer model
# Step1: position aware encoder representations of four different words. [deeper analysing required.]
word_1 = np.array([1, 0, 0])
word_2 = np.array([0, 1, 0])
word_3 = np.array([1, 1, 0])
word_4 = np.array([0, 0, 1])

words = np.stack([word_1, word_2, word_3, word_4])      # matrix representation.
words.shape

(4, 3)

In [6]:
# generating the weight matrices
np.random.seed(42)
# The transformation: R3 -> R2
W_Q = np.random.randint(3, size=(3, 2))     # Query matrix
W_K = np.random.randint(3, size=(3, 2))     # Key matrix
W_V = np.random.randint(3, size=(3, 3))     # Value matrix

In [9]:
# Step1.1: Generating the queries, keys and values
query_1 = word_1 @ W_Q
key_1 = word_1 @ W_K
value_1 = word_1 @ W_V
 
# Parallelizing the process
Q = words @ W_Q     # Querries
K = words @ W_K     # Keys
V = words @ W_V     # Values

In [11]:
# The next step in the multi head self attention paradigm is to calculate the similarity score between querries and keys.
score1 = np.array([Q[0].dot(K[0]), Q[0].dot(K[1]), Q[0].dot(K[2]), Q[0].dot(K[3])])     # The similarity score between query 1 and all the keys.

# Step2 [MATMUL]: Parallelizing the process, we have the score matrix called the attention filter.
score = Q@K.T       # score[i, j]: the similarity score of i'th query with the j'th key. 

In [14]:
# Step 3: Scaling the resultant attention filter. This is a design choice to manage the magnitude of the gradient during the training process.
# The scaling factor is square root of the dimension of key vector.
score_scaled = score / np.sqrt(len(K[0]))

In [18]:
# Step 4: Appky softmax function over the entire array not along a particular dimension. 
weights = np.exp(score_scaled) / np.exp(score_scaled).sum()

In [26]:
# Step 5: The attentions are calculated as an weighted sum of the value vectors.
attention = weights@V

The complete process can be done in a single line<br>
$$softmax(\frac{QK^T}{\sqrt{dim_k}})\times V$$

In [27]:
# The whole process can be done in a single line
def softmax(arr: np.array) -> float:
    weights = np.exp(arr) / np.exp(arr).sum()
    return weights

In [28]:
attention = softmax(Q@K.T/np.sqrt(len(K[0])))@V