In [2]:
# This is the first installment of transformers, with its basic building block called self attention mechanism.
import numpy as np

In [3]:
# [Source] : 'https://www.youtube.com/watch?v=MVeOwsggkt4&list=PLZ2ps__7DhBZVxMrSkTIcG6zZBDKUXCnM&index=64

Disadvantages of the seq-seq model architecture like RNN and encoder-decoder in machine translation are.<br>
<img src="Images/EncoderDecoder.png" alt="drawing" width="600"/><br>
1. Lack of context retention. Since, the encoder compress the input sentence into a single vector, complete essence of the input seq. may get lost.
2. The RNN based architecture for encoding may suffer from lack of parallelization resulting in computational inefficiency.
3. Lack of contextual representation. During encoding, no special attention given to dominant tokens.

Attention mechanism and contextual learning.
1. Once encoding completes, we can take the encoded internal state vectors, and feed them to the decoder for translation.<br>
2. During translation, the weight corresponding to "I" and "nan" should be higher compared to other state vectors as shown in the alignment matrix.<br>
3. The alignment score between state $s_t$ and $h_i$ can be given by
$$\alpha_{t,i}=align(s_t, h_i)=\frac{exp(score(s_{t-1}, h_i))}{\sum_{i'=1}exp(score(s_{t-1}, h_{i'}))}$$
4. For a given t, all the $\alpha_{t,i}$'s can be calculated in parallel but this cannot be done for all t's in parallel. This is because, the $alpha_{t,i}$ depends upond the value of $s_{t-1}$<br>
5. The contextual learning paradigm of RNN also result in computational inefficiency.

SELF ATTENTION
1. There is no recurrence relation in the self attention layers still it is aware of the contextual representation of the input sequence.
2. The objective of the self attention layer is to take the current embeddings($h_i$'s) and find the contextual embeddings($s_i$'s).<br>
The contextual embeddings of the word  "movie"($s_5$) is evaluated as attention weighted sum of the remaining words. Our objective is to evaluate these attention weights.
$$
    s_4 = \sum_{j=1}^5 \alpha_{4,j}h_j
$$
3. We can parallelize the calculation of $\alpha_{i,j}$ because of no recurrence relation. We call it self attention because the attention depends only upon the input state vectors.
4. Consider the sentence "The animal didn't cross the street because it was too tired.", here the contextual embedding of the word "it" <br>
should have a higher weight corresponding to animal. If the last word is changed to congested, the contextual embedding should have higher weight<br>
corresponding to road. Threfore, the alpha's should have higher values corresponding to these words.

# Attention is all you need
<img src="Images/transformers.png" alt="drawing" width="300"/><img src="Images/multihead_attention.png" alt="drawing" width="740"/><br>

INPUT EMBEDDING
1. vector representation of token in a sequence. In transformer, there is no concept of static embedding, it learns the tokenization while model training.<br>

POSITIONAL ENCODING [the sinusoidal layer after the input embedding]<br>
1. The purpose of the positional encoding layer is to make the model aware of the position of the input tokens in the sequence. <br>
2. Since, the recurrence layer of the traditional seq-seq is removed, the contextual understanding is derived from the positon aware tokens<br>
3. [TODO] Deep dive inside the positional encoding.

SELF ATTENTION
1. Attention allows the model to focus on different part of the input token for model prediction. Multihead attention is concatenation of multiple self attention.<br>
2. The 3 linear inputs shown in the above figure are three different representatio the same token. These are called, queries(Q), keys(K) and values(V).<br>
3. The three representation of a single token is to learn different aspect of relationship between tokens in the same input sequence.<br>
4. Query represents the token for which the attention weights are calculated. It assigns higher weights to the tokens which are more responsible for the prediction<br> corresponding to the current token.
5. Key represents the other tokens in the sequence. It is responsible for calculating attention weights wrt to the query and determines the importance of other token in the seq.
6. The value represents the content associated with the tokens in a sequence. 

In [4]:
# Code for general attention mechanism in transformer model
# Step1: position aware encoder representations of four different words. [deeper analysing required.]
word_1 = np.array([1, 0, 0])
word_2 = np.array([0, 1, 0])
word_3 = np.array([1, 1, 0])
word_4 = np.array([0, 0, 1])

words = np.stack([word_1, word_2, word_3, word_4])      # matrix representation.
words.shape

(4, 3)

In [5]:
# The tenosr word represents one sequence (batch dim.) of 4 tokens (time time dim.), each token is of dimension 3 (embedding dim.).

In [14]:
# generating the weight matrices
np.random.seed(42)
# The transformation: R3 -> R2
W_Q = np.random.randn(3, 2)     # Query matrix
W_K = np.random.randn(3, 2)     # Key matrix
W_V = np.random.randn(3, 3)     # Value matrix

In [15]:
# Step1.1: Generating the queries, keys and values
query_1 = word_1 @ W_Q
key_1 = word_1 @ W_K
value_1 = word_1 @ W_V
 
# Parallelizing the process
Q = words @ W_Q     # Querries
K = words @ W_K     # Keys
V = words @ W_V     # Values

In [16]:
# The next step in the multi head self attention paradigm is to calculate the similarity score between querries and keys.
score1 = np.array([Q[0].dot(K[0]), Q[0].dot(K[1]), Q[0].dot(K[2]), Q[0].dot(K[3])])     # The similarity score between query 1 and all the keys.

# Step2 [MATMUL]: Parallelizing the process, we have the score matrix called the attention filter.
# Assuming a MT task, for time step t, we want to calculate the influence of all the tokens. The influence is calculated as similarity score.
score = Q@K.T       # score[i, j]: the similarity score (influence) of j'th key over the i'th time token.

In [17]:
# Step 3: Scaling the resultant attention filter. This is a design choice to manage the magnitude of the gradient during the training process.
# This is to normalize the variance resltant after the matrix multiplication.
# The scaling factor is square root of the dimension of key vector.
print('Variance of Q', Q.var())
print('Variance of K', K.var())
print('Variance of score', score.var())
score_scaled = score / np.sqrt(len(K[0]))
print('Variance of scor after scaling', score_scaled.var())

Variance of Q 0.4635347453166358
Variance of K 0.6331985417208494
Variance of score 1.8052575483422657
Variance of scor after scaling 0.9026287741711327


In [19]:
# The whole process of self attention mechanism can be done in a single line.
def softmax(arr: np.array, axis: int) -> float:
    weights = np.exp(arr) / np.exp(arr).sum()
    return weights

In [20]:
# Step 4: Apply softmax function along the rows. Each row of the score matrix represents the similarity of a query to all the keys.
# We want the weight matrix to normalize the similarity scores, hence apply softmax function along the rows to generate the weight matrix.
weights = softmax(score_scaled, -1)      # why axis -1?

In [21]:
# Step 5: The attentions are calculated as an weighted sum of the value vectors.
attention = weights@V

The complete process can be done in a single line<br>
$$softmax(\frac{QK^T}{\sqrt{dim_k}})\times V$$

In [23]:
# Summarizing the above steps in a single line of code.
attention = softmax(Q@K.T/np.sqrt(len(K[0])), -1)@V

**Musking**<br>
1. For encoder there is no requirement of masking, as all tokens in a seq influence each other tokens.
1. During decoding we cannot use the future tokens to calculate the score matrix, therefore we need masking<br>
    which can restrict the use of future tokens from influencing the current token.

# Multiheaded Self Attention

Multiheaded self attention is a process of concatenation of multiple self attention heads to form a long vector, which is the<br>
input of the next step in the transformer architecture. The purpose of the multiple single head attention is that an individual <br>
head focus on unique aspects of the semantics of the input sequence. After the concatenation the concatenaaetd vector is again <br>
passed through another linear layer th shrink the size of the input vector.<br>
<img src="Images/multihead_attention_eqs.png" alt="drawing" width="590"/>

In [3]:
import torch
import torch.nn as nn
import torch.nn.functional as F

In [4]:
max_seq_len = 4 # No. of maximum words in a sequence. For seq with length less than max_seq_len, padding is required.
batch_size = 1  # Batch size (no. of sequences) for parallel processing.
input_dim = 3   # Dimension of the embedding vectors.
d_k = 64        # Dimension of the linearly projected queries and keys.
d_v = 64        # Dimension of the linearly projected valuess.
d_model = 5     # Dimension of the output of the multihead attention layer.
x = torch.randn((batch_size, max_seq_len, input_dim))   # Random input tensor
print(x.shape)

torch.Size([1, 4, 3])


In [None]:
n_heads = 8
head_dims = 


<img src="Images/multihead_attention_unrolled1.png" alt="drawing" width="590"/> <img src="Images/multihead_attention_unrolled2.png" alt="drawing" width="700"/><br>

In [5]:
class ScaledDotProductAttention:
    def __init__(self) -> None:
        """
        Implementing scaled dot procduct attention.
        """

    def __call__(self, Qs: torch.tensor, Ks: torch.tensor, Vs: torch.tensor, mask: bool) -> torch.tensor:
        """
        Calling scaled dot procduct attention.

        Args:
            Qs: query matrix    [batch_size, n_heads, max_seq_len, d_k]
            Ks: key matrix      [batch_size, n_heads, max_seq_len, d_k]
            Vs: values  matrix  [batch_size, n_heads, max_seq_len, d_v]
            mask: whether to apply mask based on whether called during encoding/decoding.

        Returns:
            Calculated attention weights.
        """
        score_mat = Qs @ Ks.permute(0, 1, 3, 2)
        score_mat_scaled = score_mat / np.sqrt(len(Ks[0]))

        if mask:
            score_mat_scaled = score_mat_scaled.masked_fill(mask == 0, -1e9)

        attention = F.softmax(score_mat_scaled, -1)
        out =  attention @ Vs
        return out

In [7]:
# Combining all the concept to build a multihead attention class using numpy
class MultiheadAttention(nn.Module):
    def __init__(self, h: int, d_k: int, d_v: int, d_emb: int) -> None:
        """
        Initialize multihead attention mechanism.

        Args:
            h: number of heads
            d_k: dimension of key and query vectors
            d_v: dimension of value vector
            d_emb: embedding dimension of each token.
            max_seq_len: maximum sequence length allowed.
        """
        super().__init__()
        self.n_heads = h                    # No. of heads.
        self.d_k = d_k                      # key dimension.
        self.d_v = d_v                      # value dimension.
        self.d_model = h * d_v              # Model dimension.
        self.W_Q = nn.Linear(d_emb, d_k*h)  # Query matrix.
        self.W_K = nn.Linear(d_emb, d_k*h)  # Key matrix.
        self.W_V = nn.Linear(d_emb, d_v*h)  # Value matrix.
        self.W_o = nn.Linear(d_model, d_model)  # Output matrix.
        self.scaled_dot_product_attention = ScaledDotProductAttention()

    def split_head(self, x: torch.tensor) -> torch.tensor:
        """
        Split the input tensor into multiple heads.

        Args:
            x: input tensor of shape (batch_size, max_seq_len, d_k*n_heads)

        Returns:
            Reshaped tensor of dimension (batch_size, n_heads, max_seq_len, d_k/d_v)
        """
        x = x.view(x.shape[0], x.shape[1], self.n_heads, -1)    # (batch_size, max_seq_len, n_heads, d_k/d_v)
        out = x.permute(0, 2, 1, 3)                             # (batch_size, n_heads, max_seq_len, d_k/d_v)
        return out
    
    def forward(self, x, mask: bool) -> torch.tensor:
        """
        Forward pass of multihead attention.

        Args:
            x: input tensor of shape (batch_size, max_seq_len, d_k*n_heads)
            mask: whether to apply mask based on whether called during encoding/decoding.

        Returns:
            Reshaped tensor of dimension (batch_size, max_seq_len, d_model)
        """
        Qs = self.W_Q(x)       # Queries, shape: (batch_size, max_seq_len, d_k*n_heads)
        Ks = self.W_K(x)       # Keys, shape: (batch_size, max_seq_len, d_k*n_heads)
        Vs = self.W_V(x)       # Values, shape: (batch_size, max_seq_len, d_v*n_heads)

        Qs = self.split_head(Qs)   # (batch_size, n_heads, max_seq_len, d_k)
        Ks = self.split_head(Ks)   # (batch_size, n_heads, max_seq_len, d_k)
        Vs = self.split_head(Vs)   # (batch_size, n_heads, max_seq_len, d_v)

        multihead_vals = self.scaled_dot_product_attention(Qs, Ks, Vs, mask)   # (batch_size, n_heads, max_seq_len, d_v)
        multihead_vals = multihead_vals.view(x.shape[0], x.shape[1], self.n_heads*self.d_v)   # (batch_size, max_seq_len, d_model)
        out = self.W_o(multihead_vals)   # (batch_size, max_seq_len, d_model)
        return out

In [8]:
"""
[NOTE]
The proposed transformer architecture by Vaswani et al. has the same key and value dimension of 512.
The number of heads is 8 hence the dimension of linearly projected queries, keys and values is going to be 64.
The input embedding dimension is same as d_model = num_head*key_dim.

To make a generalized architecture I have not hard coded the suggested values which makes it more generalizable.
"""
input_dim = 512
d_k = 64
d_v = 32
num_heads = 8

batch_size = 30
sequence_length = 5
x = torch.randn( (batch_size, sequence_length, input_dim) )

model = MultiheadAttention(num_heads, d_k, d_v, input_dim)

In [9]:
model.forward(x, False)

RuntimeError: mat1 and mat2 shapes cannot be multiplied (150x256 and 5x5)