# Transformers explained

In December 2017, Vaswani et al. published their seminal paper, Attention Is All You Need. They performed their work at Google Research and Google Brain. Lets look at the transformer model that is described in *Attention is All You Need*  


The transformer model is a stack of 6 layers. The output of layer $l$ is the input of layer $l+1$ until the final prediction is reached.There is a 6 layer encoder stack on left and a 6 layer decoder stack on the right.

![](data/Attention.png)

On the left, the inputs enter the encoder side of the Transformer through an **Attention** sub-layer and **FeedForward Network (FFN)** sub-layer. On the right, the target outputs go into the decoder side of the Transformer through two attention sub-layers and an FFN sub-layer.

The attention mechanism is a "word-to-word" operation. The attention mechanism will find how each word is related to all other words in a sequence, including the word being analyzed itself. Let's examine the following sequence:

*The cat sat on the mat.*

Attention will run dot products between word vectors and determine the strongest relationships of a word among all the other words, including itself. 
('cat' and 'cat').The attention mechanism will provide a deeper relationship between words and produce better results.For each attention sub-layer, the Transformer model runs not one but eight attention mechanisms in parallel to speed up the calculations. We shall discuss how these attention mechanism works in detail in the comnig sections.

## The encoder stack

The layers of the encoder and decoder of the original Transformer model are stacks of layers. Each layer of the encoder stack has the following structure:

![](data/encoder.png)

The original encoder layer structure remains the same for all of the N=6 layers of the Transformer model. Each layer contains two main sub-layers: a **multi-headed attention mechanism** and a **fully connected position-wise feedforward network**.  
Notice that a **residual connection** surrounds each **main sub-layer, Sublayer(x)** i.e.output of sublayer, in the Transformer model. These connections transport the unprocessed **input x** of a sublayer to a layer normalization function. This way, we are certain that key information such as **positional encoding** is not lost on the way. The normalized output of each
layer is thus: **LayerNormalization (x + Sublayer(x)).** Though the structure of each of the N=6 layers of the encoder is identical, the content of each layer is not strictly identical to the previous layer. Each layer learns from the previous layer and explores different ways of associating the tokens in the sequence.

The designers of the Transformer introduced a very efficient constraint. The output of every sub-layer of the model has a constant dimension, including the **embedding layer** and the **residual connections**. This dimension is $d_{model}$ and can be set to another value depending on our goals. In the original Transformer architecture, $d_{model}$ =512.

## Inputs

The input embedding sub-layer converts the input tokens to vectors of dimension $d_{model}$ = 512 using learned embeddings in the original Transformer model. The structure of the input embedding is classical.


**Representing Inputs**

We first represent each word of the input sentence using a one-hot vector. A one-hot vector is a vector in which every element is '0' except for a single element which is a '1'. The length of each one-hot vector is determined beforehand by the size of the vocabulary. If we want to represent 10,000 different words we need to use one-hot vectors of length 10,000 (so that we have a unique slot for the “one” for each word.) We don't want to feed the Transformer plain one-hot vectors because they're sparse, huge, and tell us nothing about the characteristics of the word. Therefore we learn a "word embedding" which is a smaller real-valued vector representation of the word that carries some information about the word. 

**Word Embeddedings:**
Word embedding is a process of converting words in to vector representations in a way that similar words have similar representations. 
We can do this using `nn.Embedding` in Pytorch, or, more generally speaking, by multiplying our one-hot vector with a learned weight matrix $W$. `nn.Embedding` consists of a weight matrix $W$ that will transform a one-hot vector into a real-valued vector. The weight matrix has shape (**num_embeddings, embedding_dim**). num_embeddings is simply the vocabulary size  we need one embedding for each word in the vocabulary. embedding_dim is the size we want our real-valued representation to be; we can choose this to be whatever we want – 3, 64, 256, 512, etc. In the Transformers paper they choose 512 (the hyperparameter $d_{model}$ = 512).

People refer to nn.Embedding as a "lookup table" because you can imagine the weight matrix as merely a stack of the real-valued vector representations of the words:

![](data/lookup.png)

There are two options for dealing with the Pytorch nn.Embedding weight matrix. One option is to initialize it with pre-trained embeddings and keep it fixed, in which case it’s really just a lookup table. Another option is to initialize it randomly, or with pre-trained embeddings, but keep it trainable. In that case the word representations will get refined and modified throughout training because the weight matrix will get refined and modified throughout training.

The Transformer uses a random initialization of the weight matrix and refines these weights during training – i.e. it learns its own word embeddings

So now we get $d_{model}$ = 512 dimension vector for each word that looks something like this

$$word = [1.35794589e-02,\  -2.18823571e-02, ....................., 1.34526128e-02,\  6.74355254e-02]_{1x512}$$


Now that we have word embeddings of each word in sentence we need to look for positions of words in the sentence. Since we have word embeddings of dimensions 512. we need to add positional information to it with a dimension of 512.

**Positional Encoding**

Vaswani et al. provide sine and cosine functions that we can generate different frerquencies for the positional encoding (PE) for each position and each dimension i of the $d_{model}$ = 512 of the word embedding vector:

$$PE_{(pos 2i)} = \sin \bigg(\frac{pos}{10000^{\frac{2i}{d_{model}}}} \bigg)$$

$$PE_{(pos 2i+1)} = \cos \bigg(\frac{pos}{10000^{\frac{2i}{d_{model}}}} \bigg)$$


The Python implementation looks like this:

In [1]:
import math
def positional_encoding(pos, d_model=512):
    pe = []
    for i in range(0, d_model):
        if i % 2 == 0:
            pe.append(math.sin(pos / (10000 ** ((2 * i)/d_model))))
        else:
            pe.append(math.cos(pos / (10000 ** ((2 * i)/d_model))))
    return pe

In [2]:
pe2 = positional_encoding(2)
print(f'Size of position vector: {len(pe2)}')

Size of position vector: 512


So now we can say,  
**Positional embedding  = word embedding vector + positional ecoding vector** (both are of dim = 512).  
There is one problem to this. If we add directly both of these vectors then we might loose some information of word embedding. So we need to increase the value of word embedding by multiplying with a scalar and here again they choose the value to be $\sqrt{d_{model}}$

**Positional embedding  = $\sqrt{d_{model}}$ * word embedding vector  +  positional ecoding vector** These positional embeddings will be going in as inputs in our encoder stack.

## Encoder

![](data/encoder1.png)

Thats is our encoder part as discussed above it contains 6 layers. Our input sentences goes to 1st layer as Positional embeddings and What comes out is a different representations of these sentences.  Each of the six encoder layers contains two sub-layers:

+ the first sub-layer is 'a multi-head self-attention mechanism'
+ the second sub-layer is 'a simple, position-wise fully connected feed-forward network' 

![](data/encoder2.png) 

**Sub-layer 1: Multi-head attention**

The multi-head attention sub-layer contains eight heads and is followed by postlayer normalization, which will add residual connections to the output of the sublayer and normalize it. To understand multi head attention lets first understand what the word attention means.

**Self-Attention**

Lets suppose we have a sentence  

*"I **swam** across the **river** to get to the other **BANK**"*  

We see at the ending of our sentence we word have a word called bank. The question comes to us is what this word mean at the end of our sentence. Does it mean a sloping raised land or a financial institution? To answer this question our attention goes to near by words and we see swam and river in our sentence. Now we can say that the word bank refers to a sloping land raise and not to a financial institution. 

*"I **drove** across the **road** to get to the **BANK**"*
This bank refers to a financial institution. See how things change with respect to the referring context. So context is important to find the meaning of any word in an sentence.

Attention refers to the mechanism that weighs neighbouring words to enhance the meaning of the word of interest. Like in the sentence above how much drove and road weights to augment the meaning of bank. The main purpose of self attention mechanism is to add contextual information to the words in the sentence.

So the way self attention works is, it takes the words across the sentence, then convert them into tokens then transforms them into word embedding vectors then weigh each word vectors according to the context and finally produce a contextualised representations of the word vectors. 

![](data/a1.png)

The way weighing is done by taking the dot product between each vectors. We know that similar words vectors tend to cluster together or closeby so they will have larger value of dot product close to 1 and the words that are totally opposite to the context will have smaller values close to -1. we refer the dot product as scores. In other words the higher the scores the closer their meanings, the higher the agreement between words. 

![](data/a2.png)
![](data/a3.jpg)

After calcuating the scores for each words with all the other words in a sentence. We see the scores are all of different ranges. So we want to normalize them in such a way that they add up to 1. we use softmax funtion in doing so. The values we obtain we call them weights.

![](data/a4.png)

Now we use these weights to weight the original word vectors.

![](data/a5.jpg)

Now these new $y_{1}, y_{2}, y_{3}...$ are new contextualized representations of our input vectors 

![](data/a6.png)

Now lets sum it up all together and then visualize how we get representation of a word with more context.


![](data/a7.png)

Suppose we want to get the contextualized repersentation of word vector $v_2$. First we have a dot product of  $v_2$ with all the other words in the sentence  $v_{1}$, $v_{2}$, $v_{3}$ and we get scores  $s_{21}$, $s_{22}$, $s_{23}$. Note scores are scalars they are obtained by dor product of two vectors. Then we normalize the scores to obtain weights  $w_{21}$, $w_{22}$, $w_{23}$. Then we multipy those weights with out original word vectors to get transformed contexualized word representation $y_2$ of original word vectors $v_2$.

$$y_2 = w_{21}v_{1} + w_{22}v_{1} + w_{23}v_{1} $$

**Note:** The original word vector and transformed word vector are of same dimensions. We can do this to all our word vectors to obtain new contextualized representations.

But wait we are not learning anything here. Where are our weights? So we now introduce three weight matrix $M_q, M_k, M_v$ and In data base anology we call them Queries, Keys and Values. Query is the owrd whose context we are looking for. Keys are all the words in the sentences and value is that we want to obtain.

$Q = q * M_q$  
$K = k * M_k$  
$V = v * M_v$  
  
![](data/a8.jpg)

Overview

![](data/a9.png)

We see the architecture is very similar to what we have discussed above. Only new things that we see here are Scale and Mask.
If we have lot of dimensions we can end up having large scores. And if that value foes through softmax function. the gradient signal is going to be very weak. so here they scale the dot product by dividing it with $\sqrt{d_k}$ that causes our scores to be in a good range. And the Mask is used becuase the original transformer was build for predicting words so it cannot attend over future words. But this is optional in our case.  This pretty much sums all our **Scaled Dot-Product Attention**

![](data/a10.png)

**Multi head attention**

Now lets look at the problem of Multi head attention. We will consider again the sentence. I swam across the river to get on the other side.
![](data/ma1.png) 

In this if consider the word swam it can have multiple context like who swam? swam where? So in different cases we have different attention. To handle this we come up with the solution Multi head attention.

![](data/ma2.png) 

We see this that when we have multiple heads  for a single word vector $v_1$ we have multiple outputs of contextualized vectors. say $y_{1}^{'} , y_{1}^{''}, y_{1}^{'''}$. Like discussed above all these attentions are computed with respect to different contexts. Now we concatenate these multiple outputs and send it across al linear layer to give a final ouptut $y_1$. Now that we have understood the concept of multi head attention lets get back to the architecture that we were discussing in the paper.

Each head in the Multi-Head Attention Layer intakes the new embedding (Positional Encoding generated in the last step) which is n x 512 in the dimension where 'n' is the tokens in the sequence & produces an output of shape n x 64 each. This output from all heads is then concatenated to produce a single output of the Multi Headed Attention module of the dimension n x 512. In the paper, 8 attention heads are used.

Inside each head $h_n$
of the attention mechanism, each word vector has three
representations:
+ A query vector ($Q$) that has a dimension of $d_q$ = 64, which is activated and trained when a word vector $x_n$  seeks all of the key-value pairs of the other word vectors, including itself in self-attention
+ A key vector ($K$) that has a dimension of $d_k$ = 64, which will be trained to provide an attention value
+ A value vector ($V$) that has a dimension of $d_v$ = 64, which will be trained to provide another attention value

To obtain $Q$, $K$, and $V$, we must train the model with their respective weight matrices $Q_w$, $K_w$ and $V_w$, which have $d_k$ = 64 columns and $d_model$ = 512 rows. For example, $Q$ is obtained by a dot-product between $x$ and $Q_w$. $Q$ will have a dimension of $d_k$ = 64.


$d_{model}$ = 512 (dimension of embedding for each token)  
$d_k$ = 64 (dimension of Query & Key vector)  
$d_v$ = 64 (dimension of Value vector)  
  
weight matrices dimemsions
  
$Q_w = d_{model} * d_k =512 * 64$  
$K_w = d_{model} * d_k = 512 * 64$  
$V_w = d_{model} * d_v = 512 * 64$  

Let me demonstrate with a small example all the concepts that we have learned till now.

consider the sentence : *set yourself free*  
tokenize : [1,2,3]  ; $n$ = 3  
we will consider embedding size of each word is $d_{model}$ = 4 (in paper 512)

In [3]:
import numpy as np
from scipy.special import softmax

pe = np.array([[1, 0, 1, 0], [1, 2, 2, 0], [1, 2, 1, 1]])
pe

array([[1, 0, 1, 0],
       [1, 2, 2, 0],
       [1, 2, 1, 1]])

Here we have word embeddings of lenght 4 for 3 words in sentence.  
Let $d_k, d_v$ =3 (in paper 64) 
Lets generate weight matrices $Q_w, K_w, V_w.$  These will be of dimensions $d_{model} * (d_k or d_v)$

In [4]:
qw = np.array([[2, 1, 1], [0, 2, 2], [0, 1, 0], [2, 2, 1]])
kw = np.array([[1, 1, 2], [2, 0, 2], [0, 2, 0], [2, 2, 1]])
vw = np.array([[2, 1, 1], [1, 2, 1], [2, 2, 0], [0, 2, 0]])

In [5]:
q = np.matmul(pe,qw)
q

array([[2, 2, 1],
       [2, 7, 5],
       [4, 8, 6]])

In [6]:
k = np.matmul(pe,kw)
k

array([[1, 3, 2],
       [5, 5, 6],
       [7, 5, 7]])

In [7]:
v = np.matmul(pe,vw)
v

array([[4, 3, 1],
       [8, 9, 3],
       [6, 9, 3]])

In [8]:
atten = ((q*k.T)/np.sqrt(3))*v
atten

array([[  4.61880215,  17.32050808,   4.04145188],
       [ 27.71281292, 181.86533479,  43.30127019],
       [ 27.71281292, 249.41531629,  72.74613392]])

In [9]:
s1 = atten[0]
s2 = atten[1]
s3 = atten[2]

In [10]:
weights = softmax(atten, axis=1)
weights

array([[3.04591053e-06, 9.99995244e-01, 1.70992517e-06],
       [1.12826308e-67, 1.00000000e+00, 6.64341453e-61],
       [5.19787897e-97, 1.00000000e+00, 1.87736611e-77]])

In [11]:
w1 = weights[0]
w1

array([3.04591053e-06, 9.99995244e-01, 1.70992517e-06])

For our 1 st token 'set' = [3.04591053e-06, 9.99995244e-01, 1.70992517e-06]

The importance of set for

+ set is 3.04591053e-06  
+ yourself is  9.99995244e-01  
+ free is 1.70992517e-06  

More the weight the more importance to that token corresponding to that token (including itself)

In [12]:
w1[0]*v[0] #A1 dims: [1x1]x[1x3]

array([1.21836421e-05, 9.13773158e-06, 3.04591053e-06])

In [13]:
w1[1]*v[1] #A2

array([7.99996195, 8.9999572 , 2.99998573])

In [14]:
w1[2]*v[2] #A3

array([1.02595510e-05, 1.53893266e-05, 5.12977552e-06])

These 3 attention vectors are calculated for 1st token. Now we need to add  the vectors A1+A2+A3.

In [15]:
w1[0]*v[0] + w1[1]*v[1] + w1[2]*v[2] # And finally, attention for 1st token is calculated.

array([7.9999844 , 8.99998172, 2.99999391])

Similarly, we can calculate attention for the remaining 2 tokens (considering 2nd & 3rd row of softmaxed matrix respectively) & hence, our Attention matrix will be of the shape, $n$ x $d_k$ i.e. 3 x 3 in our case.

Now, coming back to the paper where we have 8 such attention heads. In this case, we will concatenate output matrices of shape $n$ x $d_k$ from all heads & this concatenated matrix is multiplied with a weights matrix such that output = $n$ x $d_{model}$ which was the input shape for this Multi-Head Attention layer.

$$MultiHead(output) = concat(z_0, z_1, z_2, z_3, z_4, z_5, z_6, z_7) W_0 = n, d_{model}$$

n is total number of tokens in sequence.

**Post Layer Normalization**

The Post-LN contains an add function and a layer normalization process. The add
function processes the residual connections that come from the input of the sublayer. The goal of the residual connections is to make sure critical information is not lost. 
The Post-LN or layer normalization can thus be described as follows:
$$ LayerNorm(x+Sublayer(x)) $$

Sublayer(x) is the sub-layer itself. x is the information available at the input step of Sublayer(x)

The input of LayerNorm is a vector v resulting from x + Sublayer(x). $d_{model}$ = 512 for every input and output of the Transformer, which standardizes all the processes. Many layer normalization methods exist, and variations exist from one model to another. The basic concept for v= x + Sublayer(x) can be defined by LayerNorm(v):  

$$ LayerNorm(v)= \gamma \frac{v - \mu}{\sigma} + \beta $$

The variables are:
+ $\mu$ is the mean of v of dimension d. As such:
$$\mu = \frac{1}{d}\sum_{k=1}^{d} v_k $$  


+ $\sigma$ is the standard deviation v of dimension d. As such:
$$\sigma^2 = \frac{1}{d} \sum_{k=1}^{d}$$  


+ $\gamma$ is a scaling parameter.  

+ $\beta$ is a bias vector.


This version of LayerNorm(v) shows the general idea of the many possible Post-LN
methods.

The next sub-layer can now process the output of the Post-LN or LayerNorm(v). In this case, the sub-layer is a feedforward network

**Sub-layer 2: Feedforward network**

The FFN sub-layer can be described as follows:
+ The FFNs in the encoder and decoder are fully connected.
+ The FFN is a position-wise network. Each position is processed separately and in an identical way.
+ The FFN contains two layers and applies a ReLU activation function.
+ The input and output of the FFN layers is $d_{model}$ = 512, but the inner layer is larger with $d_{ff}$ = 2048
+ The FFN can be viewed as performing two kernel size 1 convolutions.

Taking this description into account, we can describe the optimized and standardized FFN as follows:

$$ FFN(x) = max(0, x*W_1 + b_1)W_2 = b_2 $$

The output of the FFN goes to the Post-LN, as described in the previous above. This is repeated for 6 times and then the output is sent to the next layer of the encoder stack and the multi-head attention layer of the decoder stack.

## Decoder

![](data/decoder.png)

The structure of the decoder layer remains the same as the encoder for all the N=6 layers of the Transformer model. Each layer contains three sub-layers: 
+ A multiheaded masked attention mechanism, 
+ A multi-headed attention mechanism, and a
+ Fully connected position-wise feedforward network.

The decoder has a third main sub-layer, which is the **masked multi-head attention mechanism**. In this sub-layer output, at a given position, the following words are masked so that the Transformer bases its assumptions on its inferences without seeing the rest of the sequence. That way, in this model, it cannot see future parts of the sequence.  

A residual connection, Sublayer(x), surrounds each of the three main sub-layers in the Transformer model like in the encoder stack:  

$$ LayerNormalization(x + Sublayer(x)) $$

The embedding layer sub-layer is only present at the bottom level of the stack, like for the encoder stack. The output of every sub-layer of the decoder stack has a constant dimension, $d_{model}$ like in the encoder stack, including the embedding layer
and the output of the residual connections.

The decoder is capped off with a linear layer that acts as a classifier, and a softmax to get the word probabilities. In the original paper the output is a translation.

Example:  
Input = The black cat sat on the couch and the brown dog slept on the rug.  
Output = Le chat noir était assis sur le canapé et le chien marron dormait sur le tapis

The output words go through the word embedding layer, and then the positional
encoding function, like in the first layer of the encoder stack.

**The attention layers**

The Transformer is an auto-regressive model. It uses the previous output sequences as an additional input. The multi-head attention layers of the decoder use the same process as the encoder. However, the masked multi-head attention sub-layer 1 only lets attention apply to the positions up to and including the current position. The future words are hidden from the Transformer, and this forces it to learn how to predict.

To prevent the decoder from looking at future tokens, we apply a look ahead mask. The mask is added before calculating the softmax, and after scaling the scores. Let’s take a look at how this works. 


![](data/mask.png)

when we compute softmax of the masked scores, the negative infinities get zeroed out, leaving zero attention scores for future tokens. This masking is the only difference in how the attention scores are calculated in the **masked multi-headed attention sub -layer**. This layer still has multiple heads, that the mask is being applied to, before getting concatenated and fed through a linear layer for further processing. The output of the first multi-headed attention is a masked output vector with information on how the model should attend on the decoder's input.


The multi-head attention sub-layer 2 draws information from the encoder by taking encoder (K, V) into account during the dot-product attention operations. This sublayer also draws information from the masked multi-head attention sub-layer 1 (masked attention) by also taking sub-layer 1(Q) into account during the dot-product attention operations. The decoder thus uses the trained information of the encoder.

We can define the input of the self-attention multi-head sub-layer of a decoder as:  



$$ Input\_Attention=(Output\_decoder\_masked\_attention\_layer (Q), Output\_encoder\_layer(K,V)) $$



A post-layer normalization process follows the masked multi-head attention sub-layer1 and multihead attention sub-layer2 same like the encoder which includes residual connections. Then the transformer goes to FFN layer followed by post-layer normalization followed by a linear layer.

**Linear layer and Outputs**

The output of the final pointwise feedforward layer goes through a final linear layer, that acts as a classifier. The classifier is as big as the number of classes. For example, if we have 10000 words in vocabulary and each word is assigned a single class that means we the output of that classier will be of size 10,000. The output of the classifier then gets fed into a softmax layer, which will produce probability scores between 0 and 1. We take the index of the highest probability score, and that equals our predicted word. The predicted word is then added to decoder inputs and the cycle continues to predict the next word.

Thats it, We have finally completed the Transformer. 