# Transformers explained

In December 2017, Vaswani et al. published their seminal paper, Attention Is All You Need. They performed their work at Google Research and Google Brain. Lets look at the transformer model that is described in *Attention is All You Need*  


The transformer model is a stack of 6 layers. The output of layer $l$ is the input of layer $l+1$ until the final prediction is reached.There is a 6 layer encoder stack on left and a 6 layer decoder stack on the right.

![](data/Attention.png)

On the left, the inputs enter the encoder side of the Transformer through an **Attention** sub-layer and **FeedForward Network (FFN)** sub-layer. On the right, the target outputs go into the decoder side of the Transformer through two attention sub-layers and an FFN sub-layer.

The attention mechanism is a "word-to-word" operation. The attention mechanism will find how each word is related to all other words in a sequence, including the word being analyzed itself. Let's examine the following sequence:

*The cat sat on the mat.*

Attention will run dot products between word vectors and determine the strongest relationships of a word among all the other words, including itself. 
('cat' and 'cat').The attention mechanism will provide a deeper relationship between words and produce better results.For each attention sub-layer, the Transformer model runs not one but eight attention mechanisms in parallel to speed up the calculations. We shall discuss how these attention mechanism works in detail in the comnig sections.

## The encoder stack

The layers of the encoder and decoder of the original Transformer model are stacks of layers. Each layer of the encoder stack has the following structure:

![](data/encoder.png)

The original encoder layer structure remains the same for all of the N=6 layers of the Transformer model. Each layer contains two main sub-layers: a **multi-headed attention mechanism** and a **fully connected position-wise feedforward network**.  
Notice that a **residual connection** surrounds each **main sub-layer, Sublayer(x)**, in the Transformer model. These connections transport the unprocessed **input x** of a sublayer to a layer normalization function. This way, we are certain that key information such as **positional encoding** is not lost on the way. The normalized output of each
layer is thus: **LayerNormalization (x + Sublayer(x)).** Though the structure of each of the N=6 layers of the encoder is identical, the content of each layer is not strictly identical to the previous layer. Each layer learns from the previous layer and explores different ways of associating the tokens in the sequence.

The designers of the Transformer introduced a very efficient constraint. The output of every sub-layer of the model has a constant dimension, including the **embedding layer** and the **residual connections**. This dimension is $d_{model}$ and can be set to another value depending on your goals. In the original Transformer architecture, $d_{model}$ =512.

## Inputs

The input embedding sub-layer converts the input tokens to vectors of dimension $d_{model}$ = 512 using learned embeddings in the original Transformer model. The structure of the input embedding is classical.


**Representing Inputs**

We first represent each word of the input sentence using a one-hot vector. A one-hot vector is a vector in which every element is '0' except for a single element which is a '1'. The length of each one-hot vector is determined beforehand by the size of the vocabulary. If we want to represent 10,000 different words we need to use one-hot vectors of length 10,000 (so that we have a unique slot for the “one” for each word.) We don't want to feed the Transformer plain one-hot vectors because they're sparse, huge, and tell us nothing about the characteristics of the word. Therefore we learn a "word embedding" which is a smaller real-valued vector representation of the word that carries some information about the word. 

**Word Embeddedings:**
Word embedding is a process of converting words in to vector representations in a way that similar words have similar representations. 
We can do this using `nn.Embedding` in Pytorch, or, more generally speaking, by multiplying our one-hot vector with a learned weight matrix $W$. `nn.Embedding` consists of a weight matrix $W$ that will transform a one-hot vector into a real-valued vector. The weight matrix has shape (**num_embeddings, embedding_dim**). num_embeddings is simply the vocabulary size  we need one embedding for each word in the vocabulary. embedding_dim is the size we want our real-valued representation to be; we can choose this to be whatever we want – 3, 64, 256, 512, etc. In the Transformers paper they choose 512 (the hyperparameter $d_{model}$ = 512).

People refer to nn.Embedding as a "lookup table" because you can imagine the weight matrix as merely a stack of the real-valued vector representations of the words:

![](data/lookup.png)

There are two options for dealing with the Pytorch nn.Embedding weight matrix. One option is to initialize it with pre-trained embeddings and keep it fixed, in which case it’s really just a lookup table. Another option is to initialize it randomly, or with pre-trained embeddings, but keep it trainable. In that case the word representations will get refined and modified throughout training because the weight matrix will get refined and modified throughout training.

The Transformer uses a random initialization of the weight matrix and refines these weights during training – i.e. it learns its own word embeddings

So now we get $d_{model}$ = 512 dimension vector for each word that looks something like this

$$word = [1.35794589e-02,\  -2.18823571e-02, ....................., 1.34526128e-02,\  6.74355254e-02]_{1x512}$$


Now that we have word embeddings of each word in sentence we need to look for positions of words in the sentence. Since we have word embeddings of dimensions 512. we need to add positional information to it with a dimension of 512.

**Positional Encoding**

Vaswani et al. provide sine and cosine functions that we can generate different frerquencies for the positional encoding (PE) for each position and each dimension i of the $d_{model}$ = 512 of the word embedding vector:

$$PE_{(pos 2i)} = \sin \bigg(\frac{pos}{10000^{\frac{2i}{d_{model}}}} \bigg)$$

$$PE_{(pos 2i+1)} = \cos \bigg(\frac{pos}{10000^{\frac{2i}{d_{model}}}} \bigg)$$


The Python implementation looks like this:

In [18]:
import math
def positional_encoding(pos, d_model=512):
    pe = []
    for i in range(0, d_model):
        if i % 2 == 0:
            pe.append(math.sin(pos / (10000 ** ((2 * i)/d_model))))
        else:
            pe.append(math.cos(pos / (10000 ** ((2 * i)/d_model))))
    return pe

In [30]:
pe2 = positional_encoding(2)
print(f'Size of position vector: {len(pe2)}')

Size of position vector: 512


So now we can say,

**Positional embedding  = word embedding vector + positional ecoding vector** (both are of dim = 512). There is one problem to this. If we add directly both of these vectors then we might loose some information of word embedding. So we need to increase the value of word embedding by multiplying with a scalar and here again they choose the value to be $\sqrt{d_{model}}$

**Positional embedding  = $\sqrt{d_{model}}$ * word embedding vector  +  positional ecoding vector**

## Encoder

![](data/encoder1.png)