# The Attention Mechanism
We have seen how a RNN layer implements a sort of memory by the recursive use of its output with the input. This memory is nonetheless limited in particular for dataset for which an element may depend on other elements and not only on the previous one. An attention mechanism allows a model to access all the elements in a sequence and to weight their relevance for the prediction of the next one. In NLP an attention mechanism should weight more the relevant elements of a sentence, such as subject, verb and object, wherever they are in the sentence, that is the model has learnt the grammar from the examples used to train it.
![The attention mechanism](images/attention_mechanism.jpg)

A graphical representation of the attention mechanism as proposed in the paper "[Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/abs/1409.0473)" by D. Bahdanau, K.H. Cho and Y. Bengio is shown in the picture. My interpretation of this very relevant paper is the following.

A neural network learns the conditional probability distribution that links an input to the output. What has to be learnt by the network during its training are the parameters of the distribution, aka model. The model should represent 

1. The dependency of each element from the previous and the following elements in the input and output sequences
2. The weights between the elements in the input sequence and those in the output sequence

The first requirement can be implemented as a RNN or LSTM layer. The second requirement can be implemented as a matrix and basically defines the attention mechanism. The model's parameters are computed from the training examples using the backpropagation algorithm.  

## Self-attention
In this section we build a small example of an attention mechanism that will provide us a basic intuition to better understand the mechanism itself and the transformer architecture. It has been found that the attention mechanism works better without the recurrent component. The input sequence is processed all at once and the modified mechanism is called self-attention. The architecture with this modification is known as transformer. The step to implement the self-attention mechanism is divided into three steps

1. Compute the semantic similarity $\omega_{i}$ between the embedding $x^{(i)}$ of an element and all the other elements of the sequence
2. Compute the attention weights $\alpha_{ij}$ by computing the softmax of the similarity values
3. Use the attention weights and the elements of the sequence to compute the context-aware vectors $z^{(i)}$

$$\omega_{i,j} = x^{(i)} \cdot x^{(j)}$$

$$\alpha_{i,j} = \frac{exp(\omega_{i,j})}{\sum_{j=1}^T exp(\omega_{i,j})}$$

$$z^{(i)} = \sum_{j=1}^T \alpha_{i,j} x^{(j)}$$

Let's assume we have indexed of a set of words that constitute a sentence, that is a sequence.

In [12]:
import torch

# input sequence / sentence:
#  "Can you help me to translate this sentence"

sentence = torch.tensor(
    [0, # can
     7, # you     
     1, # help
     2, # me
     5, # to
     6, # translate
     4, # this
     3] # sentence
)

sentence

tensor([0, 7, 1, 2, 5, 6, 4, 3])

We compute the embeddings of these words, a representation of the words in a d=16 dimensional space $\mathbb{R}^{d}$, that is 8 vectors of 16 real numbers. The embedding of a word is a set of parameters, 16 in our case, that provide a meaning of that word simply because similar words are close in this space. PyTorch provide a [function](https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html) to compute the embeddings. The embeddings are initialized randomly and should be updated during a training process.

In [13]:
torch.manual_seed(123)
d = 16 # dimension of the embeddings space
embed = torch.nn.Embedding(10, d)
embedded_sentence = embed(sentence).detach()
embedded_sentence[0]

tensor([ 0.3374, -0.1778, -0.3035, -0.5880,  0.3486,  0.6603, -0.2196, -0.3792,
         0.7671, -1.1925,  0.6984, -1.4097,  0.1794,  1.8951,  0.4954,  0.2692])

We can compute the semantic similarity between each pair of word embeddings

In [5]:
omega = torch.empty(8, 8)

for i, x_i in enumerate(embedded_sentence):
    for j, x_j in enumerate(embedded_sentence):
        omega[i, j] = torch.dot(x_i, x_j)

We can achieve the same result but more efficiently by using the matrix multiplication operator

In [22]:
omega_mat = embedded_sentence.matmul(embedded_sentence.T)
omega_mat.shape

torch.Size([8, 8])

We can see that the result is the same

In [7]:
torch.allclose(omega_mat, omega)

True

We compute the attention weights for each word as a function of its semantic similarity with all the other words in the sentence.

In [8]:
import torch.nn.functional as F

attention_weights = F.softmax(omega, dim=1)
attention_weights.shape

torch.Size([8, 8])

With the attention weights we can compute the context-aware vectors of the input word sequence

In [11]:
context_vectors = torch.matmul(attention_weights, embedded_sentence)
context_vectors.shape

torch.Size([8, 16])

## Scaled dot-product attention
Now we build a slightly more complex example that is closer to the implementation of the attention mechanism in the transformer. In this example we use three matrices to compute three transformation of the embeddings: the query embeddings $q^{(i)}$, the key embeddings $k^{(i)}$, and the value embeddings $v^{(i)}$.

$$q^{(i)} = \hat U_q x^{(i)}$$
$$k^{(i)} = \hat U_k x^{(i)}$$
$$v^{(i)} = \hat U_v x^{(i)}$$

In [14]:
torch.manual_seed(123)

U_query = torch.rand(d, d)
U_key = torch.rand(d, d)
U_value = torch.rand(d, d)

In [18]:
queries = U_query.matmul(embedded_sentence.T).T
keys = U_key.matmul(embedded_sentence.T).T
values = U_value.matmul(embedded_sentence.T).T

We compute the semantic similarities between the queries and the keys
$$\omega_{ij} = q^{(i)} \cdot k^{(j)}$$

In [20]:
omega = queries.matmul(keys.T)
omega.shape

torch.Size([8, 8])

In [23]:
omega_2 = omega[1]
omega_2

tensor([-25.1623,   9.3602,  14.3667,  32.1482,  53.8976,  46.6626,  -1.2131,
        -32.9392])

We compute the normalized attention weights $\alpha_{ij}$ using the softmax function

In [26]:
import torch.nn.functional as F

attention_weights_2 = F.softmax(omega_2 / d**0.5, dim=0)
attention_weights_2

tensor([2.2317e-09, 1.2499e-05, 4.3696e-05, 3.7242e-03, 8.5596e-01, 1.4026e-01,
        8.8897e-07, 3.1935e-10])

Finally we compute the context weighted values 

$$z^{(i)} = \sum_{j=1}^T \alpha_{ij}v_j$$

In [27]:
context_vector_2 = attention_weights_2.matmul(values)
context_vector_2

tensor([-1.2226, -3.4387, -4.3928, -5.2125, -1.1249, -3.3041, -1.4316, -3.2765,
        -2.5114, -2.6105, -1.5793, -2.8433, -2.4142, -0.3998, -1.9917, -3.3499])

## Multi-head attention
