# Attention and Self-Attention

_Attention is all you need_ was the influential paper published by Google Research in 2017 that led to a whole new class of NLP models named "Transformers" and their derivatives.
We will go through the stelar concept that is "Self-attention" that was used in this paper for the Transformer architecture.

# What's the attention mechanism introduced in the paper "Attention is all you need"?

Before delving into the different kinds of attention discussed on the paper, we need to review some concepts on embeddings.


### Embeddings

As discussed on the module on NLP, an _embedding_ is a mapping from a set of words or "tokens" to a high dimensional Euclidean space.
In the world of Machine Learning a good embedding is one that **can encode semantical relationships between words** and whose axes **represent some quality or attribute of words** (e.g. sentiment, is it a big object, etc.). 

![Example of what an embedding might look like in three dimensions](https://miro.medium.com/v2/resize:fit:2000/format:webp/1*SYiW1MUZul1NvL1kc1RxwQ.png)
<div class="alert alert-block alert-info">
Tokens and words are not exactly the same thing, other than words, a token might be a class of punctuation symbols, a suffix or prefix of words or placeholders for words not in the training dataset. 
</div>

Some of the most used tools for embeddings are [word2vec](https://code.google.com/archive/p/word2vec/) and [GloVe](https://nlp.stanford.edu/projects/glove/).


## Queries, Keys and Values:

The focal point of the Architecture are the multi-head attention blocks.

![Transformer architecture](https://daleonai.com/images/screen-shot-2021-05-06-at-12.12.21-pm.png)

To understand how they work we need to understand what do we mean by "attention". Let's take a look at the following formula:
$$
\operatorname{Attention}(Q, K, V) = \operatorname{softmax}(QK^T / \sqrt{d_k})V
$$
This is known as _scaled dot product attention_. Let's break it down:


- **Queries $Q$:** This is a vector of embeddings that represent what the model is looking for.
- **Keys $K$:** Represent what is being attended to.
- **Values $V$:** Represent the information associated with each key.
- **dimension $d_k$**: This is the dimension of the keys vector. It is used for numerical stability.


**Example:** In a sentence translation task, the query might be a word in the target language, the keys might be words in the source language, and the values might be their corresponding translations. We'll implement it in simple Python

In [2]:
import math

# Let's create a simple 2D embedding for a word-bag

wordbag=['El', 'perro','tiene', 'el', 'pelo', 'suelto', 'The', 'dog', 'has', 'loose', 'hair']

# For this example, we can ignore capitalization

wordbag=[word.upper() for word in wordbag]

def circle_embedding(dataset):
    """
    Maps every element of the dataset to an element on the unit circle: list -> dict
    """
    d_s=len(dataset)
    embedding = dict(
                 map(
                     lambda item:
                     (item[1], (math.cos(2*math.pi*item[0]/d_s),
                                math.sin(2*math.pi*item[0]/d_s))),
                     enumerate(dataset)))
    return embedding

embedding = circle_embedding(wordbag)

# Let's create our Q,K and V. For this example we will translate 'Perro'. 

Q = embedding['PERRO']
K = [embedding[i] for i in ['PERRO', 'TIENE', 'PELO']]
V = [embedding[i] for i in ['DOG', 'HAS', 'HAIR']]

def softmax(v):
    """
    Simple softmax implementation for 1D vectors.
    """
    exp_sum = sum([math.exp(e) for e in v])
    for element in v:
        yield math.exp(element)/exp_sum

def dot_attention(Q, K, V):
    """
    Calculate dot attention for Q,K and V
    """
    d_k = len(K)
    v = []

    for k in K:
        v.append((Q[0]*k[0] + Q[1]*k[1])/math.sqrt(d_k))
    s = softmax(v)
    return list(s)
    #return map(lambda x: (x[0]*x[1][0], x[0]*x[1][1]), zip(s,V)) -> we return the softmax score for easier visualization of the probabilities.

print(dot_attention(Q,K,V))

[0.4116032614668716, 0.3755560066392266, 0.2128407318939017]


As we saw earlier, attention by itself doesn't really have any learnable parameters, but it can be used for data augmentation in the following manner:


- **Weighting mechanism:** Assigns weights to different parts of the input sequence.
- **Focus on relevant information:** Allows the model to focus on the most relevant parts of the input.
- **Example:** In a machine translation task, the attention mechanism might focus on the part of the source sentence that is most relevant to the current word in the target sentence.


### What's self-attention?



For NLP tasks, especially with predictive text tasks we need to gather **meaning** and **context** from the input.
This is where self-attention comes into play. We augment the attention mechanism in the following manner:

- The _self_ denomination for self-attention comes from the characteristic that $Q = K = V$. 
- For each of the $Q$, $K$, and $V$ vectors, we are going to have a different **embedding** this is going to be retrieved by
the trainable square weight matrices $W^Q$, $W^K$ and $W^V$.

$$
\operatorname{SelfAttention}(Q, K, V) = \operatorname{Attention}(QW^Q, KW^K, VW^V).
$$

Remember that if $q \in \mathbb{R}^{d_q}$ is a column entry of the matrix $Q$, then $W^Q$ is a $d_q$ by $d_q$ matrix. The same goes for $K$ and $V$.

![Multi-Head Attention](https://www.researchgate.net/publication/359127201/figure/fig4/AS:1140931101241348@1649030580668/The-structure-of-multi-head-attention-mechanism.ppm)

The actual Transformer architecture uses a mechanism called multiheaded attention. This mechanism has some advantages:
**Projections as heads:**

- **Parallel attention mechanisms:** Divide the input into multiple parallel attention mechanisms.
- **Different perspectives:** Each head can focus on different aspects of the input.

**Different heads encode different meanings:**

- **Diverse understanding:** Can capture different relationships between words.
- **Improved performance:** Enhances the model's ability to understand complex language patterns.

**Embeddings as a way to derive meaning of word from context:**

- **Contextual understanding:** Learn to capture the meaning of words based on their context.
- **Semantic relationships:** Represent semantic relationships between words.

The multi-head attention mechanism is implemented by taking $h$ projections from the embeddings of $Q,K$ and $V$ to lower dimensional $d_k = d_{\operatorname{model}}/h$ spaces. For each of the _heads_ 
$h_i$ we take a different set of embeddings given by $W_i^Q, W_i^K, W_i^V$, we concatenate the _self-attention_ values as follows:

$$
\begin{align*}
\operatorname{MultiHead}(Q, K, V) &= \operatorname{Concat}(\mathrm{head}_1, \ldots, \mathrm{head}_h) \\
\text{where  } \mathrm{head}_i &= \operatorname{Attention}(QW_i^Q, KW_i^K, VW_i^V).
\end{align*}
$$
