# Attention and Self-Attention

_Attention is all you need_ was the influential paper published by Google Research in 2017 that led to a whole new class of NLP models named "Transformers" and their derivatives.
We will go through the stelar concept that is "Self-attention" that was used in this paper for the Transformer architecture.

# What's the attention mechanism introduced in the paper "Attention is all you need"?

Before delving into the different kinds of attention discussed on the paper, we need to review some concepts on embeddings.


### Embeddings

As discussed on the module on NLP, an _embedding_ is a mapping from a set of words or "tokens" to a high dimensional Euclidean space.
In the world of Machine Learning a good embedding is one that **can encode semantical relationships between words** and whose axes **represent some quality or attribute of words** (e.g. sentiment, is it a big object, etc.). 

![Example of what an embedding might look like in three dimensions](https://miro.medium.com/v2/resize:fit:2000/format:webp/1*SYiW1MUZul1NvL1kc1RxwQ.png)
<div class="alert alert-block alert-info">
Tokens and words are not exactly the same thing, other than words, a token might be a class of punctuation symbols, a suffix or prefix of words or placeholders for words not in the training dataset. 
</div>

Some of the most used tools for embeddings are [word2vec](https://code.google.com/archive/p/word2vec/) and [GloVe](https://nlp.stanford.edu/projects/glove/).


## Queries, Keys and Values:

The focal point of the Architecture are the multi-head attention blocks.

![Transformer architecture](https://daleonai.com/images/screen-shot-2021-05-06-at-12.12.21-pm.png)

To understand how they work we need to understand what do we mean by "attention". Let's take a look at the following formula:
$$
\operatorname{Attention}(Q, K, V) = \operatorname{softmax}(QK^T / \sqrt{d_k})V
$$
This is known as _scaled dot product attention_. Let's break it down:


- **Queries $Q$:** This is a vector of embeddings that represent what the model is looking for.
- **Keys $K$:** Represent what is being attended to.
- **Values $V$:** Represent the information associated with each key.
- **dimension $d_k$**: This is the dimension of the keys vector.


**Example:** In a sentence translation task, the query might be a word in the target language, the keys might be words in the source language, and the values might be their corresponding translations. We'll implement it in simple Python

In [17]:
import math

# Let's create a simple 2D embedding for a word-bag

wordbag=['El', 'perro','tiene', 'el', 'pelo', 'suelto', 'The', 'dog', 'has', 'loose', 'hair']

# For this example, we can ignore capitalization

wordbag=[word.upper() for word in wordbag]

def circle_embedding(dataset):
    """
    Maps every element of the dataset to an element on the unit circle: list -> dict
    """
    d_s=len(dataset)
    embedding = dict(
                 map(
                     lambda item:
                     (item[1], (math.cos(2*math.pi*item[0]/d_s),
                                math.sin(2*math.pi*item[0]/d_s))),
                     enumerate(dataset)))
    return embedding

embedding = circle_embedding(wordbag)

# Let's create our Q,K and V. For this example we will translate 'Perro'. 

Q = embedding['PERRO']
K = [embedding[i] for i in ['PERRO', 'TIENE', 'PELO']]
V = [embedding[i] for i in ['DOG', 'HAS', 'HAIR']]

def softmax(v):
    """
    Simple softmax implementation for 1D vectors.
    """
    exp_sum = sum([math.exp(e) for e in v])
    for element in v:
        yield math.exp(element)/exp_sum

def dot_attention(Q, K, V):
    d_k = len(K)
    v = []

    for k in K:
        v.append((Q[0]*k[0] + Q[1]*k[1])/d_k)
    s = softmax(v)
    return s
    #return map(lambda x: (x[0]*x[1][0], x[0]*x[1][1]), zip(s,V)) -> we return the softmax score for easier visualization of the probabilities.

print(list(dot_attention(Q,K,V)))






[0.37996893436819246, 0.3603853979421595, 0.2596456676896481]


As we saw earlier, attention by itself doesn't really have any learnable parameters, but it can be used for 


- **Weighting mechanism:** Assigns weights to different parts of the input sequence.
- **Focus on relevant information:** Allows the model to focus on the most relevant parts of the input.
- **Example:** In a machine translation task, the attention mechanism might focus on the part of the source sentence that is most relevant to the current word in the target sentence.


**Scaled dot product attention:**

- **Calculation:** Similar to dot product attention, but includes a scaling factor.
- **Formula:** Attention(Q, K, V) = softmax(QK^T / sqrt(d_k))V
- **Benefits:** Prevents vanishing gradients for large values of d_k.

### How did sequential RNN's worked and performed?

**LSTM blocks:**

- **Long Short-Term Memory:** A type of recurrent neural network (RNN) designed to address the vanishing gradient problem.
- **Components:** Input gate, forget gate, output gate, cell state.
- **Benefits:** Can learn long-range dependencies in sequential data.

**Forgetfulness:**

- **Vanishing gradient problem:** As information is processed through multiple layers, gradients can become very small, making it difficult for the model to learn long-range dependencies.
- **Impact:** Limits the ability of RNNs to capture long-term dependencies.

**Sequential RNN model:**

- **Processing:** Processes input sequence one token at a time.
- **Hidden states:** Stores information about the previous inputs.
- **Limitations:** Can be computationally expensive for long sequences.

### What's self-attention?

**Embedding matrices:**

- **Representations:** Convert words into numerical representations.
- **Learning:** Learned from the training data.

**Dot product as a way to measure similarity:**

- **Calculation:** Calculates the dot product between queries and keys.
- **Similarity measure:** The larger the dot product, the more similar the query and key.

**Trained embeddings as a mechanism to encode text characteristics:**

- **Contextual understanding:** Capture the meaning of words based on their context.
- **Semantic relationships:** Learn to represent semantic relationships between words.

**Masking in attention:**

- **Prevention:** Prevents the model from attending to future tokens in the sequence.
- **Causality:** Ensures that the model only considers information from the past.

### What's multi-headed attention?

**Projections as heads:**

- **Parallel attention mechanisms:** Divide the input into multiple parallel attention mechanisms.
- **Different perspectives:** Each head can focus on different aspects of the input.

**Different heads encode different meanings:**

- **Diverse understanding:** Can capture different relationships between words.
- **Improved performance:** Enhances the model's ability to understand complex language patterns.

**Embeddings as a way to derive meaning of word from context:**

- **Contextual understanding:** Learn to capture the meaning of words based on their context.
- **Semantic relationships:** Represent semantic relationships between words.
