## A primer on transformers

This very basic introduction to self-attention and transformers models in deep learning is structured in the following sections:

* [General Introduction](#general-intro)
* [Basic attention concept](#attention-concept)
* [Detailed attention mechanism](#attention-mechanism)

## General intro <a name="general-intro"></a>

**Transformers** are a spectacularly effective type of neural network architecture that was introduced with the seminal paper "**Attention is all you need**" (Vaswani et al. 2017: [here](https://arxiv.org/abs/1706.03762); largely an effort from scientists at *Google Brain*).

- RNN, LSTM, GRU etc.: sequential calculations, no parallelization possible (severe computational limit)
- transformers offer a solution to this problem: transformers still capture long-range dependencies in the data (even more so than LSTM/GRU), and at the same time are amenable to parallelization
- transformers are an innovative approach that dispenses with recurrence and convolutions entirely

Transformers are based on **self-attention** (building block of transformers).
**Attention** is an innovative concept that revolutionized deep learning (initially introduced by [Bahdanau et al. 2014](https://arxiv.org/abs/1409.0473))

- traditional basic **encoder/decoder** = **embedding layer** + LSTM/GRU units --> single context vector
- **attention** adds spearate paths (vectors), one for each input value, from the encoder to the decoder
- for each embedded term (vector of continuous numbers on learned features) in the encoder, we can calculate **distances** (e.g. cosine similarities) with terms in the decoder
- in this way, the network will give more weight to pairs with higher similarity scores
- these similarity scores, transformed through an activation function (usually softmax) and scaled, are the **attention values**
- with attention, the encoder basically remains the same (in a simple implementation, it can basically be an embedding layer, maybe + LSTM/GRU)
- however, the decoder now has access to the attention values (the separate paths) for each input relative to the output, and these values help predict the decoded output

**Attention** $\rightarrow$ **Cross-Attention** $\rightarrow$ **Self-Attention** $\rightarrow$ **Multi-Head Attention**

Self-attention refers to an attention mechanism that relates different positions
of a single sequence in order to compute a representation of the input data.
Self-attention has been used successfully in a variety of tasks including reading comprehension, abstractive summarization, textual entailment and learning task-independent sentence representations.

<!-- <img src="transformer.png" alt="transformer" style="width: 400px;"/> -->

![](https://drive.google.com/uc?export=view&id=1MTfoDOb0J7EbYOCFotiqnaMbeHnAdASB)

<u>Figure above</u>: left, **encoder**: stack of N = 6 layers, each with two
sub-layers: i) multi-head self-attention mechanism on input embeddings (e.g. first language vocabulary)), and ii) fully connected feed-forward network, followed by layer normalization. Right, **decoder**: stack of N = 6 layers, each with i) multi-head self-attention on output embeddings (second language vocabulary); ii) multi-head attention over the output of the encoder; iii) fully connected feed-forward network. Similar to the encoder, layer normalization is employed. (multi-head attention refers to the separate attention paths for each input word)

A **transformer** is a sequence-to-sequence (image-to-sequence, sequence-to-image, image-to-image etc.) **encoder-decoder model**: a simple transformer is similar to an encoder-decoder RNN model, the only difference being that the RNN layers are replaced with self-attention layers.
Both the **encoder** and the **decoder** consist of **multiple layers** of **self-attention** and **feed-forward networks**.

- **transformers** excel at modeling **sequential data** (NLP) - but also image data etc.
- unlike RNNs, **transformers are parallelizable**: the main reasons is that **transformers replace recurrence with attention** (attention values can be calculated simultaneously for different parts of the sequence data). Outputs can also be computed in parallel (instead of a series like in RNN)
- **transformers** can capture long-range contexts and dependencies in the data better than RNNs or CNNs; thus, longer connections can be learned.
- **attention** allows each sequence position to have access to the entire input at each layer, while in RNNs and CNNs, the information needs to pass through many sequential processing steps, which makes it harder to learn
- transformers make **no assumptions about the temporal/spatial relationships** across the data

The first step is **input embedding**, which in the case of NLP is a **word embedding layer**: each word in the input (dictionary) maps to a vector of continuous values.

The second step is **positional encoding**: RNN explicitely model the order in sequence data by processing the input elements sequentially, transformers don't (attention in stead). Therefore, the position in the sequence is introduced in the model using the `sine` and `cosine` functions (even and odd position indices).

Now we have the encoder: first, the **multi-head attention** module, where the network learns to associate words in the input text (e.g. "weather" with "sunny"). It is called multi-head because this association is done (in parallel) for all (most) words in the input data.

## Attention <a name="attention-concept"></a>

Attention in ML is the ability of the model to automatically, dynamically and independently **highlight** and **use** the **salient parts of the input data**:

- "reading" the raw data (e.g. words in a text) and converting these into distributed representations (one feature vector associated with each word position) $\rightarrow$ This is the **encoder**!
- list of feature vectors storing the output of the "reader" part of the model: sort of "memory" that can be retrieved later, not necessarily in the same order, without having to check/inspect all of them $\rightarrow$ (some) encoded feature vectors + previous decoder's hidden states = on what to focus to produce the output
- "exploits" the content of the memory to sequentially perform a task, at each time step having the ability to focus (bestow attention) on the content of one memory element (or a few, with different weights) $→$ at each time step, the score values (encoded feature vectors + decoder's hidden states) are normalized (softmax) $\rightarrow$ **weights**: the encoded vectors are then scaled by the weights $\rightarrow$ **context vector**, which is then fed into the decoder to generate the final output


1. **Alignment scores**: the encoded feature vectors ($\text{h}_i$) and the previous decoder hidden states ($\text{s}_{t-1}$) are used to compute a score (match between the input sequence and the current output at each position $t$), implemented with a feed-forward neural network:
  - $\text{e}_{t,i} = \text{a}(\text{s}_{t-1}, \text{h}_i)$

2. **Weights**: obtained by applying a softmax function to the alignment scores:
  - $\alpha_{t,i} = \text{softmax}(\text{e}_{t,i})$

3. **Context vector**: fed to the decoder at each time step $t$, and computed as weighted sum of the encoder feature vectors:
  - $\text{c}_t = \sum_{i=1}^T \alpha_{t,i} \text{h}_i$

### Attention mechanism <a name="attention-mechanism"></a>

**Input** + **query**:

- e.g. input = image; query = 'what's the weather like?'
- the network will probably have to <u>attend</u> (focuses on) the sky, first
- this translates to copmuting the **similarity** between the **input matrix** and the **query vector** $\rightarrow$ **similarity score**

The triplet is: i) **query** (Q), ii) **key** (K), iii) **value** (V): queries are executed against key-value pairs

- Q: $\text{s}_{t-1}$: previous decoder's hidden states
- K: $\text{h}_i$: encoded feature vectors
- V: $\text{h}_i$: encoded feature values

Similarly to the basic attention concept (above): i) Q and K are multiplied together (dot product) to produce a **matrix of scores** which reflects how much attention to pay on each query-key pair;
ii) the matrix of scores is then normalised (`softmax function`) to produce the **attention weights**;
iii) attention weights are then multiplied by the values (V) to obtain the output.

1. each **query vector** $\text{q} = \text{s}_{t-1}$ is matched against keys (K: key vectors $\text{k}_i$) to compute a **score value** (dot/scalar product):
  - $\text{e}_{q,k_i} = \text{q} \cdot \text{k}_i$

2. scores are normalised (softmax) to generate **weights**:
  - $\alpha_{q,k_i} = \text{softmax}(\text{e}_{q,k_i})$

3. generalized attention is then computed as weighted sum of the value vectors (fetched from corresponding keys):
  - $\text{attention}(q,K,V) = \sum_i{\alpha_{q,k_i} \text{v}_{k_i}}$

#### Numerical example

([adapted from here](https://machinelearningmastery.com/the-attention-mechanism-from-scratch/))

Let's start with the input embeddings: actually, these would be generated by and encoder (embedding layer); here we declare the embeddings manually for four example words

In [None]:
import numpy as np
import scipy as sp

In [None]:
# encoder representations of four different words
word_1 = np.array([1, 0, 0])
word_2 = np.array([0, 1, 0])
word_3 = np.array([1, 1, 0])
word_4 = np.array([0, 1, 1])

Then we generate weights to be then multiplied to the word embeddings to generate the queries, keys, and values (in actual practice, these weights would are learned during training $→$ **learning the weights** is always the job of the network!)

In [None]:
# generating the weight matrices
np.random.seed(42) # to allow us to reproduce the same attention values
W_Q = np.random.randint(3, size=(3, 3))
W_K = np.random.randint(3, size=(3, 3))
W_V = np.random.randint(3, size=(3, 3))

In [None]:
# generating the queries (decoder hidden states at t-1), keys (encoded feature vectors) and values (encoded feature values)
query_1 = word_1 @ W_Q
key_1 = word_1 @ W_K
value_1 = word_1 @ W_V

query_2 = word_2 @ W_Q
key_2 = word_2 @ W_K
value_2 = word_2 @ W_V

query_3 = word_3 @ W_Q
key_3 = word_3 @ W_K
value_3 = word_3 @ W_V

query_4 = word_4 @ W_Q
key_4 = word_4 @ W_K
value_4 = word_4 @ W_V

In [None]:
print(word_1)
print(query_1)
print(key_1)
print(value_1)

We now have all the ingredients to calculate the alignment scores at time step $t$ for each input word (embedded):

In [None]:
from numpy import dot

# scoring the first query vector against all key vectors
scores = np.array([dot(query_1, key_1), dot(query_1, key_2), dot(query_1, key_3), dot(query_1, key_4)])

We now **normalise the score values**: i) divide the score values by the square root of the dimensionality of the key vectors (in this case, three); ii) softmax.

In this way, the **weights** are obtained:

In [None]:
from scipy.special import softmax

## computing the weights by a softmax operation
weights = softmax(scores / key_1.shape[0] ** 0.5)

In [None]:
print(weights)

Finally, the **attention output** is calculated as weighted sum of all four value vectors:

In [None]:
# computing the attention by a weighted sum of the value vectors
attention = (weights[0] * value_1) + (weights[1] * value_2) + (weights[2] * value_3) + (weights[3] * value_4)

print(attention)

In a more compact matrix representation we can code this as:

In [None]:
# stacking the word embeddings into a single array
words = np.array([word_1, word_2, word_3, word_4])

# generating the weight matrices
np.random.seed(42)
W_Q = np.random.randint(3, size=(3, 3))
W_K = np.random.randint(3, size=(3, 3))
W_V = np.random.randint(3, size=(3, 3))

# generating the queries, keys and values
Q = words @ W_Q
K = words @ W_K
V = words @ W_V

# scoring the query vectors against all key vectors
scores = Q @ K.transpose()

# computing the weights by a softmax operation
weights = softmax(scores / K.shape[1] ** 0.5, axis=1)

# computing the attention by a weighted sum of the value vectors
attention = weights @ V

print(attention)

We have attention values for each word (input element in the sequence) and for each embedded feature (3 in this toy example)