# Transformer core concepts

This Jupyter notebook explores the core concepts of the Transformer model, a powerful architecture widely used in natural language processing (NLP) tasks. The notebook covers the following concepts:

1. __Matrix multiplication__: A refresher on matrix multiplication, which is a fundamental operation used in the Transformer model.
1. __Embedding vectors__: An explanation of how embedding vectors are used to represent words or subwords as numerical vectors, capturing their meaning and relationships.
1. __Positional encoding__: An introduction to positional encoding, a technique used in the Transformer model to inject information about word order or position into the embedding vectors.
1. __Self-attention__: An exploration of self-attention, a mechanism in the Transformer model that allows the model to weigh the importance of different parts of the input and capture relationships between elements.
1. __Basic self-attention__: A demonstration of basic self-attention in PyTorch, showcasing how input tokens are weighted and combined to produce attention-weighted outputs.
1. __Scaling the dot product__: An explanation of the scaling applied to the dot product in self-attention to prevent the inputs to the softmax function from growing too large.
1. __Queries, keys, and values__: An introduction to the query, key, and value components in self-attention, which enable the model to focus on relevant parts of the input and compute attention scores.
1. __Multi-head attention__: An exploration of multi-head attention, where multiple self-attention mechanisms, known as attention heads, are combined to enhance the model's ability to capture different relationships between tokens.

## Objectives

By completing this notebook, readers will:

1. Understand the core concepts of the Transformer model, including matrix multiplication, embedding vectors, positional encoding, self-attention, and multi-head attention.
1. Gain insights into how self-attention mechanisms work and their role in capturing relationships between input elements.
1. Learn how to implement basic self-attention in PyTorch and visualize the weighting and combination of input tokens.
1. Comprehend the importance of scaling the dot product in self-attention to prevent large inputs to the softmax function.
1. Gain familiarity with queries, keys, and values in self-attention and their roles in attention score computation.
1. Understand the benefits of multi-head attention in capturing different relationships between tokens and enhancing the model's performance.

With this knowledge, readers will be well-equipped to dive deeper into the Transformer model and apply it to various NLP tasks.

## Refresher: Matrix multiplication

Before we continue, however, it is a good idea to take a little refresher on _matrix multiplication_ ($*$). Let's take matrices $M$ and $N$ as an example. $M$ has 2 rows ($i$) and 4 columns ($j$), while $N$ has 4 rows and 2 columns. 

To multiply these matrices, we start by taking the dot product ($\times$) of the first row of $M$ with the first column of $N$. To do this, we multiply the first element in the first row of $M$ with the first element in the first column of $N$, then add the result to the product of the second element in the first row of $M$ and the second element in the first column of $N$, and so on for all four elements in the row and column.

We repeat this process for all rows in $M$ and all columns in $N$, and write the results into a new matrix that has the same number of rows as $M$ and the same number of columns as $N$. In this example, the resulting matrix would have 2 rows and 2 columns.

With matrix multiplication, it is necessary to have the shapes conform to a simple principle where the the number of columns of the left-side matrix matches the number to rows in the right-side matrix. The resulting matrix will then have the row-count of the left-side matrix and the column-count of the right-side matrix. Let's define two matrices:

$$\begin{aligned}
M &= 2 \times 4 \enskip \text{matrix} \\
N &= 4 \times 2 \enskip \text{matrix} \\
\end{aligned}$$

To multiply the matrices we need to first check if $M_\text{columns} = N_\text{rows} \to 4 = 4$. The result of the matrix multiplication will then have the shape of $M_\text{rows} \times N_\text{cols} \to 2 \times 2$. The above process is depicted below.



![image](../../diagrams/matrix-multiplication.png)


In short, the dot product is the sum of products of values in two same-sized vectors and the matrix multiplication is a matrix version of the dot product with two matrices. The output of the dot product is a scalar whereas that of the matrix multiplication is a matrix whose elements are the dot products of pairs of vectors in each matrix.

Keeping this in mind is important when dealing with data fed to neural networks. The data is predominantly multi-dimensional tensor data. Think of a matrix with an arbitrary number of dimensions, such as a 3-dimensional or even a 64-dimensional matrix. With NLP, the data has usually four dimensions:
 - the batch size
 - number of input sequences per batch
 - the number of tokens per sequence
 - the number of embedding values per token

Aligning the tensors requires extra attention even when just combining pre-trained models to a simple classification head - the outputs of preceding steps must match the expected input shapes or error abound.

A note of caution, though. When looking at how these are implemented in popular Python libraries, such as `numpy` or `torch`, the naming conventions seem to overlap. Let's look at `numpy` as an example. First, we initialize two differently shaped matrices, `a` and `b`

In [22]:
import numpy as np
a = np.array([[1,2,3],[1,2,3]])
b = np.array([[0],[1],[2]])
print("a=")
print(a)
print("b=")
print(b)
print({
    "a.shape":a.shape,
    "b.shape":b.shape
})

a=
[[1 2 3]
 [1 2 3]]
b=
[[0]
 [1]
 [2]]
{'a.shape': (2, 3), 'b.shape': (3, 1)}


NumPy offers us several ways to multiply these matrices out-of-the-box. The first is the `dot` function, which produces a dot product of two _arrays_. In NumPy, arrays can have arbitrary number of dimensions.

In [25]:
ab = np.dot(a,b)
print("ab=")
print(ab)
print("ab.shape=")
print(ab.shape)

ab=
[[8]
 [8]]
ab.shape=
(2, 1)


Then there is the `matmul` function, which produces the matrix multiplication of two _arrays_.

In [26]:
ab = np.matmul(a,b)
print("ab=")
print(ab)
print("ab.shape=")
print(ab.shape)

ab=
[[8]
 [8]]
ab.shape=
(2, 1)


There is also the `multiply` function, which then produces element-wise multiplication. The difference to the two above is that the matrices need to be equal in shape.

In [17]:
print('multiply(a,a)=')
print(np.multiply(a,a))
print('multiply(b,b)=')
print(np.multiply(b,b))
print('multiply(a,b)=')
print(np.multiply(a,b))

multiply(a,a)=
[[1 4 9]
 [1 4 9]]
multiply(b,b)=
[[0]
 [1]
 [4]]
multiply(a,b)=


ValueError: operands could not be broadcast together with shapes (2,3) (3,1) 

Additionally, PyTorch normally operates with batched matrix multiplication, where the first dimensions is treated as batch size and is, thus, omitted from the multiplications.

In [27]:
import torch
# Create tensors and add one dimension to simulate batch size
a_tensor = torch.tensor(a).unsqueeze(0)
b_tensor = torch.tensor(b).unsqueeze(0)
ab_tensor = torch.bmm(a_tensor,b_tensor)
print("ab=")
print(ab_tensor)
print("ab_tensor.shape=")
print(ab_tensor.shape)

ab=
tensor([[[8],
         [8]]])
ab_tensor.shape=
torch.Size([1, 2, 1])



For additional reading, see the following resources:
- [What Are Dot Product and Matrix Multiplication?](https://mkang32.github.io/python/2020/08/23/dot-product.html)
- [What Should I Use for Dot Product and Matrix Multiplication?](https://mkang32.github.io/python/2020/08/30/numpy-matmul.html)

## Embedding vectors

The intuition behind embedding tokens to a vector is to represent each word or subword in a text as a numerical vector that captures its meaning and relationship to other words in the text. This is the basic idea behind word embeddings and subword embeddings, which are widely used in natural language processing (NLP) tasks.

The key intuition behind embeddings is that words or subwords that have similar meanings or are used in similar contexts should be represented by similar or nearby embedding vectors in the high-dimensional space. This allows the embedding vectors to capture semantic and syntactic relationships between words, such as synonyms, antonyms, and analogies. 

For example, the embedding vectors for "dog" and "cat" might be closer together than the embedding vectors for "dog" and "car", reflecting their semantic similarity. Let's demonstrate this with a code experiment. In the following code cell:
 - The embedding vectors are defined as a 2D array called ``embeddings``, where each row represents the embedding vector for a word. For example, the embedding vector for "cat" is ``[0.3, 0.2, -0.1]``, while the embedding vector for "dog" is ``[0.1, 0.4, 0.2]``, and so on. Vectors contain only three values so that they can be easily plotted in 3D.
 - By visualizing the embedding vectors in 3D space, we can gain insights into how similar or different words are in terms of their embeddings. Words with similar meanings or used in similar contexts should have embedding vectors that are closer together in the high-dimensional space.


In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Define some example vocabulary and corresponding embedding vectors
vocab = ["cat", "dog", "car", "tree"]
embeddings = np.array(
    [[0.3, 0.2, -0.1], [0.1, 0.4, 0.2], [-0.2, 0.1, 0.4], [0.1, -0.1, 0.3]]
)


# Define a function to plot the embeddings in 3D space
def plot_embeddings_3d(vocab, embeddings):
    fig = plt.figure()
    ax = fig.add_subplot(
        111,
        projection="3d",
    )
    ax.scatter(embeddings[:, 0], embeddings[:, 1], embeddings[:, 2])
    for i, word in enumerate(vocab):
        ax.text(embeddings[i, 0], embeddings[i, 1], embeddings[i, 2], word)
    plt.show()


# Visualize the embeddings in 3D space
plot_embeddings_3d(vocab, embeddings)

Embeddings can be learned from scratch on a specific task or dataset, or they can be pre-trained on large amounts of unlabeled data and then fine-tuned on smaller labeled datasets for specific tasks. Pre-trained embeddings, such as those used in BERT, have been shown to be highly effective for a wide range of NLP tasks and are widely used in the field. 

What is most important is that with embeddings text can be given to neural network in numerical format.

### Positional encoding

In this part, we will introduce the concept of positional encoding as it has been defined in the original Transformers paper [Attention is all you need](https://arxiv.org/pdf/1706.03762.pdf).

The intuition behind positional encoding is to inject some notion of word order or position into the embedding vectors, which is necessary for sequence-to-sequence models like transformers that process variable-length input sequences.

In natural language processing (NLP), word order is critical to understanding the meaning of a sentence or paragraph. However, traditional word embeddings like word2vec or GloVe do not capture any information about word order or position in the input sequence. This means that a traditional embedding alone would not be sufficient for a model like a transformer to fully understand a given text.

Positional encoding addresses this limitation by adding an additional vector to the word embeddings that encodes the position of each word in the sequence. This vector is added to the input embeddings before they are passed through the transformer layers, allowing the model to better understand the relationships between words based on their position in the sequence.

The formula for computing the positional encoding vector for a word at a given position $\text{pos}$ and dimension $i$ is:

$$\begin{aligned}
PE_{pos,2i} & = \text{sin} \left( \frac{pos}{10000^{2i/d_{embed}}} \right)\text{,  if }i\text{ is even} \\
PE_{pos,2i+11} & = \text{cos} \left( \frac{pos}{10000^{2i/d_{embed}}} \right)\text{,  if }i\text{ is odd} \\
\end{aligned}$$

Here, $\text{pos}$ is the position of the word in the sequence, $i$ is the dimension of the positional encoding vector, and $d_{embed}$ is the dimension of the input embeddings. The value 10000 is a hyperparameter that determines the scale of the sine and cosine functions. 

Let's use an example. Let's assume that the word ``hello`` maps to an embedding vector ``[0.1, 0.2]``. Below is in an illustration how applying positional encoding to the vector in different changes the original embedding values. In other words, the the sentence we're using is

        ["hello", "hello", "hello", "hello"]

or

        [[0.1, 0.2], [0.1, 0.2], [0.1, 0.2], [0.1, 0.2]]

The code example provides a hands-on demonstration of how positional encoding can capture the position-related information in the embedding vectors, enabling transformer models to better understand the relationships between words based on their positions in a sequence. In the example:
 - We define a function called ``positional_encoding`` that takes a position pos and the dimension of the embedding vector ``embedding_dim`` as input. It calculates the positional encoding vector for the given position using a sine and cosine function based on the provided formulas.
 - Using an example embedding vector ``[0.1, 0.2]`` we create an array of embeddings where each row represents the same embedding vector. Then we iterate over each position in the sequence and calculate the corresponding positional encoding using the ``positional_encoding`` function. The resulting positional encodings are stored in the ``pos_encodings`` matrix.
 - Finally, we visualize the positional embeddings in a scatter plot. Each embedding vector is represented as a point in the plot, and the positions are indicated by different colors and labels.

In [None]:
import numpy as np
import matplotlib.pyplot as plt


def positional_encoding(pos, embedding_dim):
    pe = np.zeros((embedding_dim,))
    for i in range(0, embedding_dim, 2):
        pe[i] = np.sin(pos / (10000 ** (i / embedding_dim)))
        pe[i + 1] = np.cos(pos / (10000 ** ((i + 1) / embedding_dim)))
    return pe


embedding = [0.1, 0.2]
embeddings = np.array([embedding, embedding, embedding, embedding])

pos_encodings = np.zeros((len(embeddings), len(embedding)))
for pos in range(len(embeddings)):
    pos_encodings[pos] = positional_encoding(pos, len(embedding))

pos_embed = embeddings + pos_encodings

fig, ax = plt.subplots(figsize=(8, 3))
for i in range(len(embeddings)):
    ax.scatter(pos_embed[i, 0], pos_embed[i, 1], label=f"Position {i}")
ax.set_title("Positional Embedding:\n Same token, different positions  ")
ax.legend()
plt.show()

By observing the scatter plot, we can see how the positional encodings modify the original embedding values for the same token. As the position of the token changes, the resulting embedding vectors also change, reflecting the influence of position on the embedding representation.



For additional reading, please see the following resources:
- [The Annotated Transformer: Positional Encoding](http://nlp.seas.harvard.edu/annotated-transformer/#positional-encoding)

## Self-attention


The self-attention mechanism is like a teacher who helps you focus on what's important in a classroom. In the same way, the self-attention mechanism helps the transformer focus on which parts of the input it should be paying attention to, when processing information. This is achieved by allowing the model to weigh the importance of different parts of the input, giving more emphasis to the parts that are more relevant to the output.

The weighing of inputs is done by computing **attention scores** between input elements (e.g. words in a sentence) and then using these scores to compute a **weighted sum of the input elements**. The attention **scores are computed using a learned function (_neural network_)** that takes as input both the current input element and a "query" vector that represents the current state of the model. The resulting **weighted sum is then used as input** to the next layer of the transformer. 

Below is an illustration of how an input sequence $x$ is weighted when fed to a transformer's $TF$ first layer $TF_0$.

![image](../../diagrams/attention.png)

The self-attention mechanism in a transformer-based architecture allows the model to analyze all tokens in an input sequence simultaneously, unlike recurrent neural networks (RNNs) that process tokens sequentially. In RNNs, the model passes information from one token to the next through a hidden state, making it difficult to parallelize computation across the sequence. 

In contrast, transformers apply an self-attention mechanism that allows the model to look at all tokens in the sequence at once, and to dynamically weigh the importance of each token based on its relationship to the other tokens in the sequence. This parallelization across the sequence enables transformer models to process longer sequences more efficiently than RNNs.

Next we will go through the following topics:
 - __Basic self-attention__: The building block of attention mechanisms
 - __Scaling the dot product__: Ensuring that the computation are stable
 - __Queries, Keys and Values__: From non-parameterized self-attention to learned self-attention
 - __Multi-head attention__: Learning multiple ways to determine important parts in input sequences

### Basic self-attention

The basic self-attention serves as a building block for creating a complete attention mechanism in a transformer model. The basic self-attention operates on sequence level by weighing tokens based on the context of the sequence only (i.e. no learning).

> Note: In the original transformer model, the attention mechanism enables the model to focus on different parts of the input sequence and capture the dependencies or connections between tokens effectively. The attention mechanism is used in two key components of the transformer: the encoder and the decoder. The basic self-attention is only a part of the whole attention mechanism operating on sequences.

Let's try to understand this fundamental building block by implementing the basic self-attention operation in Python. The first thing we should do is work out how to express the self attention in matrix multiplications. 

There are no parameters in basic self-attention (yet). What the basic self-attention actually does is entirely determined by whatever mechanism creates the input sequence. Upstream mechanisms, like an embedding layer, drive the self-attention by learning representations with particular dot products (although we’ll add a few parameters later).

For an input sequence $x$, basic self-attention for each token $x_i$ can be defined as

$$\begin{aligned}
w'_i & = x_i^Tx_i \\
w_i & = \text{softmax}(w'_i) \\
y_i & =  w_ix_i \\
\end{aligned}$$
where $w'_i$ are the raw weights, $w_i$ the scaled weights and $y_i$ the output of the self-attention mechanism. The shape of the $w'_i$ depends on the number of tokens in the input sequence $x$ and the number of embedding dimensions in each token $x_i$.

Using the above definitions, the illustrated process of perorming basic self-attention on an input is depicted below. Let's define $x$ as an input with two tokens, each being a four-dimensional embedding vector. First the raw weights $w'_i$ are calculated:


![image](../../diagrams/basic-self-attention-raw-weight.png)

In [23]:
import numpy as np
x = np.array([[1,2,1,2],[3,4,3,4]])
w = np.dot(x,x.T)
w

array([[10, 22],
       [22, 50]])

Next, the raw weights are scaled:

![image](../../diagrams/basic-self-attention-weight.png)

In [25]:
from scipy.special import softmax
w = softmax(w)
w

array([[4.24835426e-18, 6.91440011e-13],
       [6.91440011e-13, 1.00000000e+00]])

Lastly, the input is weighted, producing the self-attention output:

![image](../../diagrams/basic-self-attention-output.png)

In [26]:
y = np.dot(w,x)
y

array([[2.07432428e-12, 2.76576854e-12, 2.07432428e-12, 2.76576854e-12],
       [3.00000000e+00, 4.00000000e+00, 3.00000000e+00, 4.00000000e+00]])

From this we can see that the basic self-attention would diminish the importance of the first token, leaving the second token to contribute to later stages of the computations. 



For additional reading, see the following resources:
 - [Basics of Self-Attention](https://towardsdatascience.com/self-attention-5b95ea164f61)
 - [Transformers from scratch: Self attention](https://peterbloem.nl/blog/transformers#self-attention)

### Queries, keys and values

The query, key, and value-based attention mechanism is an extension and development of the basic self-attention mechanism. While the basic self-attention treats each input token equally and operates solely on each input sequence separately, the query, key, and value-based attention mechanism introduces distinct roles for each token and incorporates separate linear transformations to enhance the attention process.

In a transformer model, the query, key, and value are components of the self-attention mechanism used to compute attention scores between input elements. Overall, the query, key, and value enable the transformer to dynamically focus its attention on the most relevant parts of the input sequence, allowing for more effective processing and prediction. 

#### Query: 
- The query vector represents the current token and is used to identify the parts of the input sequence that are most relevant to the current task. It captures the information necessary for the model to make predictions or generate output. 
- An input token vector is compared to every other vector to establish the weights for its own output. Query is used to identify the parts of the input sequence that are most relevant to the current task.

#### Key: 
- The key vector represents other tokens in the input sequence and is used to "answer" the query by computing a similarity score between the query and each key vector. The key vectors encode the information about the relationships between tokens in the sequence. 
- An input token vector is compared to every other vector to establish the weights for the output of those vectors. Keys are used to "answer" the query by computing a similarity score between the query and each key vector.

#### Value: 
- The value vector provides additional information associated with each token and is used as part of the weighted sum to compute the output of the attention mechanism. The value vectors capture the content or context of each token. 
- An input token vector vectors is used as part of the weighted sum to compute each output vector once the weights have been established, i.e., to compute the output of the attention mechanism. Specifically, the attention scores between the query and key are used to weight the value vectors, and then a weighted sum of the values is computed to produce the output.


#### Attention mechanism

The linear transformation for the query, key and value self-attention mechanism for each token $x_i$ can be defined as

$$\begin{aligned}
q_i = W_qx_i \enskip , \quad k_i & = W_kx_i \enskip , \quad v_i = W_vx_i \\
w'_{ij} & = q_i^Tk_j \\
w_{ij} & = \text{softmax}(w'_{ij}) \\
y_i & = \sum_jw_{ij}v_j \\
\end{aligned}$$

where $q$ is the query matrix, $k$ the key matrix and $v$ the value matrix. The $W_q$, $W_k$ and $W_v$ are $d_k \times d_k$ weight matrices, where $d_k$ is the number of embedding dimensions.. The $w'_{ij}$ are the raw weights, $w_{ij}$ the scaled weights and $y_i$ the output of the self-attention mechanism. 

Let's then work through an example of the self-attention qith queries, keys and values. We stick to just figuring out the process and leave any architectural definitions of layers and such for later. Let's again define $x$ as an input with two tokens, each being a four-dimensional embedding vector.

In [114]:
import numpy as np
x = np.array([[1,2,1,2],[3,4,3,4]])
print({"x.shape": x.shape})
x

{'x.shape': (2, 4)}


array([[1, 2, 1, 2],
       [3, 4, 3, 4]])

Let's then define the weight matrix for the queries and calculate the query matrix.

In [115]:
d_k = x.shape[-1]
W_q = np.random.random((d_k,d_k))
q = np.dot(W_q, x.T)
print({"q.shape": q.shape})
q


{'q.shape': (4, 2)}


array([[2.23068269, 6.23853766],
       [0.5516514 , 1.39332151],
       [3.39797841, 7.28216267],
       [1.14603798, 2.95227015]])

We will do the same with the keys.

In [116]:
W_k = np.random.random((d_k,d_k))
k = np.dot(W_k, x.T)
print({"k.shape": k.shape})
k


{'k.shape': (4, 2)}


array([[ 3.41049675,  7.65870809],
       [ 3.21822546,  6.78692766],
       [ 1.74981191,  4.13631379],
       [ 4.70445857, 10.59898094]])

And with values.

In [117]:
W_v = np.random.random((d_k,d_k))
v = np.dot(W_v, x.T)
print({"v.shape": v.shape})
v


{'v.shape': (4, 2)}


array([[3.4142163 , 7.5882181 ],
       [3.02124941, 6.80024817],
       [3.14950406, 7.31213798],
       [2.86978219, 6.73908348]])

Let's then calculate our weight matrix, applying softmax at the same time.

In [118]:
from scipy.special import softmax

w = np.dot(q, k.T)
w = softmax(w)
print({"w.shape": w.shape})
w


{'w.shape': (4, 4)}


array([[3.90270296e-17, 1.10440673e-19, 2.74849923e-28, 6.47419758e-08],
       [9.74115109e-36, 2.60028783e-36, 2.87940198e-38, 1.19621843e-33],
       [6.18805932e-12, 5.63274116e-15, 1.58821104e-25, 9.99999935e-01],
       [1.13309143e-29, 6.93123722e-31, 5.14586158e-35, 2.93885032e-25]])

As the last step, let's then produce our attention output. We have to transpose the output to make it conform to the input shape. This is important to enable stacking multiple attention layers on top of each other, as every attention layer expects inputs in similar shape.

In [119]:
y = np.dot(w, v)
y = y.T
print({"y.shape": y.shape})
y

{'y.shape': (2, 4)}


array([[1.85795369e-07, 3.47409154e-33, 2.86978200e+00, 8.43426811e-25],
       [4.36301580e-07, 8.15322697e-33, 6.73908304e+00, 1.98060646e-24]])

In this example we used only randomized weights. In a transformer model, these weights would be learned to best fit the training data.

For addition reading, see the following resources:
 - [The Annotated Transformer: Encoder and Decoder stacks](http://nlp.seas.harvard.edu/annotated-transformer/#encoder-and-decoder-stacks)
 - [Illustrated: Self-Attention](https://towardsdatascience.com/illustrated-self-attention-2d627e33b20a#570c)

### Scaling the dot product

The softmax function can be sensitive to very large input values. These kill the gradient, and slow down learning, or cause it to stop altogether. Since the average value of the dot product grows with the embedding dimension $d_k$, it helps to scale the dot product back a little to stop the inputs to the softmax function from growing too large. Using the previous definition of query, key and value based self-attention, the scaling of the dot products is applied before the softmax function:

$$\begin{aligned}
w'_{ij} & = q_i^Tk_j \\
\text{here} \to \quad w_{ij}^{'} & = {q_i^Tk_j \over \sqrt{d_k}} \\
w_{ij} & = \text{softmax}(w'_{ij}) \\
y_i & = \sum_jw_{ij}v_j \\
\end{aligned}$$

Citing from the original paper:
> We suspect that for large values of $d_k$ [the number of embedding dimensions] the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients. To counteract this effect, we scale the dot products by ${1 \over \sqrt{d_k}}$


For additional learning, see the following resources:
 - [Attention is all you need: 3.2.1 Scaled Dot-Product Attention](https://arxiv.org/pdf/1706.03762.pdf)
 - [The Annotated Transformer: Encoder and Decoder stacks](http://nlp.seas.harvard.edu/annotated-transformer/#encoder-and-decoder-stacks)
 - [Transformers from scratch: Additional tricks](https://peterbloem.nl/blog/transformers#additional-tricks)

### Multi-head attention

In a single query, key and value self-attention operation, all information about the sequence just gets summed together. This means there is just a single way the influence of tokens to other tokens gets modelled. 

We can give the self attention greater power of discrimination, by combining several self-attention mechanisms (which we'll index with $r$), each with different matrices $W_q^r$, $W_k^r$ and $W_v^r$. These are called __attention heads__. For input token $x_i$ each attention head produces a different output vector $y_i^r$. These are then concatenated and passed through a linear transformation to reduce the dimension back to $k$.

Practically this means, that the self-attention operation is just copied over as many times as there are heads. Each attention head gets its own separately initialized weight matrices. This helps the transformer model to learn multiple and different ways for the tokens to influence each other.

Below is an illustration of how multi-head attention enables the retrieval of different representations of the same input sequence.

![image](../../diagrams/multihead-attention.png)

This and the transformer model will be imeplement in the next notebook.