# Transformers

<hr style="border:2px solid gray">

# Index <a id='index'></a>
1. [Introduction](#intro)
1. [Tokenization and transformer data structures](#tokenization)
1. [seq2seq problems and the introduction of attention](#attention)
1. [What are transformers?](#transformers)
1. [Transformers in PyTorch](#pytorch-transformers)
1. [Exercises](#exercises)


<hr style="border:2px solid gray">

# Introduction [^](#index) <a id='intro'></a>

So far, we have described different types of neural networks in terms of how they define locality:

* Fully connected neural networks are global, as each neuron connects to all neurons in the next layer

* Convolutional neural networks are very local, as pixels are convolved with adjacent pixels

* Graph neural networks use the graph structure to define locality

Now, we will consider something that was briefly mentioned in the GNN exercises: **attention mechanisms**. To describe how the locality of these mechanisms work, we can interpret this as the model *learning its own definition of locality*. We will discuss this in more detail shortly.
<br></br>

The main parts for this notebook will be as follows:

* A brief discussion of data structures relevant for transformers and the process of **tokenization**

* The historical introduction of attention mechanisms for so-called **sequence to sequence** (**seq2seq**) problems and some basic principles as to how they work

* One of the most influential developments in machine learning architectures, which relies on attention mechanisms: the **transformer**. This is the development that underpins many of the modern AI models, including large language models like ChatGPT, image generation models like DALL-E, and AlphaFold, a model that predicts the structure of proteins. 

* An overview of how we can use transformers practically, including building the architecture ourselves using PyTorch and how we can use pre-trained models with the HuggingFace `transformers` library



<hr style="border:2px solid gray">

# Tokenization and transformer data structures [^](#intro) <a id='tokenization'></a>

You will likely have heard of modern ML models referring to tokens - this is how we break up a sequence and convert it to a numerical vector so we can learn how to solve sequence problems. There are often many ways that we can tokenize a given datatype. For example, for sentences, we could have the following tokenization schemes:

<!--could add schematics here if there is time-->

* Word-level tokenization: each individual word is a token, and the numerical vector we construct is the index of each word in our dictionary

* Character-level tokenization: each individual *character* is a token, and each character is assigned a numerical value

* Subword-level tokenization: each token is not necessarily a whole word or an individual character, but instead may be a small part of a word. Each subword may have specific meanings, e.g. we could see prefixes such as "pre" or "post" as tokens separate from words they may otherwise be a part of.

There are of course other options for tokenization in language processing problems, but these are a few examples.

Often after tokenization, a sequence must be projected to the right number of dimensions to match the model. This is often done using linear layers, and is referred to as finding embeddings of the sequence, similar to the node, edge, and graph embeddings we discussed for GNNs.

## Data structures relevant for transformers



<hr style="border:2px solid gray">

# seq2seq problems and the introduction of attention [^](#index) <a id='attention'></a>

One of the first notably successful applications of attention mechanisms was for natural
language processing (NLP) - in particular, a family of techniques called **seq2seq** where
NLP problems are understood as a process from one sequence into another sequence, via some
intermediate representation that contains all the info necessary to reconstruct the output
sequence. We call this intermediate representation the **context vector**.

In fact, we can consider this process as using an encoder to find an embedding of the input
sequence and using a decoder to go from the embedding to the target sequence, where in this
case the embedding is the context vector. We can consider this for an English-to-French
translation problem: 

* The input sequence is a sentence in English, which is represented by a numerical vector
  which may e.g. just be indices of words in a dictionary

* The encoder converts the input sequence into the context vector

* The context vector is then passed to the decoder, which decodes the context vector to a
  different numerical vector where each value corresponds to a word in the target language,
  in this case French

This is illustrated in the schematic below.

<center>
<img src='seq2seq-schematic.png' width=1000></img>
</center>  
<div style="text-align:center;">
<div style='width:950px;display:inline-block;vertical-align:top;margin-top:10px;line-height:1.2;'>
<div style="text-align: justify;">

*A schematic illustrating a seq2seq problem, for English-to-French translation. The input English phrase "the cat is black" is first represented as a numerical vector and then transformed into a context vector by the encoder. The context vector is passed to the decoder, which transforms it to some new numerical vector corresponding to the French translation of the original sentence, "le chat est noir".*
</div></div></div>

<!-- How much detail on RNNs and specifically RNN encoder-decoder is needed? e.g. can talk about how hidden states in encoder just depend on input and previous hidden state, while decoder both updates a hidden state and an output *separately*,  -->

Historically, both the encoder and decoder were so-called **recurrent neural networks** (**RNNs**), which are designed to handle sequential data where the previous point in the sequence is relevant for the next point in the sequence. This can include:

*  NLP: sentences are sequences, as earlier words in the sentence are important for understanding the context of later words in the sentence (and indeed,
   vice-versa); we can also see this as we read sentences as a sequence, one word after another.
   
* Time series data: values at previous times influence values in the future

In general, an individual RNN take both the current input and the previous output as inputs to the model, e.g. to find the output at a time $t = 1$ we give the model both the input for time $t = 1$ and the output for $t = 0$. 

You can read more about the general RNN encoder-decoder structure in [this paper](https://arxiv.org/abs/1406.1078).

However, there is a key problem with this approach, related to how much information can be conveyed by the context vector:

* For any size of context vector, it must be able to capture *all* the information contained in the input sequence

* If we want to handle longer input sequences, the amount of information that has to be captured by the context vector increases

* The context vector length is constant regardless of the length of input sequence, so it is difficult to be able to summarise enough information for long sequences without wasting resources for short sequences

* For longer sequences, RNNs have difficulty having equal weight from words ealier in the sequence than those later in the sequence, so we lose information from the start of the sentence if we have a long sentence

This is referred to as the **bottleneck** problem, as the decoder only sees the context vector. 
<br></br>

In order to overcome this, we need some way to pass more information from the encoder to the decoder. What information can we get? The general RNN encoder operation goes as follows:

* For the 1st step of the input sequence, find some hidden state by passing through the encoder model, like an embedding in GNNs

* For the 2nd step of the input sequence, input both the 2nd step input and the 1st step hidden state to find the 2nd step hidden state

* Repeat this process until the end of the sequence is reached; the final hidden state is the context vector

In fact, what we can do instead of just using the context vector as inputs to the decoder (as well as previous decoder hidden states) is use the information from each step of the encoding, i.e. use all the hidden states rather than just the last one. To do this, we use **attention mechanisms**.

## Attention mechanisms

When we say "attention mechanisms", what we actually mean is some method of determining the relative importance of different parts of the input, and then influence the model to **attend** to important parts of the input and disregard unimportant parts. In the context of machine translation, we can describe this as working out what words in the input sequence are most relevant to each word in the output sequence. 

How does this actually work? In general, we need these parts:

* Some representation of the output we want to predict

* Some representation of our input to our model

* An **alignment model** that scores how well a given single input value matches a single output value
<br></br>

To understand how we will use this, let us consider the RNN encoder-decoder model. We want to calculate some new quantity we can use to improve the performance of the decoder, based on the encoder hidden states and the alignment scores between the encoder hidden states and our decoder outputs. For decoder sequence step $t$, we do the following steps:

* Calculate the alignment score between the decoder hidden state $t - 1$ and all encoder hidden states

* Find the softmax of the alignment scores to get attention weights for decoder sequence step $t$

* Take the weighted sum of the encoder hidden states using the attention weights

The output of the weighted sum is used as an input to the decoder for sequence step $t$, alongside the decoder hidden state and the target (or predicted) sequence entry from step $t - 1$. In this example, the alignment calculated is between encoder hidden states and decoder hidden states, to find what encoder hidden states are most relevant to each decoder hidden state. This is illustrated in the schematic below.

<center>
<img src='attn-schematic.png' width=900></img>
</center>
<div style="text-align:center;">
<div style='width:900px;display:inline-block;vertical-align:top'>
<div style="text-align: justify;">

*The previous machine translation task, now with added attention. For each decoder step, the alignment scores $\alpha$
between encoder states and the previous decoder hidden state are found, and the attention-weighted sum of encoder hidden 
states is calculated as the attention mechanism output. The decoder output is then determined by its previous value 
and the attention mechanism output.*
</div></div></div>

What function we use to calculate the alignment scores can have a significant effect on the performance of this approach.
A couple of the early examples included:

* [Bahdanau attention](https://arxiv.org/abs/1409.0473): use a single-layer neural network, with independent learnable 
weight matrices for the encoder and decoder hidden states, a tanh activation, and another learnable weight vector to 
project to a single value

* [Luong attention](https://arxiv.org/abs/1508.04025): take a weighted dot product between the encoder and decoder 
hidden states, where the weights are a learnable matrix

Of course, there are many other types of attention we might consider, and we will discuss an important one later.


<div style="background-color: #FFF8C6">

To give another example, we will consider how we might use attention in a GNN; you can read about this in more detail in
the [Graph Attention Networks paper](https://arxiv.org/abs/1710.10903).

You briefly saw this concept last time, when we used the `GATConv` layer to try and improve the performance of our GNN on the Cora dataset.
In essence, this approach includes attention in the neighbourhood aggregation:

* Rather than using a simple aggregation procedure, we can instead take some weighted sum (or other weighted aggregation method) of the neighbours

* Weights are derived according to some attention mechanism between the node of interest and each of its neighbours

* Effectively, we learn which of the neighbours are most important for prediction at the target node

In the case of this model, to find the alignment score between two nodes the two node embeddings are transformed by a 
single weight matrix (as is common in other GNN layers), concatenated, and passed through a single-layer neural network
with a LeakyReLU activation function that maps the concatenated vector to a single value. Attention weights are then 
computed as the softmax of these alignment scores over all nodes in the neighbourhood.

This is illustrated in the schematic below. 



<div style="display: flex; justify-content: center; gap: 80px; align-items: flex-start;">
<div style="display: flex; flex-direction: column; align-items: flex-start; width: 350px; margin: 0;">
<img src='gat-attn-mech.png' width=239 style="align-self: center;"/>
<div style="margin-top: 10px; text-align: justify; max-width: 350px; font-style: italic; line-height: 1.2;">

<strong>Left</strong>: the attention mechanism between a node $i$ and the node $j$, which is a node in the
neighbourhood of $i$. Node features are transformed according to the same weight matrix, aggregated and projected
by a learnable vector $\mathbf{a}$. A LeakyReLU activation is applied, and then the softmax over all
nodes in the neighbourhood is found to get the attention weights.
</div>
</div>
<div style="display: flex; flex-direction: column; align-items: flex-start; width: 504px; margin: 0;">
<img src='gat-node-pred.png' width=480 style="align-self: center;"/>
<div style="margin-top: 10px; text-align: justify; max-width: 504px; font-style: italic; line-height: 1.2;">

<strong>Right</strong>: finding the next layer node embedding using an attention mechanism. Attention weights $\alpha_{ij}$ are calculated between the node of interest $i$ and each of its neighbours $j$ (as well as itself). All the weighted node embeddings are then aggregated to produce the next embedding for the node of interest. 
</div>
</div>
</div>

<center>

*Schematics illustrating a graph attention mechanism.  Adapted from [[source](https://arxiv.org/abs/1710.10903)].*
</center>

In fact, the original paper shows that this change to a GNN architecture leads to a significant improvement in 
performance on both transductive and inductive benchmark tasks relative to previous state-of-the-art GNN models, including:

* State-of-the-art performance or better for the three major publication network benchmark datasets: Cora, Citeseer, and Pubmed

* A 20% improvement in performance for predicting protein-protein interactions based on graph representations of proteins

For more details on this, please see the [corresponding paper](https://arxiv.org/abs/1710.10903).

## Summary

In this section, we have discussed the attention mechanisms and their historical introduction, including:

* seq2seq problems including machine translation, and the difficulty of traditional RNN encoder-decoder methods

* The introduction of attention to machine translation

* Graph attention networks and multi-head attention

In the next section, we will use our newfound understanding of attention mechanisms to discuss one of the most influential developments in machine learning in recent years: **transformers**.



<hr style="border:2px solid gray">

# What are transformers? <a id='transformers'></a>


First proposed in the 2017 paper ["Attention Is All You Need"](https://arxiv.org/abs/1706.03762) by Vaswani et al., 
the transformer architecture (and its components) is a greatly influential model that completely changed the approach 
to sequence-based ML tasks, and indeed has found great success in many different applications. 

As opposed to the complex RNN (or sometimes CNN) models often used for seq2seq tasks, which are often supplemented by 
attention mechanisms, the transformer architecture instead relies solely on attention mechanisms combined with regular linear layers, rather than any recurrence or convolutions. We will first introduce some of the language used in the original paper and then we will discuss the design of the transformer architecture.

## Queries, keys, and values

As a way to describe attention mechanisms, the terms queries, keys, and values were borrowed from database terminology and in fact were popularised to describe attention mechanisms by the original transformer paper. We can break this down as follows:

* Our training dataset consists of key-value pairs, e.g. a list of words with a value assigned to each word

* When we predict, we pass a query to the model to get a value

* For a given query, an attention mechanism returns a weighted combination of values, based on how well their corresponding keys match the query

In other words, we find the **alignment score** between the query and our set of keys, and return a weighted sum of the values weighted by the softmax of the **alignment scores**. This is of course the same type of mechanism we discussed in the previous section; we can think of what each of these are in our previous examples:

* RNN encoder-decoder with attention: we find the alignment between the encoder hidden states and the decoder hidden states to find a weighted sum of our encoder hidden states; therefore, both the keys and the values are the encoder hidden states, and the decoder hidden states are the queries

* Graph attention networks: the weight vector $\mathbf{a}$ and the LeakyReLU function returns the alignment scores between the projected embeddings for node of interest and each of the neighbouring node (and itself), and the weighted sum is of the projected node embeddings for each neighbouring node. We can identify the keys and values as the projected node embeddings for neighbouring (and self) nodes, and the query as the embedding for the node of interest.

## The transformer architecture

While a transformer is built of a encoder and a decoder like previous RNN models, there are three key things that set a transformer apart in compared to earlier seq2seq models (apart from the use of linear layers in place of recurrent or convolutional layers). These include:

* The choice of attention function: **scaled dot-product attention**

* **Multi-head attention**

* **Self attention**

It is the combination of these things, and where they are used in the model, that enabled the jump in performance seen using this model. We will discuss each of these in turn.

## Scaled dot-product attention

While we briefly mentioned two attention functions earlier (Bahdanau and Luong), the attention mechanism used in transformers is generally established as very computationally efficient and performs similarly to the best attention mechanisms in the literature. This is defined as follows:

* Pack a set of queries into a single matrix $Q$

* Similarly pack the keys and values into matrices $K$ and $V$

* We denote the dimension of the queries and keys as $d_k$, and the dimension of the values as $d_v$

* The attention output is then given according to

$$\text{Attention(}Q, K, V\text{)} = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V$$

The normalising factor $\frac{1}{\sqrt{d_k}}$ is important as when we have large $d_k$ the dot product can reach very large values, and thus push the softmax output into regions with very small gradients. This can subsequently cause problems with training (similar to the exploding and vanishing gradient problems we have discussed before). This is discussed in a little more detail in the [original paper](https://arxiv.org/abs/1706.03762).



## Multi-head attention

So far, we have considered a single attention mechanism in a model. However, what would happen if we had multiple, and how would we go about it, and is this even useful? 

In fact, the original transformers paper proposed the idea of **multi-head attention**, a way of using multiple attention functions in parallel to incorporate information from different representations simultaneously. In the original paper, this goes as follows:

* Typical attention mechanisms would use a single attention function for keys, values and queries with $d_\text{model}$ dimensions

* Instead, do $h$ independent linear projections of the queries, keys, and values, to $d_k$, $d_k$, and $d_v$ dimensions respectively - we refer to each of these sets of projections as an attention head

* For each attention head, apply the attention function in parallel to produce a $d_v$-dimensional output for each head

* Concatenate the outputs from all attention heads and finally projected them to the desired number of dimensions $d_\text{model}$

We can then learn the parameters of the linear projections for each attention head, allowing us to use not just one representation of the keys, queries, and values, but as many as we may want to. This allows is to incorporate more information than we might otherwise, including different ways of looking at the information. 

In an NLP task, this could be thought of as considering different possible meanings of a single word and then seeing how important other words are to understanding the sentence for each possible meaning of that word. 



The multi-head attention mechanism used in the original transformers paper is illustrated in the figure below.

<center>
<img src='mha-schematic.png' width=600></img>
</center>
<div style="text-align:center;">


<div style="display: flex; justify-content: center; gap: 80px; align-items: flex-start;">
<div style="margin-top: 10px; text-align: justify; max-width: 700px; font-style: italic; line-height: 1.2;">

*The structure of multi-head attention used in "Attention is All You Need". For a number of heads $h$, the input queries, keys, and values are projected $h$ times and the scaled dot-product attention is calculated for each projection. The attention outputs are then concatenated and projected once more to the desired dimensionality, producing the multi-head attention output. Adapted from the [original paper](https://arxiv.org/abs/1706.03762).*
</div></div></div>

<div style="background-color:#FFCCCB">

If we want to write out a mathematical expression for multi-head attention, it goes as follows:

$$\text{MultiHead}(Q,\,K,\,V) = \text{Concat}(\text{h}_1,\,\dots,\,\text{head}_h)\,\mathbf{W}^O,$$
$$\text{where head}_i = \text{Attention}\left(\mathbf{W}^Q_i\,Q, \,\, \mathbf{W}^K_i\,K, \,\, \mathbf{W}^V_i\,V\right).$$

Individual symbols are defined as follows:

* $\mathbf{W}^Q_i$ denotes the queries linear projection parameter matrix for attention head $i$, which is a $d_\text{model} \times d_k$ matrix

* $\mathbf{W}^K_i$ denotes the keys linear projection parameter matrix for attention head $i$, which is a $d_\text{model} \times d_k$ matrix

* $\mathbf{W}^V_i$ denotes the values linear projection parameter matrix for attention head $i$, which is a $d_\text{model} \times d_v$ matrix

* $\text{Attention}$ denotes an arbitrary attention function of queries, keys, and values; in the case of transformers, this is the scaled dot-product attention described before

* $\text{Concat}$ denotes a concatenation into a single vector

* $\mathbf{W}^O$ denotes the linear projection matrix from the concatenated individual attention head outputs to the final model dimension, which is a $h d_v \times d_\text{model}$ matrix

Because each attention head is independent, any calculations across different heads can be parallelised. This can help greatly with the computational cost of training multi-head attention.


## Self attention

So far, when we have discussed attention between the output sequence and the input sequence. However, there is no reason why we couldn't find attention between any two things - in fact, we can consider so-called **self attention**, where we find attention weights between each element in the input sequence and all other elements in the sequence. 

This way, we can find what words in the input sequence are most important to understanding each word in the same sequence. 

This is illustrated in the schematic below.

<center>
<img src='self-attn-schematic.png' width=800>
</center>
<div style="text-align:center">
<div style="display: flex; justify-content: center; gap: 80px; align-items: flex-start;">
<div style="margin-top: 10px; text-align: justify; max-width: 700px; font-style: italic; line-height: 1.2;">

*Schematic illustrating self-attention, for a single query. The input sequence acts as the queries, keys, and values to calculate the output.*
</div></div></div>


In fact, we have seen self-attention already in the Graph Attention Networks example - each node attends to all of the nodes in its neighbourhood, and itself, to learn which nodes are most important for finding an embedding for a given node. 

While similar ideas had been introduced previously for RNN models, it was the transformers paper that introduced the highly-parallelisable self-attention that has been such a success for modern models.


## Applications of attention in the model

The transformer architecture uses attention in several ways. These include:

* Encoder-decoder attention: just like in the RNN attention models, allow the decoder output to attend to all positions in the input sequence; in the context of transformers, this is sometimes referred to as **cross-attention**

* Encoder self-attention: each position in an encoder layer output sequence attends to all positions in the previous encoder output sequence (for the first encoder layer, this is in the input sequence)

* Decoder self-attention: same principle as for the encoder, but an additional constraint is needed to ensure an element in the output sequence is only determined by the previous element in the sequence, not later ones. This is done by masking keys that are not allowed for a given query, i.e. masking out elements that are not earlier in the sequence than the query



## Layer normalisation

<!--may need to make more bullet point-y-->

Previously, we have discussed how the weights in one layer of a neural network are strongly dependent on the outputs of the neurons in the previous layer, and how we can handle this dependency with methods such as batch normalisation. 

The transformer architecture uses a related but different normalisation approach, called **layer normalisation**. Rather than normalising activations across the batch, this approach normalises activations across the layer, i.e. based on the statistics *within the layer*. 

In comparison with batch normalisation, which computes correction factors that are common to all samples in the batch but different for each hidden neuron, layer normalisation finds correction factors that are common to each hidden neuron but different for each sample passed through the layer. This means layer normalisation can be applied regardless of batch size. 

This is particularly relevant for sequence problems, as we generally want to consider each collection of tokens separately from other ones, i.e. to work one sentence at a time. Batch normalisation would instead normalise over different dequences and separately for each token, resulting in issues for test sequences longer than the training sequences. Because layer normalisation normalises over all dimensions apart from the batch dimension, it works the same irrespective of sequence length. 

<div style="background-color:#FFCCCB">

To express layer normalisation mathematically, start by defining the following:

* $a^l$: the input to the $l$-th hidden layer of a deep feed-forward neural network, referred to as the **activation** of the previous layer
* $H$: the number of hidden units in a layer
* $\gamma^l$ and $\beta^l$: learnable parameters for layer $l$, with the same dimensions as the desired output shape

Now we can define the mean and standard deviation in hidden layer $l$ according to

$$\mu^l = \frac{1}{H}\sum_{i = 1}^H a^l_i,\qquad\quad \sigma^l = \sqrt{\frac{1}{H}\sum_{i = 1}^H(a^l_i - \mu^l)^2},$$

where $a^l_i$ denotes the $i$-th element of $a^l$, and $\mu^l$ and $\sigma^l$ denote the layer mean and standard deviation respectively.

Finally, we can write the layer normalisation output according to

$$\text{LayerNorm}(a^l) = \bar{a}^l = \frac{a^l - \mu^l}{\sigma^l}\cdot\gamma^l + \beta^l$$

Note that all multiplication is element-wise, such that $\bar{a}^l_i = \frac{a^l_i - \mu^l}{\sigma^l} \cdot \gamma^l_i + \beta^l_i$.

## Transformer encoder and decoder

Both the transformer encoder and decoder are built of a set of $N = 6$ identical sub-layers, with slightly different sub-layers between the encoder and decoder. These are structured as follows:

**Encoder layer**: 
    
* Contains two sub-layers:
    1. multi-head self-attention
    1. feed forward (i.e. non-recurrent) neural network

* Sums output of each sub-layer with the input to that sub-layer i.e. includes a **residual connection** (like skip connections we saw in GNNs)

* Passes sum of input and sub-layer output to layer normalisation, such that the final sub-layer output is given as $\text{LayerNorm}(x + \text{Sublayer(x)})$, where $\text{Sublayer}(x)$ denotes the function of the sub-layer itself

<center>
<img src='enc-layer-schematic.png' height=230>
</center>
<div style="text-align:center">
<div style="display: flex; justify-content: center; gap: 80px; align-items: flex-start;">
<div style="margin-top: 10px; text-align: justify; max-width: 700px; font-style: italic; line-height: 1.2;">

*Schematic of a single transformer encoder layer. This is built of two sub-layers: multi-head self-attention and a feed-forward neural network. Residual connections are included around each sub-layer and layer normalisation is applied after each sub-layer. Adapted from the [original paper](https://arxiv.org/abs/1706.03762).*
</div></div></div>
<!-- encoder layer schematic -->

**Decoder layer**:

* Contains three sub-layers, two similar to the encoder layers:
    1. masked multi-head self-attention, to prevent earlier sequence entries attending to later sequence entries
    1. multi-head attention over encoder output, i.e. the encoder output as the keys and values and decoder self-attention output as the queries
    1. feed forward neural network

* Like the encoder layers, uses residual connections around each sub-layer followed by layer normalisation

<center>
<img src='dec-layer-schematic.png' height=250>
</center>
<div style="text-align:center">
<div style="display: flex; justify-content: center; gap: 80px; align-items: flex-start;">
<div style="margin-top: 10px; text-align: justify; max-width: 700px; font-style: italic; line-height: 1.2;">

*Schematic of a single transformer decoder layer. This is built of three sub-layers: masked multi-head self-attention, multi-head attention over the encoder outputs, and a feed-forward neural newtork.  Residual connections are included around each sub-layer and layer normalisation is applied after each sub-layer. Adapted from the [original paper](https://arxiv.org/abs/1706.03762).*
</div></div></div>

## Positional encoding

Because the transformer layers are solely linear, the model will lose information about the order of the sequence unless we do something about it, i.e. manually add some information about relative or absolute positions of tokens in the sequence. This is done with **positional encodings**. 

Applied to the embeddings inputted to the encoder and decoder, the positional encoding has the same dimension as the embeddings so they can be summed. 

For the transformer model, each dimension of the positional encoding is a sin or cos of the token position, where the wavelength of the sinusoid is determined by the index of the dimension and the total number of dimensions. Because this is periodic, the model can learn relative positions of tokens in the sequence.

<div style="background-color:#FFCCCB">


Explicitly, the positional encodings from the original paper are given as

\begin{align*}
\text{PE}_{(\text{pos},\,i)} &= \sin\left(\text{pos}/10000^{i/d_{\text{model}}}\right)\qquad &i \text{ even}\\
\text{PE}_{(\text{pos},\,i)} &= \cos\left(\text{pos}/10000^{(i-1)/d_{\text{model}}}\right)\qquad &i \text{ odd}
\end{align*}
where $\text{pos}$ denotes the position of the token, $i$ denotes the dimension, and $d_{\text{model}}$ is the number of dimensions of the embeddings.

The reason for proposing this positional embedding is that $\text{PE}_{\text{pos} + k}$ is a linear function of $\text{PE}_{\text{pos}}$ (for given $i$, $\text{PE}_{\text{pos}+k,\,i}$ is a linear combination of $\text{PE}_{\text{pos},\,i}$ and $\text{PE}_{\text{pos},\,i+1}$, and the same for $\text{PE}_{\text{pos}+k,\,i+1}$). This can allow the model to learn relative positions of tokens in the sequence.


## Putting it all together

Finally, we can look at the architecture diagram from the original paper:

<center>
<img src="transformer-architecture.png" width=600></img>
</center>
<div style="text-align:center">
<div style="display: flex; justify-content: center; gap: 80px; align-items: flex-start;">
<div style="margin-top: 10px; text-align: justify; max-width: 580px; font-style: italic; line-height: 1.2;">

*The architecture of the original transformer model, taken from the [original paper]((https://arxiv.org/abs/1706.03762)).*
</div></div></div>

Let's break down how each part of this model works. Starting with the encoder:

* The input sequence is converted to a numerical vector via learned embeddings

* Position encodings are added to the input embeddings

* The embeddings are passed through $N = 6$ encoder layers to produce the encoder outputs, including multi-head self-attention, layer normalisation, and a feedforward neural network

Note that unlike in RNNs, the entire input sequence is processed in one go.


<br></br>
For the decoder:

* The output sequence (more on this in a moment) is converted to a numerical vector via learned embeddings

* Position encodings are added to these embeddings

* Output sequence embeddings are passed through $N = 6$ decoder layers which contain the following operations, all followed by adding the residuals and applying layer normalisation:

    * Masked multi-head self-attention, so each sequence element only attends to earlier sequence elements

    * Multi-head attention over the encoder outputs as keys and values, and decoder embeddings as the queries (i.e. the cross-attention)

    * Feed-forward neural network, with the same weights as the encoder

* After $N$ decoder layers, the output is passed through a final learned linear transformation and a softmax layer to produce weights for the next token, which are often interpreted as probabilities (although they are not truly probabilities)

<br></br>

**Note**: the operation of the decoder is slightly different during training and prediction, as follows:

* During training, the output sequence used as the first decoder input is the true output, with the so-called **start-of-sequence token** (sometimes referred to as `<SOS>`) prepended i.e. at the start of the sequence

    * The whole target sequence is passed through the decoder and the loss is computed between the prediction and the true sequence, often with loss functions like cross entropy

* During prediction, the output sequence first inputted is an empty sequence apart from a start-of-sequence token at the start

    * The initial output sequence is passed through the decoder, and the token with the maximum value in the softmax output is assigned to be the next sequence entry $\hat{y}_1$

    * The updated sequence with the start-of-sequence token and our first predicted entry $\hat{y}_1$ is passed into the decoder to predicted the next entry $\hat{y}_2$

    * This procedure is repeated until predictions have been made for every entry in the output sequence
<br></br>

In training, the whole target sequence is used at once so any mistakes made by the model in early sequence entries do not cause errors in later sequence entries. In contrast, during prediction because we don't know what the output sequence should be we must feed the output back in as input to predict later entries in the output sequence.

<!--Could ideally use some schematic here? Maybe illustrating the building of the output sequence in training vs prediction-->

## Summary

In this section, we have covered the transformer architecture, including:

* design of the encoder and decoder, and the applications of attention

* layer normalisation

* positional encodings

* operation of the model during training and prediction

In the next section, we will discuss more practical elements of transformers including how to implement them in PyTorch, and considerations that are necessary during training.

<hr style="border:2px solid gray">

# Transformers in PyTorch <a id='pytorch-transformers'></a>

* Overview:  Manual implementation stepthrough

* Discussion of training practicalities

* Mention Transformer block but also ref to PyTorch article on building your own; step through building e.g. an encoder layer

* Get students to write encoder block using the demonstrated encoder layer?

* Get students to write decoder block based on encoder block? Slightly more complex but probably worthwhile?

* Transformer warmup: start learning rate at 0, gradually increase to desired value over first few iterations
    * Adam uses bias correction factor --> increases variance in adaptive learning rate early
    * Layer normalisation iteratively applied can create very high gradients initially (can solve with e.g pre-layer normalisation, or other normalisation techniques)
    * Popular learning rate scheduler: cosine warm-up --> linear warm up, cosine decay

* HuggingFace & the HuggingFace transformers library (in brief)

<hr style="border:2px solid gray">

# Exercises <a id='exercises'></a>

* Guided fill-in code blocks (could weave in previous section?):
    * mult-head attention
    * encoder layer
    * decoder layer
    * whole transformer model
<br></br>

* list reversing example from UVA? bit trivial but also maybe worthwhile
<br></br>

* example from Lauri - definitely a nice simple example, even simpler than standard transformer architecture:

    * Suggested target classes: check Lauri suggestion, thought he mentioned top/quark as an easy-ish option for a pair of classes but not 100% sure
    
    * Embed -> MHA -> FFN -> MHA -> FFN -> mean pool

    * Need to experiment with run-times

    * Can either distribute data on BB or via the Zenodo link