# Transformers overview

- What do they do?
- How do they work?
- How can they be applied?

Illustration credits: http://jalammar.github.io/illustrated-transformer/

## Introduction

<img src="figs/transformers-overview/attention-is-all-you-need-arxiv.png" width="50%"></img>

- At the time, people would use RNNs or CNNs together with attention mechanisms for complex sequence modeling.
- The attention mechanism allowed for shortcuts in dependencies between token positions far apart.
- With the Transformer architecture, the recurrence was dropped in favor of using only attention.
- Without the recurrence, the Transformer architecture allows for a higher degree of parallelization since it's not limited by the sequential computation of RNNs.

## Overview
- The attention is all you need paper discusses the transformer architecture / layer in a sequence to sequence context (neural machine translation) with an encoder/decoder setup.
- Will go through the parts of this setup in this paper first.
- But can also use just the decoder (GPT-{1, 2, 3}, et al) or just the encoder (BERT, et al).
- The Transformer building block is pretty versatile.

## Overview
<img src="figs/transformers-overview/transformers-1.png"></img>

## Overview
<img src="figs/transformers-overview/transformers-2.png"></img>

## Overview
<img src="figs/transformers-overview/transformers-3.png"></img>

## Encoder
<img src="figs/transformers-overview/transformers-4.png"></img>

## Encoder
<img src="figs/transformers-overview/transformers-5.png"></img>

## Self-attention
- When processing an embedding at a certain position, self-attention allows the model to consider the other positions in the input embedding sequence in order to produce a better output embedding at that position.
- The example below shows the self-attention scores on the other words when processing the word "it".
- Here it seems like the model is paying more attention to the words "the animal" which is what the "it" word refers to in this context.

<img src="figs/transformers-overview/transformers-6.png"></img>

## Self-attention
- At every position, the embedding $x_i$ is taken through three transformations $\mathbf{W}_Q$, $\mathbf{W}_K$, and $\mathbf{W}_V$ to produce the query ($q_i$), key ($k_i$), and value ($v_i$) vectors respectively.

<img src="figs/transformers-overview/transformers-7.png" width="50%"></img>

## Self-attention
- At a certain position, the self-attention scores of all the positions in the same sequence are computed by taking the dot product of the **query** vector $q_i$ with all **key** vectors $k_j$, dividing by $\sqrt{d_k}$, and then passing through a softmax.
- The self-attention scores are then used as the weights in a weighted sum of the **value** vectors $v_j$.
- This weighted sum is the output of the self-attention layer.
- In matrix form, this looks like 
$$\mathbf{X} \times \mathbf{W}_Q = \mathbf{Q}$$
$$\mathbf{X} \times \mathbf{W}_K = \mathbf{K}$$
$$\mathbf{X} \times \mathbf{W}_V = \mathbf{V}$$
$$attention(Q, K, V) = softmax(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d_k}})\mathbf{V}$$
- $d_k$ is the size of the query and key vector.

## Multihead attention
- The attention is actually implemented as multihead attention which is essentially just the attention as already described but done multiple times with $h$ separate sets of linear transformations on the input queries, keys, values.
- The $h$ different outputs are then concatenated and projected down to the final output.

<img src="figs/transformers-overview/transformers-fig2.png" width="75%"></img>

## Decoder
- The decoder layers are the same as the encoder layers but with two differences:
- First, the addition of another attention layer that instead attends over the encoder outputs.
    - In this case the query vectors are from the decoding sequence but the key and value vectors are from the encoding output sequence.
- Second, in the self-attention of the decoding layers, attending to future positions is not allowed. I.e. at position $i$, the model is only allowed to attend to positions $j <= i$
    - This is implemented through "masking" the future positions by ensuring that they have low softmax scores by setting these positions to $-inf$

<img src="figs/transformers-overview/transformers-8.png" width="50%"></img>

## Positional embeddings
- The transformer architecture doesn't have any notion of order of the embeddings since all the attention scores are computed in parallel.
- In this work, the authors inject this information by adding positional embeddings to the input embeddings of both the encoder and the decoder.
- There are different choices for positional embeddings, but here they use fixed embeddings computed from many sinusoids at different frequencies.

## Summarized architecture
<img src="figs/transformers-overview/transformers-fig1.png" width="50%"></img>

## Training (neural machine translation)
- The decoder has a linear+softmax output layer that predicts the next token from the output embedding at a certain position.
- The encoder and decoder are trained end-to-end by maximizing the likelihood of the ground truth word in the output distribution at every position.
- The ground truth words are passed in to the decoder but shifted one step to the right. This is called teacher forcing.
- Remember that the decoder has the masking to ensure that it can not look into the future steps when computing its output anywhere.

## Decoding
- When using the model to translate an input text, the input text is passed through the encoder in the same way.
- But the decoder is now decoding in an autoregressive manner, i.e. the predicted output at a step is used as the input to the following step.
- It's possible to cache intermediate calculations when doing this instead of recomputing everything with the "new"  input sequence.
- There are different strategies for decoding/generating a sequence with the decoder.
- The simplest would be **greedy decoding** which is just to take the top predicted word at every step.
- Another very simple would be **random sampling** based on the output distribution.
- Another method would be **beam-search** which considers multiple "paths" or hypotheses when picking the next token.
- Lately, people have been using other strategies like **top-k sampling** and **nucleus sampling** but we can talk about those some other time.

## Transformers since 2017
Some transformer based models that have come out since when the original paper was released. It's by no means an extensive list and I've probably missed a bunch of interesting papers. I think we should revisit this section later.

<img src="figs/transformers-overview/transformer-models.png" width="30%"></img>
<img src="figs/transformers-overview/transformer-models2.png" width="30%"></img>

## GPT
- The GPT models (1, 2, 3) are large generative language models.
- Architecturewise, it's "just" an upscaled version of the decoder part of the previously described architecture trained on massive amounts of text.
- Trained like any other language model.

## BERT
- BERT is a language understanding model.
- BERT is first pretrained on massive amounts of unlabeled text using a masked language modeling objective.
- BERT can then be finetuned on many different downstream tasks like sentence classification by attaching another head and finetuning it.
- The masked language modeling as a pretraining objective has since been used in a ton of other models in different variants.

## TabNet
- Transformer for tabular data

## VisionTransformer
- Transformer encoder applied to image classification, i.e. replacing convolutions.
- They split the image into patches, linearly project flattened patches into embeddings and feed them through the encoder.
- Trained with a classification head in image classification problems and achieve state of the art results.

## Visual BERT models
- There's couple of models exploring adding a visual modality to BERT.
- VisualBert, VLBert, ViLBERT are some of them.
- They all use similar ideas with taking patches of images as visual words.

## CLIP and DALL-E
- CLIP predicts the similarity of an image and a text description.
- DALL-E generates an image given a text prompt, as seen in the (possibly cherry-picked) example below.

<img src="figs/transformers-overview/dall-e.png" width="50%"></img>

## Research directions
TODO

<img src="figs/transformers-overview/long-range-transformers-taxonomy.png" width="50%"></img>