# A3: Transformers (text)

In today's assignment, we will take a look at Tranformer. Some of the material in this
lab comes from the following online sources:

- https://medium.com/the-dl/transformers-from-scratch-in-pytorch-8777e346ca51
- https://towardsdatascience.com/how-to-code-the-transformer-in-pytorch-24db27c8f9ec


## Transformers and machine learning trends

Before the arrival of transformers, CNNs were most often used in the visual domain, while RNNs
like LSTMs were most often used in NLP. There were many attempts at crossover, without much real success. Neither approach seemed capable of dealing with very large complex
natural language datasets effectively.

In 2017, the Transformer was introduced. The paper, "Attention is all you need," has now been
cited more than 70,000 times.

The main concept in a Transformer is self-attention, which replaces the sequential processing of
RNNs and the local processing of CNNs with the ability to adaptively extract arbitrary relationships
between different elements of its input, output, and memory state.

## Transformer architecture

We will use [Frank Odom's implementation of the Transformer in PyTorch](https://github.com/fkodom/transformer-from-scratch/tree/main/src).

The architecture of the transformer looks like this:

<img src="../img/Transformer.png" title="Transformer" style="width: 600px;" />

Here is a summary of the Transformer's details and mathematics:

<img src="../img/SummaryTransformer.PNG" title="Transformer Details" style="width: 1000px;" />

There are several processes that we need to implement in the model. We go one by one.

## Attention

Before Transformers, the standard model for sequence-to-sequence learning was seq2seq, which combines an RNN for encoding with
an RNN for decoding. The encoder processes the input and retains important information in a sequence or block of memory,
while the decoder extracts the important information from the memory in order to produce an output.

One problem with seq2seq is that some information may be lost while processing a long sequence.
Attention allows us to focus on specific inputs directly.

An attention-based decoder, when we want to produce the output token at a target position, will calculate an attention score
with the encoder's memory at each input position. A high score for a particular encoder position indicates that it is more important
than another position. We essentially use the decoder's input to select which encoder output(s) should be used to calculate the
current decoder output. Given decoder input $q$ (the *query*) and encoder outputs $p_i$, the attention operation calculates dot
products between $q$ and each $p_i$. The dot products give the similarity of each pair. The dot products are softmaxed to get
positive weights summing to 1, and the weighted average $r$ is calculated as

$$r = \sum_i \frac{e^{p_i\cdot q}}{\sum_j e^{p_j\cdot q}}p_i .$$

We can think of $r$ as an adaptively selected combination of the inputs most relevant to producing an output.

### Multi-head self attention

Transformers use a specific type of attention mechanism, referred to as multi-head self attention.
This is the most important part of the model. An illustration from the paper is shown below.

<img src="img/MultiHeadAttention.png" title="Transformer" style="width: 600px;" />

The multi-head attention layer is described as:

$$\text{Attention}(Q,K,V)=\text{softmax}(\frac{QK^T}{\sqrt{d_k}})V$$

$Q$, $K$, and $V$ are batches of matrices, each with shape <code>(batch_size, seq_length, num_features)</code>.
When we are talking about *self* attention, each of the three matrices in
each batch is just a separate linear projection of the same input $\bar{h}_t^{l-1}$.

Multiplying the query $Q$ with the key $K$ arrays results in a <code>(batch_size, seq_length, seq_length)</code> array,
which tells us roughly how important each element in the sequence is to each other element in the sequence. These dot
products are converted to normalized weights using a softmax across rows, so that each row of weights sums to one.
Finally, the weight matrix attention is applied to the value ($V$) array using matrix multiplication. We thus get,
for each token in the input sequence, a weighted average of the rows of $V$, each of which corresponds to one of the
elements in the input sequence.

Here is code for the scaled dot-product operation that is part of a multi-head attention layer:

In [1]:
import torch
import torch.nn.functional as f
from torch import Tensor, nn

def scaled_dot_product_attention(query: Tensor, key: Tensor, value: Tensor) -> Tensor:
    # MatMul operations are translated to torch.bmm in PyTorch
    temp = query.bmm(key.transpose(1, 2))
    scale = query.size(-1) ** 0.5
    softmax = f.softmax(temp / scale, dim=-1)
    return softmax.bmm(value)

A multi-head attention module is composed of several identical
*attention head* modules.
Each attention head contains three linear transformations for $Q$, $K$, and $V$ and combines them using scaled dot-product attention.
Note that this attention head could be used for self attention or another type of attention such as decoder-to-encoder attention, since
we keep $Q$, $K$, and $V$ separate.

In [3]:
class AttentionHead(nn.Module):
    def __init__(self, dim_in: int, dim_q: int, dim_k: int):
        super().__init__()
        self.q = nn.Linear(dim_in, dim_q)
        self.k = nn.Linear(dim_in, dim_k)
        self.v = nn.Linear(dim_in, dim_k)

    def forward(self, query: Tensor, key: Tensor, value: Tensor) -> Tensor:
        return scaled_dot_product_attention(self.q(query), self.k(key), self.v(value))

Multiple attention heads can be combined with the output concatenation and linear transformation to construct a multi-head attention layer:

In [4]:
class MultiHeadAttention(nn.Module):
    def __init__(self, num_heads: int, dim_in: int, dim_q: int, dim_k: int):
        super().__init__()
        self.heads = nn.ModuleList(
            [AttentionHead(dim_in, dim_q, dim_k) for _ in range(num_heads)]
        )
        self.linear = nn.Linear(num_heads * dim_k, dim_in)

    def forward(self, query: Tensor, key: Tensor, value: Tensor) -> Tensor:
        return self.linear(
            torch.cat([h(query, key, value) for h in self.heads], dim=-1)
        )

Each attention head computes its own transformation of the query, key, and value arrays,
and then applies scaled dot-product attention. Conceptually, this means each head can attend to a different part of the input sequence, independent of the others. Increasing the number of attention heads allows the model to pay attention to more parts of the sequence at
once, which makes the model more powerful.

### Positional Encoding

To complete the transformer encoder, we need another component, the *position encoder*.
The <code>MultiHeadAttention</code> class we just write has no trainable components that depend on a token's position
in the sequence (axis 1 of the input tensor). Meaning all of the weight matrices we have seen so far
*perform the same calculation for every input position*; that is, we don't have any position-dependent weights.
All of the operations so far operate over the *feature dimension* (axis 2). This is good in that the model is compatible with any sequence
length. But without *any* information about position, our model is going to be unable to differentiate between different orderings of
the input -- we'll get the same result regardless of the order of the tokens in the input.

Since order matters ("Ridgemont was in the store" has a different
meaning from "The store was in Ridgemont"), we need some way to provide the model with information about tokens' positions in the input sequence.
Whatever strategy we use should provide information about the relative position of data points in the input sequences.
In the Transformer, positional information is encoded using trigonometric functions in a constant 2D matrix $PE$:

$$PE_{(pos,2i)}=\sin (\frac{pos}{10000^{2i/d_{model}}})$$
$$PE_{(pos,2i+1)}=\cos (\frac{pos}{10000^{2i/d_{model}}}),$$

where $pos$ refers to a position in the input sentence sequence and $i$ refers to the position along the embedding vector dimension.
This matrix is *added* to the matrix consisting of the embeddings of each of the input tokens:

<img src="../img/positionalencoder.png" title="Positional Encoder" style="width: 400px;" />

Position encoding can implemented as follows (put this in `utils.py`):

In [6]:
def position_encoding(seq_len: int, dim_model: int, device: torch.device = torch.device("cpu")) -> Tensor:
    pos = torch.arange(seq_len, dtype=torch.float, device=device).reshape(1, -1, 1)
    dim = torch.arange(dim_model, dtype=torch.float, device=device).reshape(1, 1, -1)
    phase = pos / (1e4 ** (dim // dim_model))

    return torch.where(dim.long() % 2 == 0, torch.sin(phase), torch.cos(phase))

These sinusoidal encodings allow us to work with arbirary length sequences because the sine and cosine functions are periodic in the range
$[-1, 1]$. One hope is that if during inference we are provided with an input sequence longer than any found during training.
The position encodings of the last elements in the sequence would be different from anything the model has seen before, but with the
periodic sine/cosine encoding, there will still be some similar structure, with the new encodings being very similar to neighboring encodings the model has seen before. For this reason, despite the fact that learned embeddings appeared to perform equally as well, the authors chose
this fixed sinusoidal encoding.

### The complete encoder

The transformer uses an encoder-decoder architecture. The encoder processes the input sequence and returns a sequence of
feature vectors or memory vectors, while the decoder outputs a prediction of the target sequence,
incorporating information from the encoder memory.

First, let's complete the transformer layer with the two-layer feed forward network. Put this in `utils.py`:

In [7]:
def feed_forward(dim_input: int = 512, dim_feedforward: int = 2048) -> nn.Module:
    return nn.Sequential(
        nn.Linear(dim_input, dim_feedforward),
        nn.ReLU(),
        nn.Linear(dim_feedforward, dim_input),
    )

Let's create a residual module to encapsulate the feed forward network or attention
model along with the common dropout and LayerNorm operations (also in `utils.py`):

In [8]:
class Residual(nn.Module):
    def __init__(self, sublayer: nn.Module, dimension: int, dropout: float = 0.1):
        super().__init__()
        self.sublayer = sublayer
        self.norm = nn.LayerNorm(dimension)
        self.dropout = nn.Dropout(dropout)

    def forward(self, *tensors: Tensor) -> Tensor:
        # Assume that the "query" tensor is given first, so we can compute the
        # residual.  This matches the signature of 'MultiHeadAttention'.
        return self.norm(tensors[0] + self.dropout(self.sublayer(*tensors)))

Now we can create the complete encoder! Put this in `encoder.py`. First, the encoder layer
module, which comprised a self attention residual block followed by a fully connected residual block:

In [10]:
class TransformerEncoderLayer(nn.Module):
    def __init__(
        self,
        dim_model: int = 512,
        num_heads: int = 6,
        dim_feedforward: int = 2048,
        dropout: float = 0.1,
    ):
        super().__init__()
        dim_q = dim_k = max(dim_model // num_heads, 1)
        self.attention = Residual(
            MultiHeadAttention(num_heads, dim_model, dim_q, dim_k),
            dimension=dim_model,
            dropout=dropout,
        )
        self.feed_forward = Residual(
            feed_forward(dim_model, dim_feedforward),
            dimension=dim_model,
            dropout=dropout,
        )

    def forward(self, src: Tensor) -> Tensor:
        src = self.attention(src, src, src)
        return self.feed_forward(src)

Then the Transformer encoder just encapsulates several transformer encoder layers:

In [11]:
class TransformerEncoder(nn.Module):
    def __init__(
        self,
        num_layers: int = 6,
        dim_model: int = 512,
        num_heads: int = 8,
        dim_feedforward: int = 2048,
        dropout: float = 0.1,
    ):
        super().__init__()
        self.layers = nn.ModuleList(
            [
                TransformerEncoderLayer(dim_model, num_heads, dim_feedforward, dropout)
                for _ in range(num_layers)
            ]
        )

    def forward(self, src: Tensor) -> Tensor:
        seq_len, dimension = src.size(1), src.size(2)
        src += position_encoding(seq_len, dimension)
        for layer in self.layers:
            src = layer(src)

        return src

### The decoder

The decoder module is quite similar to the encoder, with just a few small differences:
- The decoder accepts two inputs (the target sequence and the encoder memory), rather than one input.
- There are two multi-head attention modules per layer (the target sequence self-attention module and the decoder-encoder attention module) rather than just one.
- The second multi-head attention module, rather than strict self attention, expects the encoder memory as $K$ and $V$.
- Since accessing future elements of the target sequence would be "cheating," we need to mask out future elements of the input target sequence.

First, we have the decoder version of the transformer layer and the decoder module itself:

In [12]:
class TransformerDecoderLayer(nn.Module):
    def __init__(
        self,
        dim_model: int = 512,
        num_heads: int = 6,
        dim_feedforward: int = 2048,
        dropout: float = 0.1,
    ):
        super().__init__()
        dim_q = dim_k = max(dim_model // num_heads, 1)
        self.attention_1 = Residual(
            MultiHeadAttention(num_heads, dim_model, dim_q, dim_k),
            dimension=dim_model,
            dropout=dropout,
        )
        self.attention_2 = Residual(
            MultiHeadAttention(num_heads, dim_model, dim_q, dim_k),
            dimension=dim_model,
            dropout=dropout,
        )
        self.feed_forward = Residual(
            feed_forward(dim_model, dim_feedforward),
            dimension=dim_model,
            dropout=dropout,
        )

    def forward(self, tgt: Tensor, memory: Tensor) -> Tensor:
        tgt = self.attention_1(tgt, tgt, tgt)
        tgt = self.attention_2(tgt, memory, memory)
        return self.feed_forward(tgt)

    
class TransformerDecoder(nn.Module):
    def __init__(
        self,
        num_layers: int = 6,
        dim_model: int = 512,
        num_heads: int = 8,
        dim_feedforward: int = 2048,
        dropout: float = 0.1,
    ):
        super().__init__()
        self.layers = nn.ModuleList(
            [
                TransformerDecoderLayer(dim_model, num_heads, dim_feedforward, dropout)
                for _ in range(num_layers)
            ]
        )
        self.linear = nn.Linear(dim_model, dim_model)

    def forward(self, tgt: Tensor, memory: Tensor) -> Tensor:
        seq_len, dimension = tgt.size(1), tgt.size(2)
        tgt += position_encoding(seq_len, dimension)
        for layer in self.layers:
            tgt = layer(tgt, memory)

        return torch.softmax(self.linear(tgt), dim=-1)


Note that there is not, as of yet, any masked attention implementation here!
Making this version of the Transformer work in practice would require at least that.

### Putting it together

Now we can put the encoder and decoder together:

In [13]:
class Transformer(nn.Module):
    def __init__(
        self, 
        num_encoder_layers: int = 6,
        num_decoder_layers: int = 6,
        dim_model: int = 512, 
        num_heads: int = 6, 
        dim_feedforward: int = 2048, 
        dropout: float = 0.1, 
        activation: nn.Module = nn.ReLU(),
    ):
        super().__init__()
        self.encoder = TransformerEncoder(
            num_layers=num_encoder_layers,
            dim_model=dim_model,
            num_heads=num_heads,
            dim_feedforward=dim_feedforward,
            dropout=dropout,
        )
        self.decoder = TransformerDecoder(
            num_layers=num_decoder_layers,
            dim_model=dim_model,
            num_heads=num_heads,
            dim_feedforward=dim_feedforward,
            dropout=dropout,
        )

    def forward(self, src: Tensor, tgt: Tensor) -> Tensor:
        return self.decoder(tgt, self.encoder(src))

Let’s create a simple test, as a sanity check for our implementation. We can construct random tensors for the input and target sequences, check that our model executes without errors, and confirm that the output tensor has the correct shape:

In [14]:
src = torch.rand(64, 32, 512)
tgt = torch.rand(64, 16, 512)
out = Transformer()(src, tgt)
print(out.shape)
# torch.Size([64, 16, 512])

torch.Size([64, 16, 512])


You could try implementing masked attention and training this Transformer model on a
sequence-to-sequence problem. However, to understand masking, you might first find
the [PyTorch Transformer tutorial](https://pytorch.org/tutorials/beginner/transformer_tutorial.html)
useful. Note that this model is only a Transformer encoder for language modeling, but it uses
masking in the encoder's self attention module.