# Transformers Architecture with PyTorch Examples.

In [None]:
%pip install torch


## Transformer Overview

The Transformer model consists of an encoder and a decoder. Each of these components is made up of several layers of attention mechanisms and feed-forward networks.

![Transformer architecture](https://daleonai.com/images/screen-shot-2021-05-06-at-12.12.21-pm.png)

As for now we have reviewed **Self-Attention** and **Multiheaded Attention**, let's take a look on how to implement them with examples with PyTorch



### Self-Attention Calculation

The self-attention mechanism can be described with the following steps:

1. Compute Query, Key, and Value matrices from the input.
2. Calculate attention scores using the dot product of Query and Key.
3. Apply a Softmax function to the attention scores.
4. Compute the weighted sum of the Value vectors.



### PyTorch Implementation


In [None]:

import torch
import torch.nn.functional as F

def self_attention(Q, K, V):
    dimension = torch.sqrt(torch.Tensor([K.shape[-1]]))
    scores = torch.matmul(Q, K.transpose(-2,-1)) / dimension
    weights = F.softmax(scores, dim=-1)
    output = torch.matmul(weights, V)
    return output

# Example
Q = torch.randn(1, 3, 4)  # (batch_size, seq_len, embed_dim)
K = torch.randn(1, 3, 4)
V = torch.randn(1, 3, 4)

print(f'Q:{Q}\nK:{K}\nV:{V}')
attention_output = self_attention(Q, K, V)
print(f'Attention:{attention_output}')



## Positional Encodings



In the description of the original Transformer model, apart from the embeddings the authors provided
a mechanism to encode positions of the word.

They describe the positional encoding $PE$ in terms of cosines and sines of the positions of the words. These are not dependant
on the embeddings. Consider $i$ ranging from $0$ to $d_{model}/2$, and $pos$ the position of the embedding vector, then the 
$PE$ matrix is given by:

$$
\begin{align*}
PE(pos, 2i) &=\sin(\frac{pos}{l^{2i/d_{model}}}) \\
PE(pos, 2i+1) &=\cos(\frac{pos}{l^{2i/d_{model}}})
\end{align*}
$$

Here $l$ is a user defined scalar, in the paper _Attention is all you need_ they showcase $l=10000$.


## PyTorch Implementation


In [None]:
import torch
import math

class PositionalEncoding(torch.nn.Module):
    def __init__(self, d_model, max_len=5000, l=10000):

        super().__init__()

        self.encoding = torch.zeros(max_len, d_model)
        positions = torch.arange(0, max_len).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2) * -(math.log(l) / d_model))
        self.encoding[:, 0::2] = torch.sin(positions * div_term)
        self.encoding[:, 1::2] = torch.cos(positions * div_term)
        self.encoding = self.encoding.unsqueeze(0)
    
    def forward(self, x):
        return x + self.encoding[:, :x.size(1)]

# Example
pos_encoding = PositionalEncoding(d_model=4)
x = torch.randn(1, 3, 4)
encoded_x = pos_encoding(x)
print(encoded_x)



# Encoder block


A Transformer encoder block consists of the following layers:

- Multi-Head Attention
- Add & Norm
- Feed-Forward Network
- Add & Norm


## PyTorch Implementation


In [None]:

import torch.nn as nn

class TransformerBlock(nn.Module):
    def __init__(self, d_model, nhead, dim_feedforward):
        super(TransformerBlock, self).__init__()
        self.attention = nn.MultiheadAttention(d_model, nhead)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, dim_feedforward),
            nn.ReLU(),
            nn.Linear(dim_feedforward, d_model)
        )
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(0.1)
    
    def forward(self, x, v=None):
        if v == None:
            attn_output, _ = self.attention(x, x, x)
        else:
            attn_output, _ = self.attention(x, x, v)
        x = x + self.dropout(attn_output)
        x = self.norm1(x)
        ffn_output = self.ffn(x)
        x = x + self.dropout(ffn_output)
        x = self.norm2(x)
        return x

# Example
transformer_block = TransformerBlock(d_model=4, nhead=2, dim_feedforward=8)
x = torch.randn(3, 1, 4)  # (seq_len, batch_size, d_model)
output = transformer_block(x)
print(output)


In [None]:
import torch.nn as nn

class TransformerEncoder(nn.Module):
    def __init__(self, d_model, nhead, num_layers, dim_feedforward):
        super(TransformerEncoder, self).__init__()
        self.positional_encoding = PositionalEncoding(d_model)
        self.layers = nn.ModuleList([
            TransformerBlock(d_model, nhead, dim_feedforward)
            for _ in range(num_layers)
        ])
    
    def forward(self, x):
        x = self.positional_encoding(x)
        for layer in self.layers:
            x = layer(x)
        return x

# Example
model = TransformerEncoder(d_model=4, nhead=2, num_layers=2, dim_feedforward=8)
x = torch.randn(3, 1, 4)  # (seq_len, batch_size, d_model)
output = model(x)
print(output)


## Decoder Block

For the decoder, the implementation employs a **masking** mechanism in the self attention portion. To prevent illegal connections from happening, that is, for the
self attention in the masked decoder block to only allow "seeing" to past portions of the sequence we can use a triangular lower matrix:

$$
\operatorname{mask} = \begin{bmatrix}
1 & 0 & \cdots & 0 \\
1 & 1 & \cdots & 0 \\
\vdots & \vdots & \ddots & \vdots \\
1 & 1 & \cdots & 1 
\end{bmatrix}
$$

We will apply this mask to the $QK^T$ matrix making entries that don't match with the ones, $-\infty$, so they won't contribute anything to the Softmax calculation.

## PyTorch implementation

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class MaskedMultiHeadAttention(nn.Module):
    def __init__(self, d_model, nhead):
        super(MaskedMultiHeadAttention, self).__init__()
        assert d_model % nhead == 0, "Embedding dimension must be divisible by number of heads"
        
        self.d_model = d_model
        self.nhead = nhead
        self.head_dim = d_model // nhead
        
        self.q_linear = nn.Linear(d_model, d_model)
        self.k_linear = nn.Linear(d_model, d_model)
        self.v_linear = nn.Linear(d_model, d_model)
        self.out_linear = nn.Linear(d_model, d_model)
        
        self.dropout = nn.Dropout(0.1)
    
    def forward(self, query, key, value, mask=None):
        batch_size = query.size(0)
        
        def shape(x):
            # (batch_size, seq_length, d_model) -> (batch_size, nhead, seq_length, head_dim)
            x = x.view(batch_size, -1, self.nhead, self.head_dim)
            return x.permute(0, 2, 1, 3)
        
        # Linear projections
        q = shape(self.q_linear(query))
        k = shape(self.k_linear(key))
        v = shape(self.v_linear(value))
        
        # Scaled dot-product attention
        scores = torch.matmul(q, k.transpose(-2, -1)) / (self.head_dim ** 0.5)
        
        if mask is not None:
            # Apply the mask to the attention scores
            scores = scores.masked_fill(mask == 0, float('-inf'))
        
        attn_weights = F.softmax(scores, dim=-1)
        attn_weights = self.dropout(attn_weights)
        
        # Weighted sum of values
        output = torch.matmul(attn_weights, v)
        
        # Concat heads and pass through final linear layer
        output = output.permute(0, 2, 1, 3).contiguous()
        output = output.view(batch_size, -1, self.d_model)
        
        return self.out_linear(output)

# Example usage:
d_model = 64
nhead = 8
seq_length = 20
attention = MaskedMultiHeadAttention(d_model, nhead)

# Random input tensors
query = torch.randn(1, seq_length, d_model)  # (batch_size, seq_length, d_model)
key = torch.randn(1, seq_length, d_model)
value = torch.randn(1, seq_length, d_model)

# Masking: assume that mask is of shape (batch_size, nhead, seq_length, seq_length)
mask = torch.tril(torch.ones(seq_length, seq_length)).unsqueeze(0).unsqueeze(0)  # Lower triangular matrix for causal mask
mask = mask.to(dtype=torch.bool)  # Convert to boolean mask

# Forward pass
output = attention(query, key, value, mask)
print(output)



We can finally assemble the PyTorch implementation of the Transformer architecture

## PyTorch Implementation

In [None]:
import torch
import torch.nn as nn

class DecoderBlock(nn.Module):
    def __init__(self, d_model, nhead, dim_feedforward, mask):
        super(DecoderBlock, self).__init__()
        self.mask = mask
        self.masked_attn = MaskedMultiHeadAttention(d_model, nhead)
        self.transformer_block = TransformerBlock(d_model, nhead, dim_feedforward)
        self.dropout = nn.Dropout(0.1)
        self.norm = nn.LayerNorm(d_model)
    
    def forward(self, output, x):
        attn_output = self.masked_attn(output, output, output, self.mask)
        attn_output = output + self.dropout(attn_output)
        attn_output = self.norm(attn_output)
        x = self.transformer_block(x, attn_output)
        return x


class Transformer(nn.Module):
    def __init__(self, d_model, nhead, num_layers, dim_feedforward, dim_out, mask):
        super(Transformer, self).__init__()
        self.encoder = TransformerEncoder(d_model, nhead, num_layers, dim_feedforward)
        self.positional_encoding = PositionalEncoding(d_model)
        self.decoder = nn.ModuleList([DecoderBlock(d_model, nhead, dim_feedforward, mask)
                                          for _ in range(num_layers)])
        self.ffn = nn.Sequential(
            nn.Linear(d_model, dim_out),
            nn.ReLU()
        )
    
    def forward(self, x):

        x = self.encoder(x)
        out = self.positional_encoding(x)
        for layer in self.decoder:
            out = layer(out, x)
        x = self.ffn(x)
        return x

# Example
d_model = 64
nhead = 8
seq_length = 20
attention = MaskedMultiHeadAttention(d_model, nhead)

# Random input tensors
x = torch.randn(1, seq_length, d_model)  # (batch_size, seq_length, d_model)

# Masking: assume that mask is of shape (batch_size, nhead, seq_length, seq_length)
mask = torch.tril(torch.ones(seq_length, seq_length)).unsqueeze(0).unsqueeze(0)  # Lower triangular matrix for causal mask
mask = mask.to(dtype=torch.bool)  # Convert to boolean mask

model = Transformer(d_model=d_model, nhead=nhead, num_layers=2, dim_feedforward=8, dim_out=12, mask=mask)
output = model(x)
print(output)

Conclusion

In this notebook, we have explored the core components of the Transformer architecture, including self-attention, positional encoding, and the structure of Transformer blocks. We have also provided PyTorch implementations for each of these components.

Feel free to experiment with different configurations and datasets to better understand how Transformers work!

## Further reading

[Grant Sanderson on Transformers](https://www.youtube.com/watch?v=eMlx5fFNoYc)