# Transformers Architecture with PyTorch Examples

In [None]:
%pip install torch


## Transformer Overview

The Transformer model consists of an encoder and a decoder. Each of these components is made up of several layers of attention mechanisms and feed-forward networks.

![Transformer architecture](https://daleonai.com/images/screen-shot-2021-05-06-at-12.12.21-pm.png)

As for now we have reviewed **Self-Attention** and **Multiheaded Attention**, let's take a look on how to implement them with examples with PyTorch



### Self-Attention Calculation

The self-attention mechanism can be described with the following steps:

1. Compute Query, Key, and Value matrices from the input.
2. Calculate attention scores using the dot product of Query and Key.
3. Apply a Softmax function to the attention scores.
4. Compute the weighted sum of the Value vectors.



### PyTorch Implementation


In [71]:

import torch
import torch.nn.functional as F

def self_attention(Q, K, V):
    dimension = torch.sqrt(torch.Tensor([K.shape[-1]]))
    scores = torch.matmul(Q, K.transpose(-2,-1)) / dimension
    weights = F.softmax(scores, dim=-1)
    output = torch.matmul(weights, V)
    return output

# Example
Q = torch.randn(1, 3, 4)  # (batch_size, seq_len, embed_dim)
K = torch.randn(1, 3, 4)
V = torch.randn(1, 3, 4)

print(f'Q:{Q}\nK:{K}\nV:{V}')
attention_output = self_attention(Q, K, V)
print(f'Attention:{attention_output}')


Q:tensor([[[ 1.6851,  0.1665, -0.0917, -0.2914],
         [-0.8735,  0.8515,  1.1608,  0.1396],
         [-0.1546,  1.5666,  0.9541,  1.1648]]])
K:tensor([[[ 0.4788,  0.3367,  1.1149, -0.2138],
         [ 1.7596, -0.5329,  0.6616, -1.7132],
         [ 0.3233,  0.9728,  0.9566, -0.4169]]])
V:tensor([[[ 1.0031,  0.5750, -1.0804,  1.0226],
         [-2.3141,  0.3248, -0.1342, -0.5509],
         [-0.1072,  1.4614, -0.6232, -0.3182]]])
Attention:tensor([[[-1.3144,  0.5714, -0.3945, -0.2204],
         [ 0.0929,  0.9892, -0.7508,  0.1856],
         [ 0.1976,  1.0412, -0.7743,  0.1983]]])



## Positional Encodings


In [None]:

In the description of the original Transformer model , 

The positional encoding for a given position pos $pos$ and dimension $i$ is given by:

$$PE(pos,2i)=sin⁡(pos/100002i/d)PE(pos,2i)=sin(pos/100002i/d) PE(pos,2i+1)=cos⁡(pos/100002i/d)PE(pos,2i+1)=cos(pos/100002i/d)

where dd is the dimensionality of the embeddings.


In [None]:
PyTorch Implementation


In [None]:

import torch
import math

class PositionalEncoding(torch.nn.Module):
    def __init__(self, d_model, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.encoding = torch.zeros(max_len, d_model)
        positions = torch.arange(0, max_len).unsqueeze(1).float()
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model))
        self.encoding[:, 0::2] = torch.sin(positions * div_term)
        self.encoding[:, 1::2] = torch.cos(positions * div_term)
        self.encoding = self.encoding.unsqueeze(0)
    
    def forward(self, x):
        return x + self.encoding[:, :x.size(1)]

# Example
pos_encoding = PositionalEncoding(d_model=4)
x = torch.randn(1, 3, 4)
encoded_x = pos_encoding(x)
print(encoded_x)


In [None]:

Transformer Block


In [None]:

A Transformer block consists of the following layers:

    Multi-Head Attention
    Add & Norm
    Feed-Forward Network
    Add & Norm


In [None]:

PyTorch Implementation


In [None]:

python

import torch.nn as nn

class TransformerBlock(nn.Module):
    def __init__(self, d_model, nhead, dim_feedforward):
        super(TransformerBlock, self).__init__()
        self.attention = nn.MultiheadAttention(d_model, nhead)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, dim_feedforward),
            nn.ReLU(),
            nn.Linear(dim_feedforward, d_model)
        )
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(0.1)
    
    def forward(self, x):
        attn_output, _ = self.attention(x, x, x)
        x = x + self.dropout(attn_output)
        x = self.norm1(x)
        ffn_output = self.ffn(x)
        x = x + self.dropout(ffn_output)
        x = self.norm2(x)
        return x

# Example
transformer_block = TransformerBlock(d_model=4, nhead=2, dim_feedforward=8)
x = torch.randn(3, 1, 4)  # (seq_len, batch_size, d_model)
output = transformer_block(x)
print(output)


In [None]:

Multi-Head Attention


In [None]:

Multi-Head Attention allows the model to focus on different parts of the sequence from multiple perspectives. It involves multiple self-attention mechanisms in parallel.


In [None]:
PyTorch Implementation


In [None]:

python

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, nhead):
        super(MultiHeadAttention, self).__init__()
        self.attention = nn.MultiheadAttention(d_model, nhead)
    
    def forward(self, query, key, value):
        attn_output, _ = self.attention(query, key, value)
        return attn_output

# Example
multi_head_attention = MultiHeadAttention(d_model=4, nhead=2)
x = torch.randn(3, 1, 4)  # (seq_len, batch_size, d_model)
output = multi_head_attention(x, x, x)
print(output)


In [None]:

Feed-Forward Networks


In [None]:

Each Transformer block contains a feed-forward network that consists of two linear transformations with a ReLU activation in between.


In [None]:
PyTorch Implementation


In [None]:

python

class FeedForwardNetwork(nn.Module):
    def __init__(self, d_model, dim_feedforward):
        super(FeedForwardNetwork, self).__init__()
        self.ffn = nn.Sequential(
            nn.Linear(d_model, dim_feedforward),
            nn.ReLU(),
            nn.Linear(dim_feedforward, d_model)
        )
    
    def forward(self, x):
        return self.ffn(x)

# Example
feed_forward = FeedForwardNetwork(d_model=4, dim_feedforward=8)
x = torch.randn(3, 1, 4)  # (seq_len, batch_size, d_model)
output = feed_forward(x)
print(output)


In [None]:

Full Transformer Model


In [None]:

A full Transformer model combines multiple Transformer blocks. The encoder and decoder stacks can be constructed by stacking these blocks.
PyTorch Implementation


In [None]:

python

class TransformerModel(nn.Module):
    def __init__(self, d_model, nhead, num_layers, dim_feedforward):
        super(TransformerModel, self).__init__()
        self.positional_encoding = PositionalEncoding(d_model)
        self.layers = nn.ModuleList([
            TransformerBlock(d_model, nhead, dim_feedforward)
            for _ in range(num_layers)
        ])
    
    def forward(self, x):
        x = self.positional_encoding(x)
        for layer in self.layers:
            x = layer(x)
        return x

# Example
model = TransformerModel(d_model=4, nhead=2, num_layers=2, dim_feedforward=8)
x = torch.randn(3, 1, 4)  # (seq_len, batch_size, d_model)
output = model(x)
print(output)


In [None]:
Conclusion

In this notebook, we have explored the core components of the Transformer architecture, including self-attention, positional encoding, and the structure of Transformer blocks. We have also provided PyTorch implementations for each of these components.

Feel free to experiment with different configurations and datasets to better understand how Transformers work!

## Further reading

[Grant Sanderson on Transformers](https://www.youtube.com/watch?v=eMlx5fFNoYc)