<a href="https://colab.research.google.com/github/gnoejh/ict1022/blob/main/Architectures/transformer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Transformer Architecture

## Introduction

The Transformer is a neural network architecture introduced in the paper ["Attention Is All You Need"](https://arxiv.org/abs/1706.03762) by Vaswani et al. in 2017. It revolutionized natural language processing by eliminating recurrence and convolutions entirely, relying solely on attention mechanisms to draw global dependencies between input and output. This architecture became the foundation for models like BERT, GPT, and other state-of-the-art language models.

## Historical Context

Before Transformers, sequence modeling primarily relied on:
- **RNNs/LSTMs/GRUs**: Suffered from sequential computation limitations
- **CNNs for sequences**: Required many layers to capture long-range dependencies

The key innovation of Transformers was addressing these limitations by processing all positions simultaneously while maintaining the ability to model dependencies regardless of distance in the sequence.

## Architecture Details

The Transformer architecture consists of an encoder and a decoder, each composed of stacks of identical layers.

### Core Components

![Transformer Architecture](https://production-media.paperswithcode.com/methods/Screen_Shot_2020-07-08_at_12.17.05_AM_IaJn9v9.png)

#### 1. Multi-Head Attention

The key innovation in Transformers is the attention mechanism, specifically **self-attention**:

- **Query (Q)**, **Key (K)**, and **Value (V)** matrices derived from input embeddings
- Attention weights calculated as: $\text{Attention}(Q, K, V) = \text{softmax}(\frac{QK^T}{\sqrt{d_k}})V$
- **Multi-head attention** performs this operation in parallel across different representation subspaces

#### 2. Position-wise Feed-Forward Networks

Each encoder and decoder layer contains a fully connected feed-forward network applied to each position separately:
- Two linear transformations with a ReLU activation in between
- $\text{FFN}(x) = \text{max}(0, xW_1 + b_1)W_2 + b_2$

#### 3. Positional Encoding

Since the model contains no recurrence or convolution, positional encodings are added to provide information about token positions:
- Sine and cosine functions of different frequencies
- $PE_{(pos,2i)} = \sin(pos/10000^{2i/d_{model}})$
- $PE_{(pos,2i+1)} = \cos(pos/10000^{2i/d_{model}})$

#### 4. Residual Connections and Layer Normalization

Each sublayer has:
- Residual connections (skip connections)
- Layer normalization
- $\text{LayerNorm}(x + \text{Sublayer}(x))$

## Key Innovations

1. **Parallelization**: Unlike RNNs, Transformers process all tokens simultaneously, enabling much faster training
2. **Attention Mechanism**: Can directly model dependencies between any positions regardless of distance
3. **Multi-Head Attention**: Allows the model to jointly attend to information from different representation subspaces
4. **No Vanishing Gradient**: Direct connections between any positions help with gradient flow
5. **Constant Path Length**: The maximum path length between any two positions is O(1) instead of O(n) with RNNs

## Applications

Transformers have become dominant in many natural language processing tasks:

1. **Machine Translation**: The original application shown in the paper
2. **Text Generation**: GPT family models use decoder-only Transformers
3. **Language Understanding**: BERT uses encoder-only Transformers 
4. **Summarization**: Models like BART and T5 use the full encoder-decoder architecture
5. **Speech Recognition**: Models like Whisper use Transformers for audio processing
6. **Computer Vision**: Vision Transformers (ViT) adapt the architecture for image tasks

The versatility of Transformers has led to their adoption across multiple domains beyond NLP.

## Implementation Example

### Basic Transformer Implementation in PyTorch

In [1]:
import torch
import torch.nn as nn
import math

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        assert d_model % num_heads == 0
        
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        
        self.wq = nn.Linear(d_model, d_model)
        self.wk = nn.Linear(d_model, d_model)
        self.wv = nn.Linear(d_model, d_model)
        self.wo = nn.Linear(d_model, d_model)
        
    def forward(self, q, k, v, mask=None):
        batch_size = q.size(0)
        
        # Apply linear projections
        q = self.wq(q).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        k = self.wk(k).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        v = self.wv(v).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        
        # Calculate attention scores
        scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.d_k)
        
        # Apply mask if provided
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)
        
        # Apply softmax and multiply with values
        attention = torch.softmax(scores, dim=-1)
        output = torch.matmul(attention, v)
        
        # Reshape and apply final linear layer
        output = output.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
        return self.wo(output)

In [2]:
class PositionwiseFeedForward(nn.Module):
    def __init__(self, d_model, d_ff):
        super().__init__()
        self.fc1 = nn.Linear(d_model, d_ff)
        self.fc2 = nn.Linear(d_ff, d_model)
        self.relu = nn.ReLU()
        
    def forward(self, x):
        return self.fc2(self.relu(self.fc1(x)))

In [3]:
class EncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = PositionwiseFeedForward(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x, mask=None):
        # Self-attention with residual connection and layer norm
        attn_output = self.self_attn(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_output))
        
        # Feed-forward with residual connection and layer norm
        ff_output = self.feed_forward(x)
        x = self.norm2(x + self.dropout(ff_output))
        
        return x

In [4]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super().__init__()
        
        # Create positional encoding matrix
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        
        # Apply sine to even indices and cosine to odd indices
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        
        # Register as buffer (not a parameter, but part of the module)
        self.register_buffer('pe', pe.unsqueeze(0))
        
    def forward(self, x):
        # Add positional encoding to input embeddings
        return x + self.pe[:, :x.size(1), :]

### Using Pre-trained Transformers with Hugging Face

In [None]:
# pip install transformers
from transformers import AutoTokenizer, AutoModel

# Load pre-trained BERT model (transformer encoder)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

# Example usage
text = "Transformers are powerful neural network architectures."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

# Get the embeddings from the last hidden state
last_hidden_states = outputs.last_hidden_state

## Variants and Extensions

The original Transformer architecture has evolved into several variants:

1. **BERT (Bidirectional Encoder Representations from Transformers)**
   - Uses only the encoder portion of the Transformer
   - Pre-trained using masked language modeling and next sentence prediction
   
2. **GPT (Generative Pre-trained Transformer)**
   - Uses only the decoder portion (with modifications)
   - Auto-regressive language model trained to predict next tokens
   
3. **T5 (Text-to-Text Transfer Transformer)**
   - Encoder-decoder architecture
   - Frames all NLP tasks as text-to-text problems
   
4. **Vision Transformer (ViT)**
   - Adapts Transformer architecture for computer vision
   - Splits images into patches and processes them as token sequences
   
5. **Efficient Transformers**
   - Reformer, Linformer, Performer, etc.
   - Address the quadratic complexity of self-attention through various approximations

## Advantages and Limitations

### Advantages

- **Parallelization**: Much faster training than sequential models
- **Long-range dependencies**: Can model relationships between distant tokens
- **Versatility**: Applicable to various tasks and domains
- **Scalability**: Performance continues to improve with larger models and more data

### Limitations

- **Quadratic complexity**: Self-attention is O(n²) with sequence length
- **Fixed context window**: Standard Transformers have a limited input length
- **Positional encoding**: Less intuitive handling of sequential information compared to RNNs
- **Resource intensive**: Large Transformer models require significant computational resources

## References and Further Reading

- Vaswani, A., et al. (2017). [Attention Is All You Need](https://arxiv.org/abs/1706.03762). NIPS.
- Devlin, J., et al. (2018). [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805). arXiv.
- Radford, A., et al. (2018). [Improving Language Understanding by Generative Pre-Training](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf). OpenAI.
- Brown, T., et al. (2020). [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165). NeurIPS.
- Dosovitskiy, A., et al. (2020). [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929). ICLR.
- Tay, Y., et al. (2022). [Efficient Transformers: A Survey](https://arxiv.org/abs/2009.06732). ACM Computing Surveys.