<a href="https://colab.research.google.com/github/gnoejh/ict1022/blob/main/Transformer/4_transformer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# IntroductionTransformer

The Transformer model, introduced in the seminal paper ["Attention Is All You Need"](https://arxiv.org/abs/1706.03762) by Vaswani et al. (2017), revolutionized sequence processing by completely replacing recurrent neural networks with attention mechanisms. This architecture forms the foundation of many modern NLP models including BERT, GPT, T5, and others.

### Key Innovations of Transformers

| Feature | Benefit |
|---------|---------|
| Self-Attention | Captures relationships between all positions in a sequence |
| Parallelization | Enables efficient training on large datasets |
| Long-range Dependencies | Effectively models relationships between distant elements |
| Position-aware | Maintains sequence order information without recurrence |
| Scalability | Scales effectively to very large models (billions of parameters) |

The Transformer has become the dominant architecture for natural language processing tasks, including:
- Machine translation
- Text summarization
- Question answering
- Text generation
- Document classification

## Transformer Architecture

![Transformer Architecture](https://miro.medium.com/max/700/1*BHzGVskWGS_3jEcYYi6miQ.png)

The Transformer architecture consists of an encoder (left) and a decoder (right):

- **Encoder**: Processes the input sequence through multiple identical layers of self-attention and feed-forward networks
- **Decoder**: Generates the output sequence, using both self-attention and encoder-decoder attention mechanisms
- **Multi-Head Attention**: Allows the model to focus on different parts of the input sequence simultaneously
- **Positional Encoding**: Adds information about the position of tokens in the sequence
- **Feed-Forward Networks**: Process the attention output through fully connected layers
- **Residual Connections & Layer Normalization**: Help with gradient flow and training stability

This architecture has revolutionized sequence processing by eliminating recurrence and enabling highly parallelized training.

```mermaid
flowchart LR
    Input("Input Embeddings") --> AddPos("+ Positional Encoding")
    AddPos --> EncoderStack("Encoder Stack")
    Output("Output Embeddings") --> AddPosOut("+ Positional Encoding")
    AddPosOut --> DecoderStack("Decoder Stack")
    EncoderStack --> DecoderStack
    DecoderStack --> Linear("Linear Layer")
    Linear --> Softmax("Softmax")
    Softmax --> FinalOutput("Output Probabilities")
    
    subgraph "Encoder Block × N"
        EncIn("Input") --> MultiHead1("Multi-Head Self-Attention")
        MultiHead1 --> AddNorm1("Add & Norm")
        EncIn -.-> AddNorm1
        AddNorm1 --> FFN1("Feed Forward")
        FFN1 --> AddNorm2("Add & Norm")
        AddNorm1 -.-> AddNorm2
    end
    
    subgraph "Decoder Block × N"
        DecIn("Input") --> MaskedMultiHead("Masked Multi-Head Self-Attention")
        MaskedMultiHead --> AddNorm3("Add & Norm")
        DecIn -.-> AddNorm3
        AddNorm3 --> MultiHead2("Multi-Head Cross-Attention")
        MultiHead2 --> AddNorm4("Add & Norm")
        AddNorm3 -.-> AddNorm4
        AddNorm4 --> FFN2("Feed Forward")
        FFN2 --> AddNorm5("Add & Norm")
        AddNorm4 -.-> AddNorm5
    end
```

## Attention Mechanism

The core innovation of the Transformer is its attention mechanism, specifically the "Scaled Dot-Product Attention".

### Scaled Dot-Product Attention


The attention function maps a query and a set of key-value pairs to an output. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed using a compatibility function of the query with the corresponding key.

Mathematically, this is expressed as:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Where:
- $Q$ is the matrix of queries
- $K$ is the matrix of keys
- $V$ is the matrix of values
- $d_k$ is the dimension of the keys

The scaling factor $\sqrt{d_k}$ prevents the dot products from growing too large in magnitude, which would push the softmax function into regions with extremely small gradients.

```mermaid
graph TB
    Q["Q: Queries"] --> MatMul1["MatMul"]
    K["K: Keys"] --> Transpose["Transpose"]
    Transpose --> MatMul1
    MatMul1 --> Scale["Scale by 1/√dk"]
    Scale --> Mask["Optional Mask<br/>(decoder only)"]
    Mask --> Softmax["Softmax"]
    Softmax --> MatMul2["MatMul"]
    V["V: Values"] --> MatMul2
    MatMul2 --> Output["Attention Output"]
```

### Multi-Head Attention

Rather than performing a single attention function, the Transformer uses multi-head attention:

$$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \text{head}_2, ..., \text{head}_h)W^O$$

Where each head is calculated as:

$$\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$$

The projections are parameter matrices $W_i^Q \in \mathbb{R}^{d_{model} \times d_k}$, $W_i^K \in \mathbb{R}^{d_{model} \times d_k}$, $W_i^V \in \mathbb{R}^{d_{model} \times d_v}$ and $W^O \in \mathbb{R}^{hd_v \times d_{model}}$.

Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions, providing more diverse features for subsequent layers.

```mermaid
graph TD
    Input["Input"] --> SplitHeads["Linear Projections"]
    SplitHeads -->|"WQ1"| Q1["Q1"]
    SplitHeads -->|"WK1"| K1["K1"]
    SplitHeads -->|"WV1"| V1["V1"]
    Q1 --> Attn1["Attention Head 1"]
    K1 --> Attn1
    V1 --> Attn1
    
    SplitHeads -->|"WQ2"| Q2["Q2"]
    SplitHeads -->|"WK2"| K2["K2"]
    SplitHeads -->|"WV2"| V2["V2"]
    Q2 --> Attn2["Attention Head 2"]
    K2 --> Attn2
    V2 --> Attn2
    
    SplitHeads -->|"..."| Qn["..."]
    SplitHeads -->|"..."| Kn["..."]
    SplitHeads -->|"..."| Vn["..."]
    
    SplitHeads -->|"WQh"| Qh["Qh"]
    SplitHeads -->|"WKh"| Kh["Kh"]
    SplitHeads -->|"WVh"| Vh["Vh"]
    Qh --> Attnh["Attention Head h"]
    Kh --> Attnh
    Vh --> Attnh
    
    Attn1 --> Concat["Concatenate"]
    Attn2 --> Concat
    Attnh --> Concat
    Concat --> Linear["Linear Projection WO"]
    Linear --> Output["Multi-Head Output"]
```

## Positional Encoding

Since the Transformer contains no recurrence or convolution, it needs some way to incorporate the order of the sequence. This is achieved through positional encodings which are added to the input embeddings.

The positional encodings have the same dimension as the embeddings, allowing them to be summed. The formula used is:

$$PE_{(pos,2i)} = \sin(pos/10000^{2i/d_{model}})$$
$$PE_{(pos,2i+1)} = \cos(pos/10000^{2i/d_{model}})$$

Where $pos$ is the position and $i$ is the dimension. Each dimension of the positional encoding corresponds to a sinusoid with different frequencies.

The wavelengths form a geometric progression from $2\pi$ to $10000 \cdot 2\pi$. This allows the model to easily learn to attend by relative positions, since for any fixed offset $k$, $PE_{pos+k}$ can be represented as a linear function of $PE_{pos}$.

```mermaid
graph LR
    Words["Word Embeddings"] --> Add["+ Addition"]
    PositionalEnc["Positional Encodings"] --> Add
    Add --> Output["Input to Transformer"]
    
    subgraph "Positional Encoding Generation"
        Pos["Position index p"] --> SinCalc["sin(p/10000^(2i/d))"]
        Pos --> CosCalc["cos(p/10000^(2i/d))"]
        SinCalc --> EvenDim["Even dimensions"]
        CosCalc --> OddDim["Odd dimensions"]
        EvenDim --> PosEncVec["Position Encoding Vector"]
        OddDim --> PosEncVec
    end
```

The plot above shows how positional encodings vary with position (x-axis) and dimension (y-axis). The pattern enables the model to determine the relative position of words in a sequence.

## Encoder Structure

The encoder consists of a stack of $N$ identical layers (typically 6 in the original paper). Each layer has two sub-layers:

1. **Multi-Head Self-Attention mechanism**
2. **Position-wise Fully Connected Feed-Forward Network**

Around each sub-layer is a residual connection, followed by layer normalization. Mathematically:

$$\text{LayerNorm}(x + \text{Sublayer}(x))$$

Where $\text{Sublayer}(x)$ is the function implemented by the sub-layer itself.

```mermaid
graph TD
    InputEmb["Input"] --> SelfAttn["Multi-Head<br/>Self-Attention"]
    InputEmb -->|"Residual Connection"| Add1["Add"]
    SelfAttn --> Add1
    Add1 --> Norm1["Layer Norm"]
    
    Norm1 --> FFN["Position-wise<br/>Feed-Forward Network"]
    Norm1 -->|"Residual Connection"| Add2["Add"]
    FFN --> Add2
    Add2 --> Norm2["Layer Norm"]
    Norm2 --> Output["Output"]
    
    subgraph "Encoder Repeated N Times"
        Enc1["Encoder Layer 1"] --> Enc2["Encoder Layer 2"]
        Enc2 --> EllipseEnc["..."]
        EllipseEnc --> EncN["Encoder Layer N"]
    end
```

### Feed-Forward Network

The Feed-Forward Network (FFN) consists of two linear transformations with a ReLU activation in between:

$$\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2$$

Each position is processed identically but with different parameters, making it effectively a position-wise feed-forward network. The dimensionality typically follows the pattern:
- Input dimension: $d_{model}$ (e.g., 512)
- Inner-layer dimension: $d_{ff}$ (e.g., 2048)
- Output dimension: $d_{model}$ (e.g., 512)

## Decoder Structure

The decoder also consists of a stack of $N$ identical layers, but each has three sub-layers:

1. **Masked Multi-Head Self-Attention**
2. **Multi-Head Encoder-Decoder Attention**
3. **Position-wise Feed-Forward Network**

The masked attention in the first sub-layer ensures that predictions for position $i$ can depend only on the known outputs at positions less than $i$. This masking is achieved by:

$$\text{Mask}(Q, K, V) = \text{softmax}\left(\frac{QK^T + M}{\sqrt{d_k}}\right)V$$

Where $M$ is a matrix with:
$$M_{ij} = \begin{cases} 
0 & \text{if } i \geq j \\ 
-\infty & \text{if } i < j
\end{cases}$$

```mermaid
graph TD
    InputEmb["Input"] --> MaskedAttn["Masked Multi-Head<br/>Self-Attention"]
    InputEmb -->|"Residual Connection"| Add1["Add"]
    MaskedAttn --> Add1
    Add1 --> Norm1["Layer Norm"]
    
    Norm1 --> CrossAttn["Multi-Head<br/>Encoder-Decoder Attention"]
    EncoderOut["Encoder Output"] --> CrossAttn
    Norm1 -->|"Residual Connection"| Add2["Add"]
    CrossAttn --> Add2
    Add2 --> Norm2["Layer Norm"]
    
    Norm2 --> FFN["Position-wise<br/>Feed-Forward Network"]
    Norm2 -->|"Residual Connection"| Add3["Add"]
    FFN --> Add3
    Add3 --> Norm3["Layer Norm"]
    Norm3 --> Output["Output"]
    
    subgraph "Masked Self-Attention"
        Tokens["Output Tokens<br/>(so far)"] --> Mask["Apply Future Mask"]
        Mask --> SelfAttention["Self-Attention<br/>Mechanism"]
    end
```

The second attention layer performs multi-head attention where:
- Queries come from the previous decoder layer
- Keys and values come from the encoder output

This allows every position in the decoder to attend to all positions in the input sequence, implementing the encoder-decoder attention mechanism.

## Comparison with RNNs and CNNs

The Transformer architecture offers several advantages over traditional sequence models:

| Architecture | Parallelization | Long-range Dependencies | Computational Complexity | Memory Usage |
|--------------|-----------------|--------------------------|--------------------------|--------------|
| RNN | Sequential processing only | Vanishing gradient problem | O(n) per time step | Low |
| CNN | Highly parallelizable | Limited by kernel size | O(k·n) where k is kernel size | Moderate |
| Transformer | Highly parallelizable | Direct connections between any positions | O(n²) due to attention | Higher |

### Visualization of Receptive Fields

```mermaid
graph LR
    subgraph "RNN Information Flow"
        R1["t₁"] --> R2["t₂"]
        R2 --> R3["t₃"]
        R3 --> R4["t₄"]
    end
    
    subgraph "CNN Information Flow"
        C1["t₁"] --> CF1["Feature 1"]
        C2["t₂"] --> CF1
        C3["t₃"] --> CF1
        
        C2 --> CF2["Feature 2"]
        C3 --> CF2
        C4["t₄"] --> CF2
    end
    
    subgraph "Transformer Information Flow"
        T1["t₁"] <--> T2["t₂"]
        T1 <--> T3["t₃"]
        T1 <--> T4["t₄"]
        T2 <--> T3
        T2 <--> T4
        T3 <--> T4
    end
```

- **RNN**: Information flows sequentially, creating a dependency chain
- **CNN**: Information from nearby words (defined by kernel size) is processed together
- **Transformer**: Every word directly connects to every other word, regardless of distance

The Transformer's ability to capture long-range dependencies in a single step, without the sequential bottleneck of RNNs, is a key factor in its superior performance on many NLP tasks.

## Limitations and Variants

Despite its strengths, the Transformer has some limitations:

1. **Quadratic Complexity**: The self-attention mechanism has O(n²) complexity with respect to sequence length, limiting its application to very long sequences

2. **Fixed Context Window**: Most implementations have a maximum sequence length, beyond which they cannot process

3. **Lack of Built-in Inductive Bias**: Unlike CNNs (locality) and RNNs (sequentiality), Transformers have minimal inductive bias about the structure of language or sequences

### Notable Variants

```mermaid
graph TD
    Original["Original Transformer<br/>O(n²) complexity"] --> TXL["Transformer-XL<br/>Segment recurrence"]
    Original --> Reformer["Reformer<br/>LSH attention<br/>O(n log n)"]
    Original --> Longformer["Longformer<br/>Sparse attention<br/>O(n)"]
    Original --> Linformer["Linformer<br/>Projected attention<br/>O(n)"]
    Original --> Performer["Performer<br/>FAVOR+ kernel<br/>O(n)"]
    Reformer --> Routing["Routing Transformer<br/>Clustered attention"]
    Longformer --> BigBird["BigBird<br/>Global + local + random"]
    TXL --> Compressive["Compressive Transformer<br/>Memory compression"]
    Original --> Sparse["Sparse Transformer<br/>Sparse factorizations"]
```

Several variants have been proposed to address these limitations:

| Variant | Key Innovation | Complexity |
|---------|----------------|------------|
| [Transformer-XL](https://arxiv.org/abs/1901.02860) | Segment-level recurrence for longer contexts | O(n²) with cached states |
| [Reformer](https://arxiv.org/abs/2001.04451) | Locality-sensitive hashing for efficient attention | O(n log n) |
| [Longformer](https://arxiv.org/abs/2004.05150) | Sliding window attention with global tokens | O(n) |
| [Linformer](https://arxiv.org/abs/2006.04768) | Projected attention for linear complexity | O(n) |
| [Performer](https://arxiv.org/abs/2009.14794) | FAVOR+ approximation for efficient attention | O(n) |

These variants maintain the core principles of the Transformer while addressing specific limitations, further expanding the applicability of attention-based architectures.

## Recent Developments in Transformer Models

Since the original Transformer paper in 2017, numerous advances have expanded the capabilities and efficiency of these models:

### Scaling Laws and Large Language Models

Research has revealed predictable scaling laws for transformer performance:

- **Power Law Scaling**: Model performance improves as a power-law function of model size, dataset size, and compute budget
- **Emergent Abilities**: Capabilities like few-shot learning and instruction following emerge only at certain scale thresholds

```mermaid
graph TD
    subgraph "Evolution of Transformer Scale"
        BERT["BERT (2018)<br/>340M params"] --> GPT2["GPT-2 (2019)<br/>1.5B params"]
        GPT2 --> T5["T5 (2020)<br/>11B params"]
        T5 --> GPT3["GPT-3 (2020)<br/>175B params"]
        GPT3 --> PaLM["PaLM (2022)<br/>540B params"]
        PaLM --> GPT4["GPT-4 (2023)<br/>>1T params"]
    end
    
    subgraph "Emergent Capabilities"
        SmallModel["Small Models"] -->|"As scale increases"| FewShot["Few-shot Learning"]
        FewShot -->|"As scale increases"| Chain["Chain-of-Thought"]
        Chain -->|"As scale increases"| Tool["Tool Usage"]
        Tool -->|"As scale increases"| Alignment["Alignment"]
    end
```

This has led to increasingly large models:

| Model | Release Date | Parameters | Key Innovations |
|-------|--------------|------------|-----------------|
| GPT-3 | 2020 | 175B | Few-shot learning capabilities |
| PaLM | 2022 | 540B | Pathways training architecture |
| GPT-4 | 2023 | Not disclosed | Multimodal capabilities |
| LLaMA 2 | 2023 | 7B-70B | Open weights with commercial use license |

### Multimodal Transformers

Recent transformers have expanded beyond text to process multiple modalities:

- **Vision Transformers (ViT)**: Apply transformer architecture directly to image patches
- **CLIP**: Learns joint text-image representations through contrastive learning
- **Flamingo**: Connects language models with visual inputs for few-shot learning
- **GPT-4V**: Processes both images and text for multimodal reasoning

### Efficient Transformers

Research continues on making transformers more efficient:

- **Parameter-Efficient Fine-Tuning**: Methods like LoRA, Adapters, and Prefix Tuning allow adaptation with minimal parameters
- **Quantization**: Reducing precision from 32/16-bit to 8/4/2-bit with minimal performance loss
- **Pruning & Distillation**: Creating smaller models that retain much of the capability of larger ones
- **Mixture of Experts (MoE)**: Using conditional computation to activate only relevant parts of a much larger model

These developments continue to expand the practical applications of transformer models while addressing computational constraints.

## Practical Applications and Implementation

### Real-World Applications

Transformers have revolutionized multiple domains:

| Domain | Applications | Example Models |
|--------|--------------|----------------|
| Language | Translation, summarization, Q&A | T5, BART, mT5 |
| Conversational AI | Chatbots, virtual assistants | LaMDA, Claude, ChatGPT |
| Content Creation | Code generation, writing assistance | Copilot, Anthropic Claude |
| Healthcare | Medical record analysis, diagnosis assistance | Med-PaLM, BioGPT |
| Scientific Discovery | Protein folding, material science | AlphaFold, GNoME |

### Implementation Strategies

When implementing transformer models in production:

1. **Model Selection**:
   - For small datasets: Fine-tune existing pre-trained models
   - For specialized domains: Consider domain-specific pre-training
   - For resource constraints: Use smaller efficient variants

2. **Technical Considerations**:
   ```python
   # Example of loading a pre-trained model with Hugging Face
   from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
   
   # Load model and tokenizer
   model_name = "t5-base"
   tokenizer = AutoTokenizer.from_pretrained(model_name)
   model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
   
   # Example of inference
   input_text = "translate English to French: Hello, how are you?"
   input_ids = tokenizer(input_text, return_tensors="pt").input_ids
   
   outputs = model.generate(input_ids)
   decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
   print(decoded)  # Output: "Bonjour, comment allez-vous?"
   ```

3. **Deployment Challenges**:
   - **Latency**: Use techniques like knowledge distillation, quantization, or caching
   - **Cost**: Consider batch processing, model pruning, or API services
   - **Continuous Improvement**: Implement feedback loops for model improvement

4. **Ethical Considerations**:
   - Bias detection and mitigation in training data
   - Output filtering for harmful content
   - Privacy protection for user data
   - Transparent disclosure of AI-generated content

Transformer models continue to represent a significant investment in computational resources, but their versatility and performance across domains make them increasingly valuable for a wide range of applications.

## Conclusion

The Transformer architecture has fundamentally changed the landscape of sequence modeling and natural language processing. Its key innovations include:

- **Parallelization**: Enabling efficient training on massive datasets
- **Attention Mechanism**: Providing direct modeling of relationships between all elements in a sequence
- **Scalable Architecture**: Supporting models from millions to billions of parameters

These characteristics have made Transformers the foundation for most state-of-the-art NLP models since 2018, including BERT, GPT, T5, and others. The architecture continues to evolve, with ongoing research addressing its limitations and extending its capabilities to new domains beyond natural language processing, such as computer vision, speech recognition, and reinforcement learning.

The Transformer represents one of the most significant architectural innovations in deep learning, demonstrating that attention mechanisms alone can provide powerful sequence modeling capabilities without the need for recurrence or convolution.

## References

1. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). [Attention Is All You Need](https://arxiv.org/abs/1706.03762). *Neural Information Processing Systems (NeurIPS)*.

2. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805). *arXiv preprint*.

3. Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., ... & Amodei, D. (2020). [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165). *Neural Information Processing Systems (NeurIPS)*.

4. Alammar, J. (2018). [The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/). *Blog Post*.

5. Tay, Y., Dehghani, M., Bahri, D., & Metzler, D. (2020). [Efficient Transformers: A Survey](https://arxiv.org/abs/2009.06732). *arXiv preprint*.

6. Lin, Z., Feng, M., Santos, C. N. D., Yu, M., Xiang, B., Zhou, B., & Bengio, Y. (2017). [A Structured Self-attentive Sentence Embedding](https://arxiv.org/abs/1703.03130). *International Conference on Learning Representations (ICLR)*.