# üß™ Lesson 05: Full Graph Transformer

**Series**: Chemical Graph Machine Learning  
**Prerequisites**: Lessons 01-04 (graphs, PE, attention mechanisms)  
**Next Lesson**: [06 - Advanced Graph Models](./06_Advanced_Graph_Models.ipynb)  
**Estimated Time**: 90-105 minutes

---

## üìö Learning Objectives

By the end of this lesson, you will:
1. ‚úÖ Build a complete Graph Transformer from components
2. ‚úÖ Integrate edge features into attention mechanisms
3. ‚úÖ Implement multi-head self-attention for molecules
4. ‚úÖ Use positional embeddings from Lesson 02
5. ‚úÖ Add layer normalisation and residual connections
6. ‚úÖ Stack transformer layers for deep architectures
7. ‚úÖ Understand when to use transformers vs GATs

**Why this matters**: Graph Transformers achieve state-of-the-art performance on molecular benchmarks whilst being interpretable through attention visualisation.

---

## üîÑ Quick Recap: The Journey So Far

**Lesson 01**: Molecules as graphs  
**Lesson 02**: Positional encodings for global structure  
**Lesson 03**: Attention on local neighbourhoods (GAT)  
**Lesson 04**: Sparse patterns for efficiency  

**Today**: We combine everything into the full transformer architecture, adapted specifically for molecular graphs.

**The transformer's superpower**: Global attention lets every atom consider every other atom, whilst edge features and PE preserve chemical structure.

---

## üìñ Main Content Structure

### Part 1: Transformer Architecture Review
- Self-attention refresher (from Lesson 03)
- Queries, keys, values: what they represent for molecules
- Scaled dot-product attention
- Why transformers work well for molecules

**Conceptual**: Atoms as tokens, bonds as relationships

### Part 2: Incorporating Edge Features
- Standard transformers ignore edges‚Äîbut bonds carry information!
- Edge-conditioned attention: bond types modulate attention scores
- Implementing edge bias in attention computation
- Chemical example: single vs double vs aromatic bonds

**Code**: Extend attention to include edge features

### Part 3: Positional Embeddings
- Integrating Laplacian eigenvectors from Lesson 02
- Adding positional embeddings to node features
- Sine-cosine encodings vs learned embeddings
- Graph-specific PE: what works for molecules?

**Code**: Embed positional information in transformer input

### Part 4: Multi-Head Self-Attention
- Multiple attention heads capturing different patterns
- Head 1 ‚Üí polar interactions, Head 2 ‚Üí aromaticity, etc.
- Concatenation vs averaging of heads
- Choosing the number of heads

**Code**: Implement multi-head attention layer

### Part 5: The Complete Transformer Block
- Attention sublayer
- Feed-forward sublayer (atom-wise MLPs)
- Layer normalisation (why it's critical)
- Residual connections (preventing vanishing gradients)

**Code**: Build a single transformer block

### Part 6: Stacking Layers
- Deep transformers: 4-12 layers typical
- Information flow across layers
- Oversmoothing in deep GNNs vs transformers
- When to stop adding layers

**Code**: Stack multiple transformer blocks

### Part 7: Global Pooling for Predictions
- Node-level embeddings ‚Üí graph-level prediction
- Pooling strategies: mean, sum, attention-weighted
- The readout function
- Predicting molecular properties

**Code**: Add pooling and prediction head

### Part 8: Training the Full Model
- Complete architecture assembly
- Loss functions for regression vs classification
- Optimiser choices (Adam, AdamW)
- Learning rate scheduling

**Code**: Training loop for molecular property prediction

### Part 9: Attention Visualisation & Interpretation
- Extracting attention matrices from trained model
- Visualising attention on molecular structures
- Understanding what each layer focuses on
- Chemical validation of learned patterns

**Code**: Comprehensive attention analysis

---

## üí° Key Chemical Insights

### What makes Graph Transformers powerful for chemistry:

**Global context**:
- Functional groups interact across the molecule
- Acidity/basicity depends on distant substituents
- Conformational preferences are global properties

**Edge-aware attention**:
- Double bonds (rigidity) vs single bonds (flexibility)
- Aromatic systems (delocalization)
- Conjugation pathways

**Interpretability**:
- Attention weights show which atoms influence predictions
- Validates chemical intuition or reveals new patterns
- Debugging: wrong predictions often have sensible attention

**Example**: For melting point prediction, transformer might attend to:
- Symmetry (via PE)
- Hydrogen bonding sites (via edge features)
- Molecular size (via global attention)

---

## ‚úÖ Knowledge Checkpoint

Before moving to Lesson 06, ensure you can:

- [ ] Explain the Q, K, V matrices in molecular context
- [ ] Implement edge-conditioned attention
- [ ] Integrate positional embeddings correctly
- [ ] Build a complete transformer block with residuals and layer norm
- [ ] Stack multiple layers without oversmoothing
- [ ] Train on molecular data and interpret attention

**Self-test**: 
1. Build a 4-layer Graph Transformer
2. Train on a small dataset (e.g., 100 molecules with known properties)
3. Visualise attention from all layers for one molecule
4. Explain what each layer seems to capture

Can you connect the learned patterns to chemical knowledge?

---

## üîÆ Coming Up in Lesson 06: Advanced Graph Models

You now understand the core transformer architecture. Lesson 06 explores cutting-edge variants and hybrid models.

**What you'll learn**:
- **GraphGPS**: Combining message passing + transformers + PE (current SOTA)
- **Equivariant GNNs**: Respecting molecular symmetries (rotations, translations)
- **3D Transformers**: Using spatial coordinates explicitly
- **Hybrid architectures**: When to mix different approaches

**What you'll need from today**:
- Your transformer implementation as a baseline
- Understanding of attention mechanisms
- Appreciation for when global context matters

**The research frontier**: These models achieve near-DFT accuracy on quantum chemistry predictions whilst being 1000√ó faster!

---

## üìñ Further Reading

**Transformer Foundations**:
- Vaswani et al. (2017). "Attention is All You Need." *NeurIPS 2017*. [Original transformer]
- Ying et al. (2021). "Do Transformers Really Perform Bad for Graph Representation?" *NeurIPS 2021*. [Graphormer]

**Molecular Transformers**:
- Maziarka et al. (2020). "Molecule Attention Transformer." *arXiv:2002.08264*
- Liao et al. (2024). "Equiformer: Equivariant Graph Attention Transformer for 3D Atomistic Graphs." *ICLR 2023*

**Edge-Augmented Transformers**:
- Shirzad et al. (2023). "Exphormer: Sparse Transformers for Graphs." *ICML 2023*

**Benchmarks**:
- Dwivedi et al. (2023). "Long Range Graph Benchmark." [LRGB dataset]
- Hu et al. (2020). "Open Graph Benchmark." [OGB molecular datasets]

---

**Navigation**: [‚Üê Lesson 04](./04_Sparse_Attention.ipynb) | [Lesson 06 ‚Üí](./06_Advanced_Graph_Models.ipynb) | [Series Home](../README.md)