# Transformer Architecture

- 📺 **Video:** [https://youtu.be/sLsUD-RcDqg](https://youtu.be/sLsUD-RcDqg)

## Overview
- Walk through the encoder block structure: self-attention, residual connections, and feed-forward layers.
- Understand how normalization stabilizes deep stacks.

## Key ideas
- **Multi-head attention:** attends to different relations in parallel.
- **Residual pathways:** add inputs back to outputs to ease optimization.
- **Layer normalization:** normalizes activations to stabilize training.
- **Feed-forward layers:** position-wise MLPs add nonlinearity between attention blocks.

## Demo
Compose a torch-based transformer encoder block and pass dummy embeddings through it, mirroring the layer diagram in the lecture (https://youtu.be/Ny3cQhLBH4k).

In [1]:
import torch
from torch import nn

batch, seq_len, d_model = 2, 4, 16
x = torch.randn(batch, seq_len, d_model)

encoder_block = nn.TransformerEncoderLayer(d_model=d_model, nhead=4, dim_feedforward=32, dropout=0.0, batch_first=True)
y = encoder_block(x)

print('Input shape:', x.shape)
print('Output shape:', y.shape)
print('First output token representation:')
print(y[0, 0])


Input shape: torch.Size([2, 4, 16])
Output shape: torch.Size([2, 4, 16])
First output token representation:
tensor([ 0.2259,  0.0155,  2.0566, -1.4476, -1.6418,  1.2125, -0.1456,  0.5383,
        -1.2074, -0.3071, -0.9084,  1.2255,  0.0471, -0.2232,  1.0076, -0.4478],
       grad_fn=<SelectBackward0>)


## Try it
- Modify the demo
- Add a tiny dataset or counter-example


## References
- [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf)
- [Scaling Laws for Neural Language Models](https://arxiv.org/abs/2001.08361)
- [Efficient Transformers: A Survey](https://arxiv.org/abs/2009.06732)
- [Rethinking Attention with Performers](https://arxiv.org/abs/2009.14794)
- [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150)
- [The Curious Case of Neural Text Degeneration](https://arxiv.org/abs/1904.09751)


*Links only; we do not redistribute slides or papers.*