# 📘 Lesson 14 — Transformer Architecture: The Powerhouse of Modern AI

---

### 🎯 Why this lesson matters
Attention is powerful, but Transformers combine it with other blocks to create scalable models.  
Transformers revolutionized AI (GPT, BERT, Vision Transformers).  

👉 They process sequences in parallel, handle long contexts better than RNN/LSTM.  
We’ll build a mini Transformer and see WHY it’s efficient for tasks like translation or generation.


In [1]:
# Setup
import torch
import torch.nn as nn
import torch.optim as optim
import math
torch.manual_seed(42)


## 1) Multi-Head Attention — Review & Importance

- Multiple attention heads capture different aspects.
- Concat and project for rich representation.

👉 WHY multi-head? Like ensemble: Diverse views of data.


In [2]:
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        self.d_k = d_model // num_heads
        self.num_heads = num_heads
        self.q_linear = nn.Linear(d_model, d_model)
        self.k_linear = nn.Linear(d_model, d_model)
        self.v_linear = nn.Linear(d_model, d_model)
        self.out_linear = nn.Linear(d_model, d_model)

    def forward(self, q, k, v, mask=None):
        bs = q.size(0)
        q = self.q_linear(q).view(bs, -1, self.num_heads, self.d_k).transpose(1, 2)
        k = self.k_linear(k).view(bs, -1, self.num_heads, self.d_k).transpose(1, 2)
        v = self.v_linear(v).view(bs, -1, self.num_heads, self.d_k).transpose(1, 2)

        scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.d_k)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        attn = F.softmax(scores, dim=-1)
        context = torch.matmul(attn, v)
        context = context.transpose(1, 2).contiguous().view(bs, -1, self.num_heads * self.d_k)
        return self.out_linear(context)


## 2) Positional Encoding — Adding Order

- Transformers lack built-in order (parallel).
- Add sine/cosine waves to embeddings.

👉 WHY? Encodes position info without recurrence.


In [3]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super().__init__()
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)

    def forward(self, x):
        x = x + self.pe[:, :x.size(1)]
        return x


## 3) Encoder-Decoder Structure

- **Encoder**: Stack of attention + feed-forward (process input).
- **Decoder**: Similar + masked attention + cross-attention (generate output).

👉 WHY encoder-decoder? Seq2seq tasks (input → output sequences).


In [4]:
class TransformerEncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, dim_feedforward=2048):
        super().__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = nn.Sequential(
            nn.Linear(d_model, dim_feedforward),
            nn.ReLU(),
            nn.Linear(dim_feedforward, d_model)
        )
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)

    def forward(self, src, mask=None):
        src2 = self.self_attn(src, src, src, mask)
        src = src + self.norm1(src2)
        src2 = self.feed_forward(src)
        src = src + self.norm2(src2)
        return src


## 4) Building a Mini Transformer

- Embed + Positional + Encoder/Decoder stacks.


In [5]:
class MiniTransformer(nn.Module):
    def __init__(self, src_vocab, tgt_vocab, d_model=128, num_heads=4, num_layers=2):
        super().__init__()
        self.src_embed = nn.Embedding(src_vocab, d_model)
        self.tgt_embed = nn.Embedding(tgt_vocab, d_model)
        self.pos_enc = PositionalEncoding(d_model)
        self.encoder_layers = nn.ModuleList([TransformerEncoderLayer(d_model, num_heads) for _ in range(num_layers)])
        # Similar for decoder...
        self.fc_out = nn.Linear(d_model, tgt_vocab)

    def forward(self, src):
        src = self.src_embed(src) * math.sqrt(src.size(-1))
        src = self.pos_enc(src)
        for layer in self.encoder_layers:
            src = layer(src)
        return self.fc_out(src)


## 5) Practice: Mini Transformer for Sequence Tasks

- Simple copy task.


In [6]:
# Dummy data: Copy sequence
vocab_size = 10
src = torch.randint(0, vocab_size, (32, 10))  # Batch=32, seq=10
tgt = src.clone()  # Target = input

model = MiniTransformer(vocab_size, vocab_size)
optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()

for epoch in range(10):
    optimizer.zero_grad()
    output = model(src)
    loss = criterion(output.view(-1, vocab_size), tgt.view(-1))
    loss.backward()
    optimizer.step()
    if epoch == 0:
        print(f"Epoch {epoch+1}/10, Loss: {loss.item():.2f}")


Epoch 1/10, Loss: 2.30


## 6) Practice Exercises

- Add decoder layer.
- Use for toy translation.


In [7]:
# Practice: Add mask
def generate_square_subsequent_mask(sz):
    mask = (torch.triu(torch.ones(sz, sz)) == 1).transpose(0, 1)
    mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
    return mask


## 📚 Summary

✅ What we learned:
- Multi-head attention.
- Positional encoding.
- Encoder-decoder.
- Mini Transformer build.

🚀 Next Lesson: **Generative Models Basics** — creating new data.
