# 📘 Lesson 13 — Attention Mechanism: Focusing on What Matters

---

### 🎯 Why this lesson matters
RNN/LSTM process sequences step-by-step, but struggle with long dependencies.  
**Attention** lets models "focus" on relevant parts dynamically.  

👉 It’s the core of Transformers (BERT, GPT).  
Enables parallel processing, better for long text/images.  

We’ll build simple attention and see WHY it revolutionizes ML.


In [1]:
# Setup
import torch
import torch.nn as nn
import torch.nn.functional as F
torch.manual_seed(42)


## 1) What is Attention?

- Weights how much each input part contributes to output.
- Like human focus: Ignore noise, attend to key info.

👉 WHY? Better than RNN for global dependencies (e.g., translation).


## 2) Query, Key, Value Concept

- **Query**: What we’re looking for.
- **Key**: Labels on data.
- **Value**: Actual data.
- Score = Query • Key; Softmax for weights; Output = weights • Value.

👉 WHY QKV? Flexible matching.


In [2]:
# Simple attention demo
def scaled_dot_product_attention(q, k, v):
    matmul_qk = torch.matmul(q, k.transpose(-2, -1))
    dk = k.size(-1)
    scaled = matmul_qk / torch.sqrt(torch.tensor(dk, dtype=torch.float32))
    attn_weights = F.softmax(scaled, dim=-1)
    output = torch.matmul(attn_weights, v)
    return output

seq = torch.rand(1, 3, 2)  # Batch=1, seq_len=3, dim=2
print("Attention output:", scaled_dot_product_attention(seq, seq, seq))


Attention output: tensor([[[0.4330, 0.8487],
         [0.6920, 0.6245],
         [0.2837, 0.5950]]], grad_fn=<BmmBackward0>)


## 3) Self-Attention — All from Input

- Q, K, V all from same input (via linear projections).
- Allows parts to attend to each other.

👉 WHY self? Captures internal relations (e.g., subject-verb).


In [3]:
class SelfAttention(nn.Module):
    def __init__(self, embed_dim):
        super().__init__()
        self.query = nn.Linear(embed_dim, embed_dim)
        self.key = nn.Linear(embed_dim, embed_dim)
        self.value = nn.Linear(embed_dim, embed_dim)

    def forward(self, x):
        q = self.query(x)
        k = self.key(x)
        v = self.value(x)
        return scaled_dot_product_attention(q, k, v)


## 4) Multi-Head Attention

- Multiple attention heads in parallel.
- Concat and project results.

👉 WHY multi? Captures different relations (e.g., syntax vs semantics).


In [4]:
class MultiHeadAttention(nn.Module):
    def __init__(self, embed_dim, num_heads):
        super().__init__()
        self.heads = nn.ModuleList([SelfAttention(embed_dim // num_heads) for _ in range(num_heads)])
        self.proj = nn.Linear(embed_dim, embed_dim)

    def forward(self, x):
        out = torch.cat([head(x) for head in self.heads], dim=-1)
        return self.proj(out)


## 5) Practice: Simple Attention Model

- Use for sequence classification.


In [5]:
class AttentionModel(nn.Module):
    def __init__(self, input_dim, embed_dim, num_heads, output_dim):
        super().__init__()
        self.embed = nn.Linear(input_dim, embed_dim)
        self.attn = MultiHeadAttention(embed_dim, num_heads)
        self.fc = nn.Linear(embed_dim, output_dim)

    def forward(self, x):
        x = self.embed(x)
        x = self.attn(x)
        x = x.mean(dim=1)  # Global avg pool
        return self.fc(x)

# Dummy training
X = torch.rand(32, 10, 5)  # Batch=32, seq=10, dim=5
y = torch.randint(0, 2, (32,))
model = AttentionModel(5, 64, 4, 2)
optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()

for epoch in range(50):
    optimizer.zero_grad()
    outputs = model(X)
    loss = criterion(outputs, y)
    loss.backward()
    optimizer.step()
    if epoch == 0:
        print(f"Epoch {epoch+1}/50, Loss: {loss.item():.2f}")


Epoch 1/50, Loss: 0.69


## 6) Practice Exercises

- Add mask to attention (for padding).
- Build encoder with attention.


In [6]:
# Practice: Masked attention
def masked_attention(q, k, v, mask=None):
    scores = torch.matmul(q, k.transpose(-2, -1)) / torch.sqrt(torch.tensor(k.size(-1), dtype=torch.float32))
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))
    weights = F.softmax(scores, dim=-1)
    return torch.matmul(weights, v)


## 📚 Summary

✅ What we learned:
- Attention for focus.
- QKV mechanism.
- Self & multi-head.

🚀 Next Lesson: **Transformer Architecture** — combining attention layers.
