# Week 7: Introduction to Transformers - Homework

**ML2: Advanced Machine Learning**

**Estimated Time**: 1 hour

---

This homework combines programming exercises and knowledge-based questions to reinforce this week's concepts.

## Setup

Run this cell to import necessary libraries:

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn

# Set random seed for reproducibility
np.random.seed(42)
torch.manual_seed(42)

print('✓ Libraries imported successfully')

---
## Part 1: Programming Exercises (60%)

Complete the following programming tasks. Read each description carefully and implement the requested functionality.

### Exercise 1: Experiment: Attention Weights Visualization

**Time**: 10 min

See how attention dynamically weighs different words based on context.

In [None]:
import torch
import torch.nn.functional as F
import numpy as np

# Simplified attention example
sentence = ["The", "cat", "sat", "on", "the", "mat"]

# Simple word embeddings (3-dim for visualization)
embeddings = torch.randn(6, 3)  # 6 words, 3 dimensions

def scaled_dot_product_attention(query, keys, values):
    # query: (d_k,), keys: (seq_len, d_k), values: (seq_len, d_v)
    d_k = query.size(-1)
    scores = torch.matmul(keys, query) / np.sqrt(d_k)  # (seq_len,)
    attention_weights = F.softmax(scores, dim=0)  # (seq_len,)
    output = torch.matmul(attention_weights, values)  # (d_v,)
    return output, attention_weights

# For word "cat", what does it attend to?
query_word_idx = 1  # "cat"
query = embeddings[query_word_idx]
keys = embeddings
values = embeddings

output, weights = scaled_dot_product_attention(query, keys, values)

print(f"Attention weights when processing '{sentence[query_word_idx]}':")
for i, (word, weight) in enumerate(zip(sentence, weights)):
    print(f"  {word}: {weight:.3f}")

# TODO: What pattern do you see? Why does "cat" attend to certain words?

---
## Part 2: Knowledge Questions (40%)

Answer the following questions to test your conceptual understanding.

### Question 1 (Short Answer)

**Question 1 - Static vs Context-Dependent Embeddings**

Word2Vec: "bank" always has the same embedding
Transformer: "bank" has DIFFERENT embeddings in "river bank" vs "financial bank"

Explain:
1. How does the transformer create context-dependent embeddings?
2. What role does attention play in this?
3. Why is this a major breakthrough?

**Hint**: Attention looks at surrounding words to dynamically adjust the representation.

**Your Answer**:

[Write your answer here in 2-4 sentences]

### Question 2 (Short Answer)

**Question 2 - Attention as Dynamic Weighting**

In "The cat sat on the mat", when processing "sat", attention might heavily weight "cat" (subject) and "mat" (object).

Explain:
1. Why is this better than fixed-size context windows (like n-grams)?
2. What can attention capture that RNNs struggle with?
3. How does this help with long-range dependencies?

**Hint**: Attention can connect words that are far apart without sequential processing.

**Your Answer**:

[Write your answer here in 2-4 sentences]

### Question 3 (Multiple Choice)

**Question 3 - Query, Key, Value Intuition**

Attention uses three vectors: Query, Key, Value

Think of it like a search engine:
- Query = what you're looking for
- Keys = indexed items
- Values = the actual content you retrieve

Which is correct?

A) Query comes from the word being processed, Keys come from all words
B) All three are the same vector
C) Keys and Values are always different
D) Query is learned, Keys are fixed

A) Query comes from the word being processed, Keys come from all words
B) All three are the same vector
C) Keys and Values are always different
D) Query is learned, Keys are fixed

**Hint**: Query = what this word is asking for. Keys = what other words offer.

**Your Answer**: [Write your answer here - e.g., 'B']

**Explanation**: [Explain why this is correct]

### Question 4 (Short Answer)

**Question 4 - Why Scale by sqrt(d_k)?**

Scaled dot-product attention: scores = (Q·K^T) / sqrt(d_k)

Explain: Why divide by sqrt(d_k)? What problem does this solve?

**Hint**: Dot products grow large in high dimensions, pushing softmax into saturation.

**Your Answer**:

[Write your answer here in 2-4 sentences]

### Question 5 (Short Answer)

**Question 5 - Multi-Head Attention**

Transformers use MULTIPLE attention heads in parallel.

Explain:
1. Why use multiple heads instead of one big attention mechanism?
2. What might different heads learn to specialize in?
3. How does this relate to ensemble learning?

**Hint**: Different heads can capture different relationships (syntax, semantics, etc.).

**Your Answer**:

[Write your answer here in 2-4 sentences]

### Question 6 (Multiple Choice)

**Question 6 - Positional Encoding**

Transformers process all words in parallel (unlike RNNs which are sequential).

What problem does this create, and how do positional encodings solve it?

A) Memory usage - solved by compression
B) Loss of word order information - solved by adding position signals
C) Slow training - solved by parallelization
D) Overfitting - solved by regularization

A) Memory usage - solved by compression
B) Loss of word order information - solved by adding position signals
C) Slow training - solved by parallelization
D) Overfitting - solved by regularization

**Hint**: Without sequential processing, how does the model know word order?

**Your Answer**: [Write your answer here - e.g., 'B']

**Explanation**: [Explain why this is correct]

### Question 7 (Short Answer)

**Question 7 - Self-Attention vs Cross-Attention**

Self-attention: A sentence attends to itself
Cross-attention: One sequence attends to another (e.g., translation)

Explain: In an English-to-French translator, where would you use each type?

**Hint**: Self-attention within English and French. Cross-attention from French to English.

**Your Answer**:

[Write your answer here in 2-4 sentences]

### Question 8 (Short Answer)

**Question 8 - Computational Complexity**

For a sequence of length n, self-attention has O(n²) complexity.

Explain:
1. Why O(n²)? (What computation causes this?)
2. Why is this a problem for long documents?
3. How might you reduce this for very long sequences?

**Hint**: Every word attends to every other word = n×n comparisons.

**Your Answer**:

[Write your answer here in 2-4 sentences]

### Question 9 (Short Answer)

**Question 9 - Attention vs RNN Sequential Processing**

RNNs: Process word 1, then 2, then 3... (sequential)
Transformers: Process all words in parallel

Explain:
1. What's the training speed advantage of transformers?
2. What's the tradeoff in memory usage?
3. Why can't RNNs parallelize as easily?

**Hint**: RNNs need previous hidden state. Transformers compute all positions independently.

**Your Answer**:

[Write your answer here in 2-4 sentences]

### Question 10 (Short Answer)

**Question 10 - Real-World Application**

GPT, BERT, and T5 are all transformers with billions of parameters.

Explain:
1. Why did transformers enable scaling to billions of parameters when RNNs couldn't?
2. What property of attention makes it more effective at scale?
3. What's the cost of this scalability?

**Hint**: Parallelization + long-range dependencies. Cost = computation and memory.

**Your Answer**:

[Write your answer here in 2-4 sentences]

---
## Submission

Before submitting:
1. Run all cells to ensure code executes without errors
2. Check that all questions are answered
3. Review your explanations for clarity

**To Submit**:
- File → Download → Download .ipynb
- Submit the notebook file to your course LMS

**Note**: Make sure your name is in the filename (e.g., homework_01_yourname.ipynb)