# Week 3, Day 1: The Dot Product — Similarity in Code

**Time:** ~1 hour

**Goal:** Understand the dot product as the foundation of attention, both mathematically and computationally.

## The Challenge

You have a query vector and 1000 key vectors. How do you find which keys are most "relevant" to the query?

The answer: **dot products**. The dot product measures how much two vectors "agree" — it's the mathematical backbone of attention.

In [None]:
import numpy as np
import torch
import matplotlib.pyplot as plt

# Suppress scientific notation for readability
np.set_printoptions(precision=4, suppress=True)
torch.set_printoptions(precision=4, sci_mode=False)

---
## Step 1: The Challenge (5 min)

Imagine you're building an LLM. Your current token (the "query") needs to figure out which previous tokens (the "keys") are most relevant for predicting the next word.

**Concrete example:** In the sentence "The cat sat on the mat because it was tired", when processing "tired", the model needs to figure out that "it" refers to "cat", not "mat".

The dot product gives us a **similarity score** between vectors.

In [None]:
# Simple 4-dimensional vectors (like a tiny embedding)
query = np.array([1.0, 0.0, 0.5, -0.5])  # Current token's query

# Three candidate keys (previous tokens)
key_cat = np.array([0.9, 0.1, 0.4, -0.6])    # "cat" embedding
key_mat = np.array([-0.5, 0.8, 0.2, 0.3])   # "mat" embedding  
key_the = np.array([0.1, -0.1, 0.1, 0.0])   # "the" embedding

# Which key is most similar to our query?
print(f"Query · cat = {np.dot(query, key_cat):.4f}")
print(f"Query · mat = {np.dot(query, key_mat):.4f}")
print(f"Query · the = {np.dot(query, key_the):.4f}")

**Observation:** The higher the dot product, the more "similar" or "relevant" that key is to the query.

---
## Step 2: Explore — The Geometry (15 min)

### What does the dot product actually measure?

The dot product has two equivalent definitions:

**Algebraic:** Sum of element-wise products
$$a \cdot b = \sum_{i=1}^{n} a_i b_i$$

**Geometric:** Projection scaled by magnitudes
$$a \cdot b = |a| |b| \cos(\theta)$$

where $\theta$ is the angle between vectors.

In [None]:
def visualize_dot_product_2d(a, b, title="Dot Product Visualization"):
    """Visualize two 2D vectors and their dot product relationship."""
    fig, ax = plt.subplots(figsize=(8, 6))
    
    # Draw vectors from origin
    ax.quiver(0, 0, a[0], a[1], angles='xy', scale_units='xy', scale=1, 
              color='blue', width=0.02, label=f'a = {a}')
    ax.quiver(0, 0, b[0], b[1], angles='xy', scale_units='xy', scale=1, 
              color='red', width=0.02, label=f'b = {b}')
    
    # Calculate dot product and angle
    dot = np.dot(a, b)
    mag_a = np.linalg.norm(a)
    mag_b = np.linalg.norm(b)
    cos_theta = dot / (mag_a * mag_b) if mag_a * mag_b > 0 else 0
    theta = np.arccos(np.clip(cos_theta, -1, 1))
    
    # Draw projection of b onto a
    proj_scalar = dot / (mag_a ** 2) if mag_a > 0 else 0
    proj = proj_scalar * a
    ax.plot([b[0], proj[0]], [b[1], proj[1]], 'g--', linewidth=1.5, label='Projection')
    ax.scatter([proj[0]], [proj[1]], color='green', s=50, zorder=5)
    
    # Set limits and labels
    all_coords = np.array([a, b, [0, 0]])
    margin = 0.5
    ax.set_xlim(all_coords[:, 0].min() - margin, all_coords[:, 0].max() + margin)
    ax.set_ylim(all_coords[:, 1].min() - margin, all_coords[:, 1].max() + margin)
    ax.set_aspect('equal')
    ax.grid(True, alpha=0.3)
    ax.axhline(y=0, color='k', linewidth=0.5)
    ax.axvline(x=0, color='k', linewidth=0.5)
    ax.legend()
    
    ax.set_title(f"{title}\na·b = {dot:.2f}, θ = {np.degrees(theta):.1f}°, cos(θ) = {cos_theta:.2f}")
    plt.tight_layout()
    plt.show()
    
    return dot, theta

# Vectors pointing in similar directions → positive dot product
a = np.array([2.0, 1.0])
b = np.array([1.5, 1.5])
visualize_dot_product_2d(a, b, "Similar Directions (Positive Dot Product)")

In [None]:
# Perpendicular vectors → zero dot product
a = np.array([2.0, 0.0])
b = np.array([0.0, 1.5])
visualize_dot_product_2d(a, b, "Perpendicular (Zero Dot Product)")

In [None]:
# Opposite directions → negative dot product
a = np.array([2.0, 0.5])
b = np.array([-1.5, 0.0])
visualize_dot_product_2d(a, b, "Opposite Directions (Negative Dot Product)")

### Key Insight

| Dot Product | Angle | Interpretation |
|-------------|-------|----------------|
| Positive (large) | 0° - 45° | Vectors point in similar directions |
| Small positive | 45° - 90° | Weak similarity |
| Zero | 90° | Perpendicular — no relationship |
| Negative | 90° - 180° | Vectors point in opposite directions |

In attention, **positive = relevant**, **zero = unrelated**, **negative = anti-relevant**.

---
## Step 3: The Concept — Dot Products in Attention (10 min)

In transformer attention, we compute dot products between **every query** and **every key**:

$$\text{scores} = QK^T$$

Where:
- $Q$ is `[batch, seq_len, d_model]` — queries (what we're looking for)
- $K$ is `[batch, seq_len, d_model]` — keys (what we have)
- $QK^T$ is `[batch, seq_len, seq_len]` — attention scores

### The Quadratic Problem

For a sequence of length $N$ with embedding dimension $d$:
- We compute $N^2$ dot products
- Each dot product sums $d$ multiplications
- Total: $O(N^2 \cdot d)$ operations

For GPT-4 with 128K context: $128000^2 = 16.4$ billion attention scores per layer!

In [None]:
# Compute attention scores for a small example
batch_size = 1
seq_len = 8
d_model = 64

# Random Q and K matrices
Q = torch.randn(batch_size, seq_len, d_model)
K = torch.randn(batch_size, seq_len, d_model)

# Compute attention scores: QK^T
# Q: [batch, seq_len, d_model]
# K^T: [batch, d_model, seq_len]  (transposed last two dims)
# Result: [batch, seq_len, seq_len]
scores = torch.matmul(Q, K.transpose(-2, -1))

print(f"Q shape: {Q.shape}")
print(f"K shape: {K.shape}")
print(f"Scores shape: {scores.shape}")
print(f"\nScore matrix (first batch):")
print(scores[0].numpy())

In [None]:
# Visualize the score matrix
plt.figure(figsize=(8, 6))
plt.imshow(scores[0].numpy(), cmap='RdBu', aspect='auto')
plt.colorbar(label='Attention Score')
plt.xlabel('Key Position')
plt.ylabel('Query Position')
plt.title('QK^T Attention Scores\n(Red = positive/similar, Blue = negative/dissimilar)')
plt.tight_layout()
plt.show()

### Why Scale by √d?

When we sum $d$ random products, the variance grows with $d$. If $Q$ and $K$ entries have variance 1:

$$\text{Var}(Q \cdot K) = d \cdot \text{Var}(q_i) \cdot \text{Var}(k_i) = d$$

Dividing by $\sqrt{d}$ normalizes the variance back to 1, keeping scores in a reasonable range for softmax.

In [None]:
# Demonstrate variance scaling
d_values = [4, 16, 64, 256, 1024]

print("Effect of dimension on dot product variance:")
print("-" * 50)

for d in d_values:
    # Random unit-variance vectors
    q = torch.randn(10000, d)
    k = torch.randn(10000, d)
    
    # Raw dot products
    raw_dots = (q * k).sum(dim=1)
    
    # Scaled dot products
    scaled_dots = raw_dots / np.sqrt(d)
    
    print(f"d={d:4d}: raw variance={raw_dots.var():.2f}, scaled variance={scaled_dots.var():.2f}")

---
## Step 4: Code It — Efficient Dot Products (30 min)

Let's implement attention score computation efficiently.

In [None]:
def naive_attention_scores(Q, K):
    """
    Compute attention scores using explicit loops.
    Q: [batch, seq_len, d_model]
    K: [batch, seq_len, d_model]
    Returns: [batch, seq_len, seq_len]
    """
    batch_size, seq_len, d_model = Q.shape
    scores = torch.zeros(batch_size, seq_len, seq_len)
    
    for b in range(batch_size):
        for i in range(seq_len):  # query position
            for j in range(seq_len):  # key position
                # Dot product between query i and key j
                dot = 0.0
                for k in range(d_model):
                    dot += Q[b, i, k] * K[b, j, k]
                scores[b, i, j] = dot / np.sqrt(d_model)
    
    return scores

def batched_attention_scores(Q, K):
    """
    Compute attention scores using matrix multiplication.
    This is what PyTorch/Triton actually do.
    """
    d_model = Q.shape[-1]
    # Q @ K^T computes all dot products at once
    return torch.matmul(Q, K.transpose(-2, -1)) / np.sqrt(d_model)

In [None]:
# Verify they produce the same result
Q_test = torch.randn(1, 4, 8)
K_test = torch.randn(1, 4, 8)

scores_naive = naive_attention_scores(Q_test, K_test)
scores_batched = batched_attention_scores(Q_test, K_test)

print("Naive scores:")
print(scores_naive[0].numpy())
print("\nBatched scores:")
print(scores_batched[0].numpy())
print(f"\nMax difference: {(scores_naive - scores_batched).abs().max():.2e}")

In [None]:
# Benchmark: loops vs batched
import time

Q_bench = torch.randn(1, 32, 64)
K_bench = torch.randn(1, 32, 64)

# Warm up
_ = batched_attention_scores(Q_bench, K_bench)

# Time naive (small size because it's slow)
Q_small = torch.randn(1, 16, 32)
K_small = torch.randn(1, 16, 32)

start = time.perf_counter()
for _ in range(10):
    _ = naive_attention_scores(Q_small, K_small)
naive_time = (time.perf_counter() - start) / 10

start = time.perf_counter()
for _ in range(10):
    _ = batched_attention_scores(Q_small, K_small)
batched_time = (time.perf_counter() - start) / 10

print(f"Naive: {naive_time*1000:.3f} ms")
print(f"Batched: {batched_time*1000:.3f} ms")
print(f"Speedup: {naive_time/batched_time:.1f}x")

### Exercise: Implement Scaled Dot-Product Attention Scores

Complete the function below that:
1. Computes Q @ K^T
2. Scales by √d_k
3. Applies a causal mask (optional)

In [None]:
def scaled_dot_product_scores(Q, K, causal=False):
    """
    Compute scaled dot-product attention scores.
    
    Args:
        Q: Query tensor [batch, seq_len, d_k]
        K: Key tensor [batch, seq_len, d_k]
        causal: If True, mask future positions with -inf
    
    Returns:
        scores: Attention scores [batch, seq_len, seq_len]
    """
    d_k = Q.shape[-1]
    seq_len = Q.shape[1]
    
    # TODO: Compute Q @ K^T
    scores = torch.matmul(Q, K.transpose(-2, -1))
    
    # TODO: Scale by sqrt(d_k)
    scores = scores / np.sqrt(d_k)
    
    # TODO: Apply causal mask if requested
    # Causal mask: position i can only attend to positions <= i
    if causal:
        # Create upper triangular mask (True for positions to mask)
        mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=1).bool()
        scores = scores.masked_fill(mask, float('-inf'))
    
    return scores

# Test
Q_test = torch.randn(1, 4, 8)
K_test = torch.randn(1, 4, 8)

print("Without causal mask:")
print(scaled_dot_product_scores(Q_test, K_test, causal=False)[0].numpy())

print("\nWith causal mask:")
print(scaled_dot_product_scores(Q_test, K_test, causal=True)[0].numpy())

### GPU Implementation with Triton

For completeness, here's how you'd write a dot product kernel in Triton. We'll use this pattern extensively later.

In [None]:
import triton
import triton.language as tl

@triton.jit
def dot_product_kernel(
    a_ptr, b_ptr, output_ptr,
    N,  # Vector length
    BLOCK_SIZE: tl.constexpr,
):
    """
    Compute dot product of two vectors a and b.
    This kernel uses a single program to sum all elements.
    """
    # Accumulator
    acc = tl.zeros([BLOCK_SIZE], dtype=tl.float32)
    
    # Process in chunks of BLOCK_SIZE
    for start in range(0, N, BLOCK_SIZE):
        offsets = start + tl.arange(0, BLOCK_SIZE)
        mask = offsets < N
        
        # Load elements
        a_vals = tl.load(a_ptr + offsets, mask=mask, other=0.0)
        b_vals = tl.load(b_ptr + offsets, mask=mask, other=0.0)
        
        # Accumulate products
        acc += a_vals * b_vals
    
    # Sum across the block and store
    result = tl.sum(acc)
    tl.store(output_ptr, result)

def triton_dot_product(a, b):
    """Wrapper to call the Triton dot product kernel."""
    assert a.shape == b.shape
    assert a.is_cuda and b.is_cuda
    
    N = a.numel()
    output = torch.zeros(1, device=a.device)
    
    BLOCK_SIZE = 1024
    dot_product_kernel[(1,)](
        a, b, output,
        N,
        BLOCK_SIZE=BLOCK_SIZE,
    )
    
    return output[0]

In [None]:
# Test Triton implementation (requires GPU)
if torch.cuda.is_available():
    a_gpu = torch.randn(1024, device='cuda')
    b_gpu = torch.randn(1024, device='cuda')
    
    triton_result = triton_dot_product(a_gpu, b_gpu)
    torch_result = torch.dot(a_gpu, b_gpu)
    
    print(f"Triton result: {triton_result:.4f}")
    print(f"PyTorch result: {torch_result:.4f}")
    print(f"Difference: {abs(triton_result - torch_result):.2e}")
else:
    print("GPU not available. Triton kernel test skipped.")

---
## Step 5: Verify — Quiz & Reflection (10 min)

### Quiz

In [None]:
def check_answer(question, your_answer, correct_answer):
    if your_answer == correct_answer:
        print(f"✓ Correct! {question}")
        return True
    else:
        print(f"✗ Incorrect. {question}")
        print(f"  Your answer: {your_answer}")
        print(f"  Correct answer: {correct_answer}")
        return False

# Q1: What is the dot product of [1, 2, 3] and [4, 5, 6]?
q1_answer = 1*4 + 2*5 + 3*6  # Replace with your calculation
check_answer("[1,2,3] · [4,5,6]", q1_answer, 32)

In [None]:
# Q2: If Q has shape [2, 100, 64] and K has shape [2, 100, 64],
# what is the shape of QK^T?
q2_answer = (2, 100, 100)  # Replace with your answer as a tuple
check_answer("Shape of QK^T", q2_answer, (2, 100, 100))

In [None]:
# Q3: For a sequence length of 4096 and d_model=512, 
# how many scalar multiplications are needed to compute QK^T?
seq_len = 4096
d_model = 512
# Hint: We compute seq_len^2 dot products, each with d_model multiplications
q3_answer = seq_len * seq_len * d_model  # Replace with your calculation
check_answer("Scalar multiplications", q3_answer, 4096 * 4096 * 512)

In [None]:
# Q4: Why do we divide by sqrt(d_k)?
# a) To make computation faster
# b) To normalize variance of dot products
# c) To make values positive
# d) To reduce memory usage
q4_answer = 'b'  # Replace with your answer
check_answer("Why divide by sqrt(d_k)?", q4_answer, 'b')

### Reflection Questions

1. **Memory access pattern:** In the batched computation Q @ K^T, what's the memory access pattern? Is it efficient for GPU coalescing?

2. **The scaling factor:** What would happen to softmax (next lesson) if we didn't scale by √d?

3. **Causal masking:** Why do we mask with -∞ instead of 0?

---

## Summary

| Concept | Key Takeaway |
|---------|-------------|
| Dot product | Measures similarity: positive = similar, zero = orthogonal, negative = opposite |
| QK^T | Computes all pairwise attention scores as a matrix multiplication |
| √d scaling | Normalizes variance to keep softmax inputs in reasonable range |
| Complexity | O(N² × d) — quadratic in sequence length |

**Next:** The scores from QK^T can be large (positive or negative). Tomorrow we'll see how softmax converts them to probabilities — and why naive softmax breaks spectacularly.

---

**Interactive Reference:** [attention-math.html](../attention-math.html) Section 1 — Dot Product Calculator