# Week 1, Day 5: Memory Hierarchy

**Time:** ~1 hour

**Goal:** Understand why memory access patterns determine GPU performance.

## The Challenge

A naive matmul kernel can be 10-50x slower than cuBLAS. Why? It's not about compute - modern GPUs have massive arithmetic throughput. The bottleneck is **memory bandwidth**.

Today we'll learn:
1. The GPU memory hierarchy
2. Why coalescing matters (10-100x impact)
3. How to think about arithmetic intensity

In [None]:
import numpy as np
import time

try:
    import torch
    import triton
    import triton.language as tl
    GPU_AVAILABLE = True
except ImportError:
    GPU_AVAILABLE = False
    print("GPU libraries not available - conceptual content still applies")

---
## Step 1: The Challenge (5 min)

**The latency gap is enormous:**

| Memory Level | Latency | Bandwidth |
|-------------|---------|----------|
| Registers | ~1 cycle | ~20 TB/s |
| Shared Memory (SMEM) | ~20 cycles | ~20 TB/s |
| L2 Cache | ~200 cycles | ~10 TB/s |
| HBM (Global) | ~400 cycles | ~2-8 TB/s |

**Key insight:** Reading from HBM takes 400x longer than from registers!

In [None]:
# Visualize the latency gap
latencies = {
    'Registers': 1,
    'Shared Memory': 20,
    'L2 Cache': 200,
    'HBM (Global)': 400,
}

print("Memory Latency Comparison (normalized to registers)")
print("=" * 60)
for name, latency in latencies.items():
    bar = "#" * (latency // 5)
    print(f"{name:20} | {bar} {latency}x")

---
## Step 2: Explore - Memory Coalescing (15 min)

### What is Coalescing?

When threads in a warp access memory, the hardware tries to **combine** their requests into as few transactions as possible.

**Coalesced access:** Adjacent threads access adjacent memory addresses
- Thread 0 → Address 0
- Thread 1 → Address 1  
- Thread 2 → Address 2
- ...

**Result:** One 128-byte transaction serves all 32 threads (4 bytes each).

**Non-coalesced access:** Threads access scattered addresses
- Thread 0 → Address 0
- Thread 1 → Address 32
- Thread 2 → Address 64
- ...

**Result:** Up to 32 separate transactions!

In [None]:
def visualize_memory_access(pattern_name, thread_to_address):
    """Visualize how threads access memory."""
    
    # Simulate 32 threads
    num_threads = 32
    
    # Calculate addresses each thread accesses
    addresses = [thread_to_address(tid) for tid in range(num_threads)]
    
    # Group by 128-byte cache lines (assuming 4-byte floats)
    cache_line_size = 32  # 32 floats = 128 bytes
    cache_lines = set(addr // cache_line_size for addr in addresses)
    
    print(f"\n{pattern_name}")
    print("=" * 50)
    print(f"Thread addresses (first 8): {addresses[:8]}")
    print(f"Cache lines touched: {len(cache_lines)}")
    print(f"Transactions needed: {len(cache_lines)}")
    
    if len(cache_lines) == 1:
        print("Status: OPTIMAL (fully coalesced)")
    elif len(cache_lines) <= 4:
        print("Status: OK (partially coalesced)")
    else:
        print("Status: BAD (poorly coalesced)")

# Good: Coalesced access
visualize_memory_access(
    "Coalesced: data[threadIdx]",
    lambda tid: tid  # Thread i accesses element i
)

# Bad: Strided access
visualize_memory_access(
    "Strided: data[threadIdx * 32]",
    lambda tid: tid * 32  # Thread i accesses element i*32
)

# Bad: Random access
np.random.seed(42)
random_addrs = np.random.permutation(1024)[:32]
visualize_memory_access(
    "Random: data[random[threadIdx]]",
    lambda tid: random_addrs[tid]
)

### Row-Major vs Column-Major Access

For a 2D matrix stored in row-major order:

```
Memory layout: [row0_col0, row0_col1, row0_col2, ..., row1_col0, row1_col1, ...]
```

**Good:** Threads access elements along a row (consecutive in memory)

**Bad:** Threads access elements down a column (strided in memory)

In [None]:
# Demonstrate row-major layout
def show_memory_layout(rows, cols):
    """Show how 2D matrix maps to linear memory."""
    
    print(f"Matrix shape: {rows}x{cols}")
    print(f"\n2D view:")
    
    matrix = np.arange(rows * cols).reshape(rows, cols)
    print(matrix)
    
    print(f"\nLinear memory (row-major):")
    print(matrix.flatten())
    
    print(f"\nAccessing row 0: elements {list(matrix[0, :])} (consecutive!)")
    print(f"Accessing col 0: elements {list(matrix[:, 0])} (strided by {cols})")

show_memory_layout(4, 8)

---
## Step 3: The Concept - Arithmetic Intensity (10 min)

### The Roofline Model

**Arithmetic Intensity** = FLOPS / Bytes transferred

A kernel is either:
- **Memory-bound:** Can't feed data fast enough (low arithmetic intensity)
- **Compute-bound:** Can't process fast enough (high arithmetic intensity)

The crossover point is the **ridge point**:
```
Ridge = Peak FLOPS / Peak Bandwidth
```

In [None]:
def analyze_arithmetic_intensity(operation, flops, bytes_transferred, peak_flops=1000, peak_bw=2000):
    """Analyze whether an operation is memory or compute bound."""
    
    ai = flops / bytes_transferred
    ridge = peak_flops / peak_bw
    
    # Achieved performance
    if ai < ridge:
        # Memory bound: limited by bandwidth
        achieved_flops = ai * peak_bw
        bound = "MEMORY-BOUND"
    else:
        # Compute bound: limited by compute
        achieved_flops = peak_flops
        bound = "COMPUTE-BOUND"
    
    efficiency = achieved_flops / peak_flops * 100
    
    print(f"\n{operation}")
    print("=" * 50)
    print(f"FLOPS: {flops:,}")
    print(f"Bytes: {bytes_transferred:,}")
    print(f"Arithmetic Intensity: {ai:.2f} FLOPS/byte")
    print(f"Ridge point: {ridge:.2f} FLOPS/byte")
    print(f"Status: {bound}")
    print(f"Max achievable: {efficiency:.0f}% of peak FLOPS")

# Example 1: Vector addition (C = A + B)
# FLOPS: N additions
# Bytes: Read 2N floats + Write N floats = 3N * 4 bytes
N = 1_000_000
analyze_arithmetic_intensity(
    "Vector Addition (C = A + B)",
    flops=N,
    bytes_transferred=3 * N * 4
)

# Example 2: Matrix multiplication (C = A @ B), naive
# FLOPS: 2 * N^3 (for NxN matrices)
# Bytes (naive, no reuse): Each element read N times = 3N^2 * N * 4 bytes
N = 1024
analyze_arithmetic_intensity(
    "Naive MatMul (no data reuse)",
    flops=2 * N**3,
    bytes_transferred=3 * N**2 * N * 4  # Each element read N times!
)

# Example 3: Matrix multiplication with tiling
# Bytes: Each element read once from HBM = 3N^2 * 4 bytes
analyze_arithmetic_intensity(
    "Tiled MatMul (with data reuse)",
    flops=2 * N**3,
    bytes_transferred=3 * N**2 * 4  # Each element read only once!
)

### Key Insight: Data Reuse

The difference between naive and optimized matmul is **data reuse**:
- Naive: Load each element N times from slow global memory
- Tiled: Load each element once from global, reuse from fast shared memory

This is why tiling/blocking is essential for matmul!

---
## Step 4: Code It - Coalescing Demo (30 min)

Let's write kernels that demonstrate coalescing impact.

In [None]:
if GPU_AVAILABLE:
    @triton.jit
    def copy_coalesced_kernel(
        src_ptr, dst_ptr, n_elements,
        BLOCK_SIZE: tl.constexpr,
    ):
        """Copy with coalesced access pattern."""
        pid = tl.program_id(0)
        offsets = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
        mask = offsets < n_elements
        
        # Coalesced: consecutive threads access consecutive addresses
        data = tl.load(src_ptr + offsets, mask=mask)
        tl.store(dst_ptr + offsets, data, mask=mask)
    
    @triton.jit
    def copy_strided_kernel(
        src_ptr, dst_ptr, n_elements, stride,
        BLOCK_SIZE: tl.constexpr,
    ):
        """Copy with strided access pattern (BAD)."""
        pid = tl.program_id(0)
        offsets = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
        mask = offsets < n_elements
        
        # Strided: threads access elements stride apart (non-coalesced!)
        strided_offsets = offsets * stride
        strided_mask = strided_offsets < (n_elements * stride)
        
        data = tl.load(src_ptr + strided_offsets, mask=strided_mask & mask)
        tl.store(dst_ptr + strided_offsets, data, mask=strided_mask & mask)
    
    print("Kernels compiled!")

In [None]:
if GPU_AVAILABLE:
    def benchmark_copy(kernel_fn, src, dst, n, stride=1, warmup=10, repeat=100):
        """Benchmark a copy kernel."""
        BLOCK_SIZE = 1024
        grid = (triton.cdiv(n, BLOCK_SIZE),)
        
        # Warmup
        for _ in range(warmup):
            if stride == 1:
                kernel_fn[grid](src, dst, n, BLOCK_SIZE=BLOCK_SIZE)
            else:
                kernel_fn[grid](src, dst, n, stride, BLOCK_SIZE=BLOCK_SIZE)
        torch.cuda.synchronize()
        
        # Benchmark
        start = time.perf_counter()
        for _ in range(repeat):
            if stride == 1:
                kernel_fn[grid](src, dst, n, BLOCK_SIZE=BLOCK_SIZE)
            else:
                kernel_fn[grid](src, dst, n, stride, BLOCK_SIZE=BLOCK_SIZE)
        torch.cuda.synchronize()
        elapsed = (time.perf_counter() - start) / repeat
        
        # Calculate bandwidth
        bytes_transferred = 2 * n * 4  # Read + write
        bandwidth = bytes_transferred / elapsed / 1e9  # GB/s
        
        return elapsed * 1000, bandwidth
    
    # Test
    n = 10_000_000
    src = torch.randn(n * 32, device='cuda', dtype=torch.float32)  # Large enough for strided
    dst = torch.empty_like(src)
    
    print(f"Copy {n:,} elements ({n * 4 / 1e6:.1f} MB)")
    print()
    
    time_coal, bw_coal = benchmark_copy(copy_coalesced_kernel, src, dst, n)
    print(f"Coalesced:     {time_coal:.3f} ms, {bw_coal:.0f} GB/s")
    
    time_stride, bw_stride = benchmark_copy(copy_strided_kernel, src, dst, n, stride=32)
    print(f"Strided (32):  {time_stride:.3f} ms, {bw_stride:.0f} GB/s")
    
    print()
    print(f"Coalescing speedup: {time_stride / time_coal:.1f}x")

---
## Step 5: Verify - Understanding Check (10 min)

In [None]:
# Q1: Why is HBM access slow?
print("Q1: HBM is off-chip memory")
print("    - Far from compute units (~400 cycles latency)")
print("    - Limited bandwidth compared to on-chip memory")
print("    - Must be accessed efficiently (coalescing) for good throughput")

In [None]:
# Q2: What makes access coalesced?
print("Q2: Coalesced access requires:")
print("    1. Adjacent threads access adjacent addresses")
print("    2. Addresses should be aligned (128-byte boundary ideal)")
print("    3. Access width matches warp size (32 threads)")

In [None]:
# Q3: When is a kernel memory-bound vs compute-bound?
print("Q3: Arithmetic Intensity determines the bound:")
print("    AI = FLOPS / Bytes")
print("    If AI < Ridge point: MEMORY-BOUND")
print("    If AI > Ridge point: COMPUTE-BOUND")
print("")
print("    Examples:")
print("    - Vector add: AI ≈ 0.08 (memory-bound)")
print("    - Tiled matmul: AI ≈ 100+ (can be compute-bound)")

In [None]:
# Q4: How does tiling help matmul?
print("Q4: Tiling improves data reuse:")
print("    Without tiling: Each element loaded N times from HBM")
print("    With tiling: Each element loaded once, reused N times from SMEM")
print("")
print("    Result: Arithmetic intensity increases by factor of N")
print("    This converts matmul from memory-bound to compute-bound!")

---
## Summary

| Concept | Impact | Optimization |
|---------|--------|-------------|
| Memory Hierarchy | 400x latency gap | Keep data in fast memory |
| Coalescing | 10-32x bandwidth | Adjacent threads → adjacent addresses |
| Arithmetic Intensity | Determines bottleneck | Increase data reuse |
| Tiling | Enables compute-bound | Load once, use many times |

### Interactive Resources

For interactive visualizations of these concepts, see:
- [Memory Hierarchy Lesson](../lessons/memory-hierarchy.html) - Coalescing visualization, bank conflicts, bandwidth calculator

### Key Takeaways

1. **Most kernels are memory-bound** - optimize memory first
2. **Coalescing is critical** - can make 10-32x difference
3. **Data reuse is the key to performance** - that's why tiling matters

---
## Next: Day 6 - Tiling Basics

Tomorrow we'll learn how to implement tiling to achieve data reuse in shared memory.

[Continue to 06_tiling_basics.ipynb](./06_tiling_basics.ipynb)