# Week 2, Day 1: Profile Like a Pro

**Time:** ~1 hour

**Goal:** Learn to profile GPU kernels and identify bottlenecks using real metrics.

## The Challenge

Your Week 1 matmul hits 500 GFLOPS. The H100 can do 990 TFLOPS (FP16 Tensor Cores).
That's a **2000x gap**. Where's the bottleneck?

Today we learn to answer: **Is my kernel compute-bound or memory-bound?**

In [None]:
import numpy as np
import triton
import triton.language as tl
import torch
import subprocess
import os

---
## Step 1: The Challenge (5 min)

Let's bring back our tiled matmul from Week 1 and measure its actual performance.

In [None]:
@triton.jit
def matmul_kernel(
    a_ptr, b_ptr, c_ptr,
    M, N, K,
    stride_am, stride_ak,
    stride_bk, stride_bn,
    stride_cm, stride_cn,
    BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr, BLOCK_K: tl.constexpr,
):
    """Tiled matrix multiplication kernel from Week 1."""
    pid_m = tl.program_id(0)
    pid_n = tl.program_id(1)
    
    # Block starting positions
    offs_m = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
    offs_n = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
    offs_k = tl.arange(0, BLOCK_K)
    
    # Pointers to first block of A and B
    a_ptrs = a_ptr + offs_m[:, None] * stride_am + offs_k[None, :] * stride_ak
    b_ptrs = b_ptr + offs_k[:, None] * stride_bk + offs_n[None, :] * stride_bn
    
    # Accumulator
    acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32)
    
    # Main loop over K dimension
    for k in range(0, K, BLOCK_K):
        # Load tiles with boundary checks
        a = tl.load(a_ptrs, mask=(offs_m[:, None] < M) & (offs_k[None, :] + k < K), other=0.0)
        b = tl.load(b_ptrs, mask=(offs_k[:, None] + k < K) & (offs_n[None, :] < N), other=0.0)
        
        # Accumulate
        acc += tl.dot(a, b)
        
        # Advance pointers
        a_ptrs += BLOCK_K * stride_ak
        b_ptrs += BLOCK_K * stride_bk
    
    # Store result
    c_ptrs = c_ptr + offs_m[:, None] * stride_cm + offs_n[None, :] * stride_cn
    mask = (offs_m[:, None] < M) & (offs_n[None, :] < N)
    tl.store(c_ptrs, acc, mask=mask)

In [None]:
def matmul_triton(a, b, BLOCK_M=64, BLOCK_N=64, BLOCK_K=32):
    """Wrapper for Triton matmul kernel."""
    M, K = a.shape
    K2, N = b.shape
    assert K == K2
    
    c = torch.empty((M, N), device=a.device, dtype=torch.float32)
    
    grid = (triton.cdiv(M, BLOCK_M), triton.cdiv(N, BLOCK_N))
    
    matmul_kernel[grid](
        a, b, c,
        M, N, K,
        a.stride(0), a.stride(1),
        b.stride(0), b.stride(1),
        c.stride(0), c.stride(1),
        BLOCK_M=BLOCK_M, BLOCK_N=BLOCK_N, BLOCK_K=BLOCK_K,
    )
    return c

---
## Step 2: Explore (15 min)

### The Three Key Metrics

Every GPU kernel can be characterized by three numbers:

| Metric | What It Measures | Unit | How to Get It |
|--------|-----------------|------|---------------|
| **Achieved FLOPS** | Compute throughput | TFLOPS | `2*M*N*K / time` |
| **Memory Bandwidth** | Data movement rate | GB/s | `bytes_moved / time` |
| **Occupancy** | GPU utilization | % | `active_warps / max_warps` |

Let's measure each one.

In [None]:
def benchmark_matmul(M, N, K, num_runs=100, warmup=10):
    """Benchmark matmul and return performance metrics."""
    a = torch.randn(M, K, device='cuda', dtype=torch.float16)
    b = torch.randn(K, N, device='cuda', dtype=torch.float16)
    
    # Warmup
    for _ in range(warmup):
        c = matmul_triton(a, b)
    
    torch.cuda.synchronize()
    
    # Benchmark
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)
    
    start.record()
    for _ in range(num_runs):
        c = matmul_triton(a, b)
    end.record()
    
    torch.cuda.synchronize()
    time_ms = start.elapsed_time(end) / num_runs
    
    # Calculate metrics
    flops = 2 * M * N * K  # multiply-add = 2 ops
    tflops = flops / (time_ms * 1e-3) / 1e12
    
    # Memory: read A, read B, write C
    bytes_accessed = (M * K + K * N + M * N) * 2  # FP16 = 2 bytes
    bandwidth_gb = bytes_accessed / (time_ms * 1e-3) / 1e9
    
    return {
        'time_ms': time_ms,
        'tflops': tflops,
        'bandwidth_gb': bandwidth_gb,
        'flops': flops,
        'bytes': bytes_accessed,
    }

In [None]:
# Benchmark our kernel
sizes = [(1024, 1024, 1024), (2048, 2048, 2048), (4096, 4096, 4096)]

print("Matmul Performance")
print("=" * 60)
print(f"{'Size':<20} {'Time (ms)':<12} {'TFLOPS':<12} {'BW (GB/s)':<12}")
print("-" * 60)

for M, N, K in sizes:
    metrics = benchmark_matmul(M, N, K)
    print(f"{f'{M}x{N}x{K}':<20} {metrics['time_ms']:<12.3f} {metrics['tflops']:<12.2f} {metrics['bandwidth_gb']:<12.1f}")

---
## Step 3: The Concept (10 min)

### The Roofline Model

The **roofline model** tells us whether a kernel is limited by compute or memory.

```
                    ┌───────────────── Compute Roof (peak TFLOPS)
                    │
    TFLOPS          │    ╱────────────
      │             │   ╱
      │             │  ╱  Memory-bound region
      │             │ ╱   (slope = peak bandwidth)
      │             │╱
      └─────────────┴────────────────── Arithmetic Intensity (FLOPS/byte)
```

**Key insight:** 
- If your kernel is below the sloped line → **memory-bound** (need better access patterns)
- If your kernel is below the flat line → **compute-bound** (need Tensor Cores)

### Arithmetic Intensity

$$\text{Arithmetic Intensity} = \frac{\text{FLOPS}}{\text{Bytes Accessed}}$$

For matrix multiplication C = A × B:
- FLOPS = 2 × M × N × K
- Bytes (naive) = (M×K + K×N + M×N) × sizeof(dtype)
- Bytes (tiled) = Much less due to data reuse!

**The tiling trick:** By keeping data in shared memory, we increase arithmetic intensity.

In [None]:
def calculate_arithmetic_intensity(M, N, K, dtype_bytes=2):
    """Calculate arithmetic intensity for matmul."""
    flops = 2 * M * N * K
    
    # Naive: every element loaded from HBM
    bytes_naive = (M * K + K * N + M * N) * dtype_bytes
    ai_naive = flops / bytes_naive
    
    # With tiling: each A element used N/TILE_N times, each B element used M/TILE_M times
    # Effective bytes is much lower
    TILE = 64
    reuse_factor = TILE  # approximate
    bytes_tiled = bytes_naive / reuse_factor
    ai_tiled = flops / bytes_tiled
    
    return ai_naive, ai_tiled

# Example: 4096x4096 matmul
M = N = K = 4096
ai_naive, ai_tiled = calculate_arithmetic_intensity(M, N, K)

print(f"Arithmetic Intensity for {M}x{N}x{K} matmul:")
print(f"  Naive:  {ai_naive:.1f} FLOPS/byte")
print(f"  Tiled:  {ai_tiled:.1f} FLOPS/byte (approximate)")
print()
print("H100 specs:")
print(f"  Peak FP16 Tensor Core: 990 TFLOPS")
print(f"  HBM Bandwidth: 3.35 TB/s")
print(f"  Balance point: {990 / 3.35:.1f} FLOPS/byte")
print()
if ai_tiled > 990 / 3.35:
    print("→ Kernel should be COMPUTE-bound (good for Tensor Cores)")
else:
    print("→ Kernel is MEMORY-bound (need better tiling)")

---
## Step 4: Code It (30 min)

### Using Triton's Built-in Profiler

Triton provides `triton.testing.do_bench` for accurate benchmarking.

In [None]:
from triton.testing import do_bench

def profile_matmul_detailed(M, N, K):
    """Profile matmul with detailed metrics."""
    a = torch.randn(M, K, device='cuda', dtype=torch.float16)
    b = torch.randn(K, N, device='cuda', dtype=torch.float16)
    
    # Triton's benchmark function handles warmup and statistics
    ms = do_bench(lambda: matmul_triton(a, b))
    
    # Calculate metrics
    flops = 2 * M * N * K
    tflops = flops / (ms * 1e-3) / 1e12
    
    # Compare to cuBLAS
    cublas_ms = do_bench(lambda: torch.mm(a, b))
    cublas_tflops = flops / (cublas_ms * 1e-3) / 1e12
    
    return {
        'triton_ms': ms,
        'triton_tflops': tflops,
        'cublas_ms': cublas_ms,
        'cublas_tflops': cublas_tflops,
        'efficiency': tflops / cublas_tflops * 100,
    }

# Compare against cuBLAS
print("Triton vs cuBLAS")
print("=" * 70)
print(f"{'Size':<15} {'Triton (ms)':<12} {'cuBLAS (ms)':<12} {'Triton TFLOPS':<14} {'Efficiency':<10}")
print("-" * 70)

for size in [1024, 2048, 4096, 8192]:
    metrics = profile_matmul_detailed(size, size, size)
    print(f"{f'{size}x{size}':<15} {metrics['triton_ms']:<12.3f} {metrics['cublas_ms']:<12.3f} "
          f"{metrics['triton_tflops']:<14.2f} {metrics['efficiency']:<10.1f}%")

### Nsight Compute Profiling (Command Line)

For deep analysis, use NVIDIA's Nsight Compute profiler. Here's how to use it:

```bash
# Profile a Python script
ncu --set full -o profile_output python your_script.py

# Key metrics to look for:
# - sm__throughput.avg_pct_of_peak_sustained_elapsed  (SM utilization)
# - dram__throughput.avg_pct_of_peak_sustained_elapsed (Memory bandwidth)
# - sm__warps_active.avg_pct_of_peak_sustained_elapsed (Occupancy)
```

In [None]:
# Create a script we can profile with ncu
profile_script = '''
import torch
import triton
import triton.language as tl

@triton.jit
def matmul_kernel(
    a_ptr, b_ptr, c_ptr, M, N, K,
    stride_am, stride_ak, stride_bk, stride_bn, stride_cm, stride_cn,
    BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr, BLOCK_K: tl.constexpr,
):
    pid_m = tl.program_id(0)
    pid_n = tl.program_id(1)
    offs_m = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
    offs_n = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
    offs_k = tl.arange(0, BLOCK_K)
    a_ptrs = a_ptr + offs_m[:, None] * stride_am + offs_k[None, :] * stride_ak
    b_ptrs = b_ptr + offs_k[:, None] * stride_bk + offs_n[None, :] * stride_bn
    acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32)
    for k in range(0, K, BLOCK_K):
        a = tl.load(a_ptrs, mask=(offs_m[:, None] < M) & (offs_k[None, :] + k < K), other=0.0)
        b = tl.load(b_ptrs, mask=(offs_k[:, None] + k < K) & (offs_n[None, :] < N), other=0.0)
        acc += tl.dot(a, b)
        a_ptrs += BLOCK_K * stride_ak
        b_ptrs += BLOCK_K * stride_bk
    c_ptrs = c_ptr + offs_m[:, None] * stride_cm + offs_n[None, :] * stride_cn
    tl.store(c_ptrs, acc, mask=(offs_m[:, None] < M) & (offs_n[None, :] < N))

M = N = K = 4096
a = torch.randn(M, K, device="cuda", dtype=torch.float16)
b = torch.randn(K, N, device="cuda", dtype=torch.float16)
c = torch.empty(M, N, device="cuda", dtype=torch.float32)

grid = (triton.cdiv(M, 64), triton.cdiv(N, 64))
matmul_kernel[grid](
    a, b, c, M, N, K,
    a.stride(0), a.stride(1), b.stride(0), b.stride(1), c.stride(0), c.stride(1),
    BLOCK_M=64, BLOCK_N=64, BLOCK_K=32,
)
torch.cuda.synchronize()
print("Kernel executed successfully")
'''

with open('profile_matmul.py', 'w') as f:
    f.write(profile_script)

print("Created profile_matmul.py")
print("\nTo profile with Nsight Compute:")
print("  ncu --set full python profile_matmul.py")
print("\nOr for a quick summary:")
print("  ncu --metrics sm__throughput.avg_pct_of_peak_sustained_elapsed,dram__throughput.avg_pct_of_peak_sustained_elapsed python profile_matmul.py")

### Understanding Occupancy

**Occupancy** = fraction of maximum warps that are active on an SM.

Low occupancy means the GPU is underutilized. Common causes:
1. **Too many registers per thread** → fewer warps fit
2. **Too much shared memory per block** → fewer blocks fit
3. **Block size too large** → fewer blocks scheduled

In [None]:
def estimate_occupancy(block_size, registers_per_thread=32, smem_per_block=0):
    """Estimate occupancy for a given configuration.
    
    Based on H100 specs:
    - 65536 registers per SM
    - 228 KB shared memory per SM
    - 2048 threads per SM (64 warps)
    """
    # H100 specs
    REGS_PER_SM = 65536
    SMEM_PER_SM = 228 * 1024  # bytes
    MAX_THREADS_PER_SM = 2048
    MAX_WARPS_PER_SM = 64
    MAX_BLOCKS_PER_SM = 32
    
    threads_per_block = block_size
    warps_per_block = (threads_per_block + 31) // 32
    
    # Limit by registers
    regs_per_block = threads_per_block * registers_per_thread
    blocks_by_regs = REGS_PER_SM // regs_per_block if regs_per_block > 0 else MAX_BLOCKS_PER_SM
    
    # Limit by shared memory
    blocks_by_smem = SMEM_PER_SM // smem_per_block if smem_per_block > 0 else MAX_BLOCKS_PER_SM
    
    # Limit by warps
    blocks_by_warps = MAX_WARPS_PER_SM // warps_per_block
    
    # Limit by max blocks
    blocks_per_sm = min(blocks_by_regs, blocks_by_smem, blocks_by_warps, MAX_BLOCKS_PER_SM)
    
    active_warps = blocks_per_sm * warps_per_block
    occupancy = active_warps / MAX_WARPS_PER_SM * 100
    
    return {
        'blocks_per_sm': blocks_per_sm,
        'active_warps': active_warps,
        'occupancy_pct': occupancy,
        'limited_by': 'regs' if blocks_per_sm == blocks_by_regs else 
                      'smem' if blocks_per_sm == blocks_by_smem else
                      'warps' if blocks_per_sm == blocks_by_warps else 'blocks',
    }

# Test different configurations
print("Occupancy Analysis")
print("=" * 70)
print(f"{'Block Size':<12} {'Regs/Thread':<12} {'SMEM (KB)':<12} {'Occupancy':<12} {'Limited By':<12}")
print("-" * 70)

configs = [
    (256, 32, 0),      # Small block, few regs
    (256, 64, 0),      # Small block, more regs
    (256, 32, 48*1024), # With shared memory
    (512, 32, 0),      # Larger block
    (1024, 32, 0),     # Max block size
]

for block_size, regs, smem in configs:
    result = estimate_occupancy(block_size, regs, smem)
    print(f"{block_size:<12} {regs:<12} {smem/1024:<12.0f} {result['occupancy_pct']:<12.1f}% {result['limited_by']:<12}")

---
## Step 5: Verify (10 min)

### Exercise: Diagnose This Kernel

Given these metrics from a matmul kernel:
- Achieved: 150 TFLOPS
- Memory bandwidth: 2.8 TB/s (out of 3.35 TB/s peak)
- Occupancy: 45%

**Questions:**
1. Is this kernel compute-bound or memory-bound?
2. What's the efficiency vs peak Tensor Core performance?
3. What optimizations would you try first?

In [None]:
# Your analysis here
achieved_tflops = 150
memory_bw_tbs = 2.8
occupancy_pct = 45

# H100 peaks
peak_tflops = 990  # FP16 Tensor Core
peak_bw_tbs = 3.35

# Calculate utilizations
compute_util = achieved_tflops / peak_tflops * 100
memory_util = memory_bw_tbs / peak_bw_tbs * 100

print("Kernel Analysis")
print("=" * 40)
print(f"Compute utilization: {compute_util:.1f}%")
print(f"Memory utilization:  {memory_util:.1f}%")
print(f"Occupancy:           {occupancy_pct:.1f}%")
print()

if memory_util > compute_util:
    print("Diagnosis: MEMORY-BOUND")
    print("\nRecommendations:")
    print("1. Increase tile sizes for better data reuse")
    print("2. Use Tensor Cores (they have higher compute/memory ratio)")
    print("3. Check for coalescing issues")
else:
    print("Diagnosis: COMPUTE-BOUND")
    print("\nRecommendations:")
    print("1. Ensure Tensor Cores are being used")
    print("2. Check for warp divergence")
    print("3. Improve occupancy if possible")

### Quiz

**Q1:** A kernel achieves 90% memory bandwidth utilization but only 20% compute utilization. What does this indicate?

A) The kernel is compute-bound  
B) The kernel is memory-bound  
C) The kernel is perfectly balanced  
D) The kernel has a bug

**Q2:** What's the arithmetic intensity of a vector addition kernel (C[i] = A[i] + B[i])?

A) 0.5 FLOPS/byte (reads 8 bytes, writes 4 bytes, does 1 FLOP)  
B) 0.083 FLOPS/byte (1 FLOP / 12 bytes for FP32)  
C) 1 FLOP/byte  
D) Depends on vector length

In [None]:
# Answers
print("Q1: B) The kernel is memory-bound")
print("   High memory utilization + low compute utilization = memory-bound")
print()
print("Q2: B) 0.083 FLOPS/byte")
print("   For FP32: 1 FLOP / (4+4+4 bytes) = 1/12 = 0.083")
print("   Vector addition has very low arithmetic intensity - always memory-bound!")

---
## Summary

### Key Takeaways

1. **Three metrics matter:** TFLOPS, Memory Bandwidth, Occupancy
2. **Roofline model:** Compare arithmetic intensity to the balance point
3. **Memory-bound kernels:** Need better access patterns, more tiling, less data movement
4. **Compute-bound kernels:** Need Tensor Cores, less warp divergence
5. **Profile first, optimize second:** Never guess where the bottleneck is

### Tools Learned

| Tool | Use Case |
|------|----------|
| `triton.testing.do_bench` | Quick benchmarking |
| `ncu` (Nsight Compute) | Deep profiling |
| `nsys` (Nsight Systems) | Timeline analysis |

### Tomorrow: Coalescing Experiments

Now that we can measure performance, we'll experiment with memory access patterns and see exactly how coalescing affects bandwidth.

In [None]:
# Cleanup
import os
if os.path.exists('profile_matmul.py'):
    os.remove('profile_matmul.py')
    print("Cleaned up profile_matmul.py")