# Week 1, Day 2: CuPy - Instant GPU Speedup

**Time:** ~1 hour

**Goal:** Get 50-100x speedup over NumPy with minimal code changes.

## The Challenge

Yesterday we measured NumPy's CPU performance. Today:
1. Run the same matmul on GPU using CuPy
2. Understand data transfer overhead
3. Learn when GPU acceleration helps (and when it doesn't)

In [None]:
import numpy as np
import time

try:
    import cupy as cp
    GPU_AVAILABLE = True
    print(f"CuPy version: {cp.__version__}")
    print(f"GPU: {cp.cuda.runtime.getDeviceProperties(0)['name'].decode()}")
except ImportError:
    GPU_AVAILABLE = False
    print("CuPy not available. This notebook requires a GPU.")
    print("Try running on Google Colab with GPU runtime.")

---
## Step 1: The Challenge (5 min)

CuPy is a drop-in replacement for NumPy that runs on GPU.

**The magic:** Change `np` to `cp` and your code runs on GPU.

In [None]:
if GPU_AVAILABLE:
    # NumPy on CPU
    A_cpu = np.random.randn(1024, 1024).astype(np.float32)
    B_cpu = np.random.randn(1024, 1024).astype(np.float32)
    
    # CuPy on GPU - same API!
    A_gpu = cp.random.randn(1024, 1024).astype(cp.float32)
    B_gpu = cp.random.randn(1024, 1024).astype(cp.float32)
    
    # Matmul - identical syntax
    C_cpu = A_cpu @ B_cpu
    C_gpu = A_gpu @ B_gpu
    
    print(f"CPU result shape: {C_cpu.shape}, type: {type(C_cpu)}")
    print(f"GPU result shape: {C_gpu.shape}, type: {type(C_gpu)}")

---
## Step 2: Explore - Speedup Measurement (15 min)

### Important: GPU timing requires synchronization

GPU operations are asynchronous. Without `cp.cuda.Stream.null.synchronize()`, 
you might measure the time to *launch* the kernel, not *execute* it.

In [None]:
def calculate_gflops(M, N, K, time_seconds):
    flops = 2 * M * N * K
    return flops / (time_seconds * 1e9)

def benchmark_numpy(M, N, K, warmup=2, repeat=5):
    A = np.random.randn(M, K).astype(np.float32)
    B = np.random.randn(K, N).astype(np.float32)
    
    for _ in range(warmup):
        _ = A @ B
    
    times = []
    for _ in range(repeat):
        start = time.perf_counter()
        C = A @ B
        elapsed = time.perf_counter() - start
        times.append(elapsed)
    
    mean_time = np.mean(times)
    return mean_time * 1000, calculate_gflops(M, N, K, mean_time)

def benchmark_cupy(M, N, K, warmup=2, repeat=5):
    A = cp.random.randn(M, K).astype(cp.float32)
    B = cp.random.randn(K, N).astype(cp.float32)
    
    # Warmup
    for _ in range(warmup):
        _ = A @ B
        cp.cuda.Stream.null.synchronize()  # Wait for GPU!
    
    times = []
    for _ in range(repeat):
        cp.cuda.Stream.null.synchronize()  # Ensure previous ops done
        start = time.perf_counter()
        C = A @ B
        cp.cuda.Stream.null.synchronize()  # Wait for this op
        elapsed = time.perf_counter() - start
        times.append(elapsed)
    
    mean_time = np.mean(times)
    return mean_time * 1000, calculate_gflops(M, N, K, mean_time)

In [None]:
if GPU_AVAILABLE:
    sizes = [256, 512, 1024, 2048, 4096]
    
    print(f"{'Size':>6} {'NumPy (ms)':>12} {'CuPy (ms)':>12} {'Speedup':>10} {'GPU GFLOPS':>12}")
    print("-" * 60)
    
    results = []
    for size in sizes:
        np_time, np_gflops = benchmark_numpy(size, size, size)
        cp_time, cp_gflops = benchmark_cupy(size, size, size)
        speedup = np_time / cp_time
        results.append((size, np_time, cp_time, speedup, cp_gflops))
        print(f"{size:>6} {np_time:>12.2f} {cp_time:>12.2f} {speedup:>10.1f}x {cp_gflops:>12.0f}")

### Observation

Notice how:
1. Speedup increases with matrix size
2. Small matrices may not benefit much (kernel launch overhead)
3. GPU achieves hundreds or thousands of GFLOPS

---
## Step 3: The Concept - Why GPUs Are Fast (10 min)

### CPU vs GPU Architecture

| | CPU | GPU |
|-|----|----|
| Cores | 8-32 large cores | 1000s of small cores |
| Strategy | Latency-optimized | Throughput-optimized |
| Cache | Large (MB per core) | Small (KB per core) |
| Good for | Sequential, branchy code | Parallel, uniform code |

### Why Matmul is Perfect for GPUs

1. **Massive parallelism**: Each output element can be computed independently
2. **Regular memory access**: Predictable patterns enable coalescing
3. **High arithmetic intensity**: Many operations per byte loaded

In [None]:
if GPU_AVAILABLE:
    # How many parallel computations in a 4096x4096 matmul?
    N = 4096
    output_elements = N * N
    print(f"Matrix size: {N}x{N}")
    print(f"Output elements (each computed in parallel): {output_elements:,}")
    print(f"That's {output_elements / 1e6:.1f} million parallel tasks!")

---
## Step 4: Code It - Data Transfer Overhead (30 min)

### The Hidden Cost: CPU-GPU Data Transfer

Data must travel over PCIe, which is much slower than GPU memory bandwidth.

In [None]:
if GPU_AVAILABLE:
    def benchmark_with_transfer(M, N, K, repeat=5):
        """Benchmark including CPU->GPU transfer time."""
        A_cpu = np.random.randn(M, K).astype(np.float32)
        B_cpu = np.random.randn(K, N).astype(np.float32)
        
        times = []
        for _ in range(repeat):
            start = time.perf_counter()
            
            # Transfer to GPU
            A_gpu = cp.asarray(A_cpu)
            B_gpu = cp.asarray(B_cpu)
            
            # Compute
            C_gpu = A_gpu @ B_gpu
            
            # Transfer back
            C_cpu = cp.asnumpy(C_gpu)
            
            cp.cuda.Stream.null.synchronize()
            elapsed = time.perf_counter() - start
            times.append(elapsed)
        
        return np.mean(times) * 1000
    
    # Compare with and without transfer
    print(f"{'Size':>6} {'No Transfer':>14} {'With Transfer':>14} {'Overhead':>10}")
    print("-" * 50)
    
    for size in [1024, 2048, 4096]:
        time_no_transfer, _ = benchmark_cupy(size, size, size)
        time_with_transfer = benchmark_with_transfer(size, size, size)
        overhead = (time_with_transfer - time_no_transfer) / time_no_transfer * 100
        print(f"{size:>6} {time_no_transfer:>12.2f}ms {time_with_transfer:>12.2f}ms {overhead:>9.0f}%")

### Key Insight: Keep Data on GPU

For best performance:
1. Transfer data to GPU once
2. Do many operations on GPU
3. Transfer results back once

Avoid: Transfer -> Compute -> Transfer -> Compute -> Transfer...

In [None]:
if GPU_AVAILABLE:
    # Bad pattern: Transfer every iteration
    def bad_pattern(A_cpu, B_cpu, iterations=10):
        for _ in range(iterations):
            A_gpu = cp.asarray(A_cpu)
            B_gpu = cp.asarray(B_cpu)
            C_gpu = A_gpu @ B_gpu
            C_cpu = cp.asnumpy(C_gpu)
        return C_cpu
    
    # Good pattern: Keep data on GPU
    def good_pattern(A_cpu, B_cpu, iterations=10):
        A_gpu = cp.asarray(A_cpu)
        B_gpu = cp.asarray(B_cpu)
        for _ in range(iterations):
            C_gpu = A_gpu @ B_gpu
        C_cpu = cp.asnumpy(C_gpu)
        return C_cpu
    
    # Benchmark
    size = 1024
    A = np.random.randn(size, size).astype(np.float32)
    B = np.random.randn(size, size).astype(np.float32)
    
    start = time.perf_counter()
    _ = bad_pattern(A, B)
    cp.cuda.Stream.null.synchronize()
    bad_time = time.perf_counter() - start
    
    start = time.perf_counter()
    _ = good_pattern(A, B)
    cp.cuda.Stream.null.synchronize()
    good_time = time.perf_counter() - start
    
    print(f"Bad pattern (transfer each iteration): {bad_time*1000:.1f}ms")
    print(f"Good pattern (keep on GPU): {good_time*1000:.1f}ms")
    print(f"Speedup: {bad_time/good_time:.1f}x")

---
## Step 5: Verify - When to Use GPU (10 min)

### Quiz: GPU Suitability

In [None]:
if GPU_AVAILABLE:
    # Q1: At what matrix size does GPU become faster?
    print("Q1: Testing crossover point...")
    
    for size in [32, 64, 128, 256, 512]:
        np_time, _ = benchmark_numpy(size, size, size)
        cp_time, _ = benchmark_cupy(size, size, size)
        winner = "GPU" if cp_time < np_time else "CPU"
        print(f"  {size:>4}x{size}: CPU={np_time:.2f}ms, GPU={cp_time:.2f}ms -> {winner} wins")

In [None]:
if GPU_AVAILABLE:
    # Q2: What about element-wise operations?
    print("\nQ2: Element-wise operations (add)...")
    
    for size in [1000, 10000, 100000, 1000000]:
        A_np = np.random.randn(size).astype(np.float32)
        B_np = np.random.randn(size).astype(np.float32)
        A_cp = cp.asarray(A_np)
        B_cp = cp.asarray(B_np)
        
        # NumPy
        start = time.perf_counter()
        for _ in range(100):
            C = A_np + B_np
        np_time = (time.perf_counter() - start) / 100 * 1000
        
        # CuPy
        cp.cuda.Stream.null.synchronize()
        start = time.perf_counter()
        for _ in range(100):
            C = A_cp + B_cp
        cp.cuda.Stream.null.synchronize()
        cp_time = (time.perf_counter() - start) / 100 * 1000
        
        winner = "GPU" if cp_time < np_time else "CPU"
        print(f"  {size:>8} elements: CPU={np_time:.3f}ms, GPU={cp_time:.3f}ms -> {winner}")

### Summary: When to Use GPU

| Scenario | Use GPU? | Why |
|----------|----------|-----|
| Large matmul (1000+) | Yes | Massive parallelism |
| Small matmul (<100) | No | Kernel launch overhead |
| Many operations in sequence | Yes | Amortize transfer cost |
| Single small operation | No | Transfer overhead dominates |
| Element-wise, large data | Maybe | Depends on operation complexity |

---
## Summary

| What We Learned | Key Point |
|-----------------|----------|
| CuPy API | Drop-in NumPy replacement |
| GPU speedup | 10-100x for large matrices |
| Synchronization | Must sync for accurate timing |
| Transfer overhead | Keep data on GPU when possible |
| Crossover point | GPU wins for large parallel tasks |

### What's Next?

CuPy is fast because it uses NVIDIA's cuBLAS library. But how does cuBLAS achieve such performance? To understand that, we need to learn how GPUs actually work.

---
## Next: Day 3 - GPU Architecture

Tomorrow we'll peek inside the GPU and understand SMs, warps, and the SIMT execution model.

[Continue to 03_gpu_architecture.ipynb](./03_gpu_architecture.ipynb)