# Week 1, Day 1: NumPy Baseline

**Time:** ~1 hour

**Goal:** Establish a CPU baseline for matrix multiplication and learn to measure performance.

## The Challenge

Matrix multiplication is THE operation in deep learning. Today we'll:
1. Implement matmul from scratch
2. Compare against NumPy's optimized version
3. Learn to measure GFLOPS (billions of floating-point operations per second)

In [None]:
import numpy as np
import time
from typing import Tuple

---
## Step 1: The Challenge (5 min)

**Question:** How fast can we multiply two 1024x1024 matrices?

Let's start with a naive Python implementation.

In [None]:
def matmul_naive(A: np.ndarray, B: np.ndarray) -> np.ndarray:
    """Naive triple-loop matrix multiplication.
    
    C[i,j] = sum_k(A[i,k] * B[k,j])
    """
    M, K = A.shape
    K2, N = B.shape
    assert K == K2, f"Inner dimensions must match: {K} != {K2}"
    
    C = np.zeros((M, N), dtype=A.dtype)
    
    for i in range(M):
        for j in range(N):
            for k in range(K):
                C[i, j] += A[i, k] * B[k, j]
    
    return C

In [None]:
# Test correctness on small matrices
A_small = np.array([[1, 2], [3, 4]], dtype=np.float32)
B_small = np.array([[5, 6], [7, 8]], dtype=np.float32)

C_naive = matmul_naive(A_small, B_small)
C_numpy = A_small @ B_small

print("A:")
print(A_small)
print("\nB:")
print(B_small)
print("\nC = A @ B (naive):")
print(C_naive)
print("\nC = A @ B (numpy):")
print(C_numpy)
print(f"\nResults match: {np.allclose(C_naive, C_numpy)}")

In [None]:
# Time the naive implementation on a small matrix
# WARNING: This is SLOW. We use 64x64 to keep it reasonable.
N = 64
A = np.random.randn(N, N).astype(np.float32)
B = np.random.randn(N, N).astype(np.float32)

start = time.perf_counter()
C = matmul_naive(A, B)
elapsed = time.perf_counter() - start

print(f"Matrix size: {N}x{N}")
print(f"Time: {elapsed*1000:.1f} ms")

---
## Step 2: Explore - Measuring Performance (15 min)

### What is GFLOPS?

**FLOPS** = Floating-point Operations Per Second

**GFLOPS** = Billions of FLOPS

For matrix multiplication C = A @ B where A is MxK and B is KxN:
- Each element of C requires K multiplications and K-1 additions
- Total operations ≈ 2 * M * N * K (we count each multiply-add as 2 ops)

In [None]:
def calculate_gflops(M: int, N: int, K: int, time_seconds: float) -> float:
    """Calculate GFLOPS for matrix multiplication.
    
    FLOPS = 2 * M * N * K (multiply-add counted as 2 ops)
    """
    flops = 2 * M * N * K
    gflops = flops / (time_seconds * 1e9)
    return gflops

# Calculate GFLOPS for our naive implementation
gflops = calculate_gflops(N, N, N, elapsed)
print(f"Naive Python matmul: {gflops:.3f} GFLOPS")
print(f"\nFor reference:")
print(f"  Modern CPU single core: ~50-100 GFLOPS")
print(f"  NVIDIA H100 GPU: ~2000 TFLOPS (FP16 Tensor)")

### Why is it so slow?

Python's interpreter overhead dominates. Each loop iteration:
1. Fetches the next bytecode instruction
2. Interprets it
3. Performs the operation

For a 1024x1024 matmul, that's 1024^3 ≈ 1 billion loop iterations!

---
## Step 3: The Concept - NumPy's Secret (10 min)

NumPy calls highly optimized libraries written in C/Fortran:
- **BLAS** (Basic Linear Algebra Subprograms)
- Often linked to **Intel MKL** or **OpenBLAS**

These libraries:
1. Use SIMD instructions (process multiple values per instruction)
2. Optimize cache usage (tiling/blocking)
3. Use multiple CPU cores

In [None]:
# Check which BLAS NumPy is using
np.show_config()

---
## Step 4: Code It - Benchmarking NumPy (30 min)

Let's create a proper benchmarking function and measure NumPy's performance.

In [None]:
def benchmark_matmul(
    matmul_fn,
    M: int,
    N: int,
    K: int,
    warmup: int = 2,
    repeat: int = 5,
    dtype=np.float32
) -> Tuple[float, float]:
    """Benchmark a matrix multiplication function.
    
    Returns:
        (mean_time_ms, gflops)
    """
    A = np.random.randn(M, K).astype(dtype)
    B = np.random.randn(K, N).astype(dtype)
    
    # Warmup runs (important for JIT compilation, cache warmup)
    for _ in range(warmup):
        _ = matmul_fn(A, B)
    
    # Timed runs
    times = []
    for _ in range(repeat):
        start = time.perf_counter()
        C = matmul_fn(A, B)
        elapsed = time.perf_counter() - start
        times.append(elapsed)
    
    mean_time = np.mean(times)
    gflops = calculate_gflops(M, N, K, mean_time)
    
    return mean_time * 1000, gflops  # Convert to ms

In [None]:
# Benchmark NumPy for various sizes
sizes = [128, 256, 512, 1024, 2048]

print(f"{'Size':>8} {'Time (ms)':>12} {'GFLOPS':>10}")
print("-" * 32)

numpy_results = []
for size in sizes:
    time_ms, gflops = benchmark_matmul(np.matmul, size, size, size)
    numpy_results.append((size, time_ms, gflops))
    print(f"{size:>8} {time_ms:>12.2f} {gflops:>10.1f}")

In [None]:
# Visualize the results
try:
    import matplotlib.pyplot as plt
    
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
    
    sizes_arr = [r[0] for r in numpy_results]
    times_arr = [r[1] for r in numpy_results]
    gflops_arr = [r[2] for r in numpy_results]
    
    ax1.plot(sizes_arr, times_arr, 'o-', linewidth=2, markersize=8)
    ax1.set_xlabel('Matrix Size (N)')
    ax1.set_ylabel('Time (ms)')
    ax1.set_title('NumPy Matmul Time')
    ax1.grid(True, alpha=0.3)
    
    ax2.plot(sizes_arr, gflops_arr, 'o-', linewidth=2, markersize=8, color='green')
    ax2.set_xlabel('Matrix Size (N)')
    ax2.set_ylabel('GFLOPS')
    ax2.set_title('NumPy Matmul Throughput')
    ax2.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
except ImportError:
    print("matplotlib not available, skipping visualization")

### Observation

Notice that:
1. GFLOPS increases with matrix size (better cache utilization)
2. NumPy achieves tens to hundreds of GFLOPS on CPU
3. This is already ~1000x faster than naive Python!

But we can do much better with a GPU...

---
## Step 5: Verify - Understanding the Math (10 min)

### Quiz: Matrix Multiplication

Test your understanding before moving to GPUs.

In [None]:
# Q1: What's the output shape of (100, 200) @ (200, 50)?
A = np.random.randn(100, 200)
B = np.random.randn(200, 50)
C = A @ B
print(f"Q1: Shape of (100, 200) @ (200, 50) = {C.shape}")

In [None]:
# Q2: How many FLOPS in a 1000x1000 matmul?
M, N, K = 1000, 1000, 1000
flops = 2 * M * N * K
print(f"Q2: FLOPS in 1000x1000 matmul = {flops:,} = {flops/1e9:.1f} GFLOP")

In [None]:
# Q3: If NumPy achieves 100 GFLOPS, how long for 1000x1000?
gflops_target = 100
time_seconds = (flops / 1e9) / gflops_target
print(f"Q3: At 100 GFLOPS, 1000x1000 takes {time_seconds*1000:.1f} ms")

In [None]:
# Q4: Element C[i,j] is computed from which elements?
print("Q4: C[i,j] = sum over k of A[i,k] * B[k,j]")
print("    = dot product of row i of A with column j of B")

# Demonstrate
i, j = 1, 0
A_demo = np.array([[1, 2, 3], [4, 5, 6]])
B_demo = np.array([[7, 8], [9, 10], [11, 12]])
C_demo = A_demo @ B_demo

print(f"\nA = \n{A_demo}")
print(f"\nB = \n{B_demo}")
print(f"\nC[{i},{j}] = A[{i},:] dot B[:,{j}]")
print(f"        = {A_demo[i,:]} dot {B_demo[:,j]}")
print(f"        = {A_demo[i,0]}*{B_demo[0,j]} + {A_demo[i,1]}*{B_demo[1,j]} + {A_demo[i,2]}*{B_demo[2,j]}")
print(f"        = {A_demo[i,0]*B_demo[0,j]} + {A_demo[i,1]*B_demo[1,j]} + {A_demo[i,2]*B_demo[2,j]}")
print(f"        = {C_demo[i,j]}")

---
## Summary

| Implementation | Performance | Notes |
|---------------|-------------|-------|
| Naive Python | ~0.001 GFLOPS | Interpreter overhead |
| NumPy (BLAS) | ~50-200 GFLOPS | Optimized C, SIMD, multi-core |
| GPU (coming) | ~1000+ GFLOPS | Massive parallelism |

### Key Takeaways

1. **Matmul FLOPS** = 2 * M * N * K
2. **GFLOPS** = FLOPS / (time * 10^9)
3. NumPy is ~1000x faster than naive Python
4. GPUs can be another 10-100x faster than NumPy

---
## Next: Day 2 - CuPy Introduction

Tomorrow we'll see how CuPy achieves 10-100x speedup over NumPy with almost zero code changes.

[Continue to 02_cupy_intro.ipynb](./02_cupy_intro.ipynb)