# Lab 0.3: Parallel Patterns

**Chapter 0: The Parallel Mindset**

Learn to recognize the four fundamental parallel patterns and how they map to GPU programming.

## Learning Objectives
- Identify embarrassingly parallel, reduction, stencil, and irregular patterns
- Understand why pattern recognition determines optimization strategy
- Practice classifying real-world operations

In [None]:
import numpy as np
import time

## Pattern 1: Embarrassingly Parallel

**Definition**: Each output element depends only on its corresponding input. No communication needed.

**GPU sweet spot**: Maps perfectly to SIMT - one thread per element.

**Examples**: Element-wise ops, activation functions, pixel processing

In [None]:
def embarrassingly_parallel_examples(x):
    """All of these are embarrassingly parallel."""
    
    # Element-wise math
    y1 = x * 2 + 1
    y2 = np.sin(x)
    y3 = np.exp(-x)
    
    # Activation functions
    relu = np.maximum(0, x)
    sigmoid = 1 / (1 + np.exp(-x))
    tanh = np.tanh(x)
    
    # Comparisons
    mask = x > 0
    
    return y1, relu, sigmoid

x = np.random.randn(1000000)
results = embarrassingly_parallel_examples(x)

print("Embarrassingly parallel operations completed.")
print("Each output[i] depends ONLY on input[i] - perfect for GPU!")

## Pattern 2: Reduction

**Definition**: Combine many values into one (or few) using an associative operation.

**GPU approach**: Parallel tree reduction - O(log N) steps instead of O(N).

**Examples**: Sum, max, min, mean, dot product, softmax denominator

In [None]:
def reduction_examples(x):
    """All of these are reductions."""
    
    # Basic reductions
    total = np.sum(x)
    maximum = np.max(x)
    minimum = np.min(x)
    mean = np.mean(x)
    
    # More complex reductions
    norm = np.linalg.norm(x)  # sqrt(sum(x^2))
    argmax = np.argmax(x)     # index of max
    
    return total, maximum, mean, norm

x = np.random.randn(1000000)
results = reduction_examples(x)

print(f"Sum: {results[0]:.4f}")
print(f"Max: {results[1]:.4f}")
print(f"Mean: {results[2]:.4f}")
print(f"\nReductions combine many values into one.")
print("GPU uses tree reduction: 1M elements -> 20 steps (log2(1M) ~ 20)")

In [None]:
# Visualize tree reduction
def tree_reduction_demo(arr):
    """Demonstrate tree reduction for sum."""
    print(f"Input: {arr}")
    
    current = arr.copy()
    step = 0
    
    while len(current) > 1:
        step += 1
        # Pair up adjacent elements and sum
        new_len = (len(current) + 1) // 2
        new_arr = np.zeros(new_len)
        
        for i in range(new_len):
            if 2*i + 1 < len(current):
                new_arr[i] = current[2*i] + current[2*i + 1]
            else:
                new_arr[i] = current[2*i]
        
        current = new_arr
        print(f"Step {step}: {current}")
    
    return current[0]

arr = np.array([1, 2, 3, 4, 5, 6, 7, 8])
result = tree_reduction_demo(arr)
print(f"\nFinal sum: {result} (expected: {arr.sum()})")
print(f"\nOnly 3 steps for 8 elements (log2(8) = 3)")

## Pattern 3: Stencil / Neighbor

**Definition**: Each output depends on nearby inputs in a fixed pattern.

**GPU approach**: Load neighborhood into shared memory, compute output.

**Examples**: Convolution, blur, edge detection, finite differences

In [None]:
def stencil_examples(x):
    """All of these are stencil operations."""
    
    # 1D stencil: moving average
    kernel = np.array([1, 1, 1]) / 3
    moving_avg = np.convolve(x, kernel, mode='same')
    
    # 1D stencil: finite difference (derivative approximation)
    derivative = np.zeros_like(x)
    derivative[1:-1] = (x[2:] - x[:-2]) / 2  # Central difference
    
    # 1D stencil: Laplacian
    laplacian = np.zeros_like(x)
    laplacian[1:-1] = x[:-2] - 2*x[1:-1] + x[2:]
    
    return moving_avg, derivative, laplacian

x = np.sin(np.linspace(0, 4*np.pi, 100))
results = stencil_examples(x)

print("Stencil operations completed.")
print("Each output[i] depends on input[i-1], input[i], input[i+1], etc.")
print("\nGPU optimization: load neighborhood into fast shared memory.")

In [None]:
# 2D stencil example: image blur
def blur_2d(image, kernel_size=3):
    """Simple box blur - a 2D stencil operation."""
    h, w = image.shape
    k = kernel_size // 2
    output = np.zeros_like(image)
    
    for i in range(k, h-k):
        for j in range(k, w-k):
            # Each output pixel is the average of its neighborhood
            neighborhood = image[i-k:i+k+1, j-k:j+k+1]
            output[i, j] = neighborhood.mean()
    
    return output

# Create a simple test image
image = np.random.rand(100, 100)
blurred = blur_2d(image)

print(f"Original image shape: {image.shape}")
print(f"Blurred image shape: {blurred.shape}")
print("\nEach output pixel depends on a 3x3 neighborhood of input pixels.")

## Pattern 4: Irregular / Sparse

**Definition**: Unpredictable access patterns, data-dependent control flow.

**GPU challenge**: Poor memory coalescing, thread divergence.

**Examples**: Graph traversal, sparse matrix ops, tree algorithms

In [None]:
def irregular_examples():
    """Irregular access patterns - harder to parallelize efficiently."""
    
    # Sparse matrix-vector multiply
    # Access pattern depends on where non-zeros are
    from scipy import sparse
    
    # Create sparse matrix (90% zeros)
    density = 0.1
    A = sparse.random(1000, 1000, density=density, format='csr')
    x = np.random.rand(1000)
    
    # Sparse multiply - each row accesses different columns
    y = A.dot(x)
    
    return A, y

A, y = irregular_examples()
print(f"Sparse matrix: {A.shape}, {A.nnz} non-zeros ({A.nnz / (A.shape[0]*A.shape[1]) * 100:.1f}% density)")
print(f"\nIrregular because: each row accesses different column indices.")
print("GPU threads would access scattered memory locations - poor coalescing.")

## Exercise: Classify These Operations

For each operation, identify the pattern:
- **E** = Embarrassingly Parallel
- **R** = Reduction
- **S** = Stencil
- **I** = Irregular

In [None]:
# Exercise: Classify each operation

operations = [
    "ReLU activation: max(0, x)",
    "Softmax denominator: sum(exp(x))", 
    "2D Convolution",
    "Matrix multiplication C = A @ B",
    "LayerNorm: (x - mean) / std",
    "Attention scores: Q @ K.T",
    "Graph neural network message passing",
    "Histogram computation",
]

# Answers (try to figure them out first!)
answers = [
    "E - Each element independent",
    "R - Sum many values to one",
    "S - Each output depends on input neighborhood",
    "R + E - Dot products (reduction) for each output (parallel)",
    "R + E - Compute mean/std (reduction), then normalize (parallel)",
    "R + E - Dot products for each query-key pair",
    "I - Neighbors depend on graph structure",
    "R + I - Bin assignment irregular, bin counting is reduction",
]

In [None]:
# Reveal answers
print("Pattern Classification:")
print("=" * 60)
for op, ans in zip(operations, answers):
    print(f"\n{op}")
    print(f"  -> {ans}")

## Key Takeaways

1. **Pattern recognition is the first step**: Before optimizing, identify the pattern
2. **Embarrassingly parallel = GPU heaven**: Maximum parallelism, minimal communication
3. **Reductions are well-understood**: Tree reduction gives O(log N) parallel steps
4. **Stencils use shared memory**: Load neighborhood once, compute multiple outputs
5. **Irregular patterns are hardest**: May need algorithmic changes, not just GPU porting