# Triton Vector Addition with Performance Benchmarking

**FreeCodeCamp CUDA Course - Module 8: Triton**

Original Course: [https://www.youtube.com/watch?v=86FAWCzIe_4](https://www.youtube.com/watch?v=86FAWCzIe_4)
Source File: `01_vec_add.py`

---

## Overview

Introduction to OpenAI's Triton language for GPU programming. Learn how Triton provides a higher-level abstraction than CUDA while maintaining performance.

---

## Learning Objectives

By the end of this notebook, you will:

1. Understand Triton's programming model (blocked programs vs CUDA's scalar programs)
2. Write a Triton kernel for vector addition
3. Compare Triton performance with PyTorch native operations
4. Use Triton's benchmarking utilities
5. Understand memory masking and pointer arithmetic in Triton

---

## Setup: Install Triton

First, ensure GPU is enabled and install Triton.

In [None]:
# Check GPU
!nvidia-smi

In [None]:
# Install Triton
!pip install triton -q

---

## Triton vs CUDA: Key Differences

### CUDA Programming Model
- **Scalar program, blocked threads**: Write code for one thread, launch many
- Manual memory management
- Explicit synchronization (`__syncthreads()`)

### Triton Programming Model
- **Blocked program, scalar threads**: Write code for a block of elements
- Automatic memory coalescing and caching
- Higher-level abstractions
- Similar performance to hand-tuned CUDA

---

## Triton Vector Addition Kernel

In [None]:
import torch
import triton
import triton.language as tl
import time

@triton.jit
def add_kernel(x_ptr,  # Pointer to first input vector
               y_ptr,  # Pointer to second input vector
               output_ptr,  # Pointer to output vector
               n_elements,  # Size of the vector
               BLOCK_SIZE: tl.constexpr,  # Number of elements each program processes
               ):
    """
    Triton kernel for element-wise vector addition.
    
    Equivalent CUDA:
    __global__ void add_kernel(float* x, float* y, float* output, int n) {
        int idx = blockIdx.x * blockDim.x + threadIdx.x;
        if (idx < n) {
            output[idx] = x[idx] + y[idx];
        }
    }
    """
    # Get program ID (similar to blockIdx.x in CUDA)
    pid = tl.program_id(axis=0)

    # Calculate which elements this program will process
    # For 256-element vector with BLOCK_SIZE=64:
    # Program 0: elements [0:64]
    # Program 1: elements [64:128]
    # Program 2: elements [128:192]
    # Program 3: elements [192:256]
    block_start = pid * BLOCK_SIZE
    offsets = block_start + tl.arange(0, BLOCK_SIZE)

    # Create mask for bounds checking (handles non-multiple of BLOCK_SIZE)
    mask = offsets < n_elements

    # Load x and y from DRAM with masking
    x = tl.load(x_ptr + offsets, mask=mask)
    y = tl.load(y_ptr + offsets, mask=mask)
    
    # Perform addition
    output = x + y
    
    # Store result back to DRAM
    tl.store(output_ptr + offsets, output, mask=mask)


def add(x: torch.Tensor, y: torch.Tensor):
    """Wrapper function to launch Triton kernel."""
    # Preallocate output
    output = torch.empty_like(x)
    assert x.is_cuda and y.is_cuda and output.is_cuda
    n_elements = output.numel()
    
    # Define grid (number of program instances to launch)
    # Similar to: num_blocks = (n_elements + BLOCK_SIZE - 1) // BLOCK_SIZE
    grid = lambda meta: (triton.cdiv(n_elements, meta['BLOCK_SIZE']), )
    
    # Launch kernel
    add_kernel[grid](x, y, output, n_elements, BLOCK_SIZE=1024)
    
    return output


# Test the kernel
torch.manual_seed(0)
size = 2**20  # 1M elements
x = torch.rand(size, device='cuda')
y = torch.rand(size, device='cuda')

# Triton result
triton_result = add(x, y)

# PyTorch result
torch_result = x + y

# Verify correctness
print(f"Max difference: {torch.max(torch.abs(triton_result - torch_result))}")
print(f"Results match: {torch.allclose(triton_result, torch_result)}")

---

## Performance Benchmarking

Now let's benchmark Triton against PyTorch's optimized operations.

In [None]:
@triton.testing.perf_report(
    triton.testing.Benchmark(
        x_names=['size'],  # X-axis: vector size
        x_vals=[2**i for i in range(12, 28, 1)],  # 4K to 128M elements
        x_log=True,  # Logarithmic x-axis
        line_arg='provider',  # Compare different implementations
        line_vals=['triton', 'torch'],
        line_names=['Triton', 'PyTorch'],
        styles=[('blue', '-'), ('green', '-')],
        ylabel='GB/s',  # Memory bandwidth achieved
        plot_name='vector-add-performance',
        args={},
    ))
def benchmark(size, provider):
    x = torch.rand(size, device='cuda', dtype=torch.float32)
    y = torch.rand(size, device='cuda', dtype=torch.float32)
    quantiles = [0.5, 0.2, 0.8]
    
    if provider == 'torch':
        ms, min_ms, max_ms = triton.testing.do_bench(lambda: x + y, quantiles=quantiles)
    elif provider == 'triton':
        ms, min_ms, max_ms = triton.testing.do_bench(lambda: add(x, y), quantiles=quantiles)
    
    # Calculate bandwidth: 3 arrays (2 reads, 1 write) * size * bytes per element / time
    gbps = lambda ms: 3 * x.numel() * x.element_size() / ms * 1e-6
    return gbps(ms), gbps(max_ms), gbps(min_ms)


# Run benchmark
benchmark.run(print_data=True, show_plots=True)

---

## Key Concepts Explained

### 1. Program ID (`tl.program_id`)
- Similar to `blockIdx.x` in CUDA
- Identifies which "block" of work this program handles

### 2. Memory Masking
```python
mask = offsets < n_elements
```
- Prevents out-of-bounds memory access
- Handles cases where vector size isn't a multiple of BLOCK_SIZE

### 3. Blocked Programming
- Each program processes multiple elements (BLOCK_SIZE)
- Triton automatically optimizes memory access patterns

### 4. Memory Bandwidth
- Vector add is memory-bound (limited by DRAM bandwidth)
- Formula: `(2 reads + 1 write) * size * 4 bytes / time`
- T4 GPU peak: ~320 GB/s

---

## Exercises

1. **Different Block Sizes**: Modify `BLOCK_SIZE` to 256, 512, 2048
   - How does it affect performance?
   - What's the optimal block size?

2. **Vector Multiplication**: Create a `multiply_kernel` for element-wise multiplication

3. **Fused Operations**: Implement `output = (x + y) * 2` in a single kernel
   - Compare with two separate operations

4. **Multiple Vectors**: Extend to `output = x + y + z`

---

## Key Takeaways

1. **Triton provides CUDA-like performance with Python simplicity**
2. **Memory-bound operations achieve similar performance across implementations**
3. **Triton automatically handles memory coalescing and optimization**
4. **Block-based programming model is more intuitive for many algorithms**
5. **Built-in benchmarking tools make performance analysis easy**

---

## Next Steps

Continue to **02_triton_softmax.ipynb** to learn more advanced Triton programming with the softmax operation!

---

## Notes

*Use this space for your learning notes:*


