# PyTorch CUDA Extensions - Custom Polynomial Activation

**FreeCodeCamp CUDA Course - Module 9: PyTorch Extensions**

Original Course: [https://www.youtube.com/watch?v=86FAWCzIe_4](https://www.youtube.com/watch?v=86FAWCzIe_4)
Source Files: `polynomial_cuda.cu`, `polynomial_activation.py`

---

## Overview

Learn how to integrate custom CUDA kernels with PyTorch. This enables you to write high-performance operations while maintaining PyTorch's automatic differentiation and ease of use.

---

## Learning Objectives

By the end of this notebook, you will:

1. Understand PyTorch's C++/CUDA extension mechanism
2. Write CUDA kernels that integrate with PyTorch tensors
3. Use PyBind11 to expose C++/CUDA functions to Python
4. Compare custom CUDA operations with PyTorch built-ins
5. Understand when to write custom CUDA extensions

---

## Why Custom CUDA Extensions?

### Use Cases
1. **Novel Operations**: Operations not available in PyTorch
2. **Fused Kernels**: Combine multiple operations for efficiency
3. **Specialized Algorithms**: Domain-specific optimizations
4. **Research**: Implementing cutting-edge algorithms

### Trade-offs
- ‚úÖ Maximum performance control
- ‚úÖ Can implement any GPU algorithm
- ‚ùå More complex than pure PyTorch
- ‚ùå Compilation required
- ‚ùå Platform-specific code

---

## Setup

In [None]:
# Check CUDA availability
!nvidia-smi

In [None]:
# Install required tools
!pip install torch ninja -q

---

## Example: Polynomial Activation Function

We'll implement a custom activation: $f(x) = x^2 + x + 1$

### CUDA Kernel Implementation

In [None]:
%%writefile polynomial_cuda_kernel.cu
#include <torch/extension.h>
#include <cuda.h>
#include <cuda_runtime.h>

// CUDA kernel for polynomial activation: f(x) = x^2 + x + 1
template <typename scalar_t>
__global__ void polynomial_activation_cuda_kernel(
    const scalar_t* __restrict__ input,
    scalar_t* __restrict__ output,
    size_t size) {
    
    const int idx = blockIdx.x * blockDim.x + threadIdx.x;
    
    if (idx < size) {
        const scalar_t x = input[idx];
        output[idx] = x * x + x + 1.0;
    }
}

// C++ wrapper that dispatches to CUDA kernel
torch::Tensor polynomial_activation_cuda(torch::Tensor input) {
    // Create output tensor with same shape and type as input
    auto output = torch::zeros_like(input);
    
    const int threads = 256;
    const int blocks = (input.numel() + threads - 1) / threads;
    
    // Launch kernel with proper type dispatching
    AT_DISPATCH_FLOATING_TYPES(input.type(), "polynomial_activation_cuda", ([&] {
        polynomial_activation_cuda_kernel<scalar_t><<<blocks, threads>>>(
            input.data_ptr<scalar_t>(),
            output.data_ptr<scalar_t>(),
            input.numel()
        );
    }));
    
    return output;
}

// PyBind11 bindings to expose to Python
PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
    m.def("polynomial_activation", &polynomial_activation_cuda, 
          "Polynomial activation function (CUDA)");
}

### Compile the Extension

PyTorch provides utilities to compile CUDA extensions on-the-fly:

In [None]:
from torch.utils.cpp_extension import load

# Compile the CUDA extension
# This may take a minute the first time
polynomial_cuda = load(
    name='polynomial_cuda',
    sources=['polynomial_cuda_kernel.cu'],
    extra_cuda_cflags=['-O2'],
    verbose=True
)

print("‚úì CUDA extension compiled successfully!")

---

## PyTorch Integration

Now let's create a PyTorch module that uses our custom CUDA kernel:

In [None]:
import torch
import torch.nn as nn
import time

class CUDAPolynomialActivation(torch.autograd.Function):
    """PyTorch autograd Function wrapper for CUDA kernel."""
    
    @staticmethod
    def forward(ctx, x):
        return polynomial_cuda.polynomial_activation(x)

    @staticmethod
    def backward(ctx, grad_output):
        # For a complete implementation, you'd implement the derivative:
        # df/dx = 2x + 1
        # For this demo, we'll leave it unimplemented
        raise NotImplementedError("Backward pass not implemented")


class PolynomialActivation(nn.Module):
    """PyTorch Module supporting both PyTorch and CUDA implementations."""
    
    def __init__(self, implementation='pytorch'):
        super().__init__()
        self.implementation = implementation

    def forward(self, x):
        if self.implementation == 'pytorch':
            # Pure PyTorch implementation
            return x**2 + x + 1
        elif self.implementation == 'cuda':
            # Custom CUDA implementation
            return CUDAPolynomialActivation.apply(x)
        else:
            raise ValueError(f"Unknown implementation: {self.implementation}")


# Test correctness
torch.manual_seed(0)
x = torch.randn(1000, device='cuda')

pytorch_act = PolynomialActivation(implementation='pytorch').cuda()
cuda_act = PolynomialActivation(implementation='cuda').cuda()

pytorch_out = pytorch_act(x)
cuda_out = cuda_act(x)

# Verify results match
print("Correctness Check:")
print(f"Max difference: {torch.max(torch.abs(pytorch_out - cuda_out))}")
print(f"Results match: {torch.allclose(pytorch_out, cuda_out)}")

# Show sample outputs
print(f"\nSample input: {x[:5]}")
print(f"PyTorch output: {pytorch_out[:5]}")
print(f"CUDA output: {cuda_out[:5]}")

---

## Performance Benchmarking

In [None]:
def benchmark(func, x, name, num_runs=1000):
    """Benchmark a function."""
    # Warmup
    for _ in range(10):
        func(x)
    torch.cuda.synchronize()
    
    # Benchmark
    start_time = time.time()
    for _ in range(num_runs):
        func(x)
    torch.cuda.synchronize()
    end_time = time.time()
    
    avg_time = (end_time - start_time) / num_runs * 1000
    return f"{name}: {avg_time:.4f} ms"


# Benchmark with different sizes
for size in [10_000, 100_000, 1_000_000, 10_000_000]:
    print(f"\nBenchmark with {size:,} elements:")
    x = torch.randn(size, device='cuda')
    
    pytorch_time = benchmark(pytorch_act, x, "PyTorch")
    cuda_time = benchmark(cuda_act, x, "CUDA   ")
    
    print(pytorch_time)
    print(cuda_time)

---

## Understanding the Code

### 1. CUDA Kernel Template
```cpp
template <typename scalar_t>
__global__ void polynomial_activation_cuda_kernel(...) {
```
- Template allows support for `float`, `double`, etc.
- PyTorch dispatches to correct type at runtime

### 2. `__restrict__` Keyword
```cpp
const scalar_t* __restrict__ input
```
- Tells compiler: pointers don't alias (don't overlap)
- Enables more aggressive optimizations

### 3. Type Dispatching
```cpp
AT_DISPATCH_FLOATING_TYPES(input.type(), "name", ([&] {
    // Code that uses scalar_t
}));
```
- PyTorch macro that generates code for each supported type
- Ensures type safety

### 4. PyBind11 Binding
```cpp
PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
    m.def("polynomial_activation", &polynomial_activation_cuda, "...");
}
```
- Exposes C++ function to Python
- Automatically handles type conversion

---

## When Should You Write CUDA Extensions?

### ‚úÖ Good Use Cases
- Custom operations not in PyTorch
- Fusing multiple operations to reduce memory bandwidth
- Specialized algorithms (e.g., sparse operations, custom attention)
- Research implementations

### ‚ùå Avoid If
- Operation already exists in PyTorch (it's probably faster)
- Can be implemented efficiently with existing PyTorch ops
- Development time > performance gain
- Need cross-platform support

### üí° Alternative: Triton
- Easier to write than CUDA
- Still high performance
- Better for most use cases

---

## Exercises

1. **Implement Backward Pass**: Add gradient computation
   - Derivative: $\frac{df}{dx} = 2x + 1$
   - Test with `torch.autograd.gradcheck`

2. **Different Polynomial**: Implement $f(x) = x^3 - 2x^2 + x$

3. **Fused Operation**: Combine polynomial activation with another operation
   - Example: $f(x) = \text{ReLU}(x^2 + x + 1)$
   - Compare with unfused version

4. **Element-wise Binary Operation**: Implement $(x + y)^2$
   - Takes two inputs
   - Compare with PyTorch: `(x + y) ** 2`

5. **Profiling**: Use Nsight Compute to analyze your kernel
   - Check memory bandwidth utilization
   - Look for optimization opportunities

---

## Key Takeaways

1. **PyTorch extensions bridge Python and CUDA** seamlessly
2. **Template programming** enables type-generic kernels
3. **Just-in-time compilation** makes development faster
4. **Custom extensions are powerful** but add complexity
5. **Consider Triton first** for new implementations

---

## Additional Resources

- [PyTorch Custom C++ and CUDA Extensions](https://pytorch.org/tutorials/advanced/cpp_extension.html)
- [PyTorch C++ API Documentation](https://pytorch.org/cppdocs/)
- [Examples: PyTorch Extension Examples](https://github.com/pytorch/extension-cpp)

---

## Course Complete!

Congratulations on completing the FreeCodeCamp CUDA Course notebooks! You now have:

- ‚úÖ CUDA programming fundamentals
- ‚úÖ Memory optimization techniques  
- ‚úÖ Experience with CUDA libraries (cuBLAS, cuDNN)
- ‚úÖ Knowledge of Triton for high-level GPU programming
- ‚úÖ Skills to create PyTorch CUDA extensions

**Next Steps:**
- Build your own CUDA projects
- Optimize existing deep learning code
- Explore advanced topics (multi-GPU, tensor cores)
- Join CUDA/GPU programming communities

---

## Notes

*Use this space for your learning notes:*


