Learning GPU programming with Mojo on parallel scan.
This project demonstrates how to wrap Mojo kernels and expose them to PyTorch through custom operations. It implements parallel prefix sum (scan) algorithms as an educational example.
Run tests: pixi run test-all-wrappers
(also installs dependencies if needed)
- Single-block and multi-block parallel prefix sum implementations in Mojo
- PyTorch wrapper functions using MAX's CustomOpLibrary
- CUDA and ROCm support through Pixi environments
- Test suite comparing results against NumPy reference implementation
- Python 3.12
- Mojo
- CUDA 12.x or ROCm 6.3
- PyTorch 2.7.1
op/
: Mojo kernel implementationswrappers.py
: PyTorch wrapper functions for Mojo kernelstest_wrappers.py
: Test suite for kernel implementations