Skip to content

feat(native): Add C++ CUDA backend with pybind11 bindings#24

Merged
m96-chan merged 14 commits intomainfrom
feature/native-backend
Dec 11, 2025
Merged

feat(native): Add C++ CUDA backend with pybind11 bindings#24
m96-chan merged 14 commits intomainfrom
feature/native-backend

Conversation

@m96-chan
Copy link
Owner

Summary

  • Add native C++ CUDA backend using CUDA Runtime/Driver API and NVRTC
  • Implement pybind11 bindings to expose C++ functionality to Python
  • Replace cuda-python dependency with native C++ module

Changes

New native/ directory

  • core/: Device management, GPUArray, Stream using CUDA Runtime API
  • jit/: NVRTC compiler and JITKernel for runtime CUDA kernel compilation
  • ops/: CUDA kernels for add, mul, matmul operations
  • bindings/: pybind11 bindings exposing C++ to Python
  • CMakeLists.txt: CMake build configuration

Python layer updates

  • backend.py: Replace CUDABackend with NativeBackend that loads native module
  • compiler.py: Use native JITKernel for NVRTC compilation
  • basic.py: Use native ops when native backend is available

Architecture

Python API (pygpukit)
    │
    ├── CPU Simulation Backend (testing/fallback)
    │
    └── Native Backend (pybind11)
            │
            └── C++ Module (_pygpukit_native)
                    │
                    ├── core/ - CUDA Runtime API
                    ├── jit/  - NVRTC
                    └── ops/  - CUDA Kernels

Test plan

  • All 73 Python tests pass with CPU simulation backend
  • Build native module with CMake + CUDA Toolkit
  • Test on GPU hardware

Build Requirements

  • CUDA Toolkit 11.0+
  • CMake 3.18+
  • pybind11
cd native
mkdir build && cd build
cmake ..
make

🤖 Generated with Claude Code

m96-chan and others added 14 commits December 11, 2025 23:45
- Add native/ directory with C++ CUDA backend implementation
- core/: device management, GPUArray, Stream (CUDA Runtime API)
- jit/: NVRTC compiler and JITKernel for runtime compilation
- ops/: CUDA kernels for add, mul, matmul operations
- bindings/: pybind11 bindings exposing C++ to Python
- CMakeLists.txt for building native module

Update Python layer to use native module:
- backend.py: Replace CUDABackend with NativeBackend
- compiler.py: Use native JITKernel for NVRTC compilation
- basic.py: Use native ops when available

All 73 Python tests pass (using CPU simulation fallback).
Native module requires CUDA Toolkit and pybind11 to build.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Build single pybind11 module instead of separate libraries
- Set CUDA architecture to SM 86 (RTX 3090 Ti)
- Disable native tests (use Python tests)
- Add GPU demo script

Tested on RTX 3090 Ti:
- CUDA 12.4, NVRTC 12.4
- All operations working (add, mul, matmul, JIT)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Performance optimizations:
- GPUArray now wraps native C++ GPUArray directly
- Operations (add, mul, matmul) use native arrays without H<->D copies
- Factory functions (zeros, ones, from_numpy) create native arrays

CI/CD improvements:
- Add scikit-build-core for CMake integration
- Update release.yml with CUDA wheel builds for Linux/Windows
- Support multiple CUDA architectures (SM 70-90)

Results on RTX 3090 Ti (2048x2048 matmul):
- GPU compute: 7.98ms (2152 GFLOPS)
- 2.55x speedup vs CPU (compute only)
- Transfer overhead: 57% (still room for improvement)

All 73 tests pass.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- demo_gpu.py: Native C++ backend demo
- demo_optimized.py: Zero-copy performance comparison
- demo_v01.py: Basic v0.1 feature demo
- README.md: Build and usage instructions

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add comprehensive rules for Claude Code generation:
- PyGPUkit does NOT depend on cuda-python
- GPU init via CUDA Driver/Runtime API only
- NVRTC JIT for all kernels (no precompiled binaries)
- PTX JIT for CUDA version compatibility
- Correct error messages (no "install cuda-python")
- CPU fallback as fully supported backend

This ensures AI assistants generate correct code for PyGPUkit.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
scikit-build-core requires CUDA toolkit at install time, which breaks
CI on runners without CUDA. Switch to hatchling for pure-Python builds.
Native module can still be built manually or via cibuildwheel in release.yml.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove unused has_native_module imports
- Remove unused GPUArray import in compiler.py
- Add pre-commit config with ruff linter and formatter
- Add pre-commit to dev dependencies

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Fix native module import pattern to avoid no-redef error
- Fix CUDABackend -> NativeBackend reference in device.py
- Add explicit type annotations for Any returns
- Add mypy hook to pre-commit config with numpy/psutil deps
- Format code with ruff

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add dedicated lint job (ruff + mypy) running once on Python 3.11
- Test job now only runs tests, depends on lint passing
- Reduces redundant lint/mypy runs from 8x to 1x

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add cmake-check job that runs on both Linux and Windows
- Install CUDA Toolkit via Jimver/cuda-toolkit action
- Configure and build native module to catch CMake breakages early
- Build job now depends on both test and cmake-check passing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
CUDA 12.4 sub-packages not available on Ubuntu 24.04.
Use 12.6 with full toolkit install instead.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Windows CUDA installer hangs in CI. Linux-only cmake-check is
sufficient to catch CMake/C++ breakages. Windows builds are
verified in release workflow with cibuildwheel.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Switch build-windows job to self-hosted runner with CUDA pre-installed
- Remove CUDA Toolkit installation step (already available on runner)
- Labels: [self-hosted, Windows, X64, cuda]

Only release.yml uses self-hosted (triggered by tag push only).
PRs use GitHub-hosted runners for safety.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Fork PRs require maintainer approval before CI runs.
Configure in: Settings > Actions > General > Fork pull request workflows
Select: "Require approval for all outside collaborators"

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@m96-chan m96-chan merged commit b2fab0b into main Dec 11, 2025
1 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant