feat(native): Add C++ CUDA backend with pybind11 bindings by m96-chan · Pull Request #24 · m96-chan/PyGPUkit

m96-chan · 2025-12-11T14:45:58Z

Summary

Add native C++ CUDA backend using CUDA Runtime/Driver API and NVRTC
Implement pybind11 bindings to expose C++ functionality to Python
Replace cuda-python dependency with native C++ module

Changes

New native/ directory

core/: Device management, GPUArray, Stream using CUDA Runtime API
jit/: NVRTC compiler and JITKernel for runtime CUDA kernel compilation
ops/: CUDA kernels for add, mul, matmul operations
bindings/: pybind11 bindings exposing C++ to Python
CMakeLists.txt: CMake build configuration

Python layer updates

backend.py: Replace CUDABackend with NativeBackend that loads native module
compiler.py: Use native JITKernel for NVRTC compilation
basic.py: Use native ops when native backend is available

Architecture

Python API (pygpukit)
    │
    ├── CPU Simulation Backend (testing/fallback)
    │
    └── Native Backend (pybind11)
            │
            └── C++ Module (_pygpukit_native)
                    │
                    ├── core/ - CUDA Runtime API
                    ├── jit/  - NVRTC
                    └── ops/  - CUDA Kernels

Test plan

All 73 Python tests pass with CPU simulation backend
Build native module with CMake + CUDA Toolkit
Test on GPU hardware

Build Requirements

CUDA Toolkit 11.0+
CMake 3.18+
pybind11

cd native
mkdir build && cd build
cmake ..
make

🤖 Generated with Claude Code

- Add native/ directory with C++ CUDA backend implementation - core/: device management, GPUArray, Stream (CUDA Runtime API) - jit/: NVRTC compiler and JITKernel for runtime compilation - ops/: CUDA kernels for add, mul, matmul operations - bindings/: pybind11 bindings exposing C++ to Python - CMakeLists.txt for building native module Update Python layer to use native module: - backend.py: Replace CUDABackend with NativeBackend - compiler.py: Use native JITKernel for NVRTC compilation - basic.py: Use native ops when available All 73 Python tests pass (using CPU simulation fallback). Native module requires CUDA Toolkit and pybind11 to build. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Build single pybind11 module instead of separate libraries - Set CUDA architecture to SM 86 (RTX 3090 Ti) - Disable native tests (use Python tests) - Add GPU demo script Tested on RTX 3090 Ti: - CUDA 12.4, NVRTC 12.4 - All operations working (add, mul, matmul, JIT) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Performance optimizations: - GPUArray now wraps native C++ GPUArray directly - Operations (add, mul, matmul) use native arrays without H<->D copies - Factory functions (zeros, ones, from_numpy) create native arrays CI/CD improvements: - Add scikit-build-core for CMake integration - Update release.yml with CUDA wheel builds for Linux/Windows - Support multiple CUDA architectures (SM 70-90) Results on RTX 3090 Ti (2048x2048 matmul): - GPU compute: 7.98ms (2152 GFLOPS) - 2.55x speedup vs CPU (compute only) - Transfer overhead: 57% (still room for improvement) All 73 tests pass. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- demo_gpu.py: Native C++ backend demo - demo_optimized.py: Zero-copy performance comparison - demo_v01.py: Basic v0.1 feature demo - README.md: Build and usage instructions 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add comprehensive rules for Claude Code generation: - PyGPUkit does NOT depend on cuda-python - GPU init via CUDA Driver/Runtime API only - NVRTC JIT for all kernels (no precompiled binaries) - PTX JIT for CUDA version compatibility - Correct error messages (no "install cuda-python") - CPU fallback as fully supported backend This ensures AI assistants generate correct code for PyGPUkit. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

scikit-build-core requires CUDA toolkit at install time, which breaks CI on runners without CUDA. Switch to hatchling for pure-Python builds. Native module can still be built manually or via cibuildwheel in release.yml. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Remove unused has_native_module imports - Remove unused GPUArray import in compiler.py - Add pre-commit config with ruff linter and formatter - Add pre-commit to dev dependencies 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Fix native module import pattern to avoid no-redef error - Fix CUDABackend -> NativeBackend reference in device.py - Add explicit type annotations for Any returns - Add mypy hook to pre-commit config with numpy/psutil deps - Format code with ruff 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add dedicated lint job (ruff + mypy) running once on Python 3.11 - Test job now only runs tests, depends on lint passing - Reduces redundant lint/mypy runs from 8x to 1x 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add cmake-check job that runs on both Linux and Windows - Install CUDA Toolkit via Jimver/cuda-toolkit action - Configure and build native module to catch CMake breakages early - Build job now depends on both test and cmake-check passing 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

CUDA 12.4 sub-packages not available on Ubuntu 24.04. Use 12.6 with full toolkit install instead. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Windows CUDA installer hangs in CI. Linux-only cmake-check is sufficient to catch CMake/C++ breakages. Windows builds are verified in release workflow with cibuildwheel. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Switch build-windows job to self-hosted runner with CUDA pre-installed - Remove CUDA Toolkit installation step (already available on runner) - Labels: [self-hosted, Windows, X64, cuda] Only release.yml uses self-hosted (triggered by tag push only). PRs use GitHub-hosted runners for safety. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Fork PRs require maintainer approval before CI runs. Configure in: Settings > Actions > General > Fork pull request workflows Select: "Require approval for all outside collaborators" 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

m96-chan and others added 14 commits December 11, 2025 23:45

fix(ci): Use CUDA 12.6 for cmake-check

4d2af58

CUDA 12.4 sub-packages not available on Ubuntu 24.04. Use 12.6 with full toolkit install instead. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

m96-chan merged commit b2fab0b into main Dec 11, 2025
1 of 11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(native): Add C++ CUDA backend with pybind11 bindings#24

feat(native): Add C++ CUDA backend with pybind11 bindings#24
m96-chan merged 14 commits intomainfrom
feature/native-backend

m96-chan commented Dec 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

m96-chan commented Dec 11, 2025

Summary

Changes

New native/ directory

Python layer updates

Architecture

Test plan

Build Requirements

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant