feat(native): Add C++ CUDA backend with pybind11 bindings#24
Merged
Conversation
- Add native/ directory with C++ CUDA backend implementation - core/: device management, GPUArray, Stream (CUDA Runtime API) - jit/: NVRTC compiler and JITKernel for runtime compilation - ops/: CUDA kernels for add, mul, matmul operations - bindings/: pybind11 bindings exposing C++ to Python - CMakeLists.txt for building native module Update Python layer to use native module: - backend.py: Replace CUDABackend with NativeBackend - compiler.py: Use native JITKernel for NVRTC compilation - basic.py: Use native ops when available All 73 Python tests pass (using CPU simulation fallback). Native module requires CUDA Toolkit and pybind11 to build. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Build single pybind11 module instead of separate libraries - Set CUDA architecture to SM 86 (RTX 3090 Ti) - Disable native tests (use Python tests) - Add GPU demo script Tested on RTX 3090 Ti: - CUDA 12.4, NVRTC 12.4 - All operations working (add, mul, matmul, JIT) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Performance optimizations: - GPUArray now wraps native C++ GPUArray directly - Operations (add, mul, matmul) use native arrays without H<->D copies - Factory functions (zeros, ones, from_numpy) create native arrays CI/CD improvements: - Add scikit-build-core for CMake integration - Update release.yml with CUDA wheel builds for Linux/Windows - Support multiple CUDA architectures (SM 70-90) Results on RTX 3090 Ti (2048x2048 matmul): - GPU compute: 7.98ms (2152 GFLOPS) - 2.55x speedup vs CPU (compute only) - Transfer overhead: 57% (still room for improvement) All 73 tests pass. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- demo_gpu.py: Native C++ backend demo - demo_optimized.py: Zero-copy performance comparison - demo_v01.py: Basic v0.1 feature demo - README.md: Build and usage instructions 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add comprehensive rules for Claude Code generation: - PyGPUkit does NOT depend on cuda-python - GPU init via CUDA Driver/Runtime API only - NVRTC JIT for all kernels (no precompiled binaries) - PTX JIT for CUDA version compatibility - Correct error messages (no "install cuda-python") - CPU fallback as fully supported backend This ensures AI assistants generate correct code for PyGPUkit. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
scikit-build-core requires CUDA toolkit at install time, which breaks CI on runners without CUDA. Switch to hatchling for pure-Python builds. Native module can still be built manually or via cibuildwheel in release.yml. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove unused has_native_module imports - Remove unused GPUArray import in compiler.py - Add pre-commit config with ruff linter and formatter - Add pre-commit to dev dependencies 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Fix native module import pattern to avoid no-redef error - Fix CUDABackend -> NativeBackend reference in device.py - Add explicit type annotations for Any returns - Add mypy hook to pre-commit config with numpy/psutil deps - Format code with ruff 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add dedicated lint job (ruff + mypy) running once on Python 3.11 - Test job now only runs tests, depends on lint passing - Reduces redundant lint/mypy runs from 8x to 1x 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add cmake-check job that runs on both Linux and Windows - Install CUDA Toolkit via Jimver/cuda-toolkit action - Configure and build native module to catch CMake breakages early - Build job now depends on both test and cmake-check passing 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
CUDA 12.4 sub-packages not available on Ubuntu 24.04. Use 12.6 with full toolkit install instead. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Windows CUDA installer hangs in CI. Linux-only cmake-check is sufficient to catch CMake/C++ breakages. Windows builds are verified in release workflow with cibuildwheel. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Switch build-windows job to self-hosted runner with CUDA pre-installed - Remove CUDA Toolkit installation step (already available on runner) - Labels: [self-hosted, Windows, X64, cuda] Only release.yml uses self-hosted (triggered by tag push only). PRs use GitHub-hosted runners for safety. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Fork PRs require maintainer approval before CI runs. Configure in: Settings > Actions > General > Fork pull request workflows Select: "Require approval for all outside collaborators" 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Changes
New native/ directory
Python layer updates
backend.py: ReplaceCUDABackendwithNativeBackendthat loads native modulecompiler.py: Use nativeJITKernelfor NVRTC compilationbasic.py: Use native ops when native backend is availableArchitecture
Test plan
Build Requirements
🤖 Generated with Claude Code