feat(v0.2): Rust memory pool, scheduler, and tiled matmul#31
Merged
Conversation
Optimize matrix multiplication using CUDA shared memory tiling: - Load tiles of A and B into shared memory - Reduce global memory accesses by TILE_SIZE factor - Support non-tile-aligned matrix sizes with boundary checks - Use __restrict__ and #pragma unroll for compiler optimization Tile size: 16x16 (matches thread block size) Expected performance improvement: 5-10x for large matrices TDD: 21 new tests for various matrix sizes (aligned, non-aligned, rectangular, tall, wide, edge cases) Closes #26 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Benchmark results on RTX 3090 Ti: - Naive: ~2091 GFLOPS at 4096x4096 - Tiled: ~1471 GFLOPS at 1024x1024 (SLOWER) The naive kernel outperforms tiled on modern GPUs because: 1. Large L2 cache (6MB) provides efficient global memory access 2. __syncthreads() overhead hurts tiled performance 3. Shared memory management doesn't pay off Tiled kernels kept for educational purposes and f64 support. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Clarify that performance tuning targets RTX 30XX series (Ampere) and above. Older GPUs are not tuned and may have suboptimal performance. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Breaking Changes: - Require SM >= 80 (Ampere, Ada, Hopper) - Remove legacy tiled kernels (Pascal/Turing not supported) Optimizations: - L2-optimized kernel with __ldg() for read-only cache - Use __restrict__ for compiler optimization - Add --use_fast_math and --expt-relaxed-constexpr flags - Remove unnecessary __syncthreads() Performance (RTX 3090 Ti): - 1024x1024: 1497 GFLOPS (+6% vs previous) - 2048x2048: 2077 GFLOPS - 4096x4096: 2228 GFLOPS (target: 2.1-2.3 TFLOPS) New APIs: - get_sm_version(): Get SM version as integer - validate_compute_capability(): Check SM >= 80 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
CRITICAL: Rust components MUST NOT be replaced with Python. Required Rust components: 1. Rust memory pool (LRU eviction) 2. Rust GPU scheduler state machine 3. Rust-side async memory transfer engine 4. Rust-side kernel dispatch controller Architecture: Python → Rust → C++/CUDA Python is ONLY for high-level orchestration. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
## Rust Implementation (rust/) - pygpukit-core: Pure Rust memory pool with LRU eviction and size-class allocator - pygpukit-core: Pure Rust scheduler with task state machine and memory tracking - pygpukit-python: PyO3 bindings exposing MemoryPool, Scheduler, TaskMeta, etc. ## Python Thin Wrappers - MemoryPool delegates to Rust backend (13x faster allocation) - Scheduler delegates to Rust backend (5x faster task submission) - PYGPUKIT_USE_RUST environment variable (default: enabled) - Full backward compatibility with existing API ## CI/CD Integration - release.yml builds Rust module with maturin - Single wheel contains both C++ and Rust extensions - pip install pygpukit installs everything ## Benchmarks - Rust Memory Pool: 10K allocations in 3.05ms (vs Python 40.97ms) - Rust Scheduler: 10K task submissions in 9.27ms (vs Python 44.56ms) All 124 tests pass. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Fix unused loop variables (B007) in benchmark_rust.py, demo_scheduler_log.py - Fix unused variables (F841) in demo_scheduler_log.py - Add noqa comments for intentional E402 violations in examples - Add noqa comments for dynamic import redefinitions (F811) - Configure mypy to ignore union-attr, no-redef, no-any-return for Rust backend - Auto-format with ruff-format - Fix trailing whitespace and end-of-file issues All pre-commit hooks pass. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add --disable-error-code=attr-defined to both CI workflow and pre-commit to handle dynamic Rust module imports consistently. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PyGPUkit v0.2 major update with Rust backend for high-performance memory management and task scheduling.
New Features
PYGPUKIT_USE_RUSTenv varBenchmark Results
Rust vs Python Backend
Matmul Performance (RTX 3090 Ti)
Architecture
Changes
Rust Implementation (
rust/)pygpukit-core: Memory pool with LRU eviction and size-class allocatorpygpukit-core: Scheduler with task state machine and memory trackingpygpukit-python: PyO3 bindings (MemoryPool, Scheduler, TaskMeta, etc.)Python Thin Wrappers
MemoryPooldelegates to Rust backendSchedulerdelegates to Rust backendPYGPUKIT_USE_RUST=0to disable Rust (fallback to pure Python)CI/CD Integration
release.ymlbuilds Rust module with maturinpip install pygpukitinstalls everythingTest plan
Usage
Closes #26
🤖 Generated with Claude Code