PyGPUkit – Core Operator Coverage
1. Elementwise Operations
| Operation |
FP32 |
TF32 |
FP16 |
BF16 |
Notes |
| add |
✅ |
— |
🔜 |
🔜 |
Implemented |
| sub |
🔜 |
— |
🔜 |
🔜 |
Planned |
| mul |
✅ |
— |
🔜 |
🔜 |
Implemented |
| div |
🔜 |
— |
🔜 |
🔜 |
Planned |
| exp |
🔜 |
— |
🔜 |
🔜 |
SFU-bound |
| log |
🔜 |
— |
🔜 |
🔜 |
SFU-bound |
| relu |
🔜 |
— |
🔜 |
🔜 |
Can be fused later |
2. GEMM Operations
| Operation |
FP32 |
TF32 |
FP16 |
BF16 |
Notes |
| matmul |
✅ |
✅ |
🔜 |
🔜 |
TensorCore on Ampere+ |
Current Performance (v0.2.3):
- FP32: 18 TFLOPS (RTX 3090 Ti)
- TF32: 27.38 TFLOPS (RTX 3090 Ti)
3. Reduction Operations
| Operation |
FP32 |
TF32 |
FP16 |
BF16 |
Notes |
| sum |
🔜 |
— |
🔜 |
🔜 |
Tree-based reduction |
| mean |
🔜 |
— |
🔜 |
🔜 |
sum + scale |
| max |
🔜 |
— |
🔜 |
🔜 |
Warp + block reduction |
4. Memory Operations
| Operation |
Status |
Notes |
| copy |
✅ |
Device↔Device, Host↔Device |
| reshape / view |
✅ |
Zero-copy metadata only |
| contiguous |
🔜 |
Layout-aware kernel required |
Recommended Milestones
| Version |
Focus |
| v0.2.4 |
Driver-only runtime (no CUDA Toolkit) ✅ |
| v0.2.5 |
JIT stabilization, cache persistence |
| v0.2.6 |
Elementwise ops (sub, div, exp, log, relu) |
| v0.2.7 |
Reductions + contiguous |
| v0.3.0 |
FP16 / BF16 mixed precision |
Design Principle
GEMM proves performance.
Elementwise + memory prove usability.
Scheduler proves value.
PyGPUkit – Core Operator Coverage
1. Elementwise Operations
2. GEMM Operations
Current Performance (v0.2.3):
3. Reduction Operations
4. Memory Operations
Recommended Milestones
Design Principle
GEMM proves performance.
Elementwise + memory prove usability.
Scheduler proves value.