Summary
Implement SM-specific CUTLASS kernel variants with runtime dispatch for optimal performance across GPU architectures.
Motivation
- SM 80 (A100), SM 86 (RTX 30xx), SM 89 (RTX 40xx), SM 90 (H100) have different optimal configurations
- Single kernel compiled for lowest SM leaves performance on the table
- Runtime dispatch allows shipping one wheel that works optimally on all GPUs
Proposed Architecture
Wheel Structure
pygpukit/
├─ core/ # Rust core
├─ native/
│ ├─ ops/
│ │ ├─ matmul_cutlass_sm80.cu # Ampere (A100)
│ │ ├─ matmul_cutlass_sm86.cu # Ampere (RTX 30xx)
│ │ ├─ matmul_cutlass_sm90.cu # Hopper (H100)
│ │ └─ matmul_fallback.cu # SIMT fallback
Runtime Selector (Rust)
match device.sm {
90.. => kernel_sm90, // Hopper+
86.. => kernel_sm86, // RTX 30xx, RTX 40xx
80.. => kernel_sm80, // A100
_ => fallback, // Pre-Ampere (not officially supported)
}
Benefits
- Optimal performance - Each SM gets tuned tile sizes, pipeline depth
- Single wheel - No need for separate builds per architecture
- Future-proof - Easy to add SM 100+ when available
Implementation Notes
- Each SM variant compiles with
-arch=sm_XX
- Fallback uses SIMT (no TensorCore) for unsupported architectures
- Runtime detection via
cudaGetDeviceProperties
Related
Summary
Implement SM-specific CUTLASS kernel variants with runtime dispatch for optimal performance across GPU architectures.
Motivation
Proposed Architecture
Wheel Structure
Runtime Selector (Rust)
Benefits
Implementation Notes
-arch=sm_XXcudaGetDevicePropertiesRelated
-arch=sm_80