[v0.2.7] Multi-SM CUTLASS kernels with runtime dispatch

## Summary

Implement SM-specific CUTLASS kernel variants with runtime dispatch for optimal performance across GPU architectures.

## Motivation

- SM 80 (A100), SM 86 (RTX 30xx), SM 89 (RTX 40xx), SM 90 (H100) have different optimal configurations
- Single kernel compiled for lowest SM leaves performance on the table
- Runtime dispatch allows shipping one wheel that works optimally on all GPUs

## Proposed Architecture

### Wheel Structure
```
pygpukit/
 ├─ core/                  # Rust core
 ├─ native/
 │   ├─ ops/
 │   │   ├─ matmul_cutlass_sm80.cu   # Ampere (A100)
 │   │   ├─ matmul_cutlass_sm86.cu   # Ampere (RTX 30xx)
 │   │   ├─ matmul_cutlass_sm90.cu   # Hopper (H100)
 │   │   └─ matmul_fallback.cu       # SIMT fallback
```

### Runtime Selector (Rust)
```rust
match device.sm {
    90.. => kernel_sm90,   // Hopper+
    86.. => kernel_sm86,   // RTX 30xx, RTX 40xx
    80.. => kernel_sm80,   // A100
    _    => fallback,      // Pre-Ampere (not officially supported)
}
```

## Benefits

1. **Optimal performance** - Each SM gets tuned tile sizes, pipeline depth
2. **Single wheel** - No need for separate builds per architecture  
3. **Future-proof** - Easy to add SM 100+ when available

## Implementation Notes

- Each SM variant compiles with `-arch=sm_XX`
- Fallback uses SIMT (no TensorCore) for unsupported architectures
- Runtime detection via `cudaGetDeviceProperties`

## Related

- #67 CUTLASS FP32 SIMT kernel (fallback implementation)
- Current: All kernels compile for SM 80 with `-arch=sm_80`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v0.2.7] Multi-SM CUTLASS kernels with runtime dispatch #68

Summary

Motivation

Proposed Architecture

Wheel Structure

Runtime Selector (Rust)

Benefits

Implementation Notes

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[v0.2.7] Multi-SM CUTLASS kernels with runtime dispatch #68

Description

Summary

Motivation

Proposed Architecture

Wheel Structure

Runtime Selector (Rust)

Benefits

Implementation Notes

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions