v0.4.2: Warp-Shuffle Reductions, __nanosleep, libcu++ Atomics
What's New
This release upgrades the CUDA codegen with practical findings from CUDA hardware research, targeting CC 6.0+ GPUs with the existing cudarc 0.18.2 runtime.
Warp-Shuffle Block Reductions
- Two-phase warp-shuffle reduction replaces tree reduction in all generated CUDA reduction code
- Phase 1: Intra-warp
__shfl_down_sync(0xFFFFFFFF, val, offset)— zero__syncthreads()calls - Phase 2: Cross-warp reduction via shared memory — one
__syncthreads()call - Reduces barrier count from O(log N) to 1 per block reduction (e.g., 9 → 1 for 512-thread blocks)
- Applied to: persistent FDTD energy reduction, standalone block/grid reduce helpers, and all inline reduction generators
__nanosleep() Power Efficiency
- Persistent FDTD idle spin-wait now uses
__nanosleep()instead of volatile counter loop - Software grid barrier spin-loop uses
__nanosleep(100)to reduce power consumption - Configurable via
PersistentFdtdConfig::with_idle_sleep(ns)(default: 1000ns)
libcu++ Ordered Atomics (opt-in)
- Opt-in
cuda::atomic_refsupport for H2K/K2H queue operations and software barriers - Uses
memory_order_acquire/memory_order_releaseinstead of__threadfence_system()pairs - Software barrier uses
cuda::thread_scope_device(narrower scope) withmemory_order_acq_rel - Compile-time CUDA 11.0+ version guard
- Enable via
PersistentFdtdConfig::with_libcupp_atomics(true)
Files Changed
crates/ringkernel-cuda-codegen/src/persistent_fdtd.rs— config fields, nanosleep, warp-shuffle reduction, libcu++ atomicscrates/ringkernel-cuda-codegen/src/reduction_intrinsics.rs— warp-shuffle upgrade for all reduction helpers
Test Results
- 215 codegen unit tests + 12 integration tests — all passing
- 6 CUDA GPU execution tests — verified on RTX 2000 Ada (CC 8.9)
- Full workspace — zero failures
Full Changelog: v0.4.1...v0.4.2