Skip to content

v0.4.2: Warp-Shuffle Reductions, __nanosleep, libcu++ Atomics

Choose a tag to compare

@mivertowski mivertowski released this 06 Feb 22:13
· 119 commits to main since this release

What's New

This release upgrades the CUDA codegen with practical findings from CUDA hardware research, targeting CC 6.0+ GPUs with the existing cudarc 0.18.2 runtime.

Warp-Shuffle Block Reductions

  • Two-phase warp-shuffle reduction replaces tree reduction in all generated CUDA reduction code
  • Phase 1: Intra-warp __shfl_down_sync(0xFFFFFFFF, val, offset) — zero __syncthreads() calls
  • Phase 2: Cross-warp reduction via shared memory — one __syncthreads() call
  • Reduces barrier count from O(log N) to 1 per block reduction (e.g., 9 → 1 for 512-thread blocks)
  • Applied to: persistent FDTD energy reduction, standalone block/grid reduce helpers, and all inline reduction generators

__nanosleep() Power Efficiency

  • Persistent FDTD idle spin-wait now uses __nanosleep() instead of volatile counter loop
  • Software grid barrier spin-loop uses __nanosleep(100) to reduce power consumption
  • Configurable via PersistentFdtdConfig::with_idle_sleep(ns) (default: 1000ns)

libcu++ Ordered Atomics (opt-in)

  • Opt-in cuda::atomic_ref support for H2K/K2H queue operations and software barriers
  • Uses memory_order_acquire/memory_order_release instead of __threadfence_system() pairs
  • Software barrier uses cuda::thread_scope_device (narrower scope) with memory_order_acq_rel
  • Compile-time CUDA 11.0+ version guard
  • Enable via PersistentFdtdConfig::with_libcupp_atomics(true)

Files Changed

  • crates/ringkernel-cuda-codegen/src/persistent_fdtd.rs — config fields, nanosleep, warp-shuffle reduction, libcu++ atomics
  • crates/ringkernel-cuda-codegen/src/reduction_intrinsics.rs — warp-shuffle upgrade for all reduction helpers

Test Results

  • 215 codegen unit tests + 12 integration tests — all passing
  • 6 CUDA GPU execution tests — verified on RTX 2000 Ada (CC 8.9)
  • Full workspace — zero failures

Full Changelog: v0.4.1...v0.4.2