A type system that prevents shuffle-from-inactive-lane bugs in GPU warp programming by tracking active lane masks at compile time.
Status: Research prototype with real GPU execution. 272 unit + 50 example + 23 doc tests (345 total). Zero runtime overhead verified at Rust MIR, LLVM IR, and NVIDIA PTX levels. Cargo-integrated GPU compilation pipeline.
GPU warp primitives like shuffle enable fast intra-warp communication, but reading from an inactive lane is undefined behavior:
if (participate) {
int partner = __shfl_xor_sync(0xFFFFFFFF, data, 1); // BUG
// Reads from inactive lanes — undefined result
}This compiles without warnings, may appear to work, and fails silently. NVIDIA's own cuda-samples contains this bug (#398). Their core library CUB contains a variant (CCCL#854). The PIConGPU plasma physics simulation ran for months with undefined behavior in a divergent branch, undetected because pre-Volta hardware masked the violation (#2514). NVIDIA deprecated the entire __shfl API family in CUDA 9.0 to address the bug class.
We survey 21 documented bugs across 16 real-world projects. State-of-the-art persistent thread programs (e.g., the Hazy megakernel) avoid the problem by maintaining warp-uniform execution.
Track which lanes are active in the type system. Shuffle only exists on fully-active warps:
use warp_types::*;
let warp: Warp<All> = Warp::kernel_entry();
// After diverge, shuffle is gone from the type:
let (evens, odds) = warp.diverge_even_odd();
// evens.shuffle_xor(data, 1); // COMPILE ERROR — method not found
let merged: Warp<All> = merge(evens, odds);
let partner = merged.shuffle_xor(data, 1); // OKThe type system is strictly more permissive than current best practice (which prohibits divergence) while being strictly safer than the CUDA status quo (which allows it unchecked).
Warp<S>— Warps carry active set types.Warp<All>= all lanes active.- Diverge produces complements —
divergesplits a warp into two sub-warps with disjoint active sets. - Merge requires complements — Reconvergence is verified at compile time via
ComplementOf<T>trait bounds. - Method availability = safety —
shuffle_xoronly exists onWarp<All>. Not checked — absent. - Zero overhead —
Warp<S>contains onlyPhantomData<S>. Types are erased completely. - Data-dependent divergence —
diverge_dynamic(mask)handles runtime predicates. The mask is dynamic, but the complement is structural — both branches must merge before shuffle. - Cross-function inference — Generic functions take
Warp<S>withS: ActiveSet. Rust infersSat call sites.
cuda-samples #398 on the same GPU (RTX 4000 Ada):
Untyped (buggy): sum = 1 ← silent wrong answer
Typed (fixed): sum = 32 ← correct, AND the bug is a compile error
The buggy pattern literally cannot be expressed. Warp<Lane0> has no shfl_down method.
bash reproduce/demo.sh # The entire pitch in one terminalcargo test # 272 unit + 23 doc tests
cargo test --examples # 50 tests across 8 real-bug examples
cargo test --example nvidia_cuda_samples_398 # Real NVIDIA bug, caught by types// my-kernels/src/lib.rs
use warp_types::*;
#[warp_kernel]
pub fn butterfly_reduce(data: *mut i32) {
let warp: Warp<All> = Warp::kernel_entry();
let tid = warp_types::gpu::thread_id_x();
let mut val = unsafe { *data.add(tid as usize) };
// Type system guarantees: warp is Warp<All>, so shuffle is available
let d = data::PerLane::new(val);
val += warp.shuffle_xor(d, 16).get();
// ... butterfly stages ...
unsafe { *data.add(tid as usize) = val; }
}// build.rs
fn main() {
warp_types_builder::WarpBuilder::new("my-kernels")
.build()
.expect("Failed to compile GPU kernels");
}// src/main.rs
mod kernels { include!(concat!(env!("OUT_DIR"), "/kernels.rs")); }
fn main() {
let ctx = cudarc::driver::CudaContext::new(0).unwrap();
let k = kernels::Kernels::load(&ctx).unwrap();
// k.butterfly_reduce — ready to launch
}| Claim | Evidence | Command |
|---|---|---|
| Shuffle safety (diverged warp can't shuffle) | 8 compile-fail doctests | cargo test --doc |
| Real bug caught at compile time | 8 worked bug examples (21 bugs surveyed) | cargo test --examples |
| Hardware reproduction | Deterministic wrong result on RTX 4000 Ada | bash reproduce/demo.sh |
| Real GPU execution | 4 kernels PASS on RTX 4000 Ada via cudarc | cd examples/gpu-project && cargo run |
| Cargo integration | #[warp_kernel] + WarpBuilder + Kernels struct |
cd examples/gpu-project && cargo run |
| Zero overhead | Verified at MIR, LLVM IR, and PTX levels | cargo rustc --release --lib -- --emit=llvm-ir |
| Soundness (progress + preservation) | Full Lean 4 mechanization (28 theorems), zero sorry, zero axioms | cd lean && lake build |
| CUB-equivalent primitives | Typed reduce, scan, broadcast (8 tests) | cargo test cub |
| Fence-divergence safety | Type-state write tracking (3 tests) | cargo test fence |
| Platform portability (32/64 lanes) | u64 masks, AMD stubs, GpuTarget enum | cargo test warp_size |
| Gradual typing (DynWarp ↔ Warp |
Runtime/compile-time bridge (18 tests) | cargo test gradual |
| All claims | Full test suite (345 tests) | cargo test && cargo test --examples |
warp-types/
├── src/
│ ├── lib.rs # Core exports + GpuValue trait + warp_kernel re-export
│ ├── active_set.rs # Marker types: All, Even, Odd, LowHalf, ... (u64 masks)
│ ├── warp.rs # Warp<S> — the core parameterized type
│ ├── data.rs # PerLane<T>, Uniform<T>, SingleLane<T, N>
│ ├── diverge.rs # Split warps by predicate
│ ├── merge.rs # Rejoin complementary sub-warps
│ ├── shuffle.rs # Shuffle/ballot/reduce (Warp<All> only) + permutation algebra
│ ├── cub.rs # CUB-equivalent typed warp primitives
│ ├── sort.rs # Bitonic sort (typed warp primitives)
│ ├── tile.rs # Tile-level warp partitioning
│ ├── dynamic.rs # Data-dependent divergence (DynDiverge)
│ ├── gpu.rs # PTX/AMDGPU intrinsics + GpuShuffle trait
│ ├── fence.rs # Fence-divergence type-state machine
│ ├── block.rs # Block-level shared memory + reductions
│ ├── proof.rs # Executable soundness sketch (9 lemmas)
│ ├── platform.rs # CpuSimd<N> / GpuWarp32 / GpuWarp64 dual-mode
│ ├── warp_size.rs # Const-generic warp size portability
│ ├── gradual.rs # DynWarp ↔ Warp<S> gradual typing bridge
│ └── research/ # 24 research exploration modules
├── warp-types-macros/ # warp_sets! proc macro (active set hierarchy generation)
├── warp-types-kernel/ # #[warp_kernel] proc macro (GPU kernel entry points)
├── warp-types-builder/ # WarpBuilder (build.rs cross-compilation to PTX)
├── examples/
│ ├── nvidia_cuda_samples_398.rs # Real NVIDIA bug, caught by types
│ ├── cub_cccl_854.rs # CUB/CCCL sub-warp shuffle bug
│ ├── picongpu_2514.rs # PIConGPU months of silent UB
│ ├── llvm_155682.rs # LLVM lane-0 conditional shuffle
│ ├── opencv_12320.rs # OpenCV warpScanInclusive deadlock
│ ├── pytorch_98157.rs # PyTorch __activemask() misuse
│ ├── tvm_17307.rs # TVM LowerThreadAllreduce H100 crash
│ ├── demo_bug_that_types_catch.rs # Synthetic demonstration
│ └── gpu-project/ # End-to-end cargo→GPU example
├── reproduce/
│ ├── demo.sh # Full demonstration script
│ ├── host/ # cudarc host runner for real GPU execution
│ └── *.rs, *.cu # PTX comparison + hardware reproduction
├── lean/ # Lean 4 formalization (28 theorems, zero sorry)
├── paper/ # Preprint (markdown)
├── tutorial/ # Step-by-step tutorial
├── blog/ # Blog post draft
└── INTEGRATION.md # Sol integration guide
- AMD untested. u64 masks and
amdgputarget stubs are in place but require AMD hardware for validation. - Nightly required for GPU kernels.
#[warp_kernel]requiresabi_ptxandasm_experimental_archfeatures. The type system itself works on stable Rust.
MIT