warp-types: Session-Typed GPU Divergence

A type system that prevents shuffle-from-inactive-lane bugs in GPU warp programming by tracking active lane masks at compile time.

Status: Research prototype with real GPU execution. 272 unit + 50 example + 23 doc tests (345 total). Zero runtime overhead verified at Rust MIR, LLVM IR, and NVIDIA PTX levels. Cargo-integrated GPU compilation pipeline.

The Problem

GPU warp primitives like shuffle enable fast intra-warp communication, but reading from an inactive lane is undefined behavior:

if (participate) {
    int partner = __shfl_xor_sync(0xFFFFFFFF, data, 1);  // BUG
    // Reads from inactive lanes — undefined result
}

This compiles without warnings, may appear to work, and fails silently. NVIDIA's own cuda-samples contains this bug (#398). Their core library CUB contains a variant (CCCL#854). The PIConGPU plasma physics simulation ran for months with undefined behavior in a divergent branch, undetected because pre-Volta hardware masked the violation (#2514). NVIDIA deprecated the entire __shfl API family in CUDA 9.0 to address the bug class.

We survey 21 documented bugs across 16 real-world projects. State-of-the-art persistent thread programs (e.g., the Hazy megakernel) avoid the problem by maintaining warp-uniform execution.

The Solution

Track which lanes are active in the type system. Shuffle only exists on fully-active warps:

use warp_types::*;

let warp: Warp<All> = Warp::kernel_entry();

// After diverge, shuffle is gone from the type:
let (evens, odds) = warp.diverge_even_odd();
// evens.shuffle_xor(data, 1);  // COMPILE ERROR — method not found
let merged: Warp<All> = merge(evens, odds);
let partner = merged.shuffle_xor(data, 1);  // OK

The type system is strictly more permissive than current best practice (which prohibits divergence) while being strictly safer than the CUDA status quo (which allows it unchecked).

Key Ideas

Warp<S> — Warps carry active set types. Warp<All> = all lanes active.
Diverge produces complements — diverge splits a warp into two sub-warps with disjoint active sets.
Merge requires complements — Reconvergence is verified at compile time via ComplementOf<T> trait bounds.
Method availability = safety — shuffle_xor only exists on Warp<All>. Not checked — absent.
Zero overhead — Warp<S> contains only PhantomData<S>. Types are erased completely.
Data-dependent divergence — diverge_dynamic(mask) handles runtime predicates. The mask is dynamic, but the complement is structural — both branches must merge before shuffle.
Cross-function inference — Generic functions take Warp<S> with S: ActiveSet. Rust infers S at call sites.

The Killer Demo

cuda-samples #398 on the same GPU (RTX 4000 Ada):

Untyped (buggy): sum = 1   ← silent wrong answer
Typed (fixed):   sum = 32  ← correct, AND the bug is a compile error

The buggy pattern literally cannot be expressed. Warp<Lane0> has no shfl_down method.

bash reproduce/demo.sh  # The entire pitch in one terminal

Quick Start

cargo test                                    # 272 unit + 23 doc tests
cargo test --examples                         # 50 tests across 8 real-bug examples
cargo test --example nvidia_cuda_samples_398  # Real NVIDIA bug, caught by types

Write GPU Kernels

// my-kernels/src/lib.rs
use warp_types::*;

#[warp_kernel]
pub fn butterfly_reduce(data: *mut i32) {
    let warp: Warp<All> = Warp::kernel_entry();
    let tid = warp_types::gpu::thread_id_x();
    let mut val = unsafe { *data.add(tid as usize) };

    // Type system guarantees: warp is Warp<All>, so shuffle is available
    let d = data::PerLane::new(val);
    val += warp.shuffle_xor(d, 16).get();
    // ... butterfly stages ...
    unsafe { *data.add(tid as usize) = val; }
}

// build.rs
fn main() {
    warp_types_builder::WarpBuilder::new("my-kernels")
        .build()
        .expect("Failed to compile GPU kernels");
}

// src/main.rs
mod kernels { include!(concat!(env!("OUT_DIR"), "/kernels.rs")); }

fn main() {
    let ctx = cudarc::driver::CudaContext::new(0).unwrap();
    let k = kernels::Kernels::load(&ctx).unwrap();
    // k.butterfly_reduce — ready to launch
}

Claims and Evidence

Claim	Evidence	Command
Shuffle safety (diverged warp can't shuffle)	8 compile-fail doctests	`cargo test --doc`
Real bug caught at compile time	8 worked bug examples (21 bugs surveyed)	`cargo test --examples`
Hardware reproduction	Deterministic wrong result on RTX 4000 Ada	`bash reproduce/demo.sh`
Real GPU execution	4 kernels PASS on RTX 4000 Ada via cudarc	`cd examples/gpu-project && cargo run`
Cargo integration	`#[warp_kernel]` + `WarpBuilder` + `Kernels` struct	`cd examples/gpu-project && cargo run`
Zero overhead	Verified at MIR, LLVM IR, and PTX levels	`cargo rustc --release --lib -- --emit=llvm-ir`
Soundness (progress + preservation)	Full Lean 4 mechanization (28 theorems), zero sorry, zero axioms	`cd lean && lake build`
CUB-equivalent primitives	Typed reduce, scan, broadcast (8 tests)	`cargo test cub`
Fence-divergence safety	Type-state write tracking (3 tests)	`cargo test fence`
Platform portability (32/64 lanes)	u64 masks, AMD stubs, GpuTarget enum	`cargo test warp_size`
Gradual typing (DynWarp ↔ Warp)	Runtime/compile-time bridge (18 tests)	`cargo test gradual`
All claims	Full test suite (345 tests)	`cargo test && cargo test --examples`

Project Structure

warp-types/
├── src/
│   ├── lib.rs              # Core exports + GpuValue trait + warp_kernel re-export
│   ├── active_set.rs       # Marker types: All, Even, Odd, LowHalf, ... (u64 masks)
│   ├── warp.rs             # Warp<S> — the core parameterized type
│   ├── data.rs             # PerLane<T>, Uniform<T>, SingleLane<T, N>
│   ├── diverge.rs          # Split warps by predicate
│   ├── merge.rs            # Rejoin complementary sub-warps
│   ├── shuffle.rs          # Shuffle/ballot/reduce (Warp<All> only) + permutation algebra
│   ├── cub.rs              # CUB-equivalent typed warp primitives
│   ├── sort.rs             # Bitonic sort (typed warp primitives)
│   ├── tile.rs             # Tile-level warp partitioning
│   ├── dynamic.rs          # Data-dependent divergence (DynDiverge)
│   ├── gpu.rs              # PTX/AMDGPU intrinsics + GpuShuffle trait
│   ├── fence.rs            # Fence-divergence type-state machine
│   ├── block.rs            # Block-level shared memory + reductions
│   ├── proof.rs            # Executable soundness sketch (9 lemmas)
│   ├── platform.rs         # CpuSimd<N> / GpuWarp32 / GpuWarp64 dual-mode
│   ├── warp_size.rs        # Const-generic warp size portability
│   ├── gradual.rs          # DynWarp ↔ Warp<S> gradual typing bridge
│   └── research/           # 24 research exploration modules
├── warp-types-macros/      # warp_sets! proc macro (active set hierarchy generation)
├── warp-types-kernel/      # #[warp_kernel] proc macro (GPU kernel entry points)
├── warp-types-builder/     # WarpBuilder (build.rs cross-compilation to PTX)
├── examples/
│   ├── nvidia_cuda_samples_398.rs  # Real NVIDIA bug, caught by types
│   ├── cub_cccl_854.rs            # CUB/CCCL sub-warp shuffle bug
│   ├── picongpu_2514.rs           # PIConGPU months of silent UB
│   ├── llvm_155682.rs             # LLVM lane-0 conditional shuffle
│   ├── opencv_12320.rs            # OpenCV warpScanInclusive deadlock
│   ├── pytorch_98157.rs           # PyTorch __activemask() misuse
│   ├── tvm_17307.rs               # TVM LowerThreadAllreduce H100 crash
│   ├── demo_bug_that_types_catch.rs  # Synthetic demonstration
│   └── gpu-project/               # End-to-end cargo→GPU example
├── reproduce/
│   ├── demo.sh             # Full demonstration script
│   ├── host/               # cudarc host runner for real GPU execution
│   └── *.rs, *.cu          # PTX comparison + hardware reproduction
├── lean/                   # Lean 4 formalization (28 theorems, zero sorry)
├── paper/                  # Preprint (markdown)
├── tutorial/               # Step-by-step tutorial
├── blog/                   # Blog post draft
└── INTEGRATION.md          # Sol integration guide

Limitations

AMD untested. u64 masks and amdgpu target stubs are in place but require AMD hardware for validation.
Nightly required for GPU kernels. #[warp_kernel] requires abi_ptx and asm_experimental_arch features. The type system itself works on stable Rust.

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

warp-types: Session-Typed GPU Divergence

The Problem

The Solution

Key Ideas

The Killer Demo

Quick Start

Write GPU Kernels

Claims and Evidence

Project Structure

Limitations

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
.github/workflows		.github/workflows
blog		blog
examples		examples
lean		lean
paper		paper
reproduce		reproduce
src		src
tutorial		tutorial
warp-types-builder		warp-types-builder
warp-types-kernel		warp-types-kernel
warp-types-macros		warp-types-macros
.gitignore		.gitignore
.zenodo.json		.zenodo.json
CITATION.cff		CITATION.cff
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
INTEGRATION.md		INTEGRATION.md
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

warp-types: Session-Typed GPU Divergence

The Problem

The Solution

Key Ideas

The Killer Demo

Quick Start

Write GPU Kernels

Claims and Evidence

Project Structure

Limitations

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 2

Languages

Packages