Skip to content

feat(cuda-oxide): RoPE (adjacent-pair) pure-Rust #[kernel] port + gx10 sm_121 A/B vs hand-PTX (PMAT-921)#2215

Merged
noahgift merged 1 commit into
mainfrom
feat/cuda-oxide-rope
Jun 24, 2026
Merged

feat(cuda-oxide): RoPE (adjacent-pair) pure-Rust #[kernel] port + gx10 sm_121 A/B vs hand-PTX (PMAT-921)#2215
noahgift merged 1 commit into
mainfrom
feat/cuda-oxide-rope

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

PMAT-921 — cuda-oxide RoPE (adjacent-pair) port: GB10 sm_121 A/B vs hand-PTX

Advances the cuda-oxide marquee track (pure-Rust #[kernel] → PTX on Blackwell). Ports the standard adjacent-pair RoPE apply kernel (hand-PTX RopeKernel, entry rope, in crates/aprender-gpu/src/kernels/elementwise/rope/standard.rs) to a pure-Rust cuda-oxide #[kernel] and runs a true on-device A/B vs the hand-PTX baseline on the GB10 (sm_121 / compute_cap 12.1). Mirrors the established PMAT-882/893/894 port pattern (attention / RMSNorm / SwiGLU — all GO).

Why RoPE (per the rule)

RoPE is decode-hot (applied to Q and K every layer, every token) and is pure f32 FMA + sin/cos/ex2 transcendentals (ZERO DP4A) — squarely the established GO class. The rule from the prior verdicts: port FMA/softmax/elementwise/transcendental kernels; keep DP4A/INT8-bound Q4K GEMV/FFN (PMAT-881, the lone NO-GO) on hand-PTX. RoPE was the next-best un-ported candidate.

VERDICT: GO (statistical tie) — F-OXIDE-ROPE-PARITY-001

PARITY vs an f64 CPU reference — PASS at every shape (incl. Qwen2.5 high-freq theta=1M):

heads head_dim theta cos maxdiff (A / B)
32 128 10000 1.0000000 1.07e-6 / 6.86e-7
14 128 1000000 1.0000000 2.83e-6 / 2.83e-6
128 128 10000 1.0000000 1.65e-5 / 7.47e-6

PERF — FAIR matched single-launch A/B vs the real hand-PTX rope (RopeKernel::emit_ptx_for_target("sm_121"), committed in baseline-ptx/), same data + same grid (num_heads × head_dim/2) + GPU-event median 5×100. Gate oxide_us/handptx_us ≤ 1.2 = GO:

heads head_dim theta oxA µs oxB µs hand-PTX µs best ratio verdict
32 128 10000 2.070 2.073 1.863 A 1.111× GO
14 128 1000000 2.032 2.052 2.030 A 1.001× GO
128 128 10000 2.215 2.159 2.041 B 1.058× GO

Best ratio swings 0.98×–1.12× over 4 runs, centered ~1.00–1.05× — a clean statistical tie at the DRAM-bandwidth roofline (~2 µs both sides), exactly as predicted for an elementwise trig kernel. PASS at every shape.

Honest framing

This is a tie, not a speedup, and is reported as one. Adjacent-pair RoPE is memory-bandwidth-bound (1 load + 1 store per element), so neither hand-PTX nor oxide can beat the wall. The GO value: the pure-Rust #[kernel] matches the hand-PTX with no perf loss, so this decode-hot kernel can be migrated off hand-PTX (and off the GH-480 Blackwell-JIT workaround) for free — another north-star datapoint where pure-Rust→PTX replaces hand-PTX. Production migration of the live trueno RopeKernel is a follow-up, not this PR.

What's in the PR (no CI impact)

The spike is an isolated crate with its own [workspace] and the cuda-oxide toolchain (nightly-2026-04-03 + LLVM-21 + cargo-oxide 0.2.1) — it builds ONLY on gx10 and never affects the normal cargo build / CI.

  • experiments/cuda-oxide/rope/ — spike crate (src/main.rs, 2 variants) + 2 committed sm_121 hand-PTX baselines + RESULTS.md
  • contracts/cuda-oxide-rope-parity-v1.yamlF-OXIDE-ROPE-PARITY-001..005, pv validate clean, full-dir pv lint PASS (0 errors)
  • evidence/pmat-921-cuda-oxide-rope-2026-06-24/findings.json + ab_run.txt (GPU-evidence archive, per convention)
  • experiments/cuda-oxide/README.md — adds the rope/ row (+ backfills the missing swiglu/ PMAT-894 row)

Reproduce on gx10

ssh gx10
export PATH="$HOME/.cargo/bin:/usr/lib/llvm-21/bin:$PATH"
export LLVM_SYS_211_PREFIX=/usr/lib/llvm-21
cd experiments/cuda-oxide/rope
cargo oxide run   # parity (2 variants × 3 shapes) + fair matched-launch A/B

🤖 Generated with Claude Code

…0 sm_121 A/B vs hand-PTX (PMAT-921)

Port the standard adjacent-pair RoPE apply kernel (hand-PTX RopeKernel, entry
`rope`, in crates/aprender-gpu/src/kernels/elementwise/rope/standard.rs) to a
pure-Rust cuda-oxide #[kernel] and on-device A/B it vs the hand-PTX baseline on
the GB10 Blackwell (sm_121 / compute_cap 12.1). Mirrors the PMAT-882/893/894 port
pattern (attention / RMSNorm / SwiGLU, all GO).

WHY RoPE: it is decode-hot (applied to Q and K every layer, every token) and is
pure f32 FMA + sin/cos/ex2 transcendentals (ZERO DP4A) — squarely the established
GO class. Only DP4A/INT8-bound Q4K GEMV/FFN (PMAT-881) is the NO-GO class.

VERDICT: GO (statistical tie). Falsifier F-OXIDE-ROPE-PARITY-001:
- PARITY: cos=1.0000000 vs an f64 CPU reference at every shape, incl. the Qwen2.5
  high-frequency theta=1_000_000 case (maxdiff 6.9e-7 .. 1.6e-5). PASS.
- PERF: FAIR matched single-launch A/B vs the REAL hand-PTX `rope`
  (RopeKernel::emit_ptx_for_target("sm_121"), committed in baseline-ptx/), same
  data + same grid (num_heads x head_dim/2) + GPU-event median 5x100. best ratio
  0.98x-1.12x across heads 14/32/128, centered ~1.00-1.05x = a clean tie at the
  DRAM-bandwidth roofline (~2us both sides). <=1.2x gate PASS at every shape.

This is a TIE, not a speedup, and is reported as one: adjacent-pair RoPE is
memory-bandwidth-bound so neither side beats the wall. The GO value is that the
pure-Rust #[kernel] MATCHES the hand-PTX with no perf loss, so this decode-hot
kernel can be migrated off hand-PTX (and off the GH-480 Blackwell-JIT workaround)
for free — another north-star datapoint where pure-Rust->PTX replaces hand-PTX.

Isolated spike crate (own [workspace], builds ONLY on gx10 with the cuda-oxide
toolchain), so it never affects the normal cargo build / CI. Adds:
- experiments/cuda-oxide/rope/ (spike crate + 2 sm_121 hand-PTX baselines)
- contracts/cuda-oxide-rope-parity-v1.yaml (F-OXIDE-ROPE-PARITY-001..005, pv-valid)
- evidence/pmat-921-cuda-oxide-rope-2026-06-24/ (findings.json + ab_run.txt)
- README row for rope (+ backfilled the missing swiglu/PMAT-894 row)

Production migration of the live trueno RopeKernel is a follow-up, not this PR.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge June 24, 2026 10:36
@noahgift noahgift added this pull request to the merge queue Jun 24, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 24, 2026
@noahgift noahgift added this pull request to the merge queue Jun 24, 2026
Merged via the queue into main with commit 167023a Jun 24, 2026
11 checks passed
@noahgift noahgift deleted the feat/cuda-oxide-rope branch June 24, 2026 12:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant