feat(cuda-oxide): RoPE (adjacent-pair) pure-Rust #[kernel] port + gx10 sm_121 A/B vs hand-PTX (PMAT-921)#2215
Merged
Merged
Conversation
…0 sm_121 A/B vs hand-PTX (PMAT-921)
Port the standard adjacent-pair RoPE apply kernel (hand-PTX RopeKernel, entry
`rope`, in crates/aprender-gpu/src/kernels/elementwise/rope/standard.rs) to a
pure-Rust cuda-oxide #[kernel] and on-device A/B it vs the hand-PTX baseline on
the GB10 Blackwell (sm_121 / compute_cap 12.1). Mirrors the PMAT-882/893/894 port
pattern (attention / RMSNorm / SwiGLU, all GO).
WHY RoPE: it is decode-hot (applied to Q and K every layer, every token) and is
pure f32 FMA + sin/cos/ex2 transcendentals (ZERO DP4A) — squarely the established
GO class. Only DP4A/INT8-bound Q4K GEMV/FFN (PMAT-881) is the NO-GO class.
VERDICT: GO (statistical tie). Falsifier F-OXIDE-ROPE-PARITY-001:
- PARITY: cos=1.0000000 vs an f64 CPU reference at every shape, incl. the Qwen2.5
high-frequency theta=1_000_000 case (maxdiff 6.9e-7 .. 1.6e-5). PASS.
- PERF: FAIR matched single-launch A/B vs the REAL hand-PTX `rope`
(RopeKernel::emit_ptx_for_target("sm_121"), committed in baseline-ptx/), same
data + same grid (num_heads x head_dim/2) + GPU-event median 5x100. best ratio
0.98x-1.12x across heads 14/32/128, centered ~1.00-1.05x = a clean tie at the
DRAM-bandwidth roofline (~2us both sides). <=1.2x gate PASS at every shape.
This is a TIE, not a speedup, and is reported as one: adjacent-pair RoPE is
memory-bandwidth-bound so neither side beats the wall. The GO value is that the
pure-Rust #[kernel] MATCHES the hand-PTX with no perf loss, so this decode-hot
kernel can be migrated off hand-PTX (and off the GH-480 Blackwell-JIT workaround)
for free — another north-star datapoint where pure-Rust->PTX replaces hand-PTX.
Isolated spike crate (own [workspace], builds ONLY on gx10 with the cuda-oxide
toolchain), so it never affects the normal cargo build / CI. Adds:
- experiments/cuda-oxide/rope/ (spike crate + 2 sm_121 hand-PTX baselines)
- contracts/cuda-oxide-rope-parity-v1.yaml (F-OXIDE-ROPE-PARITY-001..005, pv-valid)
- evidence/pmat-921-cuda-oxide-rope-2026-06-24/ (findings.json + ab_run.txt)
- README row for rope (+ backfilled the missing swiglu/PMAT-894 row)
Production migration of the live trueno RopeKernel is a follow-up, not this PR.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PMAT-921 — cuda-oxide RoPE (adjacent-pair) port: GB10 sm_121 A/B vs hand-PTX
Advances the cuda-oxide marquee track (pure-Rust
#[kernel]→ PTX on Blackwell). Ports the standard adjacent-pair RoPE apply kernel (hand-PTXRopeKernel, entryrope, incrates/aprender-gpu/src/kernels/elementwise/rope/standard.rs) to a pure-Rust cuda-oxide#[kernel]and runs a true on-device A/B vs the hand-PTX baseline on the GB10 (sm_121 / compute_cap 12.1). Mirrors the established PMAT-882/893/894 port pattern (attention / RMSNorm / SwiGLU — all GO).Why RoPE (per the rule)
RoPE is decode-hot (applied to Q and K every layer, every token) and is pure f32 FMA + sin/cos/ex2 transcendentals (ZERO DP4A) — squarely the established GO class. The rule from the prior verdicts: port FMA/softmax/elementwise/transcendental kernels; keep DP4A/INT8-bound Q4K GEMV/FFN (PMAT-881, the lone NO-GO) on hand-PTX. RoPE was the next-best un-ported candidate.
VERDICT: GO (statistical tie) —
F-OXIDE-ROPE-PARITY-001PARITY vs an f64 CPU reference — PASS at every shape (incl. Qwen2.5 high-freq theta=1M):
PERF — FAIR matched single-launch A/B vs the real hand-PTX
rope(RopeKernel::emit_ptx_for_target("sm_121"), committed inbaseline-ptx/), same data + same grid (num_heads × head_dim/2) + GPU-event median 5×100. Gateoxide_us/handptx_us ≤ 1.2 = GO:Best ratio swings 0.98×–1.12× over 4 runs, centered ~1.00–1.05× — a clean statistical tie at the DRAM-bandwidth roofline (~2 µs both sides), exactly as predicted for an elementwise trig kernel. PASS at every shape.
Honest framing
This is a tie, not a speedup, and is reported as one. Adjacent-pair RoPE is memory-bandwidth-bound (1 load + 1 store per element), so neither hand-PTX nor oxide can beat the wall. The GO value: the pure-Rust
#[kernel]matches the hand-PTX with no perf loss, so this decode-hot kernel can be migrated off hand-PTX (and off the GH-480 Blackwell-JIT workaround) for free — another north-star datapoint where pure-Rust→PTX replaces hand-PTX. Production migration of the livetruenoRopeKernel is a follow-up, not this PR.What's in the PR (no CI impact)
The spike is an isolated crate with its own
[workspace]and the cuda-oxide toolchain (nightly-2026-04-03+ LLVM-21 + cargo-oxide 0.2.1) — it builds ONLY on gx10 and never affects the normalcargo build/ CI.experiments/cuda-oxide/rope/— spike crate (src/main.rs, 2 variants) + 2 committed sm_121 hand-PTX baselines +RESULTS.mdcontracts/cuda-oxide-rope-parity-v1.yaml—F-OXIDE-ROPE-PARITY-001..005,pv validateclean, full-dirpv lintPASS (0 errors)evidence/pmat-921-cuda-oxide-rope-2026-06-24/—findings.json+ab_run.txt(GPU-evidence archive, per convention)experiments/cuda-oxide/README.md— adds therope/row (+ backfills the missingswiglu/PMAT-894 row)Reproduce on gx10
🤖 Generated with Claude Code