feat(cuda-oxide): RoPE (adjacent-pair) pure-Rust #[kernel] port + gx10 sm_121 A/B vs hand-PTX (PMAT-921) by noahgift · Pull Request #2215 · paiml/aprender

noahgift · 2026-06-24T10:35:52Z

PMAT-921 — cuda-oxide RoPE (adjacent-pair) port: GB10 sm_121 A/B vs hand-PTX

Advances the cuda-oxide marquee track (pure-Rust #[kernel] → PTX on Blackwell). Ports the standard adjacent-pair RoPE apply kernel (hand-PTX RopeKernel, entry rope, in crates/aprender-gpu/src/kernels/elementwise/rope/standard.rs) to a pure-Rust cuda-oxide #[kernel] and runs a true on-device A/B vs the hand-PTX baseline on the GB10 (sm_121 / compute_cap 12.1). Mirrors the established PMAT-882/893/894 port pattern (attention / RMSNorm / SwiGLU — all GO).

Why RoPE (per the rule)

RoPE is decode-hot (applied to Q and K every layer, every token) and is pure f32 FMA + sin/cos/ex2 transcendentals (ZERO DP4A) — squarely the established GO class. The rule from the prior verdicts: port FMA/softmax/elementwise/transcendental kernels; keep DP4A/INT8-bound Q4K GEMV/FFN (PMAT-881, the lone NO-GO) on hand-PTX. RoPE was the next-best un-ported candidate.

VERDICT: GO (statistical tie) — `F-OXIDE-ROPE-PARITY-001`

PARITY vs an f64 CPU reference — PASS at every shape (incl. Qwen2.5 high-freq theta=1M):

heads	head_dim	theta	cos	maxdiff (A / B)
32	128	10000	1.0000000	1.07e-6 / 6.86e-7
14	128	1000000	1.0000000	2.83e-6 / 2.83e-6
128	128	10000	1.0000000	1.65e-5 / 7.47e-6

PERF — FAIR matched single-launch A/B vs the real hand-PTX rope (RopeKernel::emit_ptx_for_target("sm_121"), committed in baseline-ptx/), same data + same grid (num_heads × head_dim/2) + GPU-event median 5×100. Gate oxide_us/handptx_us ≤ 1.2 = GO:

heads	head_dim	theta	oxA µs	oxB µs	hand-PTX µs	best ratio	verdict
32	128	10000	2.070	2.073	1.863	A 1.111×	GO
14	128	1000000	2.032	2.052	2.030	A 1.001×	GO
128	128	10000	2.215	2.159	2.041	B 1.058×	GO

Best ratio swings 0.98×–1.12× over 4 runs, centered ~1.00–1.05× — a clean statistical tie at the DRAM-bandwidth roofline (~2 µs both sides), exactly as predicted for an elementwise trig kernel. PASS at every shape.

Honest framing

This is a tie, not a speedup, and is reported as one. Adjacent-pair RoPE is memory-bandwidth-bound (1 load + 1 store per element), so neither hand-PTX nor oxide can beat the wall. The GO value: the pure-Rust #[kernel] matches the hand-PTX with no perf loss, so this decode-hot kernel can be migrated off hand-PTX (and off the GH-480 Blackwell-JIT workaround) for free — another north-star datapoint where pure-Rust→PTX replaces hand-PTX. Production migration of the live trueno RopeKernel is a follow-up, not this PR.

What's in the PR (no CI impact)

The spike is an isolated crate with its own [workspace] and the cuda-oxide toolchain (nightly-2026-04-03 + LLVM-21 + cargo-oxide 0.2.1) — it builds ONLY on gx10 and never affects the normal cargo build / CI.

experiments/cuda-oxide/rope/ — spike crate (src/main.rs, 2 variants) + 2 committed sm_121 hand-PTX baselines + RESULTS.md
contracts/cuda-oxide-rope-parity-v1.yaml — F-OXIDE-ROPE-PARITY-001..005, pv validate clean, full-dir pv lint PASS (0 errors)
evidence/pmat-921-cuda-oxide-rope-2026-06-24/ — findings.json + ab_run.txt (GPU-evidence archive, per convention)
experiments/cuda-oxide/README.md — adds the rope/ row (+ backfills the missing swiglu/ PMAT-894 row)

Reproduce on gx10

ssh gx10
export PATH="$HOME/.cargo/bin:/usr/lib/llvm-21/bin:$PATH"
export LLVM_SYS_211_PREFIX=/usr/lib/llvm-21
cd experiments/cuda-oxide/rope
cargo oxide run   # parity (2 variants × 3 shapes) + fair matched-launch A/B

🤖 Generated with Claude Code

…0 sm_121 A/B vs hand-PTX (PMAT-921) Port the standard adjacent-pair RoPE apply kernel (hand-PTX RopeKernel, entry `rope`, in crates/aprender-gpu/src/kernels/elementwise/rope/standard.rs) to a pure-Rust cuda-oxide #[kernel] and on-device A/B it vs the hand-PTX baseline on the GB10 Blackwell (sm_121 / compute_cap 12.1). Mirrors the PMAT-882/893/894 port pattern (attention / RMSNorm / SwiGLU, all GO). WHY RoPE: it is decode-hot (applied to Q and K every layer, every token) and is pure f32 FMA + sin/cos/ex2 transcendentals (ZERO DP4A) — squarely the established GO class. Only DP4A/INT8-bound Q4K GEMV/FFN (PMAT-881) is the NO-GO class. VERDICT: GO (statistical tie). Falsifier F-OXIDE-ROPE-PARITY-001: - PARITY: cos=1.0000000 vs an f64 CPU reference at every shape, incl. the Qwen2.5 high-frequency theta=1_000_000 case (maxdiff 6.9e-7 .. 1.6e-5). PASS. - PERF: FAIR matched single-launch A/B vs the REAL hand-PTX `rope` (RopeKernel::emit_ptx_for_target("sm_121"), committed in baseline-ptx/), same data + same grid (num_heads x head_dim/2) + GPU-event median 5x100. best ratio 0.98x-1.12x across heads 14/32/128, centered ~1.00-1.05x = a clean tie at the DRAM-bandwidth roofline (~2us both sides). <=1.2x gate PASS at every shape. This is a TIE, not a speedup, and is reported as one: adjacent-pair RoPE is memory-bandwidth-bound so neither side beats the wall. The GO value is that the pure-Rust #[kernel] MATCHES the hand-PTX with no perf loss, so this decode-hot kernel can be migrated off hand-PTX (and off the GH-480 Blackwell-JIT workaround) for free — another north-star datapoint where pure-Rust->PTX replaces hand-PTX. Isolated spike crate (own [workspace], builds ONLY on gx10 with the cuda-oxide toolchain), so it never affects the normal cargo build / CI. Adds: - experiments/cuda-oxide/rope/ (spike crate + 2 sm_121 hand-PTX baselines) - contracts/cuda-oxide-rope-parity-v1.yaml (F-OXIDE-ROPE-PARITY-001..005, pv-valid) - evidence/pmat-921-cuda-oxide-rope-2026-06-24/ (findings.json + ab_run.txt) - README row for rope (+ backfilled the missing swiglu/PMAT-894 row) Production migration of the live trueno RopeKernel is a follow-up, not this PR. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

noahgift enabled auto-merge June 24, 2026 10:36

noahgift added this pull request to the merge queue Jun 24, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 24, 2026

noahgift added this pull request to the merge queue Jun 24, 2026

Merged via the queue into main with commit 167023a Jun 24, 2026
11 checks passed

noahgift deleted the feat/cuda-oxide-rope branch June 24, 2026 12:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(cuda-oxide): RoPE (adjacent-pair) pure-Rust #[kernel] port + gx10 sm_121 A/B vs hand-PTX (PMAT-921)#2215

feat(cuda-oxide): RoPE (adjacent-pair) pure-Rust #[kernel] port + gx10 sm_121 A/B vs hand-PTX (PMAT-921)#2215
noahgift merged 1 commit into
mainfrom
feat/cuda-oxide-rope

noahgift commented Jun 24, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

noahgift commented Jun 24, 2026

PMAT-921 — cuda-oxide RoPE (adjacent-pair) port: GB10 sm_121 A/B vs hand-PTX

Why RoPE (per the rule)

VERDICT: GO (statistical tie) — F-OXIDE-ROPE-PARITY-001

Honest framing

What's in the PR (no CI impact)

Reproduce on gx10

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

VERDICT: GO (statistical tie) — `F-OXIDE-ROPE-PARITY-001`