spec(SPEC-CUBLAS-FP8-7B-FIX-001): epic to root-cause cuBLAS FP8 7B gibberish (holds v0.35.0) by noahgift · Pull Request #1882 · paiml/aprender

noahgift · 2026-05-22T13:34:03Z

Summary

After a 1.5-hour bisect + layer-trace investigation on 2026-05-22, the cuBLAS FP8 7B Q4K <|im_start|> gibberish is confirmed to be not a recent regression but a pre-existing fragile path broken across multiple releases (v0.31.2, v0.33.0, v0.34.0 all reproduce on the same RTX 4090).

This PR adds docs/specifications/SPEC-CUBLAS-FP8-7B-FIX-001.md — a 6-stage falsifier cascade epic to root-cause it, per memory/feedback_falsifier_cascade_decomposes_magnitude.md.

Per-user decision (2026-05-22)

"Open a multi-day epic for the cuBLAS fix; ship v0.35.0 later"

v0.35.0 release tag + crates.io publish cascade are held until this epic discharges. The 8 individual fix PRs already in flight (#1867 #1868 #1870 #1872 #1873 #1875 #1876 #1878) continue to land on main as net-positive improvements.

Why bisection was invalid

git bisect run named 8bd4ce5a (the monorepo consolidation commit). Re-running the oracle at v0.31.2 — the "good" baseline — showed the same <|im_start|> gibberish. The Golden Output gate signal is non-deterministic; CUDA context poisoning recovery masks the symptom across some runs. Per memory/feedback_test_methodology_can_fake_bugs.md.

What we know

Datum	Value
Reproducer	`apr qa /home/noah/models/qwen2.5-coder-7b-instruct-q4_k_m.gguf`
Symptom	`✗ FAIL Golden Output GPU output failed (CPU passed): gibberish`
Hardware	RTX 4090 (sm_89)
Backend	cuBLASLt FP8 (197 weight matrices, 6.7 GB cache)
Layer-0 trace	CPU vs cuBLAS Q/K inputs differ already at layer 0
Logit correlation	0.987 (high) — outputs are approximately linear-related
Linear fit	`GPU ≈ 0.96 × CPU + 0.12`
CUDA failfast	Context poisoned during executor lifetime; recovery silently produces garbage
`trueno-gpu v0.4.36`	bit-identical between v0.31.2 and `8bd4ce5` (same crates.io checksum)

The 6 stages

A Deterministic reproducer (cublas_fp8_7b_reproducer.rs)
B Per-layer parity instrumentation (APR_PER_LAYER_PARITY_DUMP=1)
C Embed lookup parity (token → embedding vector)
D Pre-attention RMSNorm parity (eps-mismatch class)
E Q/K/V FP8 matmul parity at (3584, batch, 3584)
F Root-cause fix + contract amendment + v0.35.0 unblock

Each PR ~50-200 LOC. Each falsifier LIVE-DISCHARGED with empirical evidence on the 7B Q4K teacher.

Test plan

Spec parses cleanly (markdown only)
CI: workspace-test, contracts-lib
Kickoff: open the 6 sub-issues + assign owners

🤖 Generated with Claude Code

…rish Authored 2026-05-22 after a 1.5-hour bisect + layer-trace investigation on noah-Lambda-Vector (RTX 4090) surfaced that the cuBLAS FP8 7B Q4K gibberish (`<|im_start|>` repeats) is NOT a recent regression but a pre-existing fragile path that's been broken across multiple releases (v0.31.2, v0.33.0, v0.34.0 all reproduce). The initial `git bisect run` identified `8bd4ce5a` as the first bad commit, but a retest at v0.31.2 showed the same gibberish — bisection was invalid because the test signal is non-deterministic (cuBLAS context poisoning recovery masks the symptom across some runs). Per `feedback_release_only_after_bug_hunt`, v0.35.0 release tag holds until this epic discharges. Individual fix PRs (#1867 #1868 #1870 #1872 #1873 #1875 #1876 #1878) continue to land on main as net-positive improvements. ## 6-stage falsifier cascade (per feedback_falsifier_cascade_decomposes_magnitude) - **Stage A**: deterministic reproducer (`cublas_fp8_7b_reproducer.rs`) — make the bug visible on 5/5 consecutive runs - **Stage B**: per-layer parity instrumentation — `APR_PER_LAYER_PARITY_DUMP=1` writes 28 layer JSONs with CPU vs GPU checksums + cosine, so the first divergent layer becomes greppable - **Stage C**: embed lookup parity (token_id → embedding vector) - **Stage D**: pre-attention RMSNorm parity (eps mismatch is a known class — PMAT-698n) - **Stage E**: Q/K/V FP8 matmul parity at shape (3584, batch, 3584) — most likely root cause space per the layer-0 trace - **Stage F**: actual fix + contract amendment + v0.35.0 unblock Each PR ships ~50-200 LOC. Each falsifier discharges with empirical evidence on the 7B Q4K teacher. Estimate 2-5 days of focused work. ## Why this is an epic, not a one-PR fix Tried in the initial investigation (all unsuccessful): - `git bisect run` (invalid — bug is intermittent) - Layer-by-layer trace (showed Layer 0 already diverges, but not which op) - Per-file reverts: cublas.rs math mode, cublas_prefill/attention.rs, weights.rs, flash_decoding_graphed.rs, rms_norm.rs (backward) - Dependency archaeology: trueno-gpu v0.4.36 + aprender-compute re-export shim — code itself didn't change The bug is pre-existing AND deep — likely cuBLASLt FP8 GEMM algorithm selection or FP8 scale calibration for hidden_dim=3584. Per the 5-whys for #1864, it's invisible until `apr qa` Golden Output is wired into CI (currently only fires on manual `/dogfood`). ## Out of scope - 30B-MoE inference (#1583, separate epic) - wgpu kernel-level fix (covered by #1864 sub-issue post-#1876) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

noahgift · 2026-05-22T14:24:33Z

Closing as SUPERSEDED by PR #1890. The 5-line stop_tokens fix in #1890 resolves #1864 — Stages C-F of this spec are unnecessary. Stage A (PR #1884) and Stage B (PR #1887) remain useful as general-purpose cuBLAS FP8 diagnostics and merge independently. See memory/feedback_falsify_simple_before_deep.md for the methodology lesson.

noahgift enabled auto-merge (squash) May 22, 2026 13:34

noahgift mentioned this pull request May 22, 2026

Qwen2.5-7B Q4_K GPU inference produces gibberish — 'ampiezza' (wgpu) / '<|im_start|>' (cuBLAS) — regression vs #374 / #559 #1864

Open

Merge branch 'main' into spec/cublas-fp8-7b-fix-001

589d246

noahgift closed this May 22, 2026

auto-merge was automatically disabled May 22, 2026 14:24
Pull request was closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spec(SPEC-CUBLAS-FP8-7B-FIX-001): epic to root-cause cuBLAS FP8 7B gibberish (holds v0.35.0)#1882

spec(SPEC-CUBLAS-FP8-7B-FIX-001): epic to root-cause cuBLAS FP8 7B gibberish (holds v0.35.0)#1882
noahgift wants to merge 2 commits into
mainfrom
spec/cublas-fp8-7b-fix-001

noahgift commented May 22, 2026

Uh oh!

noahgift commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 22, 2026

Summary

Per-user decision (2026-05-22)

Why bisection was invalid

What we know

The 6 stages

Test plan

Uh oh!

noahgift commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant