Skip to content

spec(SPEC-CUBLAS-FP8-7B-FIX-001): epic to root-cause cuBLAS FP8 7B gibberish (holds v0.35.0)#1882

Closed
noahgift wants to merge 2 commits into
mainfrom
spec/cublas-fp8-7b-fix-001
Closed

spec(SPEC-CUBLAS-FP8-7B-FIX-001): epic to root-cause cuBLAS FP8 7B gibberish (holds v0.35.0)#1882
noahgift wants to merge 2 commits into
mainfrom
spec/cublas-fp8-7b-fix-001

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

Summary

After a 1.5-hour bisect + layer-trace investigation on 2026-05-22, the cuBLAS FP8 7B Q4K <|im_start|> gibberish is confirmed to be not a recent regression but a pre-existing fragile path broken across multiple releases (v0.31.2, v0.33.0, v0.34.0 all reproduce on the same RTX 4090).

This PR adds docs/specifications/SPEC-CUBLAS-FP8-7B-FIX-001.md — a 6-stage falsifier cascade epic to root-cause it, per memory/feedback_falsifier_cascade_decomposes_magnitude.md.

Per-user decision (2026-05-22)

"Open a multi-day epic for the cuBLAS fix; ship v0.35.0 later"

v0.35.0 release tag + crates.io publish cascade are held until this epic discharges. The 8 individual fix PRs already in flight (#1867 #1868 #1870 #1872 #1873 #1875 #1876 #1878) continue to land on main as net-positive improvements.

Why bisection was invalid

git bisect run named 8bd4ce5a (the monorepo consolidation commit). Re-running the oracle at v0.31.2 — the "good" baseline — showed the same <|im_start|> gibberish. The Golden Output gate signal is non-deterministic; CUDA context poisoning recovery masks the symptom across some runs. Per memory/feedback_test_methodology_can_fake_bugs.md.

What we know

Datum Value
Reproducer apr qa /home/noah/models/qwen2.5-coder-7b-instruct-q4_k_m.gguf
Symptom ✗ FAIL Golden Output GPU output failed (CPU passed): gibberish
Hardware RTX 4090 (sm_89)
Backend cuBLASLt FP8 (197 weight matrices, 6.7 GB cache)
Layer-0 trace CPU vs cuBLAS Q/K inputs differ already at layer 0
Logit correlation 0.987 (high) — outputs are approximately linear-related
Linear fit GPU ≈ 0.96 × CPU + 0.12
CUDA failfast Context poisoned during executor lifetime; recovery silently produces garbage
trueno-gpu v0.4.36 bit-identical between v0.31.2 and 8bd4ce5 (same crates.io checksum)

The 6 stages

  1. A Deterministic reproducer (cublas_fp8_7b_reproducer.rs)
  2. B Per-layer parity instrumentation (APR_PER_LAYER_PARITY_DUMP=1)
  3. C Embed lookup parity (token → embedding vector)
  4. D Pre-attention RMSNorm parity (eps-mismatch class)
  5. E Q/K/V FP8 matmul parity at (3584, batch, 3584)
  6. F Root-cause fix + contract amendment + v0.35.0 unblock

Each PR ~50-200 LOC. Each falsifier LIVE-DISCHARGED with empirical evidence on the 7B Q4K teacher.

Test plan

  • Spec parses cleanly (markdown only)
  • CI: workspace-test, contracts-lib
  • Kickoff: open the 6 sub-issues + assign owners

🤖 Generated with Claude Code

…rish

Authored 2026-05-22 after a 1.5-hour bisect + layer-trace investigation
on noah-Lambda-Vector (RTX 4090) surfaced that the cuBLAS FP8 7B Q4K
gibberish (`<|im_start|>` repeats) is NOT a recent regression but a
pre-existing fragile path that's been broken across multiple releases
(v0.31.2, v0.33.0, v0.34.0 all reproduce). The initial `git bisect run`
identified `8bd4ce5a` as the first bad commit, but a retest at v0.31.2
showed the same gibberish — bisection was invalid because the test
signal is non-deterministic (cuBLAS context poisoning recovery masks
the symptom across some runs).

Per `feedback_release_only_after_bug_hunt`, v0.35.0 release tag holds
until this epic discharges. Individual fix PRs (#1867 #1868 #1870 #1872
#1873 #1875 #1876 #1878) continue to land on main as net-positive
improvements.

## 6-stage falsifier cascade (per feedback_falsifier_cascade_decomposes_magnitude)

- **Stage A**: deterministic reproducer (`cublas_fp8_7b_reproducer.rs`)
  — make the bug visible on 5/5 consecutive runs
- **Stage B**: per-layer parity instrumentation
  — `APR_PER_LAYER_PARITY_DUMP=1` writes 28 layer JSONs with CPU vs GPU
    checksums + cosine, so the first divergent layer becomes greppable
- **Stage C**: embed lookup parity (token_id → embedding vector)
- **Stage D**: pre-attention RMSNorm parity (eps mismatch is a known
  class — PMAT-698n)
- **Stage E**: Q/K/V FP8 matmul parity at shape (3584, batch, 3584)
  — most likely root cause space per the layer-0 trace
- **Stage F**: actual fix + contract amendment + v0.35.0 unblock

Each PR ships ~50-200 LOC. Each falsifier discharges with empirical
evidence on the 7B Q4K teacher. Estimate 2-5 days of focused work.

## Why this is an epic, not a one-PR fix

Tried in the initial investigation (all unsuccessful):
- `git bisect run` (invalid — bug is intermittent)
- Layer-by-layer trace (showed Layer 0 already diverges, but not which op)
- Per-file reverts: cublas.rs math mode, cublas_prefill/attention.rs,
  weights.rs, flash_decoding_graphed.rs, rms_norm.rs (backward)
- Dependency archaeology: trueno-gpu v0.4.36 + aprender-compute
  re-export shim — code itself didn't change

The bug is pre-existing AND deep — likely cuBLASLt FP8 GEMM algorithm
selection or FP8 scale calibration for hidden_dim=3584. Per the
5-whys for #1864, it's invisible until `apr qa` Golden Output is wired
into CI (currently only fires on manual `/dogfood`).

## Out of scope

- 30B-MoE inference (#1583, separate epic)
- wgpu kernel-level fix (covered by #1864 sub-issue post-#1876)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift
Copy link
Copy Markdown
Contributor Author

Closing as SUPERSEDED by PR #1890. The 5-line stop_tokens fix in #1890 resolves #1864 — Stages C-F of this spec are unnecessary. Stage A (PR #1884) and Stage B (PR #1887) remain useful as general-purpose cuBLAS FP8 diagnostics and merge independently. See memory/feedback_falsify_simple_before_deep.md for the methodology lesson.

@noahgift noahgift closed this May 22, 2026
auto-merge was automatically disabled May 22, 2026 14:24

Pull request was closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant