Skip to content

docs(spec): SPEC-DISTILL-001 §87 — PMAT-704 post-mortem on Bug B wrong turn#1880

Open
noahgift wants to merge 1 commit into
mainfrom
docs/spec-distill-postmortem-pmat-704
Open

docs(spec): SPEC-DISTILL-001 §87 — PMAT-704 post-mortem on Bug B wrong turn#1880
noahgift wants to merge 1 commit into
mainfrom
docs/spec-distill-postmortem-pmat-704

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

Summary

Documents the root cause of the PMAT-704 cascade fix (PR #1879). Adds §87 to `docs/specifications/aprender-train/distillation-epic-spec.md` explaining that PR #1869 (Bug B / `RealizarQ4KTeacher`) was a wrong turn — the realizar `_cuda` forward path is CPU-bound and unusable as a distillation teacher on Grace Blackwell GB10.

What the §87 amendment covers

Spec versioning

Test plan

  • Markdown renders cleanly
  • Section ordering matches existing § convention
  • CI: `ci / gate` + `workspace-test` green (docs-only PR; should be fast)

🤖 Generated with Claude Code

…g turn

Adds a §87 amendment to SPEC-DISTILL-001 documenting the root cause of
the PMAT-704 cascade fix: PR #1869 (Bug B / RealizarQ4KTeacher) was a
wrong turn — the realizar `_cuda` forward path is CPU-bound and
unusable as a distillation teacher on Grace Blackwell GB10. The 7B
vocab-aligned 500-step validation hung at step 0 for 1.5 h with GPU
at 0% utilization — empirical proof of the defect.

The amendment includes:

* Full five-whys chain (cuMemAlloc 30 GB ceiling vs phantom OOM-killer
  SIGKILL on the explicit-managed path), with file/line citations
  pointing to the CPU-heavy ops in
  crates/aprender-serve/src/gguf/cuda/cuda.rs:18
* Root cause: conflated two failures, missed the cheap dispatch-flip
  experiment that would have rejected Bug B's hypothesis in 5 minutes.
* Fix references: PR #1879 (PMAT-704) — cuBLAS default,
  RealizarQ4KTeacher demoted to APR_DISTILL_TEACHER_BACKEND=realizar-q4k
  opt-in fallback.
* Contract changes: new `apr-distill-teacher-backend-selection-v1.yaml`,
  `cuda-q4k-frozen-teacher-v1.yaml` demoted (not retracted).
* Methodology lesson: cheap-experiment-before-design discipline.
* Cascade closure table covering PRs #1863, #1869, #1871, #1874, #1877,
  #1879.

Spec version bumped 1.1.0 → 1.3.0 with changelog entries for both §86
(via PR #1871, also pending merge) and §87 (this PR). The amendment
notes the §86 cross-reference and explains the order-of-operations
in case readers see this on a build of main that predates #1871.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift force-pushed the docs/spec-distill-postmortem-pmat-704 branch from e26ac1e to 4560521 Compare May 22, 2026 14:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant