Skip to content

docs(blackwell): cascade post-mortem — 8 PRs / 7 defects / 1 root cause#1824

Merged
noahgift merged 3 commits into
mainfrom
docs/blackwell-cascade-postmortem
May 19, 2026
Merged

docs(blackwell): cascade post-mortem — 8 PRs / 7 defects / 1 root cause#1824
noahgift merged 3 commits into
mainfrom
docs/blackwell-cascade-postmortem

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

Summary

Institutional-knowledge capture from the PMAT-698e..m + PMAT-700-B cascade that unblocked Phase 3 distillation training on Blackwell GB10 (sm_121). One session, 9 PRs, ~7h debugging.

Highlights

5 lessons:

  1. Symptom-similarity is a SIGNAL — when iteration 2-3-4 of "fix one missing kernel" surface the same downstream error, stop adding kernels; instrument the cache.
  2. Macros with cache-key arguments deserve audit — \$key must be substituted into the call, not just passed in.
  3. Pre-warm contracts have two halves — keys + bodies — both must stay synchronized with runtime.
  4. Blackwell sm_121 surfaces OLD bugs as NEW failures — JIT-on-demand "worked" on sm_89, fails on sm_121.
  5. Smoke contracts test the smoke, not just the pipeline — degenerate batch can invalidate F-DISTILL-SMOKE-001 even when machinery is correct.

3 interventions that would have caught the root cause in 1 PR instead of 7:

  • Diagnostic logging in get_or_compile from day one (one eprintln)
  • Property test: pre_warm_keys(config) ⊇ runtime_keys(config) for any forward pass
  • Differential architecture testing (sm_89 + sm_121) in CI nightly

Effort: ~50 lines net production code; ~250 lines comments + spec; 8 PRs.

Cascade table

# PR Class Effort
1 #1804 PMAT-700-B Independent small
2 #1808 PMAT-698e Independent small
3 #1809 PMAT-698f Independent small
4 #1810 PMAT-698g Defense-in-depth medium
5 #1813 PMAT-698h Defense-in-depth small
6 #1815 PMAT-698i Diagnostic infrastructure small
7 #1817 PMAT-698j Root cause 1 char
8 #1820 PMAT-698k Hygiene small
9 #1823 PMAT-698m Contract semantics small

Recommended distribution

This post-mortem is upstream-shareable — share with the trueno team to inform trueno#200 (the official Blackwell JIT fix) plus any future cross-architecture validation work.

Test plan

Docs-only PR; no behavioral changes.

🤖 Generated with Claude Code

Capture the institutional knowledge from the PMAT-698e..m + PMAT-700-B
cascade that unblocked Phase 3 distillation training on Blackwell GB10.

Highlights:
- 5 lessons (symptom-similarity signal, macro audit, pre-warm contract
  halves, Blackwell exposes latent fragility, smoke-contract validity)
- 3 interventions that would have caught the root cause in 1 PR:
  diagnostic logging from start; property test for pre-warm coverage;
  cross-architecture (sm_89 + sm_121) CI gate
- Effort accounting: ~50 lines of net production code, ~250 lines of
  comments + spec across 8 PRs

This post-mortem is upstream-shareable — recommend forwarding to the
trueno team to inform trueno#200 (the official Blackwell JIT fix) plus
any future cross-architecture work.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 19, 2026 20:49
@noahgift noahgift merged commit 510aa7b into main May 19, 2026
10 checks passed
@noahgift noahgift deleted the docs/blackwell-cascade-postmortem branch May 19, 2026 22:37
noahgift added a commit that referenced this pull request May 20, 2026
2026-05-20 — real distillation 1.5B teacher → 0.5B student on
Blackwell GB10 with the full PMAT-698e..n + PMAT-700-B cascade active.

  initial_loss = 7.6746
  final_loss   = 7.2036   ← LESS THAN initial
  62 steps, 122.7s, no errors

F-DISTILL-SMOKE-001 ("final_loss < initial_loss") discharged.

Phase 3 of SPEC-DISTILL-001 is COMPLETE.

Evidence:
- evidence/distill-phase-3-real-kd/dispatch.json — dispatch manifest
- evidence/distill-phase-3-real-kd/launch-final-pass.txt — full training log

Run dir on gx10: /home/noah/runs/distill-smoke-20260520-070404/
Trained student checkpoint: student-trained.apr/model.safetensors

Cascade summary (all merged):
- #1804 PMAT-700-B  (cuBLAS prewarm skip)
- #1808 PMAT-698e   (workspace cap)
- #1809 PMAT-698f   (APR magic in weights loader)
- #1810 PMAT-698g   (non-LoRA backward pre-warm)
- #1813 PMAT-698h   (rms_norm_gamma_reduce pre-warm)
- #1815 PMAT-698i   (FWD-CACHE diagnostic logging)
- #1817 PMAT-698j   (THE root cause — warm! macro key)
- #1820 PMAT-698k   (cache-key alignment: rope fwd + rmsnorm eps)
- #1823 PMAT-698m   (smoke setup: non-degenerate batch)
- #1824             (post-mortem doc)
- #1827 PMAT-698n   (rmsnorm pre-warm at both 1e-6 + 1e-5 eps)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant