Skip to content

fix(distill): pre-warm rmsnorm at both 1e-6 (Qwen2) and 1e-5 (Llama) eps (PMAT-698n)#1827

Merged
noahgift merged 1 commit into
mainfrom
fix/distill-rmsnorm-eps-1e6-pmat-698n
May 20, 2026
Merged

fix(distill): pre-warm rmsnorm at both 1e-6 (Qwen2) and 1e-5 (Llama) eps (PMAT-698n)#1827
noahgift merged 1 commit into
mainfrom
fix/distill-rmsnorm-eps-1e6-pmat-698n

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

Summary

PMAT-698k added the _eps{:08x} suffix to the rmsnorm pre-warm key but used the wrong default value:

  • Pre-warm: 1.0e-5_f32.to_bits() = 0x3727c5ac (Llama/Mistral default)
  • Runtime (Qwen2.5-Coder-0.5B): rms_norm_eps = 1e-6 = 0x358637bd

Different bits → different cache key → still cache-misses on Qwen2 family. Live Phase 3 dispatch v11 confirmed batched_rmsnorm_fwd_896_eps358637bd still JIT'd.

Fix

Pre-warm BOTH eps values. Cost: ~30 KB extra cache. Benefit: zero rmsnorm cache misses for either Qwen2 family OR Llama family without per-model dispatch logic.

Test plan

  • cargo check --features cuda clean
  • Live gx10 dispatch: zero [FWD-CACHE] Compiling events for batched_rmsnorm_fwd_*

Cascade stage 10

Final forward-cache JIT hygiene fix. Phase 3 pipeline already works end-to-end (#1817 PMAT-698j unblocked it); this just eliminates the last cache miss.

🤖 Generated with Claude Code

…eps (PMAT-698n)

PMAT-698k added the eps-suffix to the rmsnorm pre-warm key but used the
wrong default value: 1.0e-5_f32.to_bits() = 0x3727c5ac, whereas the
Qwen2 / Qwen2.5 family uses rms_norm_eps = 1e-6 = 0x358637bd.

Live Phase 3 dispatch v11 confirmed the runtime key was
batched_rmsnorm_fwd_896_eps358637bd — still cache-missing on the
PMAT-698k pre-warm entry.

Fix: pre-warm BOTH eps values (1e-6 for Qwen2/Qwen2.5, 1e-5 for
Llama/Mistral). Cost: ~30 KB extra cache headroom. Benefit: zero
cache misses for either model family without per-model dispatch
logic.

Test plan:
- [x] cargo check --features cuda — clean build
- [ ] Live gx10 dispatch: no [FWD-CACHE] Compiling for batched_rmsnorm_fwd_*

Stage 10 of cascade hygiene — last forward-cache JIT event eliminated.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 20, 2026 05:03
@noahgift noahgift merged commit c3c68b5 into main May 20, 2026
11 checks passed
@noahgift noahgift deleted the fix/distill-rmsnorm-eps-1e6-pmat-698n branch May 20, 2026 05:27
noahgift added a commit that referenced this pull request May 20, 2026
2026-05-20 — real distillation 1.5B teacher → 0.5B student on
Blackwell GB10 with the full PMAT-698e..n + PMAT-700-B cascade active.

  initial_loss = 7.6746
  final_loss   = 7.2036   ← LESS THAN initial
  62 steps, 122.7s, no errors

F-DISTILL-SMOKE-001 ("final_loss < initial_loss") discharged.

Phase 3 of SPEC-DISTILL-001 is COMPLETE.

Evidence:
- evidence/distill-phase-3-real-kd/dispatch.json — dispatch manifest
- evidence/distill-phase-3-real-kd/launch-final-pass.txt — full training log

Run dir on gx10: /home/noah/runs/distill-smoke-20260520-070404/
Trained student checkpoint: student-trained.apr/model.safetensors

Cascade summary (all merged):
- #1804 PMAT-700-B  (cuBLAS prewarm skip)
- #1808 PMAT-698e   (workspace cap)
- #1809 PMAT-698f   (APR magic in weights loader)
- #1810 PMAT-698g   (non-LoRA backward pre-warm)
- #1813 PMAT-698h   (rms_norm_gamma_reduce pre-warm)
- #1815 PMAT-698i   (FWD-CACHE diagnostic logging)
- #1817 PMAT-698j   (THE root cause — warm! macro key)
- #1820 PMAT-698k   (cache-key alignment: rope fwd + rmsnorm eps)
- #1823 PMAT-698m   (smoke setup: non-degenerate batch)
- #1824             (post-mortem doc)
- #1827 PMAT-698n   (rmsnorm pre-warm at both 1e-6 + 1e-5 eps)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant