Skip to content

diag(distill): per-kernel JIT logging in forward cache (PMAT-698i)#1815

Merged
noahgift merged 1 commit into
mainfrom
diag/fwd-cache-jit-logging-pmat-698i
May 19, 2026
Merged

diag(distill): per-kernel JIT logging in forward cache (PMAT-698i)#1815
noahgift merged 1 commit into
mainfrom
diag/fwd-cache-jit-logging-pmat-698i

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

Summary

The Phase 3 cuda dispatch defect cascade (PMAT-698e..h) has been chasing missing pre-warm entries ONE KERNEL AT A TIME by inferring kernel names from CUDA error messages. After 5 iterations:

Iter PR Kernel discovered Result
1 #1808 PMAT-698e (workspace cap, not a kernel) unblocks load
2 #1809 PMAT-698f (format compat, not a kernel) unblocks pipeline
3 #1810 PMAT-698g silu_backward, softmax_backward, rms_norm_backward unblocks 3 backwards
4 #1813 PMAT-698h rms_norm_gamma_reduce unblocks 1 more
5 Still 4 GH-480 patches post-pre-warm with NO kernel name in log unknown

This diagnostic PR adds eprintln to ForwardKernelCache::get_or_compile so every forward JIT event prints its kernel key. After one dispatch we get the full list in one pass.

Mirrors the existing [BWD-CACHE] log on the backward path.

Sample expected output (post-merge dispatch)

[CUDA] Pre-warmed 23 forward kernels (JIT compiled before block upload)
...
✓ 24 transformer blocks uploaded to GPU
[FWD-CACHE] Compiling 'batched_rope_neox_forward_896' (ptx_len=4231)
[FWD-CACHE] OK 'batched_rope_neox_forward_896'
[FWD-CACHE] Compiling 'bf16_to_f32_cast' (ptx_len=1108)
...

Then PMAT-698j adds those kernel names to pre_warm_for_model.

Test plan

  • cargo check --features cuda clean build
  • Live gx10 dispatch logs every forward JIT event (post-merge)
  • Each kernel name becomes a one-line addition to pre_warm_for_model

Defect cascade — stage 6 (DIAGNOSTIC)

# PR What
1 #1804 PMAT-700-B Skip PTX GEMM pre-warm (JIT cache pressure)
2 #1808 PMAT-698e Cap max_seq_len at 2048
3 #1809 PMAT-698f Accept APR magic in weights loader
4 #1810 PMAT-698g Pre-warm non-LoRA backward kernels
5 #1813 PMAT-698h Pre-warm rms_norm_gamma_reduce
6 THIS PMAT-698i Diagnostic: log every forward JIT
7 (next) PMAT-698j Consolidated pre-warm sweep from diagnostic output

🤖 Generated with Claude Code

The Phase 3 cuda dispatch defect cascade (PMAT-698e..h) has been
chasing missing pre-warm entries ONE KERNEL AT A TIME by inferring kernel
names from CUDA error messages. After 5 iterations the dispatch still
emits 4 GH-480 patches (= JIT events) post-pre-warm but the kernel
names are not logged.

This diagnostic adds eprintln to ForwardKernelCache::get_or_compile so
every JIT event prints its kernel key. After one dispatch we get the
full list of missing pre-warms in one pass — pre-warm sweep becomes O(1)
instead of O(N) iterations.

Mirrors the existing [BWD-CACHE] logging on the cuda_backward path.

Test plan:
- [x] cargo check --features cuda — clean build
- [ ] Live gx10 dispatch logs every forward-cache JIT event with kernel name
  → identifies all kernels still missing from pre_warm_for_model in one pass

Stage 6 of the Phase 3 cuda dispatch defect cascade — diagnostic, not fix:
  PMAT-700-B → PMAT-698e → PMAT-698f → PMAT-698g → PMAT-698h → PMAT-698i (DIAG)
  Next: PMAT-698j (consolidated pre-warm sweep based on diagnostic output)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 19, 2026 16:06
@noahgift noahgift merged commit 8614047 into main May 19, 2026
11 checks passed
@noahgift noahgift deleted the diag/fwd-cache-jit-logging-pmat-698i branch May 19, 2026 16:31
noahgift added a commit that referenced this pull request May 20, 2026
2026-05-20 — real distillation 1.5B teacher → 0.5B student on
Blackwell GB10 with the full PMAT-698e..n + PMAT-700-B cascade active.

  initial_loss = 7.6746
  final_loss   = 7.2036   ← LESS THAN initial
  62 steps, 122.7s, no errors

F-DISTILL-SMOKE-001 ("final_loss < initial_loss") discharged.

Phase 3 of SPEC-DISTILL-001 is COMPLETE.

Evidence:
- evidence/distill-phase-3-real-kd/dispatch.json — dispatch manifest
- evidence/distill-phase-3-real-kd/launch-final-pass.txt — full training log

Run dir on gx10: /home/noah/runs/distill-smoke-20260520-070404/
Trained student checkpoint: student-trained.apr/model.safetensors

Cascade summary (all merged):
- #1804 PMAT-700-B  (cuBLAS prewarm skip)
- #1808 PMAT-698e   (workspace cap)
- #1809 PMAT-698f   (APR magic in weights loader)
- #1810 PMAT-698g   (non-LoRA backward pre-warm)
- #1813 PMAT-698h   (rms_norm_gamma_reduce pre-warm)
- #1815 PMAT-698i   (FWD-CACHE diagnostic logging)
- #1817 PMAT-698j   (THE root cause — warm! macro key)
- #1820 PMAT-698k   (cache-key alignment: rope fwd + rmsnorm eps)
- #1823 PMAT-698m   (smoke setup: non-degenerate batch)
- #1824             (post-mortem doc)
- #1827 PMAT-698n   (rmsnorm pre-warm at both 1e-6 + 1e-5 eps)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant