Skip to content

executor: extend registry shim + parity diagnostic (Phase 4 step 2)#40

Merged
kekzl merged 1 commit into
mainfrom
refactor/weight-storage-phase4-step2-parity-diag
Apr 23, 2026
Merged

executor: extend registry shim + parity diagnostic (Phase 4 step 2)#40
kekzl merged 1 commit into
mainfrom
refactor/weight-storage-phase4-step2-parity-diag

Conversation

@kekzl
Copy link
Copy Markdown
Owner

@kekzl kekzl commented Apr 23, 2026

Summary

  • Extends the Phase-2 WeightRegistry shim in pre_dequant_weights to register shared-expert FFN (w_{gate,up,down}_shared) and the token embedding, matching the StoragePlanner enumeration landed in storage-planner: enumerate shared-expert FFN + top-level embeddings/LM head #38.
  • Adds TensorID fields to TransformerLayer (w_{gate,up,down}_shared_id) and Model (tok_emb_id). Default kInvalidTensorID; no consumer reads them yet.
  • Adds a Phase-4 parity diagnostic that compares registry_.size() against the count of plan entries at wcache-capable tiers. FP32-only kinds (norms, rope_freqs) are filtered out so the comparison is apples-to-apples.

Observed signal

On Qwen3-4B Q4_K_M today the diagnostic prints:

StoragePlanner (diagnostic): 254 entries, projected VRAM 3432.81 MiB
WeightRegistry populated with 38 handles (phase-2 shim)
Phase-4 parity: registry=38 handles, plan=254 wcache-tier entries (diff=216)

A real, actionable gap. Subsequent steps extend the registry until this converges to 0 — which is the exit condition for the Phase-4 storage flip.

Test plan

  • imp-tests full unit filter: 160 / 160 pass (2 model-dependent skipped)
  • Real-model smoke on Qwen3-4B-Instruct Q4_K_M: coherent output, ~128 tok/s decode (no regression)
  • Parity diagnostic prints WARN at load with diff count; no crash, no behavior change

…ity log

Phase 4 incremental step 2 — align the Phase-2 WeightRegistry shim with
the StoragePlanner enumeration landed in #38.

New registry handles:
- Per-layer w_{gate,up,down}_shared (TensorKind W_GATE/W_UP/W_DOWN) —
  Nemotron / DeepSeek / Qwen3.5-MoE shared-expert FFN
- Top-level tok_emb_ (TOK_EMBED)

New TensorID fields: TransformerLayer::{w_gate,w_up,w_down}_shared_id,
Model::tok_emb_id. Default kInvalidTensorID, no consumer uses them yet.

Phase-4 parity diagnostic: at end of pre_dequant_weights, compare the
registry handle count against the number of plan entries at wcache-
capable tiers (FP16/FP8/NVFP4/CUTLASS_NVFP4/MXFP4). FP32-only kinds
(norms, rope_freqs) are excluded to make the comparison apples-to-apples.

Observed on Qwen3-4B Q4_K_M: registry=38 handles vs plan=254 wcache-tier
entries (diff=216). This is expected — the shim still only registers the
short canonical field list. The diagnostic is what drives the next step:
extending registration until the diff converges to 0.

No behavioral change. 160/160 unit tests pass. Real-model smoke test on
Qwen3-4B: coherent output, ~128 tok/s decode (no regression).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant