executor: extend registry shim + parity diagnostic (Phase 4 step 2) by kekzl · Pull Request #40 · kekzl/imp

kekzl · 2026-04-23T23:24:47Z

Summary

Extends the Phase-2 WeightRegistry shim in pre_dequant_weights to register shared-expert FFN (w_{gate,up,down}_shared) and the token embedding, matching the StoragePlanner enumeration landed in storage-planner: enumerate shared-expert FFN + top-level embeddings/LM head #38.
Adds TensorID fields to TransformerLayer (w_{gate,up,down}_shared_id) and Model (tok_emb_id). Default kInvalidTensorID; no consumer reads them yet.
Adds a Phase-4 parity diagnostic that compares registry_.size() against the count of plan entries at wcache-capable tiers. FP32-only kinds (norms, rope_freqs) are filtered out so the comparison is apples-to-apples.

Observed signal

On Qwen3-4B Q4_K_M today the diagnostic prints:

StoragePlanner (diagnostic): 254 entries, projected VRAM 3432.81 MiB
WeightRegistry populated with 38 handles (phase-2 shim)
Phase-4 parity: registry=38 handles, plan=254 wcache-tier entries (diff=216)

A real, actionable gap. Subsequent steps extend the registry until this converges to 0 — which is the exit condition for the Phase-4 storage flip.

Test plan

imp-tests full unit filter: 160 / 160 pass (2 model-dependent skipped)
Real-model smoke on Qwen3-4B-Instruct Q4_K_M: coherent output, ~128 tok/s decode (no regression)
Parity diagnostic prints WARN at load with diff count; no crash, no behavior change

…ity log Phase 4 incremental step 2 — align the Phase-2 WeightRegistry shim with the StoragePlanner enumeration landed in #38. New registry handles: - Per-layer w_{gate,up,down}_shared (TensorKind W_GATE/W_UP/W_DOWN) — Nemotron / DeepSeek / Qwen3.5-MoE shared-expert FFN - Top-level tok_emb_ (TOK_EMBED) New TensorID fields: TransformerLayer::{w_gate,w_up,w_down}_shared_id, Model::tok_emb_id. Default kInvalidTensorID, no consumer uses them yet. Phase-4 parity diagnostic: at end of pre_dequant_weights, compare the registry handle count against the number of plan entries at wcache- capable tiers (FP16/FP8/NVFP4/CUTLASS_NVFP4/MXFP4). FP32-only kinds (norms, rope_freqs) are excluded to make the comparison apples-to-apples. Observed on Qwen3-4B Q4_K_M: registry=38 handles vs plan=254 wcache-tier entries (diff=216). This is expected — the shim still only registers the short canonical field list. The diagnostic is what drives the next step: extending registration until the diff converges to 0. No behavioral change. 160/160 unit tests pass. Real-model smoke test on Qwen3-4B: coherent output, ~128 tok/s decode (no regression).

…ity log (#40)

kekzl enabled auto-merge (squash) April 23, 2026 23:28

kekzl merged commit d29c9ea into main Apr 23, 2026
2 checks passed

kekzl added a commit that referenced this pull request Apr 30, 2026

executor: extend registry shim with shared-expert FFN + tok_emb + par…

3c25d33

…ity log (#40)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

executor: extend registry shim + parity diagnostic (Phase 4 step 2)#40

executor: extend registry shim + parity diagnostic (Phase 4 step 2)#40
kekzl merged 1 commit into
mainfrom
refactor/weight-storage-phase4-step2-parity-diag

kekzl commented Apr 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kekzl commented Apr 23, 2026

Summary

Observed signal

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant