refactor(nvfp4): collapse load-time scratch into a single Model map by kekzl · Pull Request #73 · kekzl/imp

kekzl · 2026-04-28T07:56:36Z

Summary

Final cleanup of the NVFP4-prequant load path. Stage E (PR #72)
moved the runtime hot-path off the per-layer NvFP4PreQuantWeight
slots — dispatch reads from Tensor.scales / Tensor.tensor_scale
directly. The slots themselves remained as load-time scratch only
and were the last residual scaffolding in the type system.

This PR removes them.

Changes

13 NvFP4PreQuantWeight slots on TransformerLayer deleted
(nvfp4_q/k/v/o, nvfp4_gate/up/down, nvfp4_w_*_shared)
3 std::vector<NvFP4PreQuantWeight> per-expert slots deleted
(expert_nvfp4_gate/up/down)
Model::nvfp4_out_proj_ deleted
NvFP4PreQuantWeight moved out of TransformerLayer to a free
struct in model_config.h (loses now-unused weight field;
valid() checks weight_scale.data only)
New single Model::nvfp4_scratch_ —
std::unordered_map<std::string, NvFP4PreQuantWeight> keyed by
canonical slot name:
- "L{idx}.{slot}" for per-layer ("L5.wq", "L5.w_gate_shared")
- "L{idx}.expert_w_{kind}.{e}" for per-expert
- "out_proj" for LM head
weight_map.cpp routes every NVFP4 scale into the map via a new
route_nvfp4_scale(model, key, kind, t) helper. Replaces 4
assign lambdas that wrote into the per-slot structs.
weight_upload.cu iterates the map (single loop, ~30 LoC) instead
of walking 16 per-layer slots × 3 fields each
executor_pre_dequant.cu Phase 0 walks the map, resolve() parses
the key back to the corresponding main weight tensor, promote()
copies the device pointers + the FP32 tensor scalar onto its
sidecars. Scratch is clear()ed at end.
safetensors_loader.cpp drops the old "link" step entirely — it
existed only to set NvFP4PreQuantWeight.weight so .valid()
could return true; that field is gone

Architecture impact

TransformerLayer is now a plain description of the runtime weights
with no load-time scaffolding mixed in. Adding a new NVFP4 quant
variant (W4A16, future llm-compressor revisions, etc.) only touches
weight_map.cpp (routing) and executor_pre_dequant.cu (resolve
key → tensor). No more per-projection per-slot field churn.

Net diff: 6 files, +202 / -213 (~11 LoC reduction, -16 fields,
-3 vectors, -1 nested struct).

Test plan

make test-gpu — 575 PASSED across 7 suites (1 pre-existing
fail unchanged).
LlmCompressorE2E.MistralSmall_LoadsAndGeneratesCoherent
(llm-compressor reciprocal scaling path) — PASSED
LlmCompressorE2E.Modelopt_QwenCoder30B_StillWorks
(Modelopt multiplicative scaling path) — PASSED
Qwen3-4B-Q8_0 GGUF smoke — 76 tok/s coherent (proves no
regression on the non-NVFP4 path)

🤖 Generated with Claude Code

The 13 NvFP4PreQuantWeight slots on TransformerLayer (nvfp4_q/k/v/o, nvfp4_gate/up/down, nvfp4_w_*_shared) plus the 3 std::vector<NvFP4Pre- QuantWeight> per-expert slots plus Model::nvfp4_out_proj_ existed only to stage scale tensors between the SafeTensors loader and Phase 0 of executor_pre_dequant.cu. The runtime hot path stopped reading them in Stage E (the promote() step copies their device pointers onto Tensor.scales / Tensor.tensor_scale and dispatch reads from there). This commit removes the scaffolding: * NvFP4PreQuantWeight is now a free struct in model_config.h. The nested-in-TransformerLayer location was a side-effect of the old per-slot layout. * Single load-time Model::nvfp4_scratch_ — std::unordered_map<std::string, NvFP4PreQuantWeight> — keyed by canonical slot name: "L{idx}.{slot}" per-layer dense / shared (e.g. "L5.wq", "L5.w_gate_shared") "L{idx}.expert_w_{kind}.{e}" per-expert (e.g. "L5.expert_w_gate.7") "out_proj" LM head * weight_map.cpp routes every NVFP4 scale tensor (weight_scale / weight_scale_2 / input_scale) into the map via a new `route_nvfp4_scale()` helper that takes (key, kind, t). * weight_upload.cu iterates the map for the host→device upload step instead of walking 16 per-layer slots × 3 fields. * executor_pre_dequant.cu Phase 0 walks the map, resolves each key back to the corresponding main weight tensor (`resolve` lambda handles dense / shared / per-expert / LM-head paths), promotes sidecars, and clears the scratch. * safetensors_loader.cpp drops the old "link" step entirely. It served only to set NvFP4PreQuantWeight.weight (so .valid() could return true) — that field is gone now; .valid() just checks weight_scale.data instead. The TransformerLayer struct loses ~13 fields and becomes a plain description of the runtime weights with no load-time scaffolding mixed in. Net diff: 6 files, +202/-213 (-11 LoC + much less surface area to misroute on next loader). Verification ------------ - test-core 109, test-text 140, test-compute 115, test-attention 67, test-kv 31, test-moe-gdn 35: all PASSED. - test-quant 70 PASSED, 1 pre-existing FAILED (WeightDispatchTest.NVFP4_GemvMatchesDirect, also fails on main). - LlmCompressorE2E.MistralSmall_LoadsAndGeneratesCoherent: PASSED (llm-compressor reciprocal scaling path). - LlmCompressorE2E.Modelopt_QwenCoder30B_StillWorks: PASSED (Modelopt multiplicative scaling path). - imp-cli smoke against Qwen3-4B-Q8_0 GGUF: 76 tok/s coherent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

)

kekzl merged commit 6267fce into main Apr 28, 2026
2 checks passed

kekzl deleted the refactor/nvfp4-slot-cleanup branch April 28, 2026 08:19

kekzl added a commit that referenced this pull request Apr 30, 2026

refactor(nvfp4): collapse load-time scratch into a single Model map (#73

9d21df8

)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(nvfp4): collapse load-time scratch into a single Model map#73

refactor(nvfp4): collapse load-time scratch into a single Model map#73
kekzl merged 1 commit into
mainfrom
refactor/nvfp4-slot-cleanup

kekzl commented Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kekzl commented Apr 28, 2026

Summary

Changes

Architecture impact

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant