refactor(nvfp4): collapse load-time scratch into a single Model map#73
Merged
Conversation
The 13 NvFP4PreQuantWeight slots on TransformerLayer (nvfp4_q/k/v/o,
nvfp4_gate/up/down, nvfp4_w_*_shared) plus the 3 std::vector<NvFP4Pre-
QuantWeight> per-expert slots plus Model::nvfp4_out_proj_ existed only
to stage scale tensors between the SafeTensors loader and Phase 0 of
executor_pre_dequant.cu. The runtime hot path stopped reading them in
Stage E (the promote() step copies their device pointers onto
Tensor.scales / Tensor.tensor_scale and dispatch reads from there).
This commit removes the scaffolding:
* NvFP4PreQuantWeight is now a free struct in model_config.h. The
nested-in-TransformerLayer location was a side-effect of the old
per-slot layout.
* Single load-time Model::nvfp4_scratch_ —
std::unordered_map<std::string, NvFP4PreQuantWeight> — keyed by
canonical slot name:
"L{idx}.{slot}" per-layer dense / shared
(e.g. "L5.wq", "L5.w_gate_shared")
"L{idx}.expert_w_{kind}.{e}" per-expert
(e.g. "L5.expert_w_gate.7")
"out_proj" LM head
* weight_map.cpp routes every NVFP4 scale tensor (weight_scale /
weight_scale_2 / input_scale) into the map via a new
`route_nvfp4_scale()` helper that takes (key, kind, t).
* weight_upload.cu iterates the map for the host→device upload step
instead of walking 16 per-layer slots × 3 fields.
* executor_pre_dequant.cu Phase 0 walks the map, resolves each key
back to the corresponding main weight tensor (`resolve` lambda
handles dense / shared / per-expert / LM-head paths), promotes
sidecars, and clears the scratch.
* safetensors_loader.cpp drops the old "link" step entirely. It
served only to set NvFP4PreQuantWeight.weight (so .valid() could
return true) — that field is gone now; .valid() just checks
weight_scale.data instead.
The TransformerLayer struct loses ~13 fields and becomes a plain
description of the runtime weights with no load-time scaffolding
mixed in. Net diff: 6 files, +202/-213 (-11 LoC + much less surface
area to misroute on next loader).
Verification
------------
- test-core 109, test-text 140, test-compute 115, test-attention 67,
test-kv 31, test-moe-gdn 35: all PASSED.
- test-quant 70 PASSED, 1 pre-existing FAILED
(WeightDispatchTest.NVFP4_GemvMatchesDirect, also fails on main).
- LlmCompressorE2E.MistralSmall_LoadsAndGeneratesCoherent: PASSED
(llm-compressor reciprocal scaling path).
- LlmCompressorE2E.Modelopt_QwenCoder30B_StillWorks: PASSED
(Modelopt multiplicative scaling path).
- imp-cli smoke against Qwen3-4B-Q8_0 GGUF: 76 tok/s coherent.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
kekzl
added a commit
that referenced
this pull request
Apr 30, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Final cleanup of the NVFP4-prequant load path. Stage E (PR #72)
moved the runtime hot-path off the per-layer
NvFP4PreQuantWeightslots — dispatch reads from
Tensor.scales/Tensor.tensor_scaledirectly. The slots themselves remained as load-time scratch only
and were the last residual scaffolding in the type system.
This PR removes them.
Changes
NvFP4PreQuantWeightslots onTransformerLayerdeleted(
nvfp4_q/k/v/o,nvfp4_gate/up/down,nvfp4_w_*_shared)std::vector<NvFP4PreQuantWeight>per-expert slots deleted(
expert_nvfp4_gate/up/down)Model::nvfp4_out_proj_deletedNvFP4PreQuantWeightmoved out ofTransformerLayerto a freestruct in
model_config.h(loses now-unusedweightfield;valid()checksweight_scale.dataonly)Model::nvfp4_scratch_—std::unordered_map<std::string, NvFP4PreQuantWeight>keyed bycanonical slot name:
"L{idx}.{slot}"for per-layer ("L5.wq","L5.w_gate_shared")"L{idx}.expert_w_{kind}.{e}"for per-expert"out_proj"for LM headweight_map.cpproutes every NVFP4 scale into the map via a newroute_nvfp4_scale(model, key, kind, t)helper. Replaces 4assignlambdas that wrote into the per-slot structs.weight_upload.cuiterates the map (single loop, ~30 LoC) insteadof walking 16 per-layer slots × 3 fields each
executor_pre_dequant.cuPhase 0 walks the map,resolve()parsesthe key back to the corresponding main weight tensor,
promote()copies the device pointers + the FP32 tensor scalar onto its
sidecars. Scratch is
clear()ed at end.safetensors_loader.cppdrops the old "link" step entirely — itexisted only to set
NvFP4PreQuantWeight.weightso.valid()could return true; that field is gone
Architecture impact
TransformerLayeris now a plain description of the runtime weightswith no load-time scaffolding mixed in. Adding a new NVFP4 quant
variant (W4A16, future llm-compressor revisions, etc.) only touches
weight_map.cpp(routing) andexecutor_pre_dequant.cu(resolvekey → tensor). No more per-projection per-slot field churn.
Net diff: 6 files, +202 / -213 (~11 LoC reduction, -16 fields,
-3 vectors, -1 nested struct).
Test plan
make test-gpu— 575 PASSED across 7 suites (1 pre-existingfail unchanged).
(llm-compressor reciprocal scaling path) — PASSED
(Modelopt multiplicative scaling path) — PASSED
regression on the non-NVFP4 path)
🤖 Generated with Claude Code