Skip to content

refactor(nvfp4): collapse load-time scratch into a single Model map#73

Merged
kekzl merged 1 commit into
mainfrom
refactor/nvfp4-slot-cleanup
Apr 28, 2026
Merged

refactor(nvfp4): collapse load-time scratch into a single Model map#73
kekzl merged 1 commit into
mainfrom
refactor/nvfp4-slot-cleanup

Conversation

@kekzl
Copy link
Copy Markdown
Owner

@kekzl kekzl commented Apr 28, 2026

Summary

Final cleanup of the NVFP4-prequant load path. Stage E (PR #72)
moved the runtime hot-path off the per-layer NvFP4PreQuantWeight
slots — dispatch reads from Tensor.scales / Tensor.tensor_scale
directly. The slots themselves remained as load-time scratch only
and were the last residual scaffolding in the type system.

This PR removes them.

Changes

  • 13 NvFP4PreQuantWeight slots on TransformerLayer deleted
    (nvfp4_q/k/v/o, nvfp4_gate/up/down, nvfp4_w_*_shared)
  • 3 std::vector<NvFP4PreQuantWeight> per-expert slots deleted
    (expert_nvfp4_gate/up/down)
  • Model::nvfp4_out_proj_ deleted
  • NvFP4PreQuantWeight moved out of TransformerLayer to a free
    struct in model_config.h (loses now-unused weight field;
    valid() checks weight_scale.data only)
  • New single Model::nvfp4_scratch_
    std::unordered_map<std::string, NvFP4PreQuantWeight> keyed by
    canonical slot name:
    • "L{idx}.{slot}" for per-layer ("L5.wq", "L5.w_gate_shared")
    • "L{idx}.expert_w_{kind}.{e}" for per-expert
    • "out_proj" for LM head
  • weight_map.cpp routes every NVFP4 scale into the map via a new
    route_nvfp4_scale(model, key, kind, t) helper. Replaces 4
    assign lambdas that wrote into the per-slot structs.
  • weight_upload.cu iterates the map (single loop, ~30 LoC) instead
    of walking 16 per-layer slots × 3 fields each
  • executor_pre_dequant.cu Phase 0 walks the map, resolve() parses
    the key back to the corresponding main weight tensor, promote()
    copies the device pointers + the FP32 tensor scalar onto its
    sidecars. Scratch is clear()ed at end.
  • safetensors_loader.cpp drops the old "link" step entirely — it
    existed only to set NvFP4PreQuantWeight.weight so .valid()
    could return true; that field is gone

Architecture impact

TransformerLayer is now a plain description of the runtime weights
with no load-time scaffolding mixed in. Adding a new NVFP4 quant
variant (W4A16, future llm-compressor revisions, etc.) only touches
weight_map.cpp (routing) and executor_pre_dequant.cu (resolve
key → tensor). No more per-projection per-slot field churn.

Net diff: 6 files, +202 / -213 (~11 LoC reduction, -16 fields,
-3 vectors, -1 nested struct).

Test plan

  • make test-gpu — 575 PASSED across 7 suites (1 pre-existing
    fail unchanged).
  • LlmCompressorE2E.MistralSmall_LoadsAndGeneratesCoherent
    (llm-compressor reciprocal scaling path) — PASSED
  • LlmCompressorE2E.Modelopt_QwenCoder30B_StillWorks
    (Modelopt multiplicative scaling path) — PASSED
  • Qwen3-4B-Q8_0 GGUF smoke — 76 tok/s coherent (proves no
    regression on the non-NVFP4 path)

🤖 Generated with Claude Code

The 13 NvFP4PreQuantWeight slots on TransformerLayer (nvfp4_q/k/v/o,
nvfp4_gate/up/down, nvfp4_w_*_shared) plus the 3 std::vector<NvFP4Pre-
QuantWeight> per-expert slots plus Model::nvfp4_out_proj_ existed only
to stage scale tensors between the SafeTensors loader and Phase 0 of
executor_pre_dequant.cu. The runtime hot path stopped reading them in
Stage E (the promote() step copies their device pointers onto
Tensor.scales / Tensor.tensor_scale and dispatch reads from there).

This commit removes the scaffolding:

  * NvFP4PreQuantWeight is now a free struct in model_config.h. The
    nested-in-TransformerLayer location was a side-effect of the old
    per-slot layout.
  * Single load-time Model::nvfp4_scratch_ —
    std::unordered_map<std::string, NvFP4PreQuantWeight> — keyed by
    canonical slot name:
      "L{idx}.{slot}"             per-layer dense / shared
                                  (e.g. "L5.wq", "L5.w_gate_shared")
      "L{idx}.expert_w_{kind}.{e}"  per-expert
                                  (e.g. "L5.expert_w_gate.7")
      "out_proj"                   LM head
  * weight_map.cpp routes every NVFP4 scale tensor (weight_scale /
    weight_scale_2 / input_scale) into the map via a new
    `route_nvfp4_scale()` helper that takes (key, kind, t).
  * weight_upload.cu iterates the map for the host→device upload step
    instead of walking 16 per-layer slots × 3 fields.
  * executor_pre_dequant.cu Phase 0 walks the map, resolves each key
    back to the corresponding main weight tensor (`resolve` lambda
    handles dense / shared / per-expert / LM-head paths), promotes
    sidecars, and clears the scratch.
  * safetensors_loader.cpp drops the old "link" step entirely. It
    served only to set NvFP4PreQuantWeight.weight (so .valid() could
    return true) — that field is gone now; .valid() just checks
    weight_scale.data instead.

The TransformerLayer struct loses ~13 fields and becomes a plain
description of the runtime weights with no load-time scaffolding
mixed in. Net diff: 6 files, +202/-213 (-11 LoC + much less surface
area to misroute on next loader).

Verification
------------
- test-core 109, test-text 140, test-compute 115, test-attention 67,
  test-kv 31, test-moe-gdn 35: all PASSED.
- test-quant 70 PASSED, 1 pre-existing FAILED
  (WeightDispatchTest.NVFP4_GemvMatchesDirect, also fails on main).
- LlmCompressorE2E.MistralSmall_LoadsAndGeneratesCoherent: PASSED
  (llm-compressor reciprocal scaling path).
- LlmCompressorE2E.Modelopt_QwenCoder30B_StillWorks: PASSED
  (Modelopt multiplicative scaling path).
- imp-cli smoke against Qwen3-4B-Q8_0 GGUF: 76 tok/s coherent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@kekzl kekzl merged commit 6267fce into main Apr 28, 2026
2 checks passed
@kekzl kekzl deleted the refactor/nvfp4-slot-cleanup branch April 28, 2026 08:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant