feat(loader): emit Gemma-4 NVFP4 scaling extras (Phase 2 Item 2, partial)#65
Merged
Conversation
kekzl
added a commit
that referenced
this pull request
Apr 26, 2026
Two zero-overhead diagnostic flags added during the Phase-2-Item-2 audit
of Gemma-4 NVFP4 non-determinism. Off by default; no behavior change
unless explicitly enabled.
- IMP_NO_PDL=1 — pdl::is_enabled() returns false for every kernel, so
pdl::launch falls back to standard <<<>>> launches. Used to confirm
whether a non-deterministic regression is caused by Programmatic
Dependent Launch tail-overlap races.
- IMP_MOE_ZERO_WORKSPACE=1 — cudaMemsetAsync the MoE workspace buffers
(expert_{gate,up,swiglu,down}, gathered) at the start of each
run_moe_ffn(). Used to confirm whether non-determinism is caused by
uninitialized reads from MoE staging slots that the legacy serial
fallback didn't write (e.g. inactive expert positions).
Both useful as next-session investigation tools for the Gemma-4 NVFP4
incoherence (PR #65 still open). Audit findings:
- LAUNCH_BLOCKING=1 → output deterministic but still incoherent
("{\"get_2_plus_2"); confirms the underlying bug is computational,
non-determinism is sampling-noise in the wrong-logits regime.
- IMP_NO_PDL=1 → still non-deterministic; rules out PDL races.
- IMP_MOE_ZERO_WORKSPACE=1 → still non-deterministic; rules out
uninit reads from MoE staging.
- Mistral-Small NVFP4 (dense) is deterministic + coherent on the
same setup; isolates the bug to the MoE-specific code paths.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
kekzl
added a commit
that referenced
this pull request
Apr 26, 2026
…Gemma-4 Gemma-4 has dual head geometry: SWA layers use head_dim=256, full-attention layers use head_dim=512 (and 2 KV heads vs 8). The HF config loader was building per-layer arrays correctly but leaving cfg.head_dim at the SWA-only value (256). The KV cache + attention workspace are sized from the SCALAR cfg.head_dim — which means with head_dim=256 the buffers are too small for the full-attention layers' 8192-wide Q output → write past the allocated stride into adjacent layer slots → corrupted state. The GGUF loader already does the same `max_hd = max(per-layer)` fixup (gguf_loader.cpp:940-946); this commit mirrors it on the HF side. Without it Gemma-4 NVFP4 SafeTensors models had silent buffer-overflow risk on the full-attention layers (every 6th layer in the 5+1 SWA/global pattern). Discovered while bisecting Gemma-4 NVFP4 incoherence (PR #65). KV-cache log line `head_dim=256` in NVFP4 vs `head_dim=512` in GGUF was the tell. Note: this fix removes the buffer-overflow risk but does NOT fully restore Gemma-4 NVFP4 coherence — there's a separate architectural issue where attention_cublas_prefill computes ld_q/ld_k from per-layer values while the buffer stride is sized from cfg.head_dim (now correctly max). For mixed-head_dim layers the leading dimensions don't match the buffer stride. That's a follow-up fix (probably: pass actual stride from Tensor.stride[0] instead of recomputing). No regression on Mistral-Small NVFP4 (still coherent: "Paris. It is the capital of France..."). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ial)
Phase 1 skipped the 90 extra scaling tensors that ship with llm-compressor
NVFP4 Gemma-4 (`.layer_scalar`, `router.per_expert_scale`, `router.scale`)
because the runtime application was untested. Validation showed Gemma-4 NVFP4
emits "Pac<unused5>" without them — they are load-bearing.
This change flips the SKIP rules to pass-through emission. The names already
match `weight_map.cpp`'s Gemma-4 routing (router.scale → ffn_gate_inp_scale,
router.per_expert_scale → expert_down_scale, layer_scalar → layer_out_scale),
which the Gemma-4 GGUF path uses correctly today. weight_upload.cu already
includes these fields in the unquantized-norm upload list (BF16→FP16
conversion handled). executor_forward.cu / executor_forward_moe.cu already
apply layer_out_scale, ffn_gate_inp_scale (as RMSNorm weight) and
expert_down_scale (as routing weight multiplier) on the GEMMA4 dispatch.
What works:
- 26 unit tests pass (3 new EmitsLayerScalar / EmitsPerExpertScale /
EmitsRouterScale, replacing the SKIP assertions; ProjScaleIsNotGemma4Extra
retained as a defensive guard).
- All four LlmCompressorE2E tests still pass: Gemma4_LoadsWithoutIMA,
Gemma4_GeneratesNonEmptyOutput, MistralSmall_LoadsAndGeneratesCoherent,
Modelopt_QwenCoder30B_StillWorks.
- Gemma-4 NVFP4 output is no longer the immediate degenerate stop
("Pac<unused5>"). The scales now reach the forward pass.
What's left (deferred to a follow-up PR):
- Gemma-4 NVFP4 output is still incoherent — varying tokens (e.g. "What is
the capital of France?" → " way world ات set" with chat-template gemma)
but not factually correct. The GGUF Gemma-4 Q8_0 path with the SAME
executor scaling code is coherent ("Paris is the capital of France"), so
the application math is right — but llm-compressor likely stores one or
more of these scales with different semantics (e.g. divisor vs multiplier,
similar to the Phase-1 weight_global_scale flip), or there's an extra
reconstruction step in the vLLM reference (vllm-project/vllm#39045) that
imp doesn't replicate. Needs a side-by-side reference trace; estimated
another ~1-2 days.
Mistral-Small remains the actual coherence gate for the dense loader; this
PR doesn't regress it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Discovery from comparing the GGUF Q8_0 baseline (which works coherently with the same executor) against the SafeTensors NVFP4 path: the per-layer scale *values* are bit-identical between the two formats (verified via raw byte diff on layer_scalar / router.scale / router.per_expert_scale across layers 0/5/15). So the executor math is correct. What's *not* identical is the layer-norm tensor inventory the loader sees. Each Gemma-4 layer has six FFN-related layernorms: - pre_feedforward_layernorm (base) - pre_feedforward_layernorm_2 (parallel branch) - post_feedforward_layernorm (base) - post_feedforward_layernorm_1 (parallel branch 1) - post_feedforward_layernorm_2 (parallel branch 2) - post_attention_layernorm (sandwich) The Phase-1 weight_map only routed the _1 / _2 parallel-branch variants and post_attention. The two *base* variants (pre_feedforward_layernorm, post_feedforward_layernorm) hit no branch and were silently counted as "matched" via the catch-all but never assigned to a layer field. Result: layer.ffn_norm and layer.post_ffn_norm stayed null for every Gemma-4 layer and the FFN computation ran without input/output normalization. GGUF stores these as `ffn_norm` and `post_ffw_norm` and the GGUF loader already routes them. This commit mirrors that routing on the SafeTensors side: pre_feedforward_layernorm → layer.ffn_norm, post_feedforward_layernorm → layer.post_ffn_norm. Status: - 26 LlmCompressor* unit tests still pass. - Gemma-4 NVFP4 generation: output is still degenerate (further work needed — likely additional missing piece in the per-expert NVFP4 path, since the prefill takes the legacy serial fallback rather than the NVFP4 grouped path; nvfp4_moe wcache count stays 0 because expert_gate_packed is null for the per-expert llm-compressor layout). But the norm fix is a strict improvement that the loader should ship regardless. - Mistral-Small dense path unaffected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two zero-overhead diagnostic flags added during the Phase-2-Item-2 audit
of Gemma-4 NVFP4 non-determinism. Off by default; no behavior change
unless explicitly enabled.
- IMP_NO_PDL=1 — pdl::is_enabled() returns false for every kernel, so
pdl::launch falls back to standard <<<>>> launches. Used to confirm
whether a non-deterministic regression is caused by Programmatic
Dependent Launch tail-overlap races.
- IMP_MOE_ZERO_WORKSPACE=1 — cudaMemsetAsync the MoE workspace buffers
(expert_{gate,up,swiglu,down}, gathered) at the start of each
run_moe_ffn(). Used to confirm whether non-determinism is caused by
uninitialized reads from MoE staging slots that the legacy serial
fallback didn't write (e.g. inactive expert positions).
Both useful as next-session investigation tools for the Gemma-4 NVFP4
incoherence (PR #65 still open). Audit findings:
- LAUNCH_BLOCKING=1 → output deterministic but still incoherent
("{\"get_2_plus_2"); confirms the underlying bug is computational,
non-determinism is sampling-noise in the wrong-logits regime.
- IMP_NO_PDL=1 → still non-deterministic; rules out PDL races.
- IMP_MOE_ZERO_WORKSPACE=1 → still non-deterministic; rules out
uninit reads from MoE staging.
- Mistral-Small NVFP4 (dense) is deterministic + coherent on the
same setup; isolates the bug to the MoE-specific code paths.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Phase-1 NVFP4 audit (commit f5f4a1a, "K must be packed*2") fixed the WeightHandle-based dispatch in five executor files but missed weight_dispatch.cu's TWO `case StorageTier::NVFP4:` arms (line 82 prefill, line 298 decode). Both had identical bugs: 1. tmp.K = w.shape[1] — uses PACKED column count instead of logical K. NVFP4 stores K/2 nibbles per byte, so the kernel consumed only half the input → wrong dot product. 2. cudaMemcpyAsync(&tmp.tensor_scale, w.payload.nvfp4.tensor_scale, sizeof(float), cudaMemcpyDeviceToHost, stream) — but `tensor_scale` is a borrowed HOST float pointer (per Phase-1 wcache_.nvfp4 binding), not a device pointer. The Memcpy silently corrupts the scale. Mistral-Small NVFP4 worked anyway because the dense q/k/v/o paths take the specialized fused QKV / gate-up GEMVs (executor_attention.cu line 345 + executor_ffn.cu line 173/289/359) which already had the correct `* 2` + direct host-pointer read. The buggy weight_dispatch.cu path is only reached by callers that use gemm_dispatch(WeightHandle&, …) directly — which is rare today but used by some Gemma-4 paths. Fix both arms to mirror the working callers: read tensor_scale directly from the host pointer, and multiply shape[1] by 2 for the logical K. Side effect verified on Mistral-Small-3.2-24B-NVFP4: greedy "The capital of France is" output improved from "Paris. It is the largest city in France and one of" → "Paris. It is the capital of France". The earlier output had been mis-scaled too, just not visibly enough to flip argmax. Also adds IMP_NO_SHARED_MLP=1 diagnostic env var (skips the Gemma-4 shared-MLP branch in run_moe_ffn — used to bisect shared-MLP vs MoE-expert contributions to the still-broken Gemma-4 NVFP4 output). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…Gemma-4 Gemma-4 has dual head geometry: SWA layers use head_dim=256, full-attention layers use head_dim=512 (and 2 KV heads vs 8). The HF config loader was building per-layer arrays correctly but leaving cfg.head_dim at the SWA-only value (256). The KV cache + attention workspace are sized from the SCALAR cfg.head_dim — which means with head_dim=256 the buffers are too small for the full-attention layers' 8192-wide Q output → write past the allocated stride into adjacent layer slots → corrupted state. The GGUF loader already does the same `max_hd = max(per-layer)` fixup (gguf_loader.cpp:940-946); this commit mirrors it on the HF side. Without it Gemma-4 NVFP4 SafeTensors models had silent buffer-overflow risk on the full-attention layers (every 6th layer in the 5+1 SWA/global pattern). Discovered while bisecting Gemma-4 NVFP4 incoherence (PR #65). KV-cache log line `head_dim=256` in NVFP4 vs `head_dim=512` in GGUF was the tell. Note: this fix removes the buffer-overflow risk but does NOT fully restore Gemma-4 NVFP4 coherence — there's a separate architectural issue where attention_cublas_prefill computes ld_q/ld_k from per-layer values while the buffer stride is sized from cfg.head_dim (now correctly max). For mixed-head_dim layers the leading dimensions don't match the buffer stride. That's a follow-up fix (probably: pass actual stride from Tensor.stride[0] instead of recomputing). No regression on Mistral-Small NVFP4 (still coherent: "Paris. It is the capital of France..."). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Gemma-4-26B-A4B NVFP4 (llm-compressor format) was producing token-stuck
output (' \n \n ...') and IMA'ing under default flags during CUDA-graph
capture of decode step 1. Bisect (IMP_AUDIT_NVFP4_SCALES=1 +
IMP_EXPERT_NVFP4_DEQUANT[_MR]=1) pinned the cause to the per-row
gemv_nvfp4_kpar loop in expert_gemm's NVFP4 branch, used only for the
LEGACY SERIAL MoE prefill path (which fires when nvfp4_moe_*_ptr is null
because llm-compressor stores experts as separate per-tensor NVFP4 weights
instead of a fused 3D packed buffer, and CUTLASS 3.x grouped is reserved
for StorageTier::CUTLASS_NVFP4 / Modelopt).
The single-call M=1 decode path on the same kernel works (Mistral dense
decode hits gemv_nvfp4_kpar at N=4096 K=5120 and is coherent), but the M>1
per-row loop at Gemma-4 expert dimensions (N=704 K=2816 / N=2816 K=704)
produces wrong output. The dense-path mirror — gemm_nvfp4 (NVFP4 → FP16
dequant + cuBLAS gemm), which Mistral dense prefill already uses — is
correct on Gemma-4, so route the M>1 expert path through it. ~22 LoC
delta. CUTLASS 3.x grouped (Modelopt) and the nvfp4_moe_*_ptr batch path
are unaffected and remain on their fast paths.
Verified post-fix on 2026-04-27:
- Gemma-4-26B-A4B-it-NVFP4 → "The capital of France is Paris."
(tg≈34 tok/s, CUDA Graphs auto-enable; IMA-with-graphs gone).
- Mistral-Small-3.2-24B-NVFP4 → "Paris is renowned for"
(tg≈76 tok/s, no regression).
- Qwen3-Coder-30B-A3B-Instruct-FP4 → "of the Fibonacci function"
(Modelopt CUTLASS 3.x grouped path, no regression).
- Gemma-4-26B-A4B-it-Q8_0 GGUF → "The capital of France"
(no regression).
- LlmCompressorE2E unit tests: 3/4 pass (Modelopt_QwenCoder30B fails
because /models/Qwen3-Coder-30B-A3B-FP4/ is empty on this machine —
pre-existing test-data issue, not a regression).
Follow-up: pinpoint the actual gemv_nvfp4_kpar per-row-loop pathology at
small N/K. Likely a launch-storm interaction with PDL or the kpar reduce
path; needs a synthetic NxK=704x2816, M=11 test against an FP32 reference.
Once root cause is known, restore native NVFP4 GEMV for legacy MoE prefill
to avoid the dequant → FP16 overhead.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… split
Four discoveries during multi-turn server testing on Gemma-4-26B-A4B-it-NVFP4
(llm-compressor format), all gated on the imp-server path which had not been
exercised against this model before:
1. **chat_template.jinja was never loaded.** HFConfigLoader::load_chat_template
only checked tokenizer_config.json's chat_template field, which is empty
on Gemma-4 NVFP4 — the actual 16448-byte Jinja template ships as a
standalone chat_template.jinja next to the safetensors. Added that as a
fallback. Without it the engine fell back to the hardcoded `apply_gemma`
family template, which doesn't match the model's trained channel layout
(`<|channel>thought<channel|>...<|channel>final<channel|>...`).
2. **Greedy fast-path bypassed banned-token suppression.**
`step_prefill_one`'s use_event_sync branch (temperature ≤ 0 / top_k=1)
goes through `forward_logits` + `sample_greedy_device`, neither of which
applied `state.banned_tokens`. Result on Gemma-4 NVFP4: argmax of raw
logits picks `<|channel>` (id 100, banned) as the natural first token —
then `is_stop_token` (which treats banned tokens as stop) finishes the
request immediately with 0 completion tokens. CLI's tested path uses
`forward()` (executor.cu:88) which DOES apply bans. Fix: apply the same
`cudaMemcpyAsync(-inf)` ban-mask inline before sample_greedy_device.
Also closed the same gap in `apply_pre_sample` (sample_from_logits) which
handled penalties / dry / logit_bias / constraints / min_p / typical_p
but had silently dropped banned-token handling that
`sample_single_from_logits` does have.
3. **Channel markers were over-banned.** With the greedy ban-mask now
actually firing, the model could no longer emit `<|channel>` even though
it's TRAINED to wrap its turn in
`<|channel>thought<channel|>...<|channel>final<channel|>...`. The
off-distribution sample produced token-stuck repetition loops
(e.g. "러쉬로 러쉬로 러쉬로 ..."). Solution: add `<|channel>` and
`<channel|>` to keep_ids alongside EOS / stop / think tokens for any
tokenizer that has them (covers Gemma-4 + future channel-trained models).
Server's existing `strip_channel_headers` / new `split_channel_segments`
handle the markers in the user-facing output.
4. **Thought channel leaked into content.** `strip_channel_headers` simply
removed the `<|channel>NAME` headers and `<channel|>` markers but left
the bodies of *all* channels concatenated, so the chain-of-thought
showed up in `content` ahead of the final answer. Added
`split_channel_segments(text)` which routes "thought"/"analysis" channel
bodies to a separate `reasoning_content` string and "final" / pre-channel
bodies to `content`. Wired into the chat-completions handler (gated by
`reasoning_format != "none"` so callers asking for a flat response still
get the legacy strip-only behaviour). Subtlety: in the observed Gemma-4
emission the *standalone* `<channel|>` (without a preceding header)
marks the END of the current channel and hands off to the default
(= "final" content) — earlier strip_channel_headers documented it as
just a marker drop. Documented the difference in split_channel_segments.
Verified with /v1/chat/completions multi-turn on Gemma-4-26B-A4B-it-NVFP4:
- T1: content = "Nice to meet you, Raphael! The capital of France is **Paris**."
reasoning_content = "* Name: Raphael / * Favorite color: Teal / ..."
- T2: content = "Your name is **Raphael** and the color you mentioned is **teal**."
- T3: content = "23 * 17 is **391**, and the country you asked about was **France**."
(model correctly recalls history + computes 23×17 in reasoning trace)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
c4844f2 to
526db1a
Compare
kekzl
added a commit
that referenced
this pull request
Apr 27, 2026
* runtime: FP32 attention S-matrix + Qwen3.5 QK-norm split Two coupled fixes from PR #62 (569af55) that apply cleanly to current main. The third change in that commit — dropping the GDN→FP16 fallback in executor_pre_dequant.cu — is *intentionally skipped* here: PR #60 added a diagnostic comment on main explaining the cuBLAS-INTERNAL_ERROR (status 14) cascade for GDN-shape MXFP4 weights is still unrooted, so the `has_gdn` guard stays. Qwen3.5-27B-mxfp4 remains incoherent for an unrelated reason per the original PR #62 author's notes, so dropping the guard wouldn't unblock working usage today. 1. **attention_cublas.cu** — Use the FP32 S-matrix whenever the scratch buffer fits, not just when scale==1.0 (Gemma-4). The FP16-S path can accumulate enough round-off across deep layers to NaN at attention layers in the second half of the model. The scale==1.0 gate only triggered on Gemma-4 by accident; Qwen3.5-27B-mxfp4 (head_dim=512, scale=1/sqrt(512)) hits the same FP16 round-off → all-NaN attention output at L59 → garbage logits. 2. **executor_attention.cu (a)** — Pass the full `attn_scores_` tensor (capacity = max seq_len^2) to `attention_cublas_prefill` instead of constructing a sub-view sized for the current n. The FP32-fits check inside attention_cublas reads the passed S tensor's shape, not the underlying buffer's capacity, so the sub-view always made FP32-fits evaluate false even when the real allocation was 200× larger. With the full buffer passed, the FP32-S path actually fires for prefill shapes that fit in the buffer. 3. **executor_attention.cu (b)** — Detect attn_q_norm/attn_k_norm with norm_dim < head_dim and reshape Q/K accordingly. Qwen3.5-27B-mxfp4 ships with head_dim=512 for Q, head_dim=256 for K, and a single (256,) norm weight intended to be applied per 256-dim chunk along the head. The previous code reshaped Q to [n*nh, hd=512] and called rmsnorm with the (256,) weight — reading 256 elements past the end of the weight buffer and producing the L59 attention NaN that poisoned the rest of the forward pass. New `split_norm_dim` lambda computes `norm_dim` from the weight's shape and reshapes the Q/K view to `[n*nh*(hd/norm_dim), norm_dim]`. No-op when norm_dim == hd (the common case). Verified clean cherry-pick with no overlap against the recently-merged PR #65 (server reasoning_content split + Gemma-4 NVFP4 multi-turn). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * tools: drop unreferenced create_test_mxfp4.py Last MXFP4 converter remnant. The production convert_mxfp4.py was already removed in PR #8 (e93deff). create_test_mxfp4.py only generated a tiny synthetic GGUF for ad-hoc loader testing and is referenced by no test, build target, doc, or script. Drop it to finish the converter cleanup. MXFP4 input format is now exclusively third-party GGUFs (Unsloth, bartowski, gpt-oss); own quantization pipeline is NVFP4 via NVIDIA Model Optimizer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced Apr 27, 2026
kekzl
added a commit
that referenced
this pull request
Apr 30, 2026
…ial) (#65) * feat(loader): emit Gemma-4 NVFP4 scaling extras (Phase 2 Item 2, partial) Phase 1 skipped the 90 extra scaling tensors that ship with llm-compressor NVFP4 Gemma-4 (`.layer_scalar`, `router.per_expert_scale`, `router.scale`) because the runtime application was untested. Validation showed Gemma-4 NVFP4 emits "Pac<unused5>" without them — they are load-bearing. This change flips the SKIP rules to pass-through emission. The names already match `weight_map.cpp`'s Gemma-4 routing (router.scale → ffn_gate_inp_scale, router.per_expert_scale → expert_down_scale, layer_scalar → layer_out_scale), which the Gemma-4 GGUF path uses correctly today. weight_upload.cu already includes these fields in the unquantized-norm upload list (BF16→FP16 conversion handled). executor_forward.cu / executor_forward_moe.cu already apply layer_out_scale, ffn_gate_inp_scale (as RMSNorm weight) and expert_down_scale (as routing weight multiplier) on the GEMMA4 dispatch. What works: - 26 unit tests pass (3 new EmitsLayerScalar / EmitsPerExpertScale / EmitsRouterScale, replacing the SKIP assertions; ProjScaleIsNotGemma4Extra retained as a defensive guard). - All four LlmCompressorE2E tests still pass: Gemma4_LoadsWithoutIMA, Gemma4_GeneratesNonEmptyOutput, MistralSmall_LoadsAndGeneratesCoherent, Modelopt_QwenCoder30B_StillWorks. - Gemma-4 NVFP4 output is no longer the immediate degenerate stop ("Pac<unused5>"). The scales now reach the forward pass. What's left (deferred to a follow-up PR): - Gemma-4 NVFP4 output is still incoherent — varying tokens (e.g. "What is the capital of France?" → " way world ات set" with chat-template gemma) but not factually correct. The GGUF Gemma-4 Q8_0 path with the SAME executor scaling code is coherent ("Paris is the capital of France"), so the application math is right — but llm-compressor likely stores one or more of these scales with different semantics (e.g. divisor vs multiplier, similar to the Phase-1 weight_global_scale flip), or there's an extra reconstruction step in the vLLM reference (vllm-project/vllm#39045) that imp doesn't replicate. Needs a side-by-side reference trace; estimated another ~1-2 days. Mistral-Small remains the actual coherence gate for the dense loader; this PR doesn't regress it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(loader): route Gemma-4 base FFN norms (was silently dropped) Discovery from comparing the GGUF Q8_0 baseline (which works coherently with the same executor) against the SafeTensors NVFP4 path: the per-layer scale *values* are bit-identical between the two formats (verified via raw byte diff on layer_scalar / router.scale / router.per_expert_scale across layers 0/5/15). So the executor math is correct. What's *not* identical is the layer-norm tensor inventory the loader sees. Each Gemma-4 layer has six FFN-related layernorms: - pre_feedforward_layernorm (base) - pre_feedforward_layernorm_2 (parallel branch) - post_feedforward_layernorm (base) - post_feedforward_layernorm_1 (parallel branch 1) - post_feedforward_layernorm_2 (parallel branch 2) - post_attention_layernorm (sandwich) The Phase-1 weight_map only routed the _1 / _2 parallel-branch variants and post_attention. The two *base* variants (pre_feedforward_layernorm, post_feedforward_layernorm) hit no branch and were silently counted as "matched" via the catch-all but never assigned to a layer field. Result: layer.ffn_norm and layer.post_ffn_norm stayed null for every Gemma-4 layer and the FFN computation ran without input/output normalization. GGUF stores these as `ffn_norm` and `post_ffw_norm` and the GGUF loader already routes them. This commit mirrors that routing on the SafeTensors side: pre_feedforward_layernorm → layer.ffn_norm, post_feedforward_layernorm → layer.post_ffn_norm. Status: - 26 LlmCompressor* unit tests still pass. - Gemma-4 NVFP4 generation: output is still degenerate (further work needed — likely additional missing piece in the per-expert NVFP4 path, since the prefill takes the legacy serial fallback rather than the NVFP4 grouped path; nvfp4_moe wcache count stays 0 because expert_gate_packed is null for the per-expert llm-compressor layout). But the norm fix is a strict improvement that the loader should ship regardless. - Mistral-Small dense path unaffected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * diag: add IMP_NO_PDL + IMP_MOE_ZERO_WORKSPACE env vars Two zero-overhead diagnostic flags added during the Phase-2-Item-2 audit of Gemma-4 NVFP4 non-determinism. Off by default; no behavior change unless explicitly enabled. - IMP_NO_PDL=1 — pdl::is_enabled() returns false for every kernel, so pdl::launch falls back to standard <<<>>> launches. Used to confirm whether a non-deterministic regression is caused by Programmatic Dependent Launch tail-overlap races. - IMP_MOE_ZERO_WORKSPACE=1 — cudaMemsetAsync the MoE workspace buffers (expert_{gate,up,swiglu,down}, gathered) at the start of each run_moe_ffn(). Used to confirm whether non-determinism is caused by uninitialized reads from MoE staging slots that the legacy serial fallback didn't write (e.g. inactive expert positions). Both useful as next-session investigation tools for the Gemma-4 NVFP4 incoherence (PR #65 still open). Audit findings: - LAUNCH_BLOCKING=1 → output deterministic but still incoherent ("{\"get_2_plus_2"); confirms the underlying bug is computational, non-determinism is sampling-noise in the wrong-logits regime. - IMP_NO_PDL=1 → still non-deterministic; rules out PDL races. - IMP_MOE_ZERO_WORKSPACE=1 → still non-deterministic; rules out uninit reads from MoE staging. - Mistral-Small NVFP4 (dense) is deterministic + coherent on the same setup; isolates the bug to the MoE-specific code paths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(nvfp4): two latent bugs in weight_dispatch.cu NVFP4 path + diag The Phase-1 NVFP4 audit (commit 46192a0, "K must be packed*2") fixed the WeightHandle-based dispatch in five executor files but missed weight_dispatch.cu's TWO `case StorageTier::NVFP4:` arms (line 82 prefill, line 298 decode). Both had identical bugs: 1. tmp.K = w.shape[1] — uses PACKED column count instead of logical K. NVFP4 stores K/2 nibbles per byte, so the kernel consumed only half the input → wrong dot product. 2. cudaMemcpyAsync(&tmp.tensor_scale, w.payload.nvfp4.tensor_scale, sizeof(float), cudaMemcpyDeviceToHost, stream) — but `tensor_scale` is a borrowed HOST float pointer (per Phase-1 wcache_.nvfp4 binding), not a device pointer. The Memcpy silently corrupts the scale. Mistral-Small NVFP4 worked anyway because the dense q/k/v/o paths take the specialized fused QKV / gate-up GEMVs (executor_attention.cu line 345 + executor_ffn.cu line 173/289/359) which already had the correct `* 2` + direct host-pointer read. The buggy weight_dispatch.cu path is only reached by callers that use gemm_dispatch(WeightHandle&, …) directly — which is rare today but used by some Gemma-4 paths. Fix both arms to mirror the working callers: read tensor_scale directly from the host pointer, and multiply shape[1] by 2 for the logical K. Side effect verified on Mistral-Small-3.2-24B-NVFP4: greedy "The capital of France is" output improved from "Paris. It is the largest city in France and one of" → "Paris. It is the capital of France". The earlier output had been mis-scaled too, just not visibly enough to flip argmax. Also adds IMP_NO_SHARED_MLP=1 diagnostic env var (skips the Gemma-4 shared-MLP branch in run_moe_ffn — used to bisect shared-MLP vs MoE-expert contributions to the still-broken Gemma-4 NVFP4 output). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(hf_loader): set scalar head_dim/n_kv_heads to max(per-layer) for Gemma-4 Gemma-4 has dual head geometry: SWA layers use head_dim=256, full-attention layers use head_dim=512 (and 2 KV heads vs 8). The HF config loader was building per-layer arrays correctly but leaving cfg.head_dim at the SWA-only value (256). The KV cache + attention workspace are sized from the SCALAR cfg.head_dim — which means with head_dim=256 the buffers are too small for the full-attention layers' 8192-wide Q output → write past the allocated stride into adjacent layer slots → corrupted state. The GGUF loader already does the same `max_hd = max(per-layer)` fixup (gguf_loader.cpp:940-946); this commit mirrors it on the HF side. Without it Gemma-4 NVFP4 SafeTensors models had silent buffer-overflow risk on the full-attention layers (every 6th layer in the 5+1 SWA/global pattern). Discovered while bisecting Gemma-4 NVFP4 incoherence (PR #65). KV-cache log line `head_dim=256` in NVFP4 vs `head_dim=512` in GGUF was the tell. Note: this fix removes the buffer-overflow risk but does NOT fully restore Gemma-4 NVFP4 coherence — there's a separate architectural issue where attention_cublas_prefill computes ld_q/ld_k from per-layer values while the buffer stride is sized from cfg.head_dim (now correctly max). For mixed-head_dim layers the leading dimensions don't match the buffer stride. That's a follow-up fix (probably: pass actual stride from Tensor.stride[0] instead of recomputing). No regression on Mistral-Small NVFP4 (still coherent: "Paris. It is the capital of France..."). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(moe): route per-expert NVFP4 prefill through gemm_nvfp4 (dequant) Gemma-4-26B-A4B NVFP4 (llm-compressor format) was producing token-stuck output (' \n \n ...') and IMA'ing under default flags during CUDA-graph capture of decode step 1. Bisect (IMP_AUDIT_NVFP4_SCALES=1 + IMP_EXPERT_NVFP4_DEQUANT[_MR]=1) pinned the cause to the per-row gemv_nvfp4_kpar loop in expert_gemm's NVFP4 branch, used only for the LEGACY SERIAL MoE prefill path (which fires when nvfp4_moe_*_ptr is null because llm-compressor stores experts as separate per-tensor NVFP4 weights instead of a fused 3D packed buffer, and CUTLASS 3.x grouped is reserved for StorageTier::CUTLASS_NVFP4 / Modelopt). The single-call M=1 decode path on the same kernel works (Mistral dense decode hits gemv_nvfp4_kpar at N=4096 K=5120 and is coherent), but the M>1 per-row loop at Gemma-4 expert dimensions (N=704 K=2816 / N=2816 K=704) produces wrong output. The dense-path mirror — gemm_nvfp4 (NVFP4 → FP16 dequant + cuBLAS gemm), which Mistral dense prefill already uses — is correct on Gemma-4, so route the M>1 expert path through it. ~22 LoC delta. CUTLASS 3.x grouped (Modelopt) and the nvfp4_moe_*_ptr batch path are unaffected and remain on their fast paths. Verified post-fix on 2026-04-27: - Gemma-4-26B-A4B-it-NVFP4 → "The capital of France is Paris." (tg≈34 tok/s, CUDA Graphs auto-enable; IMA-with-graphs gone). - Mistral-Small-3.2-24B-NVFP4 → "Paris is renowned for" (tg≈76 tok/s, no regression). - Qwen3-Coder-30B-A3B-Instruct-FP4 → "of the Fibonacci function" (Modelopt CUTLASS 3.x grouped path, no regression). - Gemma-4-26B-A4B-it-Q8_0 GGUF → "The capital of France" (no regression). - LlmCompressorE2E unit tests: 3/4 pass (Modelopt_QwenCoder30B fails because /models/Qwen3-Coder-30B-A3B-FP4/ is empty on this machine — pre-existing test-data issue, not a regression). Follow-up: pinpoint the actual gemv_nvfp4_kpar per-row-loop pathology at small N/K. Likely a launch-storm interaction with PDL or the kpar reduce path; needs a synthetic NxK=704x2816, M=11 test against an FP32 reference. Once root cause is known, restore native NVFP4 GEMV for legacy MoE prefill to avoid the dequant → FP16 overhead. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(server,gemma4): coherent multi-turn for Gemma-4 NVFP4 + reasoning split Four discoveries during multi-turn server testing on Gemma-4-26B-A4B-it-NVFP4 (llm-compressor format), all gated on the imp-server path which had not been exercised against this model before: 1. **chat_template.jinja was never loaded.** HFConfigLoader::load_chat_template only checked tokenizer_config.json's chat_template field, which is empty on Gemma-4 NVFP4 — the actual 16448-byte Jinja template ships as a standalone chat_template.jinja next to the safetensors. Added that as a fallback. Without it the engine fell back to the hardcoded `apply_gemma` family template, which doesn't match the model's trained channel layout (`<|channel>thought<channel|>...<|channel>final<channel|>...`). 2. **Greedy fast-path bypassed banned-token suppression.** `step_prefill_one`'s use_event_sync branch (temperature ≤ 0 / top_k=1) goes through `forward_logits` + `sample_greedy_device`, neither of which applied `state.banned_tokens`. Result on Gemma-4 NVFP4: argmax of raw logits picks `<|channel>` (id 100, banned) as the natural first token — then `is_stop_token` (which treats banned tokens as stop) finishes the request immediately with 0 completion tokens. CLI's tested path uses `forward()` (executor.cu:88) which DOES apply bans. Fix: apply the same `cudaMemcpyAsync(-inf)` ban-mask inline before sample_greedy_device. Also closed the same gap in `apply_pre_sample` (sample_from_logits) which handled penalties / dry / logit_bias / constraints / min_p / typical_p but had silently dropped banned-token handling that `sample_single_from_logits` does have. 3. **Channel markers were over-banned.** With the greedy ban-mask now actually firing, the model could no longer emit `<|channel>` even though it's TRAINED to wrap its turn in `<|channel>thought<channel|>...<|channel>final<channel|>...`. The off-distribution sample produced token-stuck repetition loops (e.g. "러쉬로 러쉬로 러쉬로 ..."). Solution: add `<|channel>` and `<channel|>` to keep_ids alongside EOS / stop / think tokens for any tokenizer that has them (covers Gemma-4 + future channel-trained models). Server's existing `strip_channel_headers` / new `split_channel_segments` handle the markers in the user-facing output. 4. **Thought channel leaked into content.** `strip_channel_headers` simply removed the `<|channel>NAME` headers and `<channel|>` markers but left the bodies of *all* channels concatenated, so the chain-of-thought showed up in `content` ahead of the final answer. Added `split_channel_segments(text)` which routes "thought"/"analysis" channel bodies to a separate `reasoning_content` string and "final" / pre-channel bodies to `content`. Wired into the chat-completions handler (gated by `reasoning_format != "none"` so callers asking for a flat response still get the legacy strip-only behaviour). Subtlety: in the observed Gemma-4 emission the *standalone* `<channel|>` (without a preceding header) marks the END of the current channel and hands off to the default (= "final" content) — earlier strip_channel_headers documented it as just a marker drop. Documented the difference in split_channel_segments. Verified with /v1/chat/completions multi-turn on Gemma-4-26B-A4B-it-NVFP4: - T1: content = "Nice to meet you, Raphael! The capital of France is **Paris**." reasoning_content = "* Name: Raphael / * Favorite color: Teal / ..." - T2: content = "Your name is **Raphael** and the color you mentioned is **teal**." - T3: content = "23 * 17 is **391**, and the country you asked about was **France**." (model correctly recalls history + computes 23×17 in reasoning trace) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
kekzl
added a commit
that referenced
this pull request
Apr 30, 2026
* runtime: FP32 attention S-matrix + Qwen3.5 QK-norm split Two coupled fixes from PR #62 (8f8d2b3) that apply cleanly to current main. The third change in that commit — dropping the GDN→FP16 fallback in executor_pre_dequant.cu — is *intentionally skipped* here: PR #60 added a diagnostic comment on main explaining the cuBLAS-INTERNAL_ERROR (status 14) cascade for GDN-shape MXFP4 weights is still unrooted, so the `has_gdn` guard stays. Qwen3.5-27B-mxfp4 remains incoherent for an unrelated reason per the original PR #62 author's notes, so dropping the guard wouldn't unblock working usage today. 1. **attention_cublas.cu** — Use the FP32 S-matrix whenever the scratch buffer fits, not just when scale==1.0 (Gemma-4). The FP16-S path can accumulate enough round-off across deep layers to NaN at attention layers in the second half of the model. The scale==1.0 gate only triggered on Gemma-4 by accident; Qwen3.5-27B-mxfp4 (head_dim=512, scale=1/sqrt(512)) hits the same FP16 round-off → all-NaN attention output at L59 → garbage logits. 2. **executor_attention.cu (a)** — Pass the full `attn_scores_` tensor (capacity = max seq_len^2) to `attention_cublas_prefill` instead of constructing a sub-view sized for the current n. The FP32-fits check inside attention_cublas reads the passed S tensor's shape, not the underlying buffer's capacity, so the sub-view always made FP32-fits evaluate false even when the real allocation was 200× larger. With the full buffer passed, the FP32-S path actually fires for prefill shapes that fit in the buffer. 3. **executor_attention.cu (b)** — Detect attn_q_norm/attn_k_norm with norm_dim < head_dim and reshape Q/K accordingly. Qwen3.5-27B-mxfp4 ships with head_dim=512 for Q, head_dim=256 for K, and a single (256,) norm weight intended to be applied per 256-dim chunk along the head. The previous code reshaped Q to [n*nh, hd=512] and called rmsnorm with the (256,) weight — reading 256 elements past the end of the weight buffer and producing the L59 attention NaN that poisoned the rest of the forward pass. New `split_norm_dim` lambda computes `norm_dim` from the weight's shape and reshapes the Q/K view to `[n*nh*(hd/norm_dim), norm_dim]`. No-op when norm_dim == hd (the common case). Verified clean cherry-pick with no overlap against the recently-merged PR #65 (server reasoning_content split + Gemma-4 NVFP4 multi-turn). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * tools: drop unreferenced create_test_mxfp4.py Last MXFP4 converter remnant. The production convert_mxfp4.py was already removed in PR #8 (5b18d2e). create_test_mxfp4.py only generated a tiny synthetic GGUF for ad-hoc loader testing and is referenced by no test, build target, doc, or script. Drop it to finish the converter cleanup. MXFP4 input format is now exclusively third-party GGUFs (Unsloth, bartowski, gpt-oss); own quantization pipeline is NVFP4 via NVIDIA Model Optimizer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
kekzl
added a commit
that referenced
this pull request
May 1, 2026
… llm-compressor are shipped Two stale "open bugs" listed in TODO.md were already resolved on main: - Qwen3.6-NVFP4 partial coherence: shipped via PR #81 ("six Qwen3.5/3.6-NVFP4 SafeTensors loader bugs"). Per-layer correlation vs GGUF Q4_K_M is ≥0.997 across all 40 layers; output matches the GGUF oracle. Verified that arch_norm_offset / gdn_grouped_head_layout / partial_rotary_factor / nested rope_parameters / A_log -exp transform / fp32_scan y_buf fixes are all on main. - llm-compressor Phase 2 Item 2 (Gemma-4 NVFP4): shipped via PR #65. Root cause was per-row gemv_nvfp4_kpar at Gemma-4 expert dimensions (small N=704 K=2816); routed through gemm_nvfp4 dequant→cuBLAS for M>1, decode M=1 path unchanged. Coherent end-to-end at ~34 tok/s. CHANGELOG.md entries for both PRs expanded to enumerate the actual fixes shipped instead of restating the (now-obsolete) "still incoherent" claim. Memos `qwen36_nvfp4_decode_partial_2026_04_30.md` and `llm_compressor_phase2_item2_2026_04_26.md` already carry SHIPPED / RESOLVED markers in their own headers — TODO.md was the lagging artifact. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Phase 2 Item 2 of the llm-compressor NVFP4 loader backlog — emits the 90 extra Gemma-4 scaling tensors, fixes a missing-norm bug, two latent NVFP4 dispatch bugs, and a Gemma-4 head_dim buffer-sizing bug.
Depends on #64 (Phase 2 Item 1 — Mistral3). PR base is set to that branch; will rebase to `main` once #64 lands.
Commits
1. `a791e95` — emit Gemma-4 NVFP4 scaling extras
Phase 1 skipped the 90 extras (`.layer_scalar`, `router.per_expert_scale`, `router.scale`) — output was `Pac` (immediate degenerate stop). Flips SKIP rules to pass-through emission. Names already match `weight_map.cpp`'s Gemma-4 routing.
2. `79c7659` — route Gemma-4 base FFN norms
Each Gemma-4 layer has 6 FFN-related layernorms. Phase-1 weight_map only routed the _1 / _2 parallel-branch variants; the base `pre_feedforward_layernorm` / `post_feedforward_layernorm` (= GGUF `ffn_norm` / `post_ffw_norm`) were dropped → FFN ran without normalization.
3. `8982d5a` — diagnostic env vars
`IMP_NO_PDL=1`, `IMP_MOE_ZERO_WORKSPACE=1` — used to rule out PDL races and uninit reads from MoE staging. Both showed bug is computational, not concurrency.
4. `ca05a45` — fix two latent NVFP4 bugs in weight_dispatch.cu
Phase-1 NVFP4 audit (commit f5f4a1a, "K must be packed*2") missed `weight_dispatch.cu`'s two `case StorageTier::NVFP4:` arms. Both had identical bugs: `tmp.K = w.shape[1]` (packed not logical), and `cudaMemcpyDeviceToHost` on a HOST tensor_scale pointer. Side effect on Mistral: argmax for "The capital of France is" flipped from "the largest city in France" → "the capital of France".
5. `57cda8e` — Gemma-4 scalar head_dim = max(per-layer)
Layer-diff bisect of GGUF Q8_0 (works) vs SafeTensors NVFP4 (broken):
Q,K,V essentially identical going into FMHA, but the attention compute amplifies noise. Two compounding causes:
`cfg.head_dim` = 256 not 512 (FIXED here). Gemma-4 has dual head_dim (256 SWA / 512 global). HF config loader was leaving `cfg.head_dim` at the SWA-only value while building the per-layer array correctly. KV cache + attention workspace sized for hd=256, so global layers' 8192-wide Q output overflows the 4096-wide buffer into adjacent slots. GGUF loader has the same `max_hd` fixup; HF was missing it.
`ld_q`/`ld_k` mismatch with buffer stride (NOT fixed — separate follow-up). `attention_cublas.cu` computes `int ld_q = n_heads * head_dim;` from per-layer values (e.g. 16256=4096 for L0 SWA) but buffer stride is `n_heads * cfg.head_dim` = 16512 = 8192. cuBLAS strided batched GEMM reads token N's row from offset `N * ld_q` not `N * stride` → token 1+ rows are read from inside token 0's padding. GGUF survives because its padding happens to be zero (luck); NVFP4 has stale workspace data there → garbage attention output. Fix needs `ld_q = Q.stride[0]` in `attention_cublas_prefill` + same change in `naive_attention_prefill_kernel`. Estimated 2-4 hours.
Status
Loading + dispatch: works. test-core 26/26, test-e2e 4/4 (LoadsWithoutIMA, GeneratesNonEmptyOutput, MistralSmall, Modelopt regression).
Mistral-Small NVFP4: improved. "Paris. It is the capital of France..." (was "the largest city in France"). No regression.
Gemma-4 NVFP4 coherence: NOT recovered. Strictly more correct than Phase-1 baseline (output is no longer immediate-stop `Pac`, scales+norms applied, dispatch math fixed, head_dim correct) — but still incoherent due to the open ld_stride mismatch. Bisect localizes the remaining bug to FMHA cuBLAS stride handling.
Phase 2 backlog status
🤖 Generated with Claude Code