feat(loader): emit Gemma-4 NVFP4 scaling extras (Phase 2 Item 2, partial) by kekzl · Pull Request #65 · kekzl/imp

kekzl · 2026-04-26T14:10:51Z

Summary

Phase 2 Item 2 of the llm-compressor NVFP4 loader backlog — emits the 90 extra Gemma-4 scaling tensors, fixes a missing-norm bug, two latent NVFP4 dispatch bugs, and a Gemma-4 head_dim buffer-sizing bug.

Depends on #64 (Phase 2 Item 1 — Mistral3). PR base is set to that branch; will rebase to `main` once #64 lands.

Commits

1. `a791e95` — emit Gemma-4 NVFP4 scaling extras

Phase 1 skipped the 90 extras (`.layer_scalar`, `router.per_expert_scale`, `router.scale`) — output was `Pac` (immediate degenerate stop). Flips SKIP rules to pass-through emission. Names already match `weight_map.cpp`'s Gemma-4 routing.

2. `79c7659` — route Gemma-4 base FFN norms

Each Gemma-4 layer has 6 FFN-related layernorms. Phase-1 weight_map only routed the _1 / _2 parallel-branch variants; the base `pre_feedforward_layernorm` / `post_feedforward_layernorm` (= GGUF `ffn_norm` / `post_ffw_norm`) were dropped → FFN ran without normalization.

3. `8982d5a` — diagnostic env vars

`IMP_NO_PDL=1`, `IMP_MOE_ZERO_WORKSPACE=1` — used to rule out PDL races and uninit reads from MoE staging. Both showed bug is computational, not concurrency.

4. `ca05a45` — fix two latent NVFP4 bugs in weight_dispatch.cu

Phase-1 NVFP4 audit (commit f5f4a1a, "K must be packed*2") missed `weight_dispatch.cu`'s two `case StorageTier::NVFP4:` arms. Both had identical bugs: `tmp.K = w.shape[1]` (packed not logical), and `cudaMemcpyDeviceToHost` on a HOST tensor_scale pointer. Side effect on Mistral: argmax for "The capital of France is" flipped from "the largest city in France" → "the capital of France".

5. `57cda8e` — Gemma-4 scalar head_dim = max(per-layer)

Layer-diff bisect of GGUF Q8_0 (works) vs SafeTensors NVFP4 (broken):

Stage	cos (last token)
A_pre_attn (embedding)	0.9999
D_post_norm_no	1.0000
D_after_qkv_q	0.9996
D_after_qkv_k	0.9998
D_after_qkv_v	0.9991
E_after_rope_q	0.9635
E_after_rope_k	0.9728
E_after_fmha_ao	0.7500

Q,K,V essentially identical going into FMHA, but the attention compute amplifies noise. Two compounding causes:

`cfg.head_dim` = 256 not 512 (FIXED here). Gemma-4 has dual head_dim (256 SWA / 512 global). HF config loader was leaving `cfg.head_dim` at the SWA-only value while building the per-layer array correctly. KV cache + attention workspace sized for hd=256, so global layers' 8192-wide Q output overflows the 4096-wide buffer into adjacent slots. GGUF loader has the same `max_hd` fixup; HF was missing it.
`ld_q`/`ld_k` mismatch with buffer stride (NOT fixed — separate follow-up). `attention_cublas.cu` computes `int ld_q = n_heads * head_dim;` from per-layer values (e.g. 16256=4096 for L0 SWA) but buffer stride is `n_heads * cfg.head_dim` = 16512 = 8192. cuBLAS strided batched GEMM reads token N's row from offset `N * ld_q` not `N * stride` → token 1+ rows are read from inside token 0's padding. GGUF survives because its padding happens to be zero (luck); NVFP4 has stale workspace data there → garbage attention output. Fix needs `ld_q = Q.stride[0]` in `attention_cublas_prefill` + same change in `naive_attention_prefill_kernel`. Estimated 2-4 hours.

Status

Loading + dispatch: works. test-core 26/26, test-e2e 4/4 (LoadsWithoutIMA, GeneratesNonEmptyOutput, MistralSmall, Modelopt regression).

Mistral-Small NVFP4: improved. "Paris. It is the capital of France..." (was "the largest city in France"). No regression.

Gemma-4 NVFP4 coherence: NOT recovered. Strictly more correct than Phase-1 baseline (output is no longer immediate-stop `Pac`, scales+norms applied, dispatch math fixed, head_dim correct) — but still incoherent due to the open ld_stride mismatch. Bisect localizes the remaining bug to FMHA cuBLAS stride handling.

Phase 2 backlog status

Item 1 — Mistral3 prefix + config_groups parser (PR feat(loader): Mistral3 NVFP4 prefix + config_groups recipe (Phase 2 Item 1) #64)
[~] Item 2 — Gemma-4 NVFP4 extras + norms + dispatch + head_dim (this PR; loading/dispatch fixes shipped, ld_stride fix open)
Item 3 — W4A16 fallback validation (blocked: needs target model)
Item 4 — GDN+NVFP4 (separate effort)
Item 5 — Multimodal/vision (separate effort)
Item 6 — `convert_scales_sfatom` fusion (separate spec)

🤖 Generated with Claude Code

Two zero-overhead diagnostic flags added during the Phase-2-Item-2 audit of Gemma-4 NVFP4 non-determinism. Off by default; no behavior change unless explicitly enabled. - IMP_NO_PDL=1 — pdl::is_enabled() returns false for every kernel, so pdl::launch falls back to standard <<<>>> launches. Used to confirm whether a non-deterministic regression is caused by Programmatic Dependent Launch tail-overlap races. - IMP_MOE_ZERO_WORKSPACE=1 — cudaMemsetAsync the MoE workspace buffers (expert_{gate,up,swiglu,down}, gathered) at the start of each run_moe_ffn(). Used to confirm whether non-determinism is caused by uninitialized reads from MoE staging slots that the legacy serial fallback didn't write (e.g. inactive expert positions). Both useful as next-session investigation tools for the Gemma-4 NVFP4 incoherence (PR #65 still open). Audit findings: - LAUNCH_BLOCKING=1 → output deterministic but still incoherent ("{\"get_2_plus_2"); confirms the underlying bug is computational, non-determinism is sampling-noise in the wrong-logits regime. - IMP_NO_PDL=1 → still non-deterministic; rules out PDL races. - IMP_MOE_ZERO_WORKSPACE=1 → still non-deterministic; rules out uninit reads from MoE staging. - Mistral-Small NVFP4 (dense) is deterministic + coherent on the same setup; isolates the bug to the MoE-specific code paths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…Gemma-4 Gemma-4 has dual head geometry: SWA layers use head_dim=256, full-attention layers use head_dim=512 (and 2 KV heads vs 8). The HF config loader was building per-layer arrays correctly but leaving cfg.head_dim at the SWA-only value (256). The KV cache + attention workspace are sized from the SCALAR cfg.head_dim — which means with head_dim=256 the buffers are too small for the full-attention layers' 8192-wide Q output → write past the allocated stride into adjacent layer slots → corrupted state. The GGUF loader already does the same `max_hd = max(per-layer)` fixup (gguf_loader.cpp:940-946); this commit mirrors it on the HF side. Without it Gemma-4 NVFP4 SafeTensors models had silent buffer-overflow risk on the full-attention layers (every 6th layer in the 5+1 SWA/global pattern). Discovered while bisecting Gemma-4 NVFP4 incoherence (PR #65). KV-cache log line `head_dim=256` in NVFP4 vs `head_dim=512` in GGUF was the tell. Note: this fix removes the buffer-overflow risk but does NOT fully restore Gemma-4 NVFP4 coherence — there's a separate architectural issue where attention_cublas_prefill computes ld_q/ld_k from per-layer values while the buffer stride is sized from cfg.head_dim (now correctly max). For mixed-head_dim layers the leading dimensions don't match the buffer stride. That's a follow-up fix (probably: pass actual stride from Tensor.stride[0] instead of recomputing). No regression on Mistral-Small NVFP4 (still coherent: "Paris. It is the capital of France..."). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ial) Phase 1 skipped the 90 extra scaling tensors that ship with llm-compressor NVFP4 Gemma-4 (`.layer_scalar`, `router.per_expert_scale`, `router.scale`) because the runtime application was untested. Validation showed Gemma-4 NVFP4 emits "Pac<unused5>" without them — they are load-bearing. This change flips the SKIP rules to pass-through emission. The names already match `weight_map.cpp`'s Gemma-4 routing (router.scale → ffn_gate_inp_scale, router.per_expert_scale → expert_down_scale, layer_scalar → layer_out_scale), which the Gemma-4 GGUF path uses correctly today. weight_upload.cu already includes these fields in the unquantized-norm upload list (BF16→FP16 conversion handled). executor_forward.cu / executor_forward_moe.cu already apply layer_out_scale, ffn_gate_inp_scale (as RMSNorm weight) and expert_down_scale (as routing weight multiplier) on the GEMMA4 dispatch. What works: - 26 unit tests pass (3 new EmitsLayerScalar / EmitsPerExpertScale / EmitsRouterScale, replacing the SKIP assertions; ProjScaleIsNotGemma4Extra retained as a defensive guard). - All four LlmCompressorE2E tests still pass: Gemma4_LoadsWithoutIMA, Gemma4_GeneratesNonEmptyOutput, MistralSmall_LoadsAndGeneratesCoherent, Modelopt_QwenCoder30B_StillWorks. - Gemma-4 NVFP4 output is no longer the immediate degenerate stop ("Pac<unused5>"). The scales now reach the forward pass. What's left (deferred to a follow-up PR): - Gemma-4 NVFP4 output is still incoherent — varying tokens (e.g. "What is the capital of France?" → " way world ات set" with chat-template gemma) but not factually correct. The GGUF Gemma-4 Q8_0 path with the SAME executor scaling code is coherent ("Paris is the capital of France"), so the application math is right — but llm-compressor likely stores one or more of these scales with different semantics (e.g. divisor vs multiplier, similar to the Phase-1 weight_global_scale flip), or there's an extra reconstruction step in the vLLM reference (vllm-project/vllm#39045) that imp doesn't replicate. Needs a side-by-side reference trace; estimated another ~1-2 days. Mistral-Small remains the actual coherence gate for the dense loader; this PR doesn't regress it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Discovery from comparing the GGUF Q8_0 baseline (which works coherently with the same executor) against the SafeTensors NVFP4 path: the per-layer scale *values* are bit-identical between the two formats (verified via raw byte diff on layer_scalar / router.scale / router.per_expert_scale across layers 0/5/15). So the executor math is correct. What's *not* identical is the layer-norm tensor inventory the loader sees. Each Gemma-4 layer has six FFN-related layernorms: - pre_feedforward_layernorm (base) - pre_feedforward_layernorm_2 (parallel branch) - post_feedforward_layernorm (base) - post_feedforward_layernorm_1 (parallel branch 1) - post_feedforward_layernorm_2 (parallel branch 2) - post_attention_layernorm (sandwich) The Phase-1 weight_map only routed the _1 / _2 parallel-branch variants and post_attention. The two *base* variants (pre_feedforward_layernorm, post_feedforward_layernorm) hit no branch and were silently counted as "matched" via the catch-all but never assigned to a layer field. Result: layer.ffn_norm and layer.post_ffn_norm stayed null for every Gemma-4 layer and the FFN computation ran without input/output normalization. GGUF stores these as `ffn_norm` and `post_ffw_norm` and the GGUF loader already routes them. This commit mirrors that routing on the SafeTensors side: pre_feedforward_layernorm → layer.ffn_norm, post_feedforward_layernorm → layer.post_ffn_norm. Status: - 26 LlmCompressor* unit tests still pass. - Gemma-4 NVFP4 generation: output is still degenerate (further work needed — likely additional missing piece in the per-expert NVFP4 path, since the prefill takes the legacy serial fallback rather than the NVFP4 grouped path; nvfp4_moe wcache count stays 0 because expert_gate_packed is null for the per-expert llm-compressor layout). But the norm fix is a strict improvement that the loader should ship regardless. - Mistral-Small dense path unaffected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two zero-overhead diagnostic flags added during the Phase-2-Item-2 audit of Gemma-4 NVFP4 non-determinism. Off by default; no behavior change unless explicitly enabled. - IMP_NO_PDL=1 — pdl::is_enabled() returns false for every kernel, so pdl::launch falls back to standard <<<>>> launches. Used to confirm whether a non-deterministic regression is caused by Programmatic Dependent Launch tail-overlap races. - IMP_MOE_ZERO_WORKSPACE=1 — cudaMemsetAsync the MoE workspace buffers (expert_{gate,up,swiglu,down}, gathered) at the start of each run_moe_ffn(). Used to confirm whether non-determinism is caused by uninitialized reads from MoE staging slots that the legacy serial fallback didn't write (e.g. inactive expert positions). Both useful as next-session investigation tools for the Gemma-4 NVFP4 incoherence (PR #65 still open). Audit findings: - LAUNCH_BLOCKING=1 → output deterministic but still incoherent ("{\"get_2_plus_2"); confirms the underlying bug is computational, non-determinism is sampling-noise in the wrong-logits regime. - IMP_NO_PDL=1 → still non-deterministic; rules out PDL races. - IMP_MOE_ZERO_WORKSPACE=1 → still non-deterministic; rules out uninit reads from MoE staging. - Mistral-Small NVFP4 (dense) is deterministic + coherent on the same setup; isolates the bug to the MoE-specific code paths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The Phase-1 NVFP4 audit (commit f5f4a1a, "K must be packed*2") fixed the WeightHandle-based dispatch in five executor files but missed weight_dispatch.cu's TWO `case StorageTier::NVFP4:` arms (line 82 prefill, line 298 decode). Both had identical bugs: 1. tmp.K = w.shape[1] — uses PACKED column count instead of logical K. NVFP4 stores K/2 nibbles per byte, so the kernel consumed only half the input → wrong dot product. 2. cudaMemcpyAsync(&tmp.tensor_scale, w.payload.nvfp4.tensor_scale, sizeof(float), cudaMemcpyDeviceToHost, stream) — but `tensor_scale` is a borrowed HOST float pointer (per Phase-1 wcache_.nvfp4 binding), not a device pointer. The Memcpy silently corrupts the scale. Mistral-Small NVFP4 worked anyway because the dense q/k/v/o paths take the specialized fused QKV / gate-up GEMVs (executor_attention.cu line 345 + executor_ffn.cu line 173/289/359) which already had the correct `* 2` + direct host-pointer read. The buggy weight_dispatch.cu path is only reached by callers that use gemm_dispatch(WeightHandle&, …) directly — which is rare today but used by some Gemma-4 paths. Fix both arms to mirror the working callers: read tensor_scale directly from the host pointer, and multiply shape[1] by 2 for the logical K. Side effect verified on Mistral-Small-3.2-24B-NVFP4: greedy "The capital of France is" output improved from "Paris. It is the largest city in France and one of" → "Paris. It is the capital of France". The earlier output had been mis-scaled too, just not visibly enough to flip argmax. Also adds IMP_NO_SHARED_MLP=1 diagnostic env var (skips the Gemma-4 shared-MLP branch in run_moe_ffn — used to bisect shared-MLP vs MoE-expert contributions to the still-broken Gemma-4 NVFP4 output). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…Gemma-4 Gemma-4 has dual head geometry: SWA layers use head_dim=256, full-attention layers use head_dim=512 (and 2 KV heads vs 8). The HF config loader was building per-layer arrays correctly but leaving cfg.head_dim at the SWA-only value (256). The KV cache + attention workspace are sized from the SCALAR cfg.head_dim — which means with head_dim=256 the buffers are too small for the full-attention layers' 8192-wide Q output → write past the allocated stride into adjacent layer slots → corrupted state. The GGUF loader already does the same `max_hd = max(per-layer)` fixup (gguf_loader.cpp:940-946); this commit mirrors it on the HF side. Without it Gemma-4 NVFP4 SafeTensors models had silent buffer-overflow risk on the full-attention layers (every 6th layer in the 5+1 SWA/global pattern). Discovered while bisecting Gemma-4 NVFP4 incoherence (PR #65). KV-cache log line `head_dim=256` in NVFP4 vs `head_dim=512` in GGUF was the tell. Note: this fix removes the buffer-overflow risk but does NOT fully restore Gemma-4 NVFP4 coherence — there's a separate architectural issue where attention_cublas_prefill computes ld_q/ld_k from per-layer values while the buffer stride is sized from cfg.head_dim (now correctly max). For mixed-head_dim layers the leading dimensions don't match the buffer stride. That's a follow-up fix (probably: pass actual stride from Tensor.stride[0] instead of recomputing). No regression on Mistral-Small NVFP4 (still coherent: "Paris. It is the capital of France..."). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Gemma-4-26B-A4B NVFP4 (llm-compressor format) was producing token-stuck output (' \n \n ...') and IMA'ing under default flags during CUDA-graph capture of decode step 1. Bisect (IMP_AUDIT_NVFP4_SCALES=1 + IMP_EXPERT_NVFP4_DEQUANT[_MR]=1) pinned the cause to the per-row gemv_nvfp4_kpar loop in expert_gemm's NVFP4 branch, used only for the LEGACY SERIAL MoE prefill path (which fires when nvfp4_moe_*_ptr is null because llm-compressor stores experts as separate per-tensor NVFP4 weights instead of a fused 3D packed buffer, and CUTLASS 3.x grouped is reserved for StorageTier::CUTLASS_NVFP4 / Modelopt). The single-call M=1 decode path on the same kernel works (Mistral dense decode hits gemv_nvfp4_kpar at N=4096 K=5120 and is coherent), but the M>1 per-row loop at Gemma-4 expert dimensions (N=704 K=2816 / N=2816 K=704) produces wrong output. The dense-path mirror — gemm_nvfp4 (NVFP4 → FP16 dequant + cuBLAS gemm), which Mistral dense prefill already uses — is correct on Gemma-4, so route the M>1 expert path through it. ~22 LoC delta. CUTLASS 3.x grouped (Modelopt) and the nvfp4_moe_*_ptr batch path are unaffected and remain on their fast paths. Verified post-fix on 2026-04-27: - Gemma-4-26B-A4B-it-NVFP4 → "The capital of France is Paris." (tg≈34 tok/s, CUDA Graphs auto-enable; IMA-with-graphs gone). - Mistral-Small-3.2-24B-NVFP4 → "Paris is renowned for" (tg≈76 tok/s, no regression). - Qwen3-Coder-30B-A3B-Instruct-FP4 → "of the Fibonacci function" (Modelopt CUTLASS 3.x grouped path, no regression). - Gemma-4-26B-A4B-it-Q8_0 GGUF → "The capital of France" (no regression). - LlmCompressorE2E unit tests: 3/4 pass (Modelopt_QwenCoder30B fails because /models/Qwen3-Coder-30B-A3B-FP4/ is empty on this machine — pre-existing test-data issue, not a regression). Follow-up: pinpoint the actual gemv_nvfp4_kpar per-row-loop pathology at small N/K. Likely a launch-storm interaction with PDL or the kpar reduce path; needs a synthetic NxK=704x2816, M=11 test against an FP32 reference. Once root cause is known, restore native NVFP4 GEMV for legacy MoE prefill to avoid the dequant → FP16 overhead. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… split Four discoveries during multi-turn server testing on Gemma-4-26B-A4B-it-NVFP4 (llm-compressor format), all gated on the imp-server path which had not been exercised against this model before: 1. **chat_template.jinja was never loaded.** HFConfigLoader::load_chat_template only checked tokenizer_config.json's chat_template field, which is empty on Gemma-4 NVFP4 — the actual 16448-byte Jinja template ships as a standalone chat_template.jinja next to the safetensors. Added that as a fallback. Without it the engine fell back to the hardcoded `apply_gemma` family template, which doesn't match the model's trained channel layout (`<|channel>thought<channel|>...<|channel>final<channel|>...`). 2. **Greedy fast-path bypassed banned-token suppression.** `step_prefill_one`'s use_event_sync branch (temperature ≤ 0 / top_k=1) goes through `forward_logits` + `sample_greedy_device`, neither of which applied `state.banned_tokens`. Result on Gemma-4 NVFP4: argmax of raw logits picks `<|channel>` (id 100, banned) as the natural first token — then `is_stop_token` (which treats banned tokens as stop) finishes the request immediately with 0 completion tokens. CLI's tested path uses `forward()` (executor.cu:88) which DOES apply bans. Fix: apply the same `cudaMemcpyAsync(-inf)` ban-mask inline before sample_greedy_device. Also closed the same gap in `apply_pre_sample` (sample_from_logits) which handled penalties / dry / logit_bias / constraints / min_p / typical_p but had silently dropped banned-token handling that `sample_single_from_logits` does have. 3. **Channel markers were over-banned.** With the greedy ban-mask now actually firing, the model could no longer emit `<|channel>` even though it's TRAINED to wrap its turn in `<|channel>thought<channel|>...<|channel>final<channel|>...`. The off-distribution sample produced token-stuck repetition loops (e.g. "러쉬로 러쉬로 러쉬로 ..."). Solution: add `<|channel>` and `<channel|>` to keep_ids alongside EOS / stop / think tokens for any tokenizer that has them (covers Gemma-4 + future channel-trained models). Server's existing `strip_channel_headers` / new `split_channel_segments` handle the markers in the user-facing output. 4. **Thought channel leaked into content.** `strip_channel_headers` simply removed the `<|channel>NAME` headers and `<channel|>` markers but left the bodies of *all* channels concatenated, so the chain-of-thought showed up in `content` ahead of the final answer. Added `split_channel_segments(text)` which routes "thought"/"analysis" channel bodies to a separate `reasoning_content` string and "final" / pre-channel bodies to `content`. Wired into the chat-completions handler (gated by `reasoning_format != "none"` so callers asking for a flat response still get the legacy strip-only behaviour). Subtlety: in the observed Gemma-4 emission the *standalone* `<channel|>` (without a preceding header) marks the END of the current channel and hands off to the default (= "final" content) — earlier strip_channel_headers documented it as just a marker drop. Documented the difference in split_channel_segments. Verified with /v1/chat/completions multi-turn on Gemma-4-26B-A4B-it-NVFP4: - T1: content = "Nice to meet you, Raphael! The capital of France is **Paris**." reasoning_content = "* Name: Raphael / * Favorite color: Teal / ..." - T2: content = "Your name is **Raphael** and the color you mentioned is **teal**." - T3: content = "23 * 17 is **391**, and the country you asked about was **France**." (model correctly recalls history + computes 23×17 in reasoning trace) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* runtime: FP32 attention S-matrix + Qwen3.5 QK-norm split Two coupled fixes from PR #62 (569af55) that apply cleanly to current main. The third change in that commit — dropping the GDN→FP16 fallback in executor_pre_dequant.cu — is *intentionally skipped* here: PR #60 added a diagnostic comment on main explaining the cuBLAS-INTERNAL_ERROR (status 14) cascade for GDN-shape MXFP4 weights is still unrooted, so the `has_gdn` guard stays. Qwen3.5-27B-mxfp4 remains incoherent for an unrelated reason per the original PR #62 author's notes, so dropping the guard wouldn't unblock working usage today. 1. **attention_cublas.cu** — Use the FP32 S-matrix whenever the scratch buffer fits, not just when scale==1.0 (Gemma-4). The FP16-S path can accumulate enough round-off across deep layers to NaN at attention layers in the second half of the model. The scale==1.0 gate only triggered on Gemma-4 by accident; Qwen3.5-27B-mxfp4 (head_dim=512, scale=1/sqrt(512)) hits the same FP16 round-off → all-NaN attention output at L59 → garbage logits. 2. **executor_attention.cu (a)** — Pass the full `attn_scores_` tensor (capacity = max seq_len^2) to `attention_cublas_prefill` instead of constructing a sub-view sized for the current n. The FP32-fits check inside attention_cublas reads the passed S tensor's shape, not the underlying buffer's capacity, so the sub-view always made FP32-fits evaluate false even when the real allocation was 200× larger. With the full buffer passed, the FP32-S path actually fires for prefill shapes that fit in the buffer. 3. **executor_attention.cu (b)** — Detect attn_q_norm/attn_k_norm with norm_dim < head_dim and reshape Q/K accordingly. Qwen3.5-27B-mxfp4 ships with head_dim=512 for Q, head_dim=256 for K, and a single (256,) norm weight intended to be applied per 256-dim chunk along the head. The previous code reshaped Q to [n*nh, hd=512] and called rmsnorm with the (256,) weight — reading 256 elements past the end of the weight buffer and producing the L59 attention NaN that poisoned the rest of the forward pass. New `split_norm_dim` lambda computes `norm_dim` from the weight's shape and reshapes the Q/K view to `[n*nh*(hd/norm_dim), norm_dim]`. No-op when norm_dim == hd (the common case). Verified clean cherry-pick with no overlap against the recently-merged PR #65 (server reasoning_content split + Gemma-4 NVFP4 multi-turn). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * tools: drop unreferenced create_test_mxfp4.py Last MXFP4 converter remnant. The production convert_mxfp4.py was already removed in PR #8 (e93deff). create_test_mxfp4.py only generated a tiny synthetic GGUF for ad-hoc loader testing and is referenced by no test, build target, doc, or script. Drop it to finish the converter cleanup. MXFP4 input format is now exclusively third-party GGUFs (Unsloth, bartowski, gpt-oss); own quantization pipeline is NVFP4 via NVIDIA Model Optimizer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ial) (#65) * feat(loader): emit Gemma-4 NVFP4 scaling extras (Phase 2 Item 2, partial) Phase 1 skipped the 90 extra scaling tensors that ship with llm-compressor NVFP4 Gemma-4 (`.layer_scalar`, `router.per_expert_scale`, `router.scale`) because the runtime application was untested. Validation showed Gemma-4 NVFP4 emits "Pac<unused5>" without them — they are load-bearing. This change flips the SKIP rules to pass-through emission. The names already match `weight_map.cpp`'s Gemma-4 routing (router.scale → ffn_gate_inp_scale, router.per_expert_scale → expert_down_scale, layer_scalar → layer_out_scale), which the Gemma-4 GGUF path uses correctly today. weight_upload.cu already includes these fields in the unquantized-norm upload list (BF16→FP16 conversion handled). executor_forward.cu / executor_forward_moe.cu already apply layer_out_scale, ffn_gate_inp_scale (as RMSNorm weight) and expert_down_scale (as routing weight multiplier) on the GEMMA4 dispatch. What works: - 26 unit tests pass (3 new EmitsLayerScalar / EmitsPerExpertScale / EmitsRouterScale, replacing the SKIP assertions; ProjScaleIsNotGemma4Extra retained as a defensive guard). - All four LlmCompressorE2E tests still pass: Gemma4_LoadsWithoutIMA, Gemma4_GeneratesNonEmptyOutput, MistralSmall_LoadsAndGeneratesCoherent, Modelopt_QwenCoder30B_StillWorks. - Gemma-4 NVFP4 output is no longer the immediate degenerate stop ("Pac<unused5>"). The scales now reach the forward pass. What's left (deferred to a follow-up PR): - Gemma-4 NVFP4 output is still incoherent — varying tokens (e.g. "What is the capital of France?" → " way world ات set" with chat-template gemma) but not factually correct. The GGUF Gemma-4 Q8_0 path with the SAME executor scaling code is coherent ("Paris is the capital of France"), so the application math is right — but llm-compressor likely stores one or more of these scales with different semantics (e.g. divisor vs multiplier, similar to the Phase-1 weight_global_scale flip), or there's an extra reconstruction step in the vLLM reference (vllm-project/vllm#39045) that imp doesn't replicate. Needs a side-by-side reference trace; estimated another ~1-2 days. Mistral-Small remains the actual coherence gate for the dense loader; this PR doesn't regress it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(loader): route Gemma-4 base FFN norms (was silently dropped) Discovery from comparing the GGUF Q8_0 baseline (which works coherently with the same executor) against the SafeTensors NVFP4 path: the per-layer scale *values* are bit-identical between the two formats (verified via raw byte diff on layer_scalar / router.scale / router.per_expert_scale across layers 0/5/15). So the executor math is correct. What's *not* identical is the layer-norm tensor inventory the loader sees. Each Gemma-4 layer has six FFN-related layernorms: - pre_feedforward_layernorm (base) - pre_feedforward_layernorm_2 (parallel branch) - post_feedforward_layernorm (base) - post_feedforward_layernorm_1 (parallel branch 1) - post_feedforward_layernorm_2 (parallel branch 2) - post_attention_layernorm (sandwich) The Phase-1 weight_map only routed the _1 / _2 parallel-branch variants and post_attention. The two *base* variants (pre_feedforward_layernorm, post_feedforward_layernorm) hit no branch and were silently counted as "matched" via the catch-all but never assigned to a layer field. Result: layer.ffn_norm and layer.post_ffn_norm stayed null for every Gemma-4 layer and the FFN computation ran without input/output normalization. GGUF stores these as `ffn_norm` and `post_ffw_norm` and the GGUF loader already routes them. This commit mirrors that routing on the SafeTensors side: pre_feedforward_layernorm → layer.ffn_norm, post_feedforward_layernorm → layer.post_ffn_norm. Status: - 26 LlmCompressor* unit tests still pass. - Gemma-4 NVFP4 generation: output is still degenerate (further work needed — likely additional missing piece in the per-expert NVFP4 path, since the prefill takes the legacy serial fallback rather than the NVFP4 grouped path; nvfp4_moe wcache count stays 0 because expert_gate_packed is null for the per-expert llm-compressor layout). But the norm fix is a strict improvement that the loader should ship regardless. - Mistral-Small dense path unaffected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * diag: add IMP_NO_PDL + IMP_MOE_ZERO_WORKSPACE env vars Two zero-overhead diagnostic flags added during the Phase-2-Item-2 audit of Gemma-4 NVFP4 non-determinism. Off by default; no behavior change unless explicitly enabled. - IMP_NO_PDL=1 — pdl::is_enabled() returns false for every kernel, so pdl::launch falls back to standard <<<>>> launches. Used to confirm whether a non-deterministic regression is caused by Programmatic Dependent Launch tail-overlap races. - IMP_MOE_ZERO_WORKSPACE=1 — cudaMemsetAsync the MoE workspace buffers (expert_{gate,up,swiglu,down}, gathered) at the start of each run_moe_ffn(). Used to confirm whether non-determinism is caused by uninitialized reads from MoE staging slots that the legacy serial fallback didn't write (e.g. inactive expert positions). Both useful as next-session investigation tools for the Gemma-4 NVFP4 incoherence (PR #65 still open). Audit findings: - LAUNCH_BLOCKING=1 → output deterministic but still incoherent ("{\"get_2_plus_2"); confirms the underlying bug is computational, non-determinism is sampling-noise in the wrong-logits regime. - IMP_NO_PDL=1 → still non-deterministic; rules out PDL races. - IMP_MOE_ZERO_WORKSPACE=1 → still non-deterministic; rules out uninit reads from MoE staging. - Mistral-Small NVFP4 (dense) is deterministic + coherent on the same setup; isolates the bug to the MoE-specific code paths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(nvfp4): two latent bugs in weight_dispatch.cu NVFP4 path + diag The Phase-1 NVFP4 audit (commit 46192a0, "K must be packed*2") fixed the WeightHandle-based dispatch in five executor files but missed weight_dispatch.cu's TWO `case StorageTier::NVFP4:` arms (line 82 prefill, line 298 decode). Both had identical bugs: 1. tmp.K = w.shape[1] — uses PACKED column count instead of logical K. NVFP4 stores K/2 nibbles per byte, so the kernel consumed only half the input → wrong dot product. 2. cudaMemcpyAsync(&tmp.tensor_scale, w.payload.nvfp4.tensor_scale, sizeof(float), cudaMemcpyDeviceToHost, stream) — but `tensor_scale` is a borrowed HOST float pointer (per Phase-1 wcache_.nvfp4 binding), not a device pointer. The Memcpy silently corrupts the scale. Mistral-Small NVFP4 worked anyway because the dense q/k/v/o paths take the specialized fused QKV / gate-up GEMVs (executor_attention.cu line 345 + executor_ffn.cu line 173/289/359) which already had the correct `* 2` + direct host-pointer read. The buggy weight_dispatch.cu path is only reached by callers that use gemm_dispatch(WeightHandle&, …) directly — which is rare today but used by some Gemma-4 paths. Fix both arms to mirror the working callers: read tensor_scale directly from the host pointer, and multiply shape[1] by 2 for the logical K. Side effect verified on Mistral-Small-3.2-24B-NVFP4: greedy "The capital of France is" output improved from "Paris. It is the largest city in France and one of" → "Paris. It is the capital of France". The earlier output had been mis-scaled too, just not visibly enough to flip argmax. Also adds IMP_NO_SHARED_MLP=1 diagnostic env var (skips the Gemma-4 shared-MLP branch in run_moe_ffn — used to bisect shared-MLP vs MoE-expert contributions to the still-broken Gemma-4 NVFP4 output). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(hf_loader): set scalar head_dim/n_kv_heads to max(per-layer) for Gemma-4 Gemma-4 has dual head geometry: SWA layers use head_dim=256, full-attention layers use head_dim=512 (and 2 KV heads vs 8). The HF config loader was building per-layer arrays correctly but leaving cfg.head_dim at the SWA-only value (256). The KV cache + attention workspace are sized from the SCALAR cfg.head_dim — which means with head_dim=256 the buffers are too small for the full-attention layers' 8192-wide Q output → write past the allocated stride into adjacent layer slots → corrupted state. The GGUF loader already does the same `max_hd = max(per-layer)` fixup (gguf_loader.cpp:940-946); this commit mirrors it on the HF side. Without it Gemma-4 NVFP4 SafeTensors models had silent buffer-overflow risk on the full-attention layers (every 6th layer in the 5+1 SWA/global pattern). Discovered while bisecting Gemma-4 NVFP4 incoherence (PR #65). KV-cache log line `head_dim=256` in NVFP4 vs `head_dim=512` in GGUF was the tell. Note: this fix removes the buffer-overflow risk but does NOT fully restore Gemma-4 NVFP4 coherence — there's a separate architectural issue where attention_cublas_prefill computes ld_q/ld_k from per-layer values while the buffer stride is sized from cfg.head_dim (now correctly max). For mixed-head_dim layers the leading dimensions don't match the buffer stride. That's a follow-up fix (probably: pass actual stride from Tensor.stride[0] instead of recomputing). No regression on Mistral-Small NVFP4 (still coherent: "Paris. It is the capital of France..."). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(moe): route per-expert NVFP4 prefill through gemm_nvfp4 (dequant) Gemma-4-26B-A4B NVFP4 (llm-compressor format) was producing token-stuck output (' \n \n ...') and IMA'ing under default flags during CUDA-graph capture of decode step 1. Bisect (IMP_AUDIT_NVFP4_SCALES=1 + IMP_EXPERT_NVFP4_DEQUANT[_MR]=1) pinned the cause to the per-row gemv_nvfp4_kpar loop in expert_gemm's NVFP4 branch, used only for the LEGACY SERIAL MoE prefill path (which fires when nvfp4_moe_*_ptr is null because llm-compressor stores experts as separate per-tensor NVFP4 weights instead of a fused 3D packed buffer, and CUTLASS 3.x grouped is reserved for StorageTier::CUTLASS_NVFP4 / Modelopt). The single-call M=1 decode path on the same kernel works (Mistral dense decode hits gemv_nvfp4_kpar at N=4096 K=5120 and is coherent), but the M>1 per-row loop at Gemma-4 expert dimensions (N=704 K=2816 / N=2816 K=704) produces wrong output. The dense-path mirror — gemm_nvfp4 (NVFP4 → FP16 dequant + cuBLAS gemm), which Mistral dense prefill already uses — is correct on Gemma-4, so route the M>1 expert path through it. ~22 LoC delta. CUTLASS 3.x grouped (Modelopt) and the nvfp4_moe_*_ptr batch path are unaffected and remain on their fast paths. Verified post-fix on 2026-04-27: - Gemma-4-26B-A4B-it-NVFP4 → "The capital of France is Paris." (tg≈34 tok/s, CUDA Graphs auto-enable; IMA-with-graphs gone). - Mistral-Small-3.2-24B-NVFP4 → "Paris is renowned for" (tg≈76 tok/s, no regression). - Qwen3-Coder-30B-A3B-Instruct-FP4 → "of the Fibonacci function" (Modelopt CUTLASS 3.x grouped path, no regression). - Gemma-4-26B-A4B-it-Q8_0 GGUF → "The capital of France" (no regression). - LlmCompressorE2E unit tests: 3/4 pass (Modelopt_QwenCoder30B fails because /models/Qwen3-Coder-30B-A3B-FP4/ is empty on this machine — pre-existing test-data issue, not a regression). Follow-up: pinpoint the actual gemv_nvfp4_kpar per-row-loop pathology at small N/K. Likely a launch-storm interaction with PDL or the kpar reduce path; needs a synthetic NxK=704x2816, M=11 test against an FP32 reference. Once root cause is known, restore native NVFP4 GEMV for legacy MoE prefill to avoid the dequant → FP16 overhead. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(server,gemma4): coherent multi-turn for Gemma-4 NVFP4 + reasoning split Four discoveries during multi-turn server testing on Gemma-4-26B-A4B-it-NVFP4 (llm-compressor format), all gated on the imp-server path which had not been exercised against this model before: 1. **chat_template.jinja was never loaded.** HFConfigLoader::load_chat_template only checked tokenizer_config.json's chat_template field, which is empty on Gemma-4 NVFP4 — the actual 16448-byte Jinja template ships as a standalone chat_template.jinja next to the safetensors. Added that as a fallback. Without it the engine fell back to the hardcoded `apply_gemma` family template, which doesn't match the model's trained channel layout (`<|channel>thought<channel|>...<|channel>final<channel|>...`). 2. **Greedy fast-path bypassed banned-token suppression.** `step_prefill_one`'s use_event_sync branch (temperature ≤ 0 / top_k=1) goes through `forward_logits` + `sample_greedy_device`, neither of which applied `state.banned_tokens`. Result on Gemma-4 NVFP4: argmax of raw logits picks `<|channel>` (id 100, banned) as the natural first token — then `is_stop_token` (which treats banned tokens as stop) finishes the request immediately with 0 completion tokens. CLI's tested path uses `forward()` (executor.cu:88) which DOES apply bans. Fix: apply the same `cudaMemcpyAsync(-inf)` ban-mask inline before sample_greedy_device. Also closed the same gap in `apply_pre_sample` (sample_from_logits) which handled penalties / dry / logit_bias / constraints / min_p / typical_p but had silently dropped banned-token handling that `sample_single_from_logits` does have. 3. **Channel markers were over-banned.** With the greedy ban-mask now actually firing, the model could no longer emit `<|channel>` even though it's TRAINED to wrap its turn in `<|channel>thought<channel|>...<|channel>final<channel|>...`. The off-distribution sample produced token-stuck repetition loops (e.g. "러쉬로 러쉬로 러쉬로 ..."). Solution: add `<|channel>` and `<channel|>` to keep_ids alongside EOS / stop / think tokens for any tokenizer that has them (covers Gemma-4 + future channel-trained models). Server's existing `strip_channel_headers` / new `split_channel_segments` handle the markers in the user-facing output. 4. **Thought channel leaked into content.** `strip_channel_headers` simply removed the `<|channel>NAME` headers and `<channel|>` markers but left the bodies of *all* channels concatenated, so the chain-of-thought showed up in `content` ahead of the final answer. Added `split_channel_segments(text)` which routes "thought"/"analysis" channel bodies to a separate `reasoning_content` string and "final" / pre-channel bodies to `content`. Wired into the chat-completions handler (gated by `reasoning_format != "none"` so callers asking for a flat response still get the legacy strip-only behaviour). Subtlety: in the observed Gemma-4 emission the *standalone* `<channel|>` (without a preceding header) marks the END of the current channel and hands off to the default (= "final" content) — earlier strip_channel_headers documented it as just a marker drop. Documented the difference in split_channel_segments. Verified with /v1/chat/completions multi-turn on Gemma-4-26B-A4B-it-NVFP4: - T1: content = "Nice to meet you, Raphael! The capital of France is **Paris**." reasoning_content = "* Name: Raphael / * Favorite color: Teal / ..." - T2: content = "Your name is **Raphael** and the color you mentioned is **teal**." - T3: content = "23 * 17 is **391**, and the country you asked about was **France**." (model correctly recalls history + computes 23×17 in reasoning trace) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* runtime: FP32 attention S-matrix + Qwen3.5 QK-norm split Two coupled fixes from PR #62 (8f8d2b3) that apply cleanly to current main. The third change in that commit — dropping the GDN→FP16 fallback in executor_pre_dequant.cu — is *intentionally skipped* here: PR #60 added a diagnostic comment on main explaining the cuBLAS-INTERNAL_ERROR (status 14) cascade for GDN-shape MXFP4 weights is still unrooted, so the `has_gdn` guard stays. Qwen3.5-27B-mxfp4 remains incoherent for an unrelated reason per the original PR #62 author's notes, so dropping the guard wouldn't unblock working usage today. 1. **attention_cublas.cu** — Use the FP32 S-matrix whenever the scratch buffer fits, not just when scale==1.0 (Gemma-4). The FP16-S path can accumulate enough round-off across deep layers to NaN at attention layers in the second half of the model. The scale==1.0 gate only triggered on Gemma-4 by accident; Qwen3.5-27B-mxfp4 (head_dim=512, scale=1/sqrt(512)) hits the same FP16 round-off → all-NaN attention output at L59 → garbage logits. 2. **executor_attention.cu (a)** — Pass the full `attn_scores_` tensor (capacity = max seq_len^2) to `attention_cublas_prefill` instead of constructing a sub-view sized for the current n. The FP32-fits check inside attention_cublas reads the passed S tensor's shape, not the underlying buffer's capacity, so the sub-view always made FP32-fits evaluate false even when the real allocation was 200× larger. With the full buffer passed, the FP32-S path actually fires for prefill shapes that fit in the buffer. 3. **executor_attention.cu (b)** — Detect attn_q_norm/attn_k_norm with norm_dim < head_dim and reshape Q/K accordingly. Qwen3.5-27B-mxfp4 ships with head_dim=512 for Q, head_dim=256 for K, and a single (256,) norm weight intended to be applied per 256-dim chunk along the head. The previous code reshaped Q to [n*nh, hd=512] and called rmsnorm with the (256,) weight — reading 256 elements past the end of the weight buffer and producing the L59 attention NaN that poisoned the rest of the forward pass. New `split_norm_dim` lambda computes `norm_dim` from the weight's shape and reshapes the Q/K view to `[n*nh*(hd/norm_dim), norm_dim]`. No-op when norm_dim == hd (the common case). Verified clean cherry-pick with no overlap against the recently-merged PR #65 (server reasoning_content split + Gemma-4 NVFP4 multi-turn). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * tools: drop unreferenced create_test_mxfp4.py Last MXFP4 converter remnant. The production convert_mxfp4.py was already removed in PR #8 (5b18d2e). create_test_mxfp4.py only generated a tiny synthetic GGUF for ad-hoc loader testing and is referenced by no test, build target, doc, or script. Drop it to finish the converter cleanup. MXFP4 input format is now exclusively third-party GGUFs (Unsloth, bartowski, gpt-oss); own quantization pipeline is NVFP4 via NVIDIA Model Optimizer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… llm-compressor are shipped Two stale "open bugs" listed in TODO.md were already resolved on main: - Qwen3.6-NVFP4 partial coherence: shipped via PR #81 ("six Qwen3.5/3.6-NVFP4 SafeTensors loader bugs"). Per-layer correlation vs GGUF Q4_K_M is ≥0.997 across all 40 layers; output matches the GGUF oracle. Verified that arch_norm_offset / gdn_grouped_head_layout / partial_rotary_factor / nested rope_parameters / A_log -exp transform / fp32_scan y_buf fixes are all on main. - llm-compressor Phase 2 Item 2 (Gemma-4 NVFP4): shipped via PR #65. Root cause was per-row gemv_nvfp4_kpar at Gemma-4 expert dimensions (small N=704 K=2816); routed through gemm_nvfp4 dequant→cuBLAS for M>1, decode M=1 path unchanged. Coherent end-to-end at ~34 tok/s. CHANGELOG.md entries for both PRs expanded to enumerate the actual fixes shipped instead of restating the (now-obsolete) "still incoherent" claim. Memos `qwen36_nvfp4_decode_partial_2026_04_30.md` and `llm_compressor_phase2_item2_2026_04_26.md` already carry SHIPPED / RESOLVED markers in their own headers — TODO.md was the lagging artifact. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

kekzl changed the base branch from feat/llm-compressor-mistral3 to main April 27, 2026 06:30

kekzl and others added 7 commits April 27, 2026 08:31

kekzl force-pushed the feat/llm-compressor-gemma4-extras branch from c4844f2 to 526db1a Compare April 27, 2026 06:31

kekzl enabled auto-merge (squash) April 27, 2026 06:32

kekzl merged commit ffcf518 into main Apr 27, 2026
2 checks passed

kekzl deleted the feat/llm-compressor-gemma4-extras branch April 30, 2026 18:57

kekzl mentioned this pull request May 1, 2026

fix(nvfp4): skip CUTLASS path for llm-compressor — produces wrong output #93

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(loader): emit Gemma-4 NVFP4 scaling extras (Phase 2 Item 2, partial)#65

feat(loader): emit Gemma-4 NVFP4 scaling extras (Phase 2 Item 2, partial)#65
kekzl merged 7 commits into
mainfrom
feat/llm-compressor-gemma4-extras

kekzl commented Apr 26, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kekzl commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Commits

1. `a791e95` — emit Gemma-4 NVFP4 scaling extras

2. `79c7659` — route Gemma-4 base FFN norms

3. `8982d5a` — diagnostic env vars

4. `ca05a45` — fix two latent NVFP4 bugs in weight_dispatch.cu

5. `57cda8e` — Gemma-4 scalar head_dim = max(per-layer)

Status

Phase 2 backlog status

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kekzl commented Apr 26, 2026 •

edited

Loading