runtime: clear diagnostic for MXFP4 FP16-fallback VRAM oversubscription#60
Merged
Conversation
upload_weight() for gdn_alpha/gdn_beta ran with raw_quant=false, which host-dequants Q8_0 → FP16 and overwrites weight.data, but never updates L.gdn_*_qtype. gemm_dispatch then saw qtype=Q8_0 and re-interpreted the FP16 bytes as Q8_0 blocks → ~80× too-large α/β projection → g_t collapses to 0/1, recurrent state decays in 2-3 tokens, output degenerates to " the the the …". Drop the _RAW(false) carve-out so α/β upload like every other Q8_0 weight. Verified Qwen3.5-4B Q8_0 (pp=13676, tg=220 tok/s, coherent), Qwen3.5-9B Q8_0 (pp=9483, tg=140 tok/s, coherent), Qwen3.6-35B Q4_K_M (no regression). 79/94 tests pass, 4/4 DegenerationTest pass, 2/2 GDNModelTest pass with IMP_TEST_MODEL_GDN=Qwen3.5-4B-Q8_0.gguf. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
upload_embeddings_and_output() already updated tok_emb_qtype after a host
Q8_0/Q6_K → FP16 dequant, but out_proj_qtype was passed by value and the
matching update was missing. For models with separate (untied) out_proj
and a quant type outside {Q6_K, Q8_0, Q4_0} (e.g. Q4_K, Q5_K, Q4_1, Q2_K,
Q3_K, MXFP4), the host dequant path would silently leave qtype pointing
at the original block format, mirroring the GDN α/β bug fixed in the
parent commit.
Take out_proj_qtype by reference and apply the same F16 update branch.
Most common GGUF schemes (Q4_K_M, Q5_K_M) use Q6_K for the output tensor
and weight-tied models bypass this path entirely, so no real-world model
in the repo currently triggers the issue — the change is defensive.
Tests: full GDN + Degeneration + EndToEnd + Embedding suites pass
(14/19 with model-dependent skips); Qwen3.5-4B/9B Q8_0 still coherent;
Qwen3.6-35B Q4_K_M still emits "Paris."
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Qwen3.5-27B-mxfp4 hit a confusing illegal-memory-access at the first decode forward on a 32 GiB RTX 5090. Root cause: the MXFP4 → FP16 decode fallback (needed for GDN models, where forward kernels read weights directly without going through the MXFP4 dispatcher) tried to allocate ~48 GiB of FP16 weights while ~12 GiB of MXFP4 raw was already resident, total 60 GiB > 32 GiB VRAM. On WSL2/WDDM cudaMalloc happily pages over the device boundary, but cuBLASLt then fails to allocate its internal workspace at runtime → status 14 (INVALID_VALUE) followed by an IMA in the next cudaMemcpyAsync. Two defensive changes: 1. Pre-flight VRAM check before the bulk MXFP4→FP16 alloc. If the request would leave less than 2 GiB headroom for cuBLAS workspaces and KV writes, log a clear "model too large for this GPU" error and skip the alloc. Set IMP_MXFP4_FP16_FALLBACK=force to bypass for debugging (still IMAs, but with an informed user). 2. Check the cublasGemmEx fallback return code in gemm_cublaslt_generic. When both cublasLt and the GemmEx fallback fail (now observed on the same shape that triggered the original IMA), surface that as an ERROR log including dtypes and dimensions instead of silently leaving garbage in the output and IMAing in the next kernel. Tests: full suite 79/94 pass (15 model-dependent skips, same as before). Qwen3.5-4B Q8_0 GDN, Qwen3-4B MXFP4, Qwen3.6-35B Q4_K_M still emit coherent output. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The comment claimed "GDN forward reads weights directly, not through gemm_dispatch" — true at PR #8 time, but no longer accurate after the GDN refactor moved every linear projection (ssm_in, ssm_out, gdn_gate, gdn_alpha, gdn_beta) through gemm_dispatch. The scan kernel itself only touches FP32 tensors (A_log, dt_bias, conv1d, ssm_norm) which are never quantized to MXFP4. The actual reason the GDN fallback is preserved is more painful: when \`mxfp4_gemv_available\` is forced on for GDN (skipping the 48 GiB FP16 fallback that doesn't fit on 32 GiB GPUs), cuBLASLt returns status 14 (INTERNAL_ERROR) on the MXFP4 prefill dispatch for a number of GDN-shape weights — notably Qwen3.5-27B's ssm_out at K=6144 N=5120 and FFN at K=17408 N=5120 — and an IMA cascades through every subsequent cudaMemcpyAsync. Tracking + suspect list in qwen35_27b_mxfp4_ima_2026_04_25.md. Until that's fixed, leave the historical fallback path active and update the comment so future readers don't waste time chasing the wrong mental model. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced Apr 27, 2026
kekzl
added a commit
that referenced
this pull request
Apr 27, 2026
* runtime: FP32 attention S-matrix + Qwen3.5 QK-norm split Two coupled fixes from PR #62 (569af55) that apply cleanly to current main. The third change in that commit — dropping the GDN→FP16 fallback in executor_pre_dequant.cu — is *intentionally skipped* here: PR #60 added a diagnostic comment on main explaining the cuBLAS-INTERNAL_ERROR (status 14) cascade for GDN-shape MXFP4 weights is still unrooted, so the `has_gdn` guard stays. Qwen3.5-27B-mxfp4 remains incoherent for an unrelated reason per the original PR #62 author's notes, so dropping the guard wouldn't unblock working usage today. 1. **attention_cublas.cu** — Use the FP32 S-matrix whenever the scratch buffer fits, not just when scale==1.0 (Gemma-4). The FP16-S path can accumulate enough round-off across deep layers to NaN at attention layers in the second half of the model. The scale==1.0 gate only triggered on Gemma-4 by accident; Qwen3.5-27B-mxfp4 (head_dim=512, scale=1/sqrt(512)) hits the same FP16 round-off → all-NaN attention output at L59 → garbage logits. 2. **executor_attention.cu (a)** — Pass the full `attn_scores_` tensor (capacity = max seq_len^2) to `attention_cublas_prefill` instead of constructing a sub-view sized for the current n. The FP32-fits check inside attention_cublas reads the passed S tensor's shape, not the underlying buffer's capacity, so the sub-view always made FP32-fits evaluate false even when the real allocation was 200× larger. With the full buffer passed, the FP32-S path actually fires for prefill shapes that fit in the buffer. 3. **executor_attention.cu (b)** — Detect attn_q_norm/attn_k_norm with norm_dim < head_dim and reshape Q/K accordingly. Qwen3.5-27B-mxfp4 ships with head_dim=512 for Q, head_dim=256 for K, and a single (256,) norm weight intended to be applied per 256-dim chunk along the head. The previous code reshaped Q to [n*nh, hd=512] and called rmsnorm with the (256,) weight — reading 256 elements past the end of the weight buffer and producing the L59 attention NaN that poisoned the rest of the forward pass. New `split_norm_dim` lambda computes `norm_dim` from the weight's shape and reshapes the Q/K view to `[n*nh*(hd/norm_dim), norm_dim]`. No-op when norm_dim == hd (the common case). Verified clean cherry-pick with no overlap against the recently-merged PR #65 (server reasoning_content split + Gemma-4 NVFP4 multi-turn). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * tools: drop unreferenced create_test_mxfp4.py Last MXFP4 converter remnant. The production convert_mxfp4.py was already removed in PR #8 (e93deff). create_test_mxfp4.py only generated a tiny synthetic GGUF for ad-hoc loader testing and is referenced by no test, build target, doc, or script. Drop it to finish the converter cleanup. MXFP4 input format is now exclusively third-party GGUFs (Unsloth, bartowski, gpt-oss); own quantization pipeline is NVFP4 via NVIDIA Model Optimizer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
kekzl
added a commit
that referenced
this pull request
Apr 30, 2026
kekzl
added a commit
that referenced
this pull request
Apr 30, 2026
* runtime: FP32 attention S-matrix + Qwen3.5 QK-norm split Two coupled fixes from PR #62 (8f8d2b3) that apply cleanly to current main. The third change in that commit — dropping the GDN→FP16 fallback in executor_pre_dequant.cu — is *intentionally skipped* here: PR #60 added a diagnostic comment on main explaining the cuBLAS-INTERNAL_ERROR (status 14) cascade for GDN-shape MXFP4 weights is still unrooted, so the `has_gdn` guard stays. Qwen3.5-27B-mxfp4 remains incoherent for an unrelated reason per the original PR #62 author's notes, so dropping the guard wouldn't unblock working usage today. 1. **attention_cublas.cu** — Use the FP32 S-matrix whenever the scratch buffer fits, not just when scale==1.0 (Gemma-4). The FP16-S path can accumulate enough round-off across deep layers to NaN at attention layers in the second half of the model. The scale==1.0 gate only triggered on Gemma-4 by accident; Qwen3.5-27B-mxfp4 (head_dim=512, scale=1/sqrt(512)) hits the same FP16 round-off → all-NaN attention output at L59 → garbage logits. 2. **executor_attention.cu (a)** — Pass the full `attn_scores_` tensor (capacity = max seq_len^2) to `attention_cublas_prefill` instead of constructing a sub-view sized for the current n. The FP32-fits check inside attention_cublas reads the passed S tensor's shape, not the underlying buffer's capacity, so the sub-view always made FP32-fits evaluate false even when the real allocation was 200× larger. With the full buffer passed, the FP32-S path actually fires for prefill shapes that fit in the buffer. 3. **executor_attention.cu (b)** — Detect attn_q_norm/attn_k_norm with norm_dim < head_dim and reshape Q/K accordingly. Qwen3.5-27B-mxfp4 ships with head_dim=512 for Q, head_dim=256 for K, and a single (256,) norm weight intended to be applied per 256-dim chunk along the head. The previous code reshaped Q to [n*nh, hd=512] and called rmsnorm with the (256,) weight — reading 256 elements past the end of the weight buffer and producing the L59 attention NaN that poisoned the rest of the forward pass. New `split_norm_dim` lambda computes `norm_dim` from the weight's shape and reshapes the Q/K view to `[n*nh*(hd/norm_dim), norm_dim]`. No-op when norm_dim == hd (the common case). Verified clean cherry-pick with no overlap against the recently-merged PR #65 (server reasoning_content split + Gemma-4 NVFP4 multi-turn). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * tools: drop unreferenced create_test_mxfp4.py Last MXFP4 converter remnant. The production convert_mxfp4.py was already removed in PR #8 (5b18d2e). create_test_mxfp4.py only generated a tiny synthetic GGUF for ad-hoc loader testing and is referenced by no test, build target, doc, or script. Drop it to finish the converter cleanup. MXFP4 input format is now exclusively third-party GGUFs (Unsloth, bartowski, gpt-oss); own quantization pipeline is NVFP4 via NVIDIA Model Optimizer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Qwen3.5-27B-mxfp4 GDN hit a confusing illegal-memory-access at the first decode forward on a 32 GiB GPU. Root cause: the MXFP4 → FP16 decode fallback (needed because GDN forward kernels read weights directly, bypassing the MXFP4 dispatcher) tried to allocate ~48 GiB of FP16 weights while ~12 GiB of MXFP4 raw was already resident → total 60 GiB > 32 GiB VRAM. On WSL2/WDDM `cudaMalloc` happily pages over the device boundary, but cuBLASLt then fails to get its internal workspace at runtime → status 14 (INVALID_VALUE) followed by IMA in the next `cudaMemcpyAsync`.
This PR doesn't make a 27B MXFP4 model fit on 32 GiB — that needs structural changes (host-side dequant during upload, freeing MXFP4 raw after replace). It surfaces the failure clearly so users can diagnose it in seconds instead of minutes.
Changes
Pre-flight VRAM check before the bulk MXFP4→FP16 alloc in `pre_dequant_weights`. If the request would leave less than 2 GiB headroom, log a "model too large" error and skip the alloc. `IMP_MXFP4_FP16_FALLBACK=force` bypasses for debugging.
cublasGemmEx fallback now checks return status in `gemm_cublaslt_generic`. When both cublasLt and the GemmEx fallback fail with status 14 on the same shape, log an ERROR with dtypes and dimensions instead of silently leaving garbage in the output buffer and IMAing in the next kernel.
Diagnostic output (Qwen3.5-27B-mxfp4 on RTX 5090)
Before:
```
[ERROR] cudaMemcpyAsync(...) — illegal memory access was encountered
```
After:
```
[ERROR] MXFP4 FP16 fallback would oversubscribe VRAM
(need 47.7 GiB + 2.0 GiB runtime headroom, 9.4 GiB free).
Model is too large for this GPU with the FP16 decode fallback.
Use a smaller quant or a smaller model.
[WARN] gemm: cublasLtMatmul failed (status 14) M=5 K=6144 N=5120 ...
[ERROR] gemm: cublasGemmEx fallback also failed (status 14) M=5 K=6144 N=5120
dtA=2 dtB=2 dtC=2. Output buffer is garbage; expect downstream IMA.
[ERROR] cudaMemcpyAsync(...) — illegal memory access was encountered
```
User now knows: oversubscribed VRAM → cuBLAS rejection → IMA, in that order.
Test plan
🤖 Generated with Claude Code