runtime: clear diagnostic for MXFP4 FP16-fallback VRAM oversubscription by kekzl · Pull Request #60 · kekzl/imp

kekzl · 2026-04-25T16:56:57Z

Summary

Qwen3.5-27B-mxfp4 GDN hit a confusing illegal-memory-access at the first decode forward on a 32 GiB GPU. Root cause: the MXFP4 → FP16 decode fallback (needed because GDN forward kernels read weights directly, bypassing the MXFP4 dispatcher) tried to allocate ~48 GiB of FP16 weights while ~12 GiB of MXFP4 raw was already resident → total 60 GiB > 32 GiB VRAM. On WSL2/WDDM `cudaMalloc` happily pages over the device boundary, but cuBLASLt then fails to get its internal workspace at runtime → status 14 (INVALID_VALUE) followed by IMA in the next `cudaMemcpyAsync`.

This PR doesn't make a 27B MXFP4 model fit on 32 GiB — that needs structural changes (host-side dequant during upload, freeing MXFP4 raw after replace). It surfaces the failure clearly so users can diagnose it in seconds instead of minutes.

Changes

Pre-flight VRAM check before the bulk MXFP4→FP16 alloc in `pre_dequant_weights`. If the request would leave less than 2 GiB headroom, log a "model too large" error and skip the alloc. `IMP_MXFP4_FP16_FALLBACK=force` bypasses for debugging.
cublasGemmEx fallback now checks return status in `gemm_cublaslt_generic`. When both cublasLt and the GemmEx fallback fail with status 14 on the same shape, log an ERROR with dtypes and dimensions instead of silently leaving garbage in the output buffer and IMAing in the next kernel.

Diagnostic output (Qwen3.5-27B-mxfp4 on RTX 5090)

Before:
```
[ERROR] cudaMemcpyAsync(...) — illegal memory access was encountered
```

After:
```
[ERROR] MXFP4 FP16 fallback would oversubscribe VRAM
(need 47.7 GiB + 2.0 GiB runtime headroom, 9.4 GiB free).
Model is too large for this GPU with the FP16 decode fallback.
Use a smaller quant or a smaller model.
[WARN] gemm: cublasLtMatmul failed (status 14) M=5 K=6144 N=5120 ...
[ERROR] gemm: cublasGemmEx fallback also failed (status 14) M=5 K=6144 N=5120
dtA=2 dtB=2 dtC=2. Output buffer is garbage; expect downstream IMA.
[ERROR] cudaMemcpyAsync(...) — illegal memory access was encountered
```

User now knows: oversubscribed VRAM → cuBLAS rejection → IMA, in that order.

Test plan

`imp-tests --gtest_brief=1`: 79/94 pass (15 skipped, model-dependent — same as main)
Qwen3.5-4B Q8_0 GDN: coherent output, tg=170 tok/s
Qwen3-4B MXFP4: coherent ("Paris.")
Qwen3.6-35B Q4_K_M: coherent (no FP16 fallback path triggered)
Qwen3.5-27B-mxfp4: now logs the precise oversubscription error before the IMA

🤖 Generated with Claude Code

upload_weight() for gdn_alpha/gdn_beta ran with raw_quant=false, which host-dequants Q8_0 → FP16 and overwrites weight.data, but never updates L.gdn_*_qtype. gemm_dispatch then saw qtype=Q8_0 and re-interpreted the FP16 bytes as Q8_0 blocks → ~80× too-large α/β projection → g_t collapses to 0/1, recurrent state decays in 2-3 tokens, output degenerates to " the the the …". Drop the _RAW(false) carve-out so α/β upload like every other Q8_0 weight. Verified Qwen3.5-4B Q8_0 (pp=13676, tg=220 tok/s, coherent), Qwen3.5-9B Q8_0 (pp=9483, tg=140 tok/s, coherent), Qwen3.6-35B Q4_K_M (no regression). 79/94 tests pass, 4/4 DegenerationTest pass, 2/2 GDNModelTest pass with IMP_TEST_MODEL_GDN=Qwen3.5-4B-Q8_0.gguf. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

upload_embeddings_and_output() already updated tok_emb_qtype after a host Q8_0/Q6_K → FP16 dequant, but out_proj_qtype was passed by value and the matching update was missing. For models with separate (untied) out_proj and a quant type outside {Q6_K, Q8_0, Q4_0} (e.g. Q4_K, Q5_K, Q4_1, Q2_K, Q3_K, MXFP4), the host dequant path would silently leave qtype pointing at the original block format, mirroring the GDN α/β bug fixed in the parent commit. Take out_proj_qtype by reference and apply the same F16 update branch. Most common GGUF schemes (Q4_K_M, Q5_K_M) use Q6_K for the output tensor and weight-tied models bypass this path entirely, so no real-world model in the repo currently triggers the issue — the change is defensive. Tests: full GDN + Degeneration + EndToEnd + Embedding suites pass (14/19 with model-dependent skips); Qwen3.5-4B/9B Q8_0 still coherent; Qwen3.6-35B Q4_K_M still emits "Paris." Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Qwen3.5-27B-mxfp4 hit a confusing illegal-memory-access at the first decode forward on a 32 GiB RTX 5090. Root cause: the MXFP4 → FP16 decode fallback (needed for GDN models, where forward kernels read weights directly without going through the MXFP4 dispatcher) tried to allocate ~48 GiB of FP16 weights while ~12 GiB of MXFP4 raw was already resident, total 60 GiB > 32 GiB VRAM. On WSL2/WDDM cudaMalloc happily pages over the device boundary, but cuBLASLt then fails to allocate its internal workspace at runtime → status 14 (INVALID_VALUE) followed by an IMA in the next cudaMemcpyAsync. Two defensive changes: 1. Pre-flight VRAM check before the bulk MXFP4→FP16 alloc. If the request would leave less than 2 GiB headroom for cuBLAS workspaces and KV writes, log a clear "model too large for this GPU" error and skip the alloc. Set IMP_MXFP4_FP16_FALLBACK=force to bypass for debugging (still IMAs, but with an informed user). 2. Check the cublasGemmEx fallback return code in gemm_cublaslt_generic. When both cublasLt and the GemmEx fallback fail (now observed on the same shape that triggered the original IMA), surface that as an ERROR log including dtypes and dimensions instead of silently leaving garbage in the output and IMAing in the next kernel. Tests: full suite 79/94 pass (15 model-dependent skips, same as before). Qwen3.5-4B Q8_0 GDN, Qwen3-4B MXFP4, Qwen3.6-35B Q4_K_M still emit coherent output. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The comment claimed "GDN forward reads weights directly, not through gemm_dispatch" — true at PR #8 time, but no longer accurate after the GDN refactor moved every linear projection (ssm_in, ssm_out, gdn_gate, gdn_alpha, gdn_beta) through gemm_dispatch. The scan kernel itself only touches FP32 tensors (A_log, dt_bias, conv1d, ssm_norm) which are never quantized to MXFP4. The actual reason the GDN fallback is preserved is more painful: when \`mxfp4_gemv_available\` is forced on for GDN (skipping the 48 GiB FP16 fallback that doesn't fit on 32 GiB GPUs), cuBLASLt returns status 14 (INTERNAL_ERROR) on the MXFP4 prefill dispatch for a number of GDN-shape weights — notably Qwen3.5-27B's ssm_out at K=6144 N=5120 and FFN at K=17408 N=5120 — and an IMA cascades through every subsequent cudaMemcpyAsync. Tracking + suspect list in qwen35_27b_mxfp4_ima_2026_04_25.md. Until that's fixed, leave the historical fallback path active and update the comment so future readers don't waste time chasing the wrong mental model. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* runtime: FP32 attention S-matrix + Qwen3.5 QK-norm split Two coupled fixes from PR #62 (569af55) that apply cleanly to current main. The third change in that commit — dropping the GDN→FP16 fallback in executor_pre_dequant.cu — is *intentionally skipped* here: PR #60 added a diagnostic comment on main explaining the cuBLAS-INTERNAL_ERROR (status 14) cascade for GDN-shape MXFP4 weights is still unrooted, so the `has_gdn` guard stays. Qwen3.5-27B-mxfp4 remains incoherent for an unrelated reason per the original PR #62 author's notes, so dropping the guard wouldn't unblock working usage today. 1. **attention_cublas.cu** — Use the FP32 S-matrix whenever the scratch buffer fits, not just when scale==1.0 (Gemma-4). The FP16-S path can accumulate enough round-off across deep layers to NaN at attention layers in the second half of the model. The scale==1.0 gate only triggered on Gemma-4 by accident; Qwen3.5-27B-mxfp4 (head_dim=512, scale=1/sqrt(512)) hits the same FP16 round-off → all-NaN attention output at L59 → garbage logits. 2. **executor_attention.cu (a)** — Pass the full `attn_scores_` tensor (capacity = max seq_len^2) to `attention_cublas_prefill` instead of constructing a sub-view sized for the current n. The FP32-fits check inside attention_cublas reads the passed S tensor's shape, not the underlying buffer's capacity, so the sub-view always made FP32-fits evaluate false even when the real allocation was 200× larger. With the full buffer passed, the FP32-S path actually fires for prefill shapes that fit in the buffer. 3. **executor_attention.cu (b)** — Detect attn_q_norm/attn_k_norm with norm_dim < head_dim and reshape Q/K accordingly. Qwen3.5-27B-mxfp4 ships with head_dim=512 for Q, head_dim=256 for K, and a single (256,) norm weight intended to be applied per 256-dim chunk along the head. The previous code reshaped Q to [n*nh, hd=512] and called rmsnorm with the (256,) weight — reading 256 elements past the end of the weight buffer and producing the L59 attention NaN that poisoned the rest of the forward pass. New `split_norm_dim` lambda computes `norm_dim` from the weight's shape and reshapes the Q/K view to `[n*nh*(hd/norm_dim), norm_dim]`. No-op when norm_dim == hd (the common case). Verified clean cherry-pick with no overlap against the recently-merged PR #65 (server reasoning_content split + Gemma-4 NVFP4 multi-turn). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * tools: drop unreferenced create_test_mxfp4.py Last MXFP4 converter remnant. The production convert_mxfp4.py was already removed in PR #8 (e93deff). create_test_mxfp4.py only generated a tiny synthetic GGUF for ad-hoc loader testing and is referenced by no test, build target, doc, or script. Drop it to finish the converter cleanup. MXFP4 input format is now exclusively third-party GGUFs (Unsloth, bartowski, gpt-oss); own quantization pipeline is NVFP4 via NVIDIA Model Optimizer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…on (#60)

* runtime: FP32 attention S-matrix + Qwen3.5 QK-norm split Two coupled fixes from PR #62 (8f8d2b3) that apply cleanly to current main. The third change in that commit — dropping the GDN→FP16 fallback in executor_pre_dequant.cu — is *intentionally skipped* here: PR #60 added a diagnostic comment on main explaining the cuBLAS-INTERNAL_ERROR (status 14) cascade for GDN-shape MXFP4 weights is still unrooted, so the `has_gdn` guard stays. Qwen3.5-27B-mxfp4 remains incoherent for an unrelated reason per the original PR #62 author's notes, so dropping the guard wouldn't unblock working usage today. 1. **attention_cublas.cu** — Use the FP32 S-matrix whenever the scratch buffer fits, not just when scale==1.0 (Gemma-4). The FP16-S path can accumulate enough round-off across deep layers to NaN at attention layers in the second half of the model. The scale==1.0 gate only triggered on Gemma-4 by accident; Qwen3.5-27B-mxfp4 (head_dim=512, scale=1/sqrt(512)) hits the same FP16 round-off → all-NaN attention output at L59 → garbage logits. 2. **executor_attention.cu (a)** — Pass the full `attn_scores_` tensor (capacity = max seq_len^2) to `attention_cublas_prefill` instead of constructing a sub-view sized for the current n. The FP32-fits check inside attention_cublas reads the passed S tensor's shape, not the underlying buffer's capacity, so the sub-view always made FP32-fits evaluate false even when the real allocation was 200× larger. With the full buffer passed, the FP32-S path actually fires for prefill shapes that fit in the buffer. 3. **executor_attention.cu (b)** — Detect attn_q_norm/attn_k_norm with norm_dim < head_dim and reshape Q/K accordingly. Qwen3.5-27B-mxfp4 ships with head_dim=512 for Q, head_dim=256 for K, and a single (256,) norm weight intended to be applied per 256-dim chunk along the head. The previous code reshaped Q to [n*nh, hd=512] and called rmsnorm with the (256,) weight — reading 256 elements past the end of the weight buffer and producing the L59 attention NaN that poisoned the rest of the forward pass. New `split_norm_dim` lambda computes `norm_dim` from the weight's shape and reshapes the Q/K view to `[n*nh*(hd/norm_dim), norm_dim]`. No-op when norm_dim == hd (the common case). Verified clean cherry-pick with no overlap against the recently-merged PR #65 (server reasoning_content split + Gemma-4 NVFP4 multi-turn). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * tools: drop unreferenced create_test_mxfp4.py Last MXFP4 converter remnant. The production convert_mxfp4.py was already removed in PR #8 (5b18d2e). create_test_mxfp4.py only generated a tiny synthetic GGUF for ad-hoc loader testing and is referenced by no test, build target, doc, or script. Drop it to finish the converter cleanup. MXFP4 input format is now exclusively third-party GGUFs (Unsloth, bartowski, gpt-oss); own quantization pipeline is NVFP4 via NVIDIA Model Optimizer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

kekzl and others added 4 commits April 25, 2026 17:46

kekzl merged commit 0596ee4 into main Apr 25, 2026
2 checks passed

kekzl deleted the fix/mxfp4-vram-oversubscribe-diag branch April 25, 2026 22:54

This was referenced Apr 27, 2026

runtime: FP32 attention S-matrix + Qwen3.5 QK-norm split #66

Merged

runtime: drop FP16 fallback for GDN MXFP4 + Qwen3.5-27B QK-norm split #62

Closed

kekzl added a commit that referenced this pull request Apr 30, 2026

runtime: clear diagnostic for MXFP4 FP16-fallback VRAM oversubscripti…

f5738d3

…on (#60)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: clear diagnostic for MXFP4 FP16-fallback VRAM oversubscription#60

runtime: clear diagnostic for MXFP4 FP16-fallback VRAM oversubscription#60
kekzl merged 4 commits into
mainfrom
fix/mxfp4-vram-oversubscribe-diag

kekzl commented Apr 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kekzl commented Apr 25, 2026

Summary

Changes

Diagnostic output (Qwen3.5-27B-mxfp4 on RTX 5090)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant