feat(nemotron): support NemotronHForCausalLM hybrid Mamba2+MoE+Attention NVFP4 models#104
Merged
Conversation
…ion NVFP4 models
Wires up the full load+dispatch chain for Nemotron-3-Nano-30B-A3B-NVFP4 and
similar nemotron_h_moe-arch checkpoints. Without these the model crashes at
the first decode step with prefill memcpy IMA + lm_head cublas status=13.
Three pieces:
1. hf_config_loader.cpp: register NemotronHForCausalLM/nemotron_h in arch
maps; parse mamba/MoE-specific config fields (mamba_head_dim,
mamba_num_heads, n_groups, ssm_state_size, conv_kernel,
n_routed_experts, n_shared_experts, moe_shared_expert_intermediate_size,
routed_scaling_factor, norm_topk_prob); decode hybrid_override_pattern
("MEMEM*EM..." with M=Mamba2/E=MoE/*=Attn) into n_kv_heads_per_layer.
2. weight_map.cpp: Nemotron-H name normalizer translating
backbone.embeddings/norm_f → model.embed_tokens/norm and
backbone.layers.N.mixer.<sub>.* → model.layers.N.{self_attn|mamba|mlp}.*
(dispatched by sub: q/k/v/o_proj→self_attn, in/out_proj+conv1d+
A_log+D+dt_bias+norm→mamba, experts/gate/shared_experts→mlp). Also
adds NVFP4 prequant scale routing (weight_scale/weight_scale_2/
input_scale) for mamba.in_proj/out_proj.
3. executor_pre_dequant.cu: extend resolve() in promote() to map
"L<i>.ssm_in"/"L<i>.ssm_out" scratch keys back to L.ssm_in/ssm_out
so SSM NVFP4 scales actually attach to the tensor sidecars.
Validate: scripts/validate_safetensors.py adds Nemotron-3-Nano entry.
Smoke test passes phase 0/3/5/6 (load + 32x graph replay byte-identical
+ logit health + determinism + decode 37-70 tok/s).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5 tasks
* build: switch sm_120f → sm_120a target for full RTX 5090 feature set
Original CMakeLists used sm_120f as workaround for ptxas C7600 on sm_120a.
On CUDA 13.2.1 that bug no longer reproduces — clean build, 0 ptxas errors.
sm_120a unlocks the full SM120 feature set:
- mma.sync.aligned.kind::mxf4nvf4.block_scale (block-scaled FP4 MMA)
- extended cp.async.bulk.tensor modes (TMA multicast)
- full 228-KiB SMEM-carveout per CTA
- cluster launch with CLC
- extended mbarrier phases
- sparse mxf4nvf4 K=128 MMAs (per ptx_mma_survey)
Note: tcgen05.* / TMEM are SM100-only (server B200) — NOT present on
SM120 (consumer RTX 5090) regardless of arch suffix. Earlier confusion
in code comments has been corrected via memory cross-refs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* perf(nvfp4): route SSM in_proj/out_proj through CUTLASS fast-path
Add L.ssm_in / L.ssm_out to the cutlass_nvfp4 cache registration loop in
pre_dequant_weights(). Previously NVFP4-quantized SSM (Mamba2/GDN)
projections were excluded from this cache, falling back to the dequant-to-FP16
+ cuBLAS slow path in nvfp4_gemm.cu — adding ~52 MiB scratch alloc per layer
plus full-weight FP16 round-trip per GEMM call.
Effect for Nemotron-3-Nano-30B-A3B-NVFP4:
CUTLASS NVFP4 cache: 46 → 80 tensors (+34 SSM projections, 66.75 MiB)
300-token prompt: 300s timeout → 2s coherent answer
slow-fallback warn: N× per layer per chunk → 0× per request
Math equivalence: both paths use the same NVFP4-quantized weights with the
same per-block FP8(ue4m3) + per-tensor FP32 scaling. The original exclusion
("4-bit degrades quality on 9B+ models") was about NVFP4-quantizing the
weights at all, not about using CUTLASS vs cuBLAS to compute with them.
Both paths produce numerically equivalent results.
Known remaining issue: multi-chunk-prefill (>256 token prompts) still hangs
on Nemotron-H — separate SSM-state-handoff bug between prefill chunks,
unrelated to this dispatch fix. Tracked in
docs/sm120-real-perf-plan.md as Lever 1b.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
NemotronHForCausalLM/nemotron_h_moearch in config loader + weight mapmamba_head_dim,mamba_num_heads,n_groups,ssm_state_size,conv_kernel, etc.)hf_quant_config.jsonexclude_moduleslist for selective BF16 preservation (conv1d, attn, 6 specific Mamba2 in/out_proj)Nemotron-3-Nano-30B-A3B-NVFP4toscripts/validate_safetensors.pyStatus
Single-chunk prefill works — verified 2026-05-04 against
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4. "Capital of France?" → "Paris" in 1.4s, finish=stop, coherent including<think>block.Multi-chunk-prefill (≥3 chunks, ≥~470 prompt tokens) silently hangs — separate SSM-state-handoff bug across chunk boundaries, tracked as Lever 1b. Workaround: keep prompts ≤2 chunks (chunk_size=256 → ≤~470 tok).
Decode + simple chat work coherently within those bounds. Without this PR the model crashes at first decode step with IMA + cuBLAS status=13.
Test plan
imp:verifybuild/v1/modelslists model🤖 Generated with Claude Code