feat(nemotron): support NemotronHForCausalLM hybrid Mamba2+MoE+Attention NVFP4 models by kekzl · Pull Request #104 · kekzl/imp

kekzl · 2026-05-04T21:38:01Z

Summary

Registers NemotronHForCausalLM / nemotron_h_moe arch in config loader + weight map
Parses Mamba/MoE config fields (mamba_head_dim, mamba_num_heads, n_groups, ssm_state_size, conv_kernel, etc.)
Wires hf_quant_config.json exclude_modules list for selective BF16 preservation (conv1d, attn, 6 specific Mamba2 in/out_proj)
Adds Nemotron-3-Nano-30B-A3B-NVFP4 to scripts/validate_safetensors.py

Status

Single-chunk prefill works — verified 2026-05-04 against nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4. "Capital of France?" → "Paris" in 1.4s, finish=stop, coherent including <think> block.

Multi-chunk-prefill (≥3 chunks, ≥~470 prompt tokens) silently hangs — separate SSM-state-handoff bug across chunk boundaries, tracked as Lever 1b. Workaround: keep prompts ≤2 chunks (chunk_size=256 → ≤~470 tok).

Decode + simple chat work coherently within those bounds. Without this PR the model crashes at first decode step with IMA + cuBLAS status=13.

Test plan

Boot Nemotron-3-Nano-30B-A3B-NVFP4 via imp:verify build
/v1/models lists model
Sanity prompt ("Capital of France?") returns "Paris" coherently
351 + 461 token prompts complete in 1.5–1.7s
Multi-chunk prefill (Lever 1b — separate follow-up)

🤖 Generated with Claude Code

…ion NVFP4 models Wires up the full load+dispatch chain for Nemotron-3-Nano-30B-A3B-NVFP4 and similar nemotron_h_moe-arch checkpoints. Without these the model crashes at the first decode step with prefill memcpy IMA + lm_head cublas status=13. Three pieces: 1. hf_config_loader.cpp: register NemotronHForCausalLM/nemotron_h in arch maps; parse mamba/MoE-specific config fields (mamba_head_dim, mamba_num_heads, n_groups, ssm_state_size, conv_kernel, n_routed_experts, n_shared_experts, moe_shared_expert_intermediate_size, routed_scaling_factor, norm_topk_prob); decode hybrid_override_pattern ("MEMEM*EM..." with M=Mamba2/E=MoE/*=Attn) into n_kv_heads_per_layer. 2. weight_map.cpp: Nemotron-H name normalizer translating backbone.embeddings/norm_f → model.embed_tokens/norm and backbone.layers.N.mixer.<sub>.* → model.layers.N.{self_attn|mamba|mlp}.* (dispatched by sub: q/k/v/o_proj→self_attn, in/out_proj+conv1d+ A_log+D+dt_bias+norm→mamba, experts/gate/shared_experts→mlp). Also adds NVFP4 prequant scale routing (weight_scale/weight_scale_2/ input_scale) for mamba.in_proj/out_proj. 3. executor_pre_dequant.cu: extend resolve() in promote() to map "L<i>.ssm_in"/"L<i>.ssm_out" scratch keys back to L.ssm_in/ssm_out so SSM NVFP4 scales actually attach to the tensor sidecars. Validate: scripts/validate_safetensors.py adds Nemotron-3-Nano entry. Smoke test passes phase 0/3/5/6 (load + 32x graph replay byte-identical + logit health + determinism + decode 37-70 tok/s). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* build: switch sm_120f → sm_120a target for full RTX 5090 feature set Original CMakeLists used sm_120f as workaround for ptxas C7600 on sm_120a. On CUDA 13.2.1 that bug no longer reproduces — clean build, 0 ptxas errors. sm_120a unlocks the full SM120 feature set: - mma.sync.aligned.kind::mxf4nvf4.block_scale (block-scaled FP4 MMA) - extended cp.async.bulk.tensor modes (TMA multicast) - full 228-KiB SMEM-carveout per CTA - cluster launch with CLC - extended mbarrier phases - sparse mxf4nvf4 K=128 MMAs (per ptx_mma_survey) Note: tcgen05.* / TMEM are SM100-only (server B200) — NOT present on SM120 (consumer RTX 5090) regardless of arch suffix. Earlier confusion in code comments has been corrected via memory cross-refs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * perf(nvfp4): route SSM in_proj/out_proj through CUTLASS fast-path Add L.ssm_in / L.ssm_out to the cutlass_nvfp4 cache registration loop in pre_dequant_weights(). Previously NVFP4-quantized SSM (Mamba2/GDN) projections were excluded from this cache, falling back to the dequant-to-FP16 + cuBLAS slow path in nvfp4_gemm.cu — adding ~52 MiB scratch alloc per layer plus full-weight FP16 round-trip per GEMM call. Effect for Nemotron-3-Nano-30B-A3B-NVFP4: CUTLASS NVFP4 cache: 46 → 80 tensors (+34 SSM projections, 66.75 MiB) 300-token prompt: 300s timeout → 2s coherent answer slow-fallback warn: N× per layer per chunk → 0× per request Math equivalence: both paths use the same NVFP4-quantized weights with the same per-block FP8(ue4m3) + per-tensor FP32 scaling. The original exclusion ("4-bit degrades quality on 9B+ models") was about NVFP4-quantizing the weights at all, not about using CUTLASS vs cuBLAS to compute with them. Both paths produce numerically equivalent results. Known remaining issue: multi-chunk-prefill (>256 token prompts) still hangs on Nemotron-H — separate SSM-state-handoff bug between prefill chunks, unrelated to this dispatch fix. Tracked in docs/sm120-real-perf-plan.md as Lever 1b. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

kekzl enabled auto-merge (squash) May 4, 2026 21:38

kekzl mentioned this pull request May 4, 2026

perf(nvfp4): route SSM in_proj/out_proj through CUTLASS fast-path #106

Merged

5 tasks

kekzl merged commit 8a4407f into main May 4, 2026
2 checks passed

kekzl deleted the feat/nemotron-h-arch branch May 4, 2026 21:58

kekzl mentioned this pull request May 10, 2026

chore(release): prepare v0.9.0 #162

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(nemotron): support NemotronHForCausalLM hybrid Mamba2+MoE+Attention NVFP4 models#104

feat(nemotron): support NemotronHForCausalLM hybrid Mamba2+MoE+Attention NVFP4 models#104
kekzl merged 2 commits into
mainfrom
feat/nemotron-h-arch

kekzl commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kekzl commented May 4, 2026

Summary

Status

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant