Skip to content

feat(nemotron): support NemotronHForCausalLM hybrid Mamba2+MoE+Attention NVFP4 models#104

Merged
kekzl merged 2 commits into
mainfrom
feat/nemotron-h-arch
May 4, 2026
Merged

feat(nemotron): support NemotronHForCausalLM hybrid Mamba2+MoE+Attention NVFP4 models#104
kekzl merged 2 commits into
mainfrom
feat/nemotron-h-arch

Conversation

@kekzl
Copy link
Copy Markdown
Owner

@kekzl kekzl commented May 4, 2026

Summary

  • Registers NemotronHForCausalLM / nemotron_h_moe arch in config loader + weight map
  • Parses Mamba/MoE config fields (mamba_head_dim, mamba_num_heads, n_groups, ssm_state_size, conv_kernel, etc.)
  • Wires hf_quant_config.json exclude_modules list for selective BF16 preservation (conv1d, attn, 6 specific Mamba2 in/out_proj)
  • Adds Nemotron-3-Nano-30B-A3B-NVFP4 to scripts/validate_safetensors.py

Status

Single-chunk prefill works — verified 2026-05-04 against nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4. "Capital of France?" → "Paris" in 1.4s, finish=stop, coherent including <think> block.

Multi-chunk-prefill (≥3 chunks, ≥~470 prompt tokens) silently hangs — separate SSM-state-handoff bug across chunk boundaries, tracked as Lever 1b. Workaround: keep prompts ≤2 chunks (chunk_size=256 → ≤~470 tok).

Decode + simple chat work coherently within those bounds. Without this PR the model crashes at first decode step with IMA + cuBLAS status=13.

Test plan

  • Boot Nemotron-3-Nano-30B-A3B-NVFP4 via imp:verify build
  • /v1/models lists model
  • Sanity prompt ("Capital of France?") returns "Paris" coherently
  • 351 + 461 token prompts complete in 1.5–1.7s
  • Multi-chunk prefill (Lever 1b — separate follow-up)

🤖 Generated with Claude Code

…ion NVFP4 models

Wires up the full load+dispatch chain for Nemotron-3-Nano-30B-A3B-NVFP4 and
similar nemotron_h_moe-arch checkpoints. Without these the model crashes at
the first decode step with prefill memcpy IMA + lm_head cublas status=13.

Three pieces:

1. hf_config_loader.cpp: register NemotronHForCausalLM/nemotron_h in arch
   maps; parse mamba/MoE-specific config fields (mamba_head_dim,
   mamba_num_heads, n_groups, ssm_state_size, conv_kernel,
   n_routed_experts, n_shared_experts, moe_shared_expert_intermediate_size,
   routed_scaling_factor, norm_topk_prob); decode hybrid_override_pattern
   ("MEMEM*EM..." with M=Mamba2/E=MoE/*=Attn) into n_kv_heads_per_layer.

2. weight_map.cpp: Nemotron-H name normalizer translating
   backbone.embeddings/norm_f → model.embed_tokens/norm and
   backbone.layers.N.mixer.<sub>.* → model.layers.N.{self_attn|mamba|mlp}.*
   (dispatched by sub: q/k/v/o_proj→self_attn, in/out_proj+conv1d+
   A_log+D+dt_bias+norm→mamba, experts/gate/shared_experts→mlp). Also
   adds NVFP4 prequant scale routing (weight_scale/weight_scale_2/
   input_scale) for mamba.in_proj/out_proj.

3. executor_pre_dequant.cu: extend resolve() in promote() to map
   "L<i>.ssm_in"/"L<i>.ssm_out" scratch keys back to L.ssm_in/ssm_out
   so SSM NVFP4 scales actually attach to the tensor sidecars.

Validate: scripts/validate_safetensors.py adds Nemotron-3-Nano entry.
Smoke test passes phase 0/3/5/6 (load + 32x graph replay byte-identical
+ logit health + determinism + decode 37-70 tok/s).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* build: switch sm_120f → sm_120a target for full RTX 5090 feature set

Original CMakeLists used sm_120f as workaround for ptxas C7600 on sm_120a.
On CUDA 13.2.1 that bug no longer reproduces — clean build, 0 ptxas errors.

sm_120a unlocks the full SM120 feature set:
  - mma.sync.aligned.kind::mxf4nvf4.block_scale (block-scaled FP4 MMA)
  - extended cp.async.bulk.tensor modes (TMA multicast)
  - full 228-KiB SMEM-carveout per CTA
  - cluster launch with CLC
  - extended mbarrier phases
  - sparse mxf4nvf4 K=128 MMAs (per ptx_mma_survey)

Note: tcgen05.* / TMEM are SM100-only (server B200) — NOT present on
SM120 (consumer RTX 5090) regardless of arch suffix. Earlier confusion
in code comments has been corrected via memory cross-refs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* perf(nvfp4): route SSM in_proj/out_proj through CUTLASS fast-path

Add L.ssm_in / L.ssm_out to the cutlass_nvfp4 cache registration loop in
pre_dequant_weights(). Previously NVFP4-quantized SSM (Mamba2/GDN)
projections were excluded from this cache, falling back to the dequant-to-FP16
+ cuBLAS slow path in nvfp4_gemm.cu — adding ~52 MiB scratch alloc per layer
plus full-weight FP16 round-trip per GEMM call.

Effect for Nemotron-3-Nano-30B-A3B-NVFP4:
  CUTLASS NVFP4 cache: 46 → 80 tensors (+34 SSM projections, 66.75 MiB)
  300-token prompt:    300s timeout → 2s coherent answer
  slow-fallback warn:  N× per layer per chunk → 0× per request

Math equivalence: both paths use the same NVFP4-quantized weights with the
same per-block FP8(ue4m3) + per-tensor FP32 scaling. The original exclusion
("4-bit degrades quality on 9B+ models") was about NVFP4-quantizing the
weights at all, not about using CUTLASS vs cuBLAS to compute with them.
Both paths produce numerically equivalent results.

Known remaining issue: multi-chunk-prefill (>256 token prompts) still hangs
on Nemotron-H — separate SSM-state-handoff bug between prefill chunks,
unrelated to this dispatch fix. Tracked in
docs/sm120-real-perf-plan.md as Lever 1b.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@kekzl kekzl merged commit 8a4407f into main May 4, 2026
2 checks passed
@kekzl kekzl deleted the feat/nemotron-h-arch branch May 4, 2026 21:58
@kekzl kekzl mentioned this pull request May 10, 2026
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant