Skip to content

feat: support Nemotron-H / Nemotron-Cascade-2 (#1711)#1712

Draft
michael-rabe wants to merge 7 commits intointel:mainfrom
michael-rabe:feat/nemotron-cascade2-1711
Draft

feat: support Nemotron-H / Nemotron-Cascade-2 (#1711)#1712
michael-rabe wants to merge 7 commits intointel:mainfrom
michael-rabe:feat/nemotron-cascade2-1711

Conversation

@michael-rabe
Copy link
Copy Markdown

Adds initial support for hybrid Mamba2 + Attention + MoE models via the unfused_moe registry. A bare AutoRound("nvidia/...") call now produces a coherent INT4 checkpoint without launcher-side workarounds.

Core additions:

  • unfused_moe/nemotron_h.py: linear-discoverable MoE block
  • unfused_moe/nemotron_h_setup.py: post-load fixups (Zamba2 group_size, SSM/router FP32 restore)
  • utils/source_tensor_overrides.py: generic source-checkpoint tensor reload utility
  • MODEL_CONFIG["nemotron_h"]: dispatch + upstream-rename preservation
  • compressors/base.py: apply_post_load_fixups hook (non-NH no-op)
  • Export: norm_dtype and scale_dtype per-layer overrides

Documentation:

  • New .claude/skills/adapt-unfused-moe skill covering the pipeline
  • adapt-new-llm slimmed to point at the dedicated skill
  • README / README_CN / step_by_step notes updated

Tests: 89 CPU tests across registration, post-load, export dtype controls, source-tensor overrides, and missing-tensors symmetry.

Closes #1711

Description

Please briefly describe your main changes, the motivation.

Type of Change

  • Bug fix
  • New feature
  • Documentation update
  • Performance improvement
  • Code refactoring
  • Other (please specify):

Related Issues

Fixes or relates to #

Checklist Before Submitting

  • My code has been tested locally.
  • Documentation has been updated as needed.
  • New or updated tests are included where applicable.

Michael Rabe and others added 4 commits April 21, 2026 12:33
Adds initial support for hybrid Mamba2 + Attention + MoE models via the
unfused_moe registry. A bare `AutoRound("nvidia/...")` call now produces
a coherent INT4 checkpoint without launcher-side workarounds.

Core additions:
- `unfused_moe/nemotron_h.py`: linear-discoverable MoE block
- `unfused_moe/nemotron_h_setup.py`: post-load fixups (Zamba2 group_size,
  SSM/router FP32 restore)
- `utils/source_tensor_overrides.py`: generic source-checkpoint tensor
  reload utility
- `MODEL_CONFIG["nemotron_h"]`: dispatch + upstream-rename preservation
- `compressors/base.py`: `apply_post_load_fixups` hook (non-NH no-op)
- Export: `norm_dtype` and `scale_dtype` per-layer overrides

Documentation:
- New `.claude/skills/adapt-unfused-moe` skill covering the pipeline
- `adapt-new-llm` slimmed to point at the dedicated skill
- README / README_CN / step_by_step notes updated

Tests: 89 CPU tests across registration, post-load, export dtype
controls, source-tensor overrides, and missing-tensors symmetry.

Closes intel#1711

Signed-off-by: Michael Rabe <michaelrabe1896@gmail.com>
unit test now covering cpu as well as GPU devices (XPU tested). New test added for dtype=F32 recommendation for residual opt-in.
Documentation: recommendation for Residual-stream precision (norm_dtype) added as well.
Signed-off-by: Rabe, Michael (michaelrabe1896@gmail.com)
@michael-rabe michael-rabe force-pushed the feat/nemotron-cascade2-1711 branch from 615e3fe to 69f04b4 Compare April 21, 2026 10:33
@wenhuach21
Copy link
Copy Markdown
Contributor

Thanks for the PR. Are all these file changes necessary? It seems to modify many unrelated files.

Comment thread docs/step_by_step.md Outdated

- **Residual-stream precision (`norm_dtype`):**

Opt-in kwarg on `quantize_and_save` / `save_quantized`. In deep residual architectures — especially hybrid SSM + MoE models such as Nemotron-H — accumulating residuals through BF16 norm outputs can lose precision layer over layer. Passing `norm_dtype="fp32"` exports norm weights in FP32 without touching quantized linears; disk/VRAM cost is <0.1% of a 30B model. Recommended for such hybrid architectures; optional elsewhere. Accepts `"fp16" | "bf16" | "fp32"` (string aliases) or a raw `torch.dtype`. Default (kwarg omitted): norm dtype follows the compute `amp_dtype` (FP16 on GPU/XPU with an FP16 checkpoint, BF16 on CPU/HPU).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this case, we should provide a better solution, similar to e_score_correction_bias's handling in deepseek

@xin3he is it a regression as I just found the dtype of https://huggingface.co/Intel/DeepSeek-V3.2-int4-AutoRound/blob/main/model-00059-of-00071.safetensors if bf16

@michael-rabe
Copy link
Copy Markdown
Author

Thanks for the PR. Are all these file changes necessary? It seems to modify many unrelated files.

I'm afraid it is, but haven't optimized (reduced) much after I got coherent and stable output.
Looking forward to your feedback ;).

Very welcome if we can reduce the number of changes, nemotron models seem to behave way different to others and are more sensitive to low dtypes.
I had coherence problems without these changes, the post apply "patches" became necessary to bypass the "standard" autoround behaviours without changing too much of the autoround core.

@wenhuach21
Copy link
Copy Markdown
Contributor

wenhuach21 commented Apr 21, 2026

Thanks for the PR. Are all these file changes necessary? It seems to modify many unrelated files.

I'm afraid it is, but haven't optimized (reduced) much after I got coherent and stable output. Looking forward to your feedback ;).

Very welcome if we can reduce the number of changes, nemotron models seem to behave way different to others and are more sensitive to low dtypes. I had coherence problems without these changes, the post apply "patches" became necessary to bypass the "standard" autoround behaviours without changing too much of the autoround core.

Thanks for the prompt reply. Could you let me know what indicator led you to change the norm dtype to FP32? I checked the model and noticed that the original norm weights are in BF16. I’m wondering whether modifying the norm dtype in the weights might cause issues with vLLM or SGLang, and whether they would still behave as expected.

For MoE models, I would generally suggest using higher precision for certain parts, for example, 8-bit for critical components (non-moe modules) and 4-bit for others (experts). This is a more common approach when pure 4-bit quantization leads to a significant accuracy drop and avoids the API change.

Please feel free to correct me if I’m mistaken, as I’m not very familiar with this model.

Thanks again for your pr.

@wenhuach21
Copy link
Copy Markdown
Contributor

for mixed-bits, you could try scheme="int4_mixed"

@michael-rabe
Copy link
Copy Markdown
Author

michael-rabe commented Apr 21, 2026

Thanks for the PR. Are all these file changes necessary? It seems to modify many unrelated files.

I'm afraid it is, but haven't optimized (reduced) much after I got coherent and stable output. Looking forward to your feedback ;).
Very welcome if we can reduce the number of changes, nemotron models seem to behave way different to others and are more sensitive to low dtypes. I had coherence problems without these changes, the post apply "patches" became necessary to bypass the "standard" autoround behaviours without changing too much of the autoround core.

Thanks for the prompt reply. Could you let me know what indicator led you to change the norm dtype to FP32? I checked the model and noticed that the original norm weights are in BF16. I’m wondering whether modifying the norm dtype in the weights might cause issues with vLLM or SGLang, and whether they would still behave as expected.

For MoE models, I would generally suggest using higher precision for certain parts, for example, 8-bit for critical components (non-moe modules) and 4-bit for others (experts). This is a more common approach when pure 4-bit quantization leads to a significant accuracy drop and avoids the API change.

Please feel free to correct me if I’m mistaken, as I’m not very familiar with this model.

Thanks again for your pr.

Ah, I missed the proper documentation (I'll fix that tonight).
Thanks for the valuable suggestions, I'm following up on that and set the PR into "draft" status.

You're right that the original norm weights are BF16, and BF16 is numerically fine for the norm weights themselves — upcasting them to FP32 is lossless but gains nothing at the weight level.

The purpose of norm_dtype="fp32" is different: it's a lever for the residual stream, not for the norm weights. In HuggingFace-style RMSNorm the norm output's dtype follows the weight's dtype, so storing norm weights as FP32 pulls the post-norm tensor — and via the residual add, the residual stream itself — into FP32 at runtime. Over 50+ layers the residual is a sum of many block outputs, and keeping that accumulation in FP32 reduces BF16 rounding drift.

Two caveats: this only helps on engines that honor the stored norm-weight dtype (HF transformers does, fused-kernel runtimes like vLLM/SGLang often don't). And the Nemotron-H coherence fix itself was not norm_dtype — it was the always-on post-load restore of the SSM core tensors and the router correction bias to FP32, which are the tensors with genuine FP32 requirements. norm_dtype is orthogonal and optional.

So: FP32 norm weights aren't about upgrading the norm weights themselves — they're an export-side lever for residual-stream precision on compliant inference stacks.

Please correct me, if I'm wrong. I'm new to the club and still in the learning phase ;).
I appreciate your questions.
Sorry for the wall of text.

@michael-rabe michael-rabe marked this pull request as draft April 21, 2026 17:21
@michael-rabe
Copy link
Copy Markdown
Author

Progressing, the comments were helpful and can reduce the number of changes.
Still investigating as I'm not yet happy with the results (accuracy) and want to make sure it's not in the quantisation itself.

Michael Rabe and others added 3 commits April 24, 2026 18:36
Schema conformance — nemotron_h.py now matches the deepseek_v3 / glm_moe
pattern used by the other unfused-MoE siblings:
- remove module docstring and `from __future__ import annotations`
- remove unused `layer_idx` constructor parameter
- extract per-expert loop into `experts_forward()` helper
- fold public API (`apply_nemotron_h_post_load`,
  `nemotron_h_default_layer_config_patterns`) from `nemotron_h_setup.py`
  into `nemotron_h.py`; setup module now holds only private helpers
- update MODEL_CONFIG paths and test imports accordingly

Consolidation:
- extract `_resolve_registered_fn()` in `unfused_moe/__init__.py` to
  DRY up `apply_post_load_fixups` and `get_default_layer_config_patterns`
  (−76 lines, identical external API)
- remove `auto_round/utils/source_tensor_overrides.py`; the sole
  consumer (`nemotron_h_setup.py`) inlines it as a private helper
- consolidate three NH test files (`test_nemotron_h_post_load.py`,
  `test_nemotron_h_registration.py`, `test_source_tensor_overrides.py`)
  into a single `test/test_cpu/models/test_nemotron_h.py` covering
  all 20 scenarios

Docs:
- compress `norm_dtype` hyperparameter bullet in `step_by_step.md` to
  match the surrounding bullet style (AdamW, quantized lm-head, etc.)
- remove per-model "Verifying a quantized Nemotron-H checkpoint"
  section — no other model has troubleshooting content in this guide
- trim the Known-Issues reference to an internal dev-only skill

Misc:
- `.gitignore`: exclude local evaluation scripts

All 90 tests green: test_nemotron_h, test_norm_dtype, test_scale_dtype,
test_missing_tensors.

Signed-off-by: Rabe, Michael <michaelrabe1896@gmail.com>
…simplicity.

norm_dtype (quantize_and_save/save_quantized), _cast_norm_modules,
and the ShardWriter upcast path are removed — NH residual precision
is handled automatically by apply_post_load_fixups (A_log, dt_bias,
e_score_correction_bias → FP32 via the unfused_moe post-load hook).

Also adds layer_idx: int | None = None to LinearNemotronHMoE.__init__
for compatibility with transformers 5.3.0, which now passes layer_idx
to all MIXER_TYPES constructors.

Signed-off-by: Rabe, Michael (michaelrabe1896@gmail.com)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support Nemotron-H / Nemotron-Cascade-2 (hybrid Mamba2 + Attention + MoE)

2 participants