Skip to content

Support Nemotron-H / Nemotron-Cascade-2 (hybrid Mamba2 + Attention + MoE) #1711

@michael-rabe

Description

@michael-rabe

Goal

Enable AutoRound to quantize Nemotron-H family models — in particular nvidia/Nemotron-Cascade-2-30B-A3B — and produce a checkpoint whose decoded output is coherent without requiring launcher-side workarounds.

Target user flow:

from auto_round import AutoRound
AutoRound("nvidia/Nemotron-Cascade-2-30B-A3B", bits=4, group_size=128, sym=True) \
    .quantize_and_save(output_dir=..., format="auto_round")

Scope

  • Nemotron-H architecture (hybrid Mamba2 SSM + Attention + routed MoE + shared expert)
  • auto_round export format, W4A16 weight-only
  • CPU + CUDA/XPU calibration paths

Acceptance

  • Bare AutoRound("nvidia/...") call produces a checkpoint that loads and generates coherent text in a downstream inference engine.
  • No regressions for other model_type values (non-NH models must be unaffected).
  • CPU test suite covers registration, post-load behaviour, and export-time precision controls.

Out of scope (follow-ups)

  • Pure-Mamba architectures (no attention layers)
  • Other Nemotron-H variants beyond the Cascade-2 checkpoint family

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions