Support Nemotron-H / Nemotron-Cascade-2 (hybrid Mamba2 + Attention + MoE)

## Goal

Enable AutoRound to quantize Nemotron-H family models — in particular `nvidia/Nemotron-Cascade-2-30B-A3B` — and produce a checkpoint whose decoded output is coherent without requiring launcher-side workarounds.

Target user flow:

```python
from auto_round import AutoRound
AutoRound("nvidia/Nemotron-Cascade-2-30B-A3B", bits=4, group_size=128, sym=True) \
    .quantize_and_save(output_dir=..., format="auto_round")
```

## Scope

- Nemotron-H architecture (hybrid Mamba2 SSM + Attention + routed MoE + shared expert)
- `auto_round` export format, W4A16 weight-only
- CPU + CUDA/XPU calibration paths

## Acceptance

- Bare `AutoRound("nvidia/...")` call produces a checkpoint that loads and generates coherent text in a downstream inference engine.
- No regressions for other `model_type` values (non-NH models must be unaffected).
- CPU test suite covers registration, post-load behaviour, and export-time precision controls.

## Out of scope (follow-ups)

- Pure-Mamba architectures (no attention layers)
- Other Nemotron-H variants beyond the Cascade-2 checkpoint family

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Nemotron-H / Nemotron-Cascade-2 (hybrid Mamba2 + Attention + MoE) #1711

Goal

Scope

Acceptance

Out of scope (follow-ups)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support Nemotron-H / Nemotron-Cascade-2 (hybrid Mamba2 + Attention + MoE) #1711

Description

Goal

Scope

Acceptance

Out of scope (follow-ups)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions