Add SarvamMLA model (sarvamai/sarvam-105b) by aashay-sarvam · Pull Request #44569 · huggingface/transformers

aashay-sarvam · 2026-03-10T11:55:01Z

What does this PR do?

Adds native support for the sarvam_mla model type (sarvamai/sarvam-105b) to HuggingFace Transformers using the modular pattern, inheriting from DeepSeek V3.

Model Architecture

SarvamMLA is a 105B parameter Mixture of Experts (MoE) language model developed by Sarvam AI. It uses:

Multi-head Latent Attention (MLA): Low-rank KV compression with decoupled RoPE
Sparse MoE: 128 routed experts, 8 active per token, plus 1 shared expert
First layer dense: first_k_dense_replace=1 (Layer 0 = dense MLP, Layer 1+ = MoE)
DeepSeek YaRN RoPE: Extended context up to 131K tokens
Sigmoid routing with group-based top-k

Files Added

File	Description
`src/transformers/models/sarvam_mla/__init__.py`	Lazy loading module
`src/transformers/models/sarvam_mla/configuration_sarvam_mla.py`	Config with Hub compatibility (head_dim, rope_type normalization)
`src/transformers/models/sarvam_mla/modular_sarvam_mla.py`	48-line modular file inheriting DeepSeek V3
`src/transformers/models/sarvam_mla/modeling_sarvam_mla.py`	Auto-generated from modular (736 lines)
`tests/models/sarvam_mla/test_modeling_sarvam_mla.py`	Unit tests
`docs/source/en/model_doc/sarvam_mla.md`	Documentation with usage examples

Files Modified

src/transformers/models/auto/configuration_auto.py — CONFIG_MAPPING_NAMES, MODEL_NAMES_MAPPING
src/transformers/models/auto/modeling_auto.py — MODEL_MAPPING_NAMES, MODEL_FOR_CAUSAL_LM, SEQUENCE_CLASSIFICATION, TOKEN_CLASSIFICATION
src/transformers/models/__init__.py — import
src/transformers/conversion_mapping.py — "sarvam_mla": "qwen2_moe" (per-expert → batched weight conversion)
docs/source/en/_toctree.yml — docs index

Hub Compatibility Fixes (in config)

head_dim override: Hub config has head_dim: 576 (for vLLM MLA compat), but internally the model uses qk_rope_head_dim = 64 for RoPE. Popped from kwargs.
deepseek_yarn rope type: Hub config uses "type": "deepseek_yarn" but ROPE_INIT_FUNCTIONS only has "yarn". Normalized in config __init__.
Weight conversion: Per-expert ModuleList weights on Hub need conversion to batched format. Handled via qwen2_moe conversion pattern.

Test Results

Unit tests: 140 passed, 92 skipped (on GPU node)
End-to-end test: Full 105B model loaded in bf16 across 8× H100 80GB GPUs
- Config loads as SarvamMLAConfig with model_type=sarvam_mla
- Layer 0 MLP = SarvamMLAMLP (dense), Layer 1+ = SarvamMLAMoE ✓
- Generation produces coherent text ✓

Who can review?

@ArthurZucker

Rocketknight1 · 2026-03-11T13:21:55Z

Hi @aashay-sarvam, it looks like the architecture is identical to Deepseek V3! Can you just upload your checkpoints with that model type instead?

vasqu

Hey there 👋

I have left a few smaller comments but essentially we can already use the existing code. I assume that you want to keep the model type which is what my comments expect. It could also be updated to work with deepseek v3 directly completely (like Matt mentioned before me) but then it loses its identity I guess (no model type, no default config)

src/transformers/models/sarvam_mla/modular_sarvam_mla.py

src/transformers/models/auto/modeling_auto.py

src/transformers/models/auto/configuration_auto.py

src/transformers/models/sarvam_mla/configuration_sarvam_mla.py

tests/models/sarvam_mla/test_modeling_sarvam_mla.py

vasqu · 2026-03-13T12:52:18Z

#44569 (comment) might've gotten lost

aashay-sarvam · 2026-03-16T06:40:40Z

#44569 (comment) might've gotten lost

Missed this - will make the changes

vasqu

Some new comments, because things changed a bit on main - sorry 😓

For the CI make fix-repo should fix most smaller things

docs/source/en/model_doc/sarvam_mla.md

src/transformers/models/auto/modeling_auto.py

src/transformers/models/sarvam_mla/modular_sarvam_mla.py

vasqu · 2026-03-16T18:45:36Z

src/transformers/models/sarvam_mla/modular_sarvam_mla.py

+        # Hub config.json uses num_experts/num_shared_experts; map to parent names
+        n_routed_experts = kwargs.pop("num_experts", n_routed_experts)
+        n_shared_experts = kwargs.pop("num_shared_experts", n_shared_experts)
+
+        # head_dim in Hub config.json is kv_lora_rank + qk_rope_head_dim (for vLLM
+        # MLA compat), but DeepseekV3Config computes it as qk_rope_head_dim.
+        kwargs.pop("head_dim", None)
+        kwargs.pop("q_head_dim", None)


Would it be possible to change the remote config instead of adding workarounds?

vasqu · 2026-03-16T18:45:59Z

src/transformers/models/sarvam_mla/modular_sarvam_mla.py

+            **kwargs,
+        )
+
+    def convert_rope_params_to_dict(self, ignore_keys_at_rope_validation: set | None = None, **kwargs):


Same here, we could properly change the remote config instead to have a proper attribute for this

have made the changes, though I still need to push the model config to the model repo (though I have tested locally)

Also, question - sglang uses sarvamMLA, will that break with this change?

Hmm, not too familiar with sglangs integration here tbh. Imo, it can and should be able to use the (native) deepseek architecture - might need a nudge to respect the architecture 🤔

cc @adarshxs if you have any insights re sglang

Add native support for the sarvam_mla model type using the modular pattern, inheriting from DeepSeek V3. The model uses Multi-head Latent Attention (MLA) with Mixture of Experts (MoE), supporting 105B parameters with 128 routed experts and 8 active per token. New files: - configuration_sarvam_mla.py: Config with attribute mapping, rope normalization, and head_dim handling for Hub compatibility - modular_sarvam_mla.py: 48-line modular file inheriting DeepSeek V3 - modeling_sarvam_mla.py: Auto-generated from modular (736 lines) - test_modeling_sarvam_mla.py: 140 passing unit tests - sarvam_mla.md: Documentation with usage examples Modified files: - Auto-registration in configuration_auto.py, modeling_auto.py - Model import in models/__init__.py - Weight conversion mapping (qwen2_moe pattern) in conversion_mapping.py - Documentation index in _toctree.yml Made-with: Cursor

Made-with: Cursor

Per vasqu's review: - Remove modular_sarvam_mla.py and modeling_sarvam_mla.py (no need to re-implement identical DeepSeek V3 architecture) - Point auto mappings directly to DeepseekV3 model classes - Move rope type normalization (deepseek_yarn -> yarn) to convert_rope_params_to_dict override - Remove test file (DeepseekV3 tests cover the architecture) - Slim down docs to config-only autodoc Made-with: Cursor

Move SarvamMLAConfig definition into modular_sarvam_mla.py and auto-generate configuration_sarvam_mla.py from it, following the canonical transformers modular pattern. Made-with: Cursor

@strict

- Remove torch_dtype="auto" from docs (now default) - Simplify modular_sarvam_mla.py to only override defaults that differ from DeepseekV3Config (no __init__, no workarounds) - Add @strict(accept_kwargs=True) for config validation (huggingface#41250) - Regenerate configuration_sarvam_mla.py with dataclass fields and __post_init__ pattern - Hub config.json changes needed: remove head_dim/q_head_dim, change rope_scaling.type to "yarn", update architectures Made-with: Cursor

github-actions · 2026-03-17T18:01:11Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto

github-actions · 2026-03-17T18:29:30Z

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=44569&sha=3d9969

vasqu reviewed Mar 12, 2026

View reviewed changes

vasqu reviewed Mar 16, 2026

View reviewed changes

aashay-sarvam added 5 commits March 17, 2026 06:43

Fix sarvam_mla docs placement: move from Vision to Text models section

8cac1eb

Made-with: Cursor

Refactor SarvamMLA config to use modular pattern

0706709

Move SarvamMLAConfig definition into modular_sarvam_mla.py and auto-generate configuration_sarvam_mla.py from it, following the canonical transformers modular pattern. Made-with: Cursor

aashay-sarvam force-pushed the add-sarvam-mla-model branch from ee9a64f to 0f5d73d Compare March 17, 2026 07:22

vasqu and others added 2 commits March 17, 2026 18:59

add date

926004b

Merge branch 'main' into add-sarvam-mla-model

d97435f

add exception there

3d99694

Conversation

aashay-sarvam commented Mar 10, 2026

What does this PR do?

Model Architecture

Files Added

Files Modified

Hub Compatibility Fixes (in config)

Test Results

Who can review?

Uh oh!

Rocketknight1 commented Mar 11, 2026

Uh oh!

vasqu left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vasqu commented Mar 13, 2026

Uh oh!

aashay-sarvam commented Mar 16, 2026

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vasqu Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

vasqu Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

aashay-sarvam Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

aashay-sarvam Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

vasqu Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

vasqu Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 17, 2026

Uh oh!

github-actions bot commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vasqu left a comment •

edited

Loading