Add SarvamMLA model (sarvamai/sarvam-105b)#44569
Add SarvamMLA model (sarvamai/sarvam-105b)#44569aashay-sarvam wants to merge 8 commits intohuggingface:mainfrom
Conversation
|
Hi @aashay-sarvam, it looks like the architecture is identical to Deepseek V3! Can you just upload your checkpoints with that model type instead? |
There was a problem hiding this comment.
Hey there 👋
I have left a few smaller comments but essentially we can already use the existing code. I assume that you want to keep the model type which is what my comments expect. It could also be updated to work with deepseek v3 directly completely (like Matt mentioned before me) but then it loses its identity I guess (no model type, no default config)
|
#44569 (comment) might've gotten lost |
Missed this - will make the changes |
vasqu
left a comment
There was a problem hiding this comment.
Some new comments, because things changed a bit on main - sorry 😓
For the CI make fix-repo should fix most smaller things
| # Hub config.json uses num_experts/num_shared_experts; map to parent names | ||
| n_routed_experts = kwargs.pop("num_experts", n_routed_experts) | ||
| n_shared_experts = kwargs.pop("num_shared_experts", n_shared_experts) | ||
|
|
||
| # head_dim in Hub config.json is kv_lora_rank + qk_rope_head_dim (for vLLM | ||
| # MLA compat), but DeepseekV3Config computes it as qk_rope_head_dim. | ||
| kwargs.pop("head_dim", None) | ||
| kwargs.pop("q_head_dim", None) |
There was a problem hiding this comment.
Would it be possible to change the remote config instead of adding workarounds?
| **kwargs, | ||
| ) | ||
|
|
||
| def convert_rope_params_to_dict(self, ignore_keys_at_rope_validation: set | None = None, **kwargs): |
There was a problem hiding this comment.
Same here, we could properly change the remote config instead to have a proper attribute for this
There was a problem hiding this comment.
have made the changes, though I still need to push the model config to the model repo (though I have tested locally)
There was a problem hiding this comment.
Also, question - sglang uses sarvamMLA, will that break with this change?
There was a problem hiding this comment.
Hmm, not too familiar with sglangs integration here tbh. Imo, it can and should be able to use the (native) deepseek architecture - might need a nudge to respect the architecture 🤔
There was a problem hiding this comment.
cc @adarshxs if you have any insights re sglang
Add native support for the sarvam_mla model type using the modular pattern, inheriting from DeepSeek V3. The model uses Multi-head Latent Attention (MLA) with Mixture of Experts (MoE), supporting 105B parameters with 128 routed experts and 8 active per token. New files: - configuration_sarvam_mla.py: Config with attribute mapping, rope normalization, and head_dim handling for Hub compatibility - modular_sarvam_mla.py: 48-line modular file inheriting DeepSeek V3 - modeling_sarvam_mla.py: Auto-generated from modular (736 lines) - test_modeling_sarvam_mla.py: 140 passing unit tests - sarvam_mla.md: Documentation with usage examples Modified files: - Auto-registration in configuration_auto.py, modeling_auto.py - Model import in models/__init__.py - Weight conversion mapping (qwen2_moe pattern) in conversion_mapping.py - Documentation index in _toctree.yml Made-with: Cursor
Per vasqu's review: - Remove modular_sarvam_mla.py and modeling_sarvam_mla.py (no need to re-implement identical DeepSeek V3 architecture) - Point auto mappings directly to DeepseekV3 model classes - Move rope type normalization (deepseek_yarn -> yarn) to convert_rope_params_to_dict override - Remove test file (DeepseekV3 tests cover the architecture) - Slim down docs to config-only autodoc Made-with: Cursor
Move SarvamMLAConfig definition into modular_sarvam_mla.py and auto-generate configuration_sarvam_mla.py from it, following the canonical transformers modular pattern. Made-with: Cursor
- Remove torch_dtype="auto" from docs (now default) - Simplify modular_sarvam_mla.py to only override defaults that differ from DeepseekV3Config (no __init__, no workarounds) - Add @strict(accept_kwargs=True) for config validation (huggingface#41250) - Regenerate configuration_sarvam_mla.py with dataclass fields and __post_init__ pattern - Hub config.json changes needed: remove head_dim/q_head_dim, change rope_scaling.type to "yarn", update architectures Made-with: Cursor
ee9a64f to
0f5d73d
Compare
|
[For maintainers] Suggested jobs to run (before merge) run-slow: auto |
|
View the CircleCI Test Summary for this PR: https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=44569&sha=3d9969 |
What does this PR do?
Adds native support for the
sarvam_mlamodel type (sarvamai/sarvam-105b) to HuggingFace Transformers using the modular pattern, inheriting from DeepSeek V3.Model Architecture
SarvamMLA is a 105B parameter Mixture of Experts (MoE) language model developed by Sarvam AI. It uses:
first_k_dense_replace=1(Layer 0 = dense MLP, Layer 1+ = MoE)Files Added
src/transformers/models/sarvam_mla/__init__.pysrc/transformers/models/sarvam_mla/configuration_sarvam_mla.pysrc/transformers/models/sarvam_mla/modular_sarvam_mla.pysrc/transformers/models/sarvam_mla/modeling_sarvam_mla.pytests/models/sarvam_mla/test_modeling_sarvam_mla.pydocs/source/en/model_doc/sarvam_mla.mdFiles Modified
src/transformers/models/auto/configuration_auto.py— CONFIG_MAPPING_NAMES, MODEL_NAMES_MAPPINGsrc/transformers/models/auto/modeling_auto.py— MODEL_MAPPING_NAMES, MODEL_FOR_CAUSAL_LM, SEQUENCE_CLASSIFICATION, TOKEN_CLASSIFICATIONsrc/transformers/models/__init__.py— importsrc/transformers/conversion_mapping.py—"sarvam_mla": "qwen2_moe"(per-expert → batched weight conversion)docs/source/en/_toctree.yml— docs indexHub Compatibility Fixes (in config)
head_dimoverride: Hub config hashead_dim: 576(for vLLM MLA compat), but internally the model usesqk_rope_head_dim = 64for RoPE. Popped from kwargs.deepseek_yarnrope type: Hub config uses"type": "deepseek_yarn"butROPE_INIT_FUNCTIONSonly has"yarn". Normalized in config__init__.ModuleListweights on Hub need conversion to batched format. Handled viaqwen2_moeconversion pattern.Test Results
SarvamMLAConfigwithmodel_type=sarvam_mlaSarvamMLAMLP(dense), Layer 1+ =SarvamMLAMoE✓Who can review?
@ArthurZucker