Conversation
|
Hey! Thank you for the PR, but can you convert to modular style? It'll make it a lot easier to review, and should cut down on the amount of code you need too! |
vasqu
left a comment
There was a problem hiding this comment.
Some initial comments but it's like @Rocketknight1 said, we should change the implementation to a modular one
Also we are missing docs and tests!
| head_dim (`int`, *optional*, defaults to 192): | ||
| The attention head dimension for Q and K. | ||
| v_head_dim (`int`, *optional*, defaults to 128): | ||
| The attention head dimension for V. This is specific to the MiMo-V2 architecture. |
There was a problem hiding this comment.
If we have this explicit difference between the head dims, I'd prefer we follow an existing notation like in deepseek, e.g.
(meaning the
qk_head_dim
| rope_theta (`float`, *optional*, defaults to 5000000.0): | ||
| The base period of the RoPE embeddings. |
There was a problem hiding this comment.
Does not exist anymore, you can set default_theta
| partial_rotary_factor (`float`, *optional*, defaults to 0.334): | ||
| Percentage of the hidden dimension to apply RoPE to. |
There was a problem hiding this comment.
We incorporate RoPE related parameters into a separate dict like object, see
You can customize the initialization like here
transformers/src/transformers/models/minimax_m2/modular_minimax_m2.py
Lines 209 to 228 in 2aa7b65
Meaning
partial_rotary_factor in this case mostly.
| hybrid_layer_pattern (`List[int]`, *optional*): | ||
| Pattern defining which layers use full attention (0) and which use sliding window attention (1). |
There was a problem hiding this comment.
We use a list of strings nowadays, see
transformers/src/transformers/models/gemma3/configuration_gemma3.py
Lines 190 to 195 in 2aa7b65
This is more explicit and allows for more layer types (as they grew over time for other models)
| scoring_func (`str`, *optional*, defaults to `"sigmoid"`): | ||
| The scoring function used for the MoE router. |
There was a problem hiding this comment.
If it's always sigmoid, we can simply leave it out
| return outputs | ||
|
|
||
|
|
||
| class MiMoV2FlashModel(MiMoV2FlashPreTrainedModel): |
There was a problem hiding this comment.
In general might be completely inheritable
| ) | ||
|
|
||
|
|
||
| class MiMoV2FlashForCausalLM(MiMoV2FlashPreTrainedModel): |
There was a problem hiding this comment.
Inherit from another model as well
| attentions=outputs.attentions, | ||
| ) | ||
|
|
||
| def prepare_inputs_for_generation( |
There was a problem hiding this comment.
Should not be needed, looks like very outdated code that is not used on our side anymore
| return model_inputs | ||
|
|
||
|
|
||
| def _prepare_4d_causal_attention_mask( |
There was a problem hiding this comment.
See my comment about the masks
There was a problem hiding this comment.
Might need to define the tokenizer in auto as well, would need a double check
Depends on if it works with tokenizers backend or really the qwen2 tokenizer 👀
|
[For maintainers] Suggested jobs to run (before merge) run-slow: auto, mimo_v2_flash |
|
Thank you @Rocketknight1 and @vasqu for the thorough review and also appreciating your efforts to give me the proper guidance! 🫡 I have completed the full refactor to the Modular Style and addressed all architectural feedback. Here's the detailed breakdown of the changes:
|
|
@Aznix07 It still doesn't use modular at all, could you take the comments into account? LLM agents are powerful but it seems like it isn't working here Furthermore, some details seem to be missing like sink attention (looking at #42995) cc @Aaraviitkgp (sorry about noticing it just now); it's hard to keep track of multiple PRs at times. I'd like to give this PR another chance but please properly work on this, don't blindly trust the agent to "just" work. There are other issues like the tests not even running, wrong import structure, very old outdated patterns we no longer use etc. |
|
@Aznix07 If possible we can work together on this, if you are ok with it ? |
|
@Aaraviitkgp, Yeah we can do it if you dont mind. |
|
@Aznix07 Add me as collaborator to your fork. |
|
@Aznix07 any update ? |
What does this PR do?
This PR adds support for the MiMo-V2-Flash architecture from Xiaomi (reference: XiaomiMiMo/MiMo-V2-Flash).
MiMo-V2-Flash is a large-scale Mixture-of-Experts (MoE) model (309B params / 15B active) that introduces several architectural innovations:
v_head_dim=128) than Query/Key (Q,K) heads (
head_dim=192).Implementation Details
MiMoV2FlashConfig.MiMoV2FlashModelandMiMoV2FlashForCausalLM.MiMoV2FlashAttention: Handles the dimension mismatch and partial RoPE.MiMoV2FlashMoE: Implements the Sigmoid-based routing logic.AutoConfig,AutoModel, andAutoModelForCausalLM.convert_mimo_v2_flash_weights_to_hf.pyto handle sharded weights and key remapping from the original repo.tests/models/mimo_v2_flash/.Fixes #42954
Before Submitting
Who Can Review?
Models: @ArthurZucker @Cyrilvallez @SunMarc