Allow loading Qwen Thinker 'base' models without generative head#45457
Allow loading Qwen Thinker 'base' models without generative head#45457tomaarsen merged 3 commits intohuggingface:mainfrom
Conversation
Currently, for qwen2_5_omni and qwen3_omni_moe, you can only load the 'Talker' variant, i.e. with the audio output. Now, you should also be able to load the 'base' models to get the token embeddings, etc. The glmasr_encoder, audioflamingo3_encoder, voxtral_encoder, etc. work similarly.
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
| ("pvt", "PvtModel"), | ||
| ("pvt_v2", "PvtV2Model"), | ||
| ("qwen2", "Qwen2Model"), | ||
| ("qwen2_5_omni_thinker", "Qwen2_5OmniThinkerForConditionalGeneration"), |
There was a problem hiding this comment.
this has to be a Qwen2_5OmniThinkerModel is we add the key, the one without an lm head. Then Qwen2_5OmniThinkerForConditionalGeneration can be mapped in image-text-to-text dict
There was a problem hiding this comment.
I don't think it makes sense to hold a generative model in base classes' mapping
There was a problem hiding this comment.
Fair enough. Damn, I thought the class was improperly named, I missed that it inherits from GenerationMixin, I just noticed that it doesn't implement anything generate-related. Sorry about that.
Perhaps it fits well in MODEL_FOR_MULTIMODAL_LM_MAPPING_NAMES? That's where I originally placed them before I thought they were headless.
Models that accept text and optionally multimodal data in inputs
and can generate text and optionally multimodal data.
The latest Sentence Transformers can also work well with this.
- Tom Aarsen
There was a problem hiding this comment.
yeah, it's a pure VLM part of model so we can just put in image-text-to-text. MODEL_FOR_MULTIMODAL_LM_MAPPING_NAMES copies from it so will also have access
|
Done, this now runs instead, makes way more sense: from transformers import AutoModelForImageTextToText
model = AutoModelForImageTextToText.from_pretrained("LCO-Embedding/LCO-Embedding-Omni-3B")
print(type(model))
# <class 'transformers.models.qwen2_5_omni.modeling_qwen2_5_omni.Qwen2_5OmniThinkerForConditionalGeneration'>
from transformers import AutoModelForMultimodalLM
model = AutoModelForMultimodalLM.from_pretrained("LCO-Embedding/LCO-Embedding-Omni-3B")
print(type(model))
# <class 'transformers.models.qwen2_5_omni.modeling_qwen2_5_omni.Qwen2_5OmniThinkerForConditionalGeneration'>
|
zucchini-nlp
left a comment
There was a problem hiding this comment.
Nice! We don't need the base model without head in AutoModel?
|
There's not really a good The closest would be to export I think perhaps we can leave that be and merge this as-is?
|
Yes, this is the way if we need |
|
ST used to require
|
|
[For maintainers] Suggested jobs to run (before merge) run-slow: auto |
|
#45018 was quite a big change, but I think the idea is that I don't have to specify anything in the main configuration files anymore. The script still seems to run, but can you double-check that I should indeed be good to just not specify anything in the main configuration files @zucchini-nlp ? from transformers import AutoModelForImageTextToText
model = AutoModelForImageTextToText.from_pretrained("LCO-Embedding/LCO-Embedding-Omni-3B")
print(type(model))
# <class 'transformers.models.qwen2_5_omni.modeling_qwen2_5_omni.Qwen2_5OmniThinkerForConditionalGeneration'>
from transformers import AutoModelForMultimodalLM
model = AutoModelForMultimodalLM.from_pretrained("LCO-Embedding/LCO-Embedding-Omni-3B")
print(type(model))
# <class 'transformers.models.qwen2_5_omni.modeling_qwen2_5_omni.Qwen2_5OmniThinkerForConditionalGeneration'>
|
|
Oops, it got merged first. Yes, it should just click now for all possible configs in the codebase, so we only need to manually add cortect model-mapping type :) |
What does this PR do?
Currently, for qwen2_5_omni and qwen3_omni_moe, you can only load the 'Talker' variant, i.e. with the audio output. This is a bit like only being able to load a checkpoint with
AutoModelForCausalLMwhileAutoModelcan't be used. This is bottlenecking for these models:Especially the former is affected, as there's actual models here, like https://huggingface.co/LCO-Embedding/LCO-Embedding-Omni-3B, https://huggingface.co/Haon-Chen/e5-omni-3B, etc. These are embedding models that don't need the talker, it just needs to get the token embeddings, which is why it relies on the base model, i.e. the thinker. I'd like to extend the change to qwen3_omni_moe_thinker too, so that embedding models can be trained with
qwen3_omni_moe.Currently, this LCO model requires
trust_remote_code=Truewith:and
Or loaded only with
Qwen2_5OmniThinkerForConditionalGeneration.from_pretrained(...).This should not be needed to load a multimodal base model, in my opinion. With this PR, you can load this model without
trust_remote_code:Code Agent Policy
I did generate the changes, but checked them over, and they match what I would have done.
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@zucchini-nlp