Skip to content

Allow loading Qwen Thinker 'base' models without generative head#45457

Merged
tomaarsen merged 3 commits intohuggingface:mainfrom
tomaarsen:feat/load_thinker_base
Apr 16, 2026
Merged

Allow loading Qwen Thinker 'base' models without generative head#45457
tomaarsen merged 3 commits intohuggingface:mainfrom
tomaarsen:feat/load_thinker_base

Conversation

@tomaarsen
Copy link
Copy Markdown
Member

What does this PR do?

Currently, for qwen2_5_omni and qwen3_omni_moe, you can only load the 'Talker' variant, i.e. with the audio output. This is a bit like only being able to load a checkpoint with AutoModelForCausalLM while AutoModel can't be used. This is bottlenecking for these models:

Especially the former is affected, as there's actual models here, like https://huggingface.co/LCO-Embedding/LCO-Embedding-Omni-3B, https://huggingface.co/Haon-Chen/e5-omni-3B, etc. These are embedding models that don't need the talker, it just needs to get the token embeddings, which is why it relies on the base model, i.e. the thinker. I'd like to extend the change to qwen3_omni_moe_thinker too, so that embedding models can be trained with qwen3_omni_moe.

Currently, this LCO model requires trust_remote_code=True with:

  "auto_map": {
    "AutoConfig": "modeling_lco_omni.Qwen2_5OmniThinkerConfig",
    "AutoModel": "modeling_lco_omni.Qwen2_5OmniThinkerForConditionalGeneration"
  },

and

# Re-exported so `auto_map` in config.json can resolve the Thinker classes;
# `qwen2_5_omni_thinker` is shipped by transformers but not in `AutoConfig`.
from transformers import Qwen2_5OmniThinkerConfig, Qwen2_5OmniThinkerForConditionalGeneration

__all__ = [
    "Qwen2_5OmniThinkerConfig",
    "Qwen2_5OmniThinkerForConditionalGeneration",
]

Or loaded only with Qwen2_5OmniThinkerForConditionalGeneration.from_pretrained(...).

This should not be needed to load a multimodal base model, in my opinion. With this PR, you can load this model without trust_remote_code:

from transformers import AutoModel

model = AutoModel.from_pretrained("LCO-Embedding/LCO-Embedding-Omni-3B")
print(type(model))
# <class 'transformers.models.qwen2_5_omni.modeling_qwen2_5_omni.Qwen2_5OmniThinkerForConditionalGeneration'>

Code Agent Policy

  • I confirm that this is not a pure code agent PR.

I did generate the changes, but checked them over, and they match what I would have done.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@zucchini-nlp

  • Tom Aarsen

Currently, for qwen2_5_omni and qwen3_omni_moe, you can only load the 'Talker' variant, i.e. with the audio output. Now, you should also be able to load the 'base' models to get the token embeddings, etc.

The glmasr_encoder, audioflamingo3_encoder, voxtral_encoder, etc. work similarly.
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

("pvt", "PvtModel"),
("pvt_v2", "PvtV2Model"),
("qwen2", "Qwen2Model"),
("qwen2_5_omni_thinker", "Qwen2_5OmniThinkerForConditionalGeneration"),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this has to be a Qwen2_5OmniThinkerModel is we add the key, the one without an lm head. Then Qwen2_5OmniThinkerForConditionalGeneration can be mapped in image-text-to-text dict

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it makes sense to hold a generative model in base classes' mapping

Copy link
Copy Markdown
Member Author

@tomaarsen tomaarsen Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. Damn, I thought the class was improperly named, I missed that it inherits from GenerationMixin, I just noticed that it doesn't implement anything generate-related. Sorry about that.

Perhaps it fits well in MODEL_FOR_MULTIMODAL_LM_MAPPING_NAMES? That's where I originally placed them before I thought they were headless.

Models that accept text and optionally multimodal data in inputs
and can generate text and optionally multimodal data.

The latest Sentence Transformers can also work well with this.

  • Tom Aarsen

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it's a pure VLM part of model so we can just put in image-text-to-text. MODEL_FOR_MULTIMODAL_LM_MAPPING_NAMES copies from it so will also have access

@tomaarsen tomaarsen marked this pull request as draft April 15, 2026 12:55
@tomaarsen
Copy link
Copy Markdown
Member Author

tomaarsen commented Apr 15, 2026

Done, this now runs instead, makes way more sense:

from transformers import AutoModelForImageTextToText

model = AutoModelForImageTextToText.from_pretrained("LCO-Embedding/LCO-Embedding-Omni-3B")
print(type(model))
# <class 'transformers.models.qwen2_5_omni.modeling_qwen2_5_omni.Qwen2_5OmniThinkerForConditionalGeneration'>

from transformers import AutoModelForMultimodalLM

model = AutoModelForMultimodalLM.from_pretrained("LCO-Embedding/LCO-Embedding-Omni-3B")
print(type(model))
# <class 'transformers.models.qwen2_5_omni.modeling_qwen2_5_omni.Qwen2_5OmniThinkerForConditionalGeneration'>
  • Tom Aarsen

@tomaarsen tomaarsen marked this pull request as ready for review April 15, 2026 13:27
Copy link
Copy Markdown
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! We don't need the base model without head in AutoModel?

@tomaarsen
Copy link
Copy Markdown
Member Author

There's not really a good AutoModel target per se, I think. I.e. not one class that accepts multi-modal inputs, but doesn't have a language modeling head. E.g. the Qwen2_5OmniThinkerForConditionalGeneration class initializes the Qwen2_5OmniAudioEncoder, Qwen2_5OmniVisionEncoder, and Qwen2_5OmniThinkerTextModel alongside a lm_head Linear.

The closest would be to export qwen2_5_omni_audio_encoder with Qwen2_5OmniAudioEncoderConfig and Qwen2_5OmniAudioEncoder (and idem for visual/text), this would likely be viable for AutoModel.

I think perhaps we can leave that be and merge this as-is?

  • Tom Aarsen

@zucchini-nlp
Copy link
Copy Markdown
Member

The closest would be to export qwen2_5_omni_audio_encoder with Qwen2_5OmniAudioEncoderConfig and Qwen2_5OmniAudioEncoder (and idem for visual/text), this would likely be viable for AutoModel.

Yes, this is the way if we need AutoModel support. Just making sure ST doesn't need to load a base model. If it works for ST, feel free to merge as is

@tomaarsen
Copy link
Copy Markdown
Member Author

ST used to require AutoModel support, but as of last week's release, it can also wrap other Auto-classes 🤗

  • Tom Aarsen

@tomaarsen tomaarsen requested a review from zucchini-nlp April 16, 2026 11:51
@zucchini-nlp zucchini-nlp enabled auto-merge April 16, 2026 11:51
@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto

@tomaarsen
Copy link
Copy Markdown
Member Author

#45018 was quite a big change, but I think the idea is that I don't have to specify anything in the main configuration files anymore. The script still seems to run, but can you double-check that I should indeed be good to just not specify anything in the main configuration files @zucchini-nlp ?

from transformers import AutoModelForImageTextToText

model = AutoModelForImageTextToText.from_pretrained("LCO-Embedding/LCO-Embedding-Omni-3B")
print(type(model))
# <class 'transformers.models.qwen2_5_omni.modeling_qwen2_5_omni.Qwen2_5OmniThinkerForConditionalGeneration'>

from transformers import AutoModelForMultimodalLM

model = AutoModelForMultimodalLM.from_pretrained("LCO-Embedding/LCO-Embedding-Omni-3B")
print(type(model))
# <class 'transformers.models.qwen2_5_omni.modeling_qwen2_5_omni.Qwen2_5OmniThinkerForConditionalGeneration'>
  • Tom Aarsen

@zucchini-nlp zucchini-nlp disabled auto-merge April 16, 2026 11:54
@zucchini-nlp
Copy link
Copy Markdown
Member

Oops, it got merged first. Yes, it should just click now for all possible configs in the codebase, so we only need to manually add cortect model-mapping type :)

@tomaarsen tomaarsen added this pull request to the merge queue Apr 16, 2026
Merged via the queue into huggingface:main with commit bc7ee23 Apr 16, 2026
28 checks passed
@tomaarsen tomaarsen deleted the feat/load_thinker_base branch April 16, 2026 12:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants