Allow loading Qwen Thinker 'base' models without generative head by tomaarsen · Pull Request #45457 · huggingface/transformers

tomaarsen · 2026-04-15T12:29:47Z

What does this PR do?

Currently, for qwen2_5_omni and qwen3_omni_moe, you can only load the 'Talker' variant, i.e. with the audio output. This is a bit like only being able to load a checkpoint with AutoModelForCausalLM while AutoModel can't be used. This is bottlenecking for these models:

https://huggingface.co/models?other=qwen2_5_omni_thinker (55 models)
https://huggingface.co/models?other=qwen3_omni_moe_thinker (5 models)

Especially the former is affected, as there's actual models here, like https://huggingface.co/LCO-Embedding/LCO-Embedding-Omni-3B, https://huggingface.co/Haon-Chen/e5-omni-3B, etc. These are embedding models that don't need the talker, it just needs to get the token embeddings, which is why it relies on the base model, i.e. the thinker. I'd like to extend the change to qwen3_omni_moe_thinker too, so that embedding models can be trained with qwen3_omni_moe.

Currently, this LCO model requires trust_remote_code=True with:

  "auto_map": {
    "AutoConfig": "modeling_lco_omni.Qwen2_5OmniThinkerConfig",
    "AutoModel": "modeling_lco_omni.Qwen2_5OmniThinkerForConditionalGeneration"
  },

and

# Re-exported so `auto_map` in config.json can resolve the Thinker classes;
# `qwen2_5_omni_thinker` is shipped by transformers but not in `AutoConfig`.
from transformers import Qwen2_5OmniThinkerConfig, Qwen2_5OmniThinkerForConditionalGeneration

__all__ = [
    "Qwen2_5OmniThinkerConfig",
    "Qwen2_5OmniThinkerForConditionalGeneration",
]

Or loaded only with Qwen2_5OmniThinkerForConditionalGeneration.from_pretrained(...).

This should not be needed to load a multimodal base model, in my opinion. With this PR, you can load this model without trust_remote_code:

from transformers import AutoModel

model = AutoModel.from_pretrained("LCO-Embedding/LCO-Embedding-Omni-3B")
print(type(model))
# <class 'transformers.models.qwen2_5_omni.modeling_qwen2_5_omni.Qwen2_5OmniThinkerForConditionalGeneration'>

Code Agent Policy

I confirm that this is not a pure code agent PR.

I did generate the changes, but checked them over, and they match what I would have done.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@zucchini-nlp

Tom Aarsen

Currently, for qwen2_5_omni and qwen3_omni_moe, you can only load the 'Talker' variant, i.e. with the audio output. Now, you should also be able to load the 'base' models to get the token embeddings, etc. The glmasr_encoder, audioflamingo3_encoder, voxtral_encoder, etc. work similarly.

HuggingFaceDocBuilderDev · 2026-04-15T12:39:39Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

zucchini-nlp · 2026-04-15T12:48:22Z

        ("pvt", "PvtModel"),
        ("pvt_v2", "PvtV2Model"),
        ("qwen2", "Qwen2Model"),
+        ("qwen2_5_omni_thinker", "Qwen2_5OmniThinkerForConditionalGeneration"),


this has to be a Qwen2_5OmniThinkerModel is we add the key, the one without an lm head. Then Qwen2_5OmniThinkerForConditionalGeneration can be mapped in image-text-to-text dict

I don't think it makes sense to hold a generative model in base classes' mapping

Fair enough. Damn, I thought the class was improperly named, I missed that it inherits from GenerationMixin, I just noticed that it doesn't implement anything generate-related. Sorry about that.

Perhaps it fits well in MODEL_FOR_MULTIMODAL_LM_MAPPING_NAMES? That's where I originally placed them before I thought they were headless.

Models that accept text and optionally multimodal data in inputs
and can generate text and optionally multimodal data.

The latest Sentence Transformers can also work well with this.

Tom Aarsen

yeah, it's a pure VLM part of model so we can just put in image-text-to-text. MODEL_FOR_MULTIMODAL_LM_MAPPING_NAMES copies from it so will also have access

…AMES instead

tomaarsen · 2026-04-15T13:27:22Z

Done, this now runs instead, makes way more sense:

from transformers import AutoModelForImageTextToText

model = AutoModelForImageTextToText.from_pretrained("LCO-Embedding/LCO-Embedding-Omni-3B")
print(type(model))
# <class 'transformers.models.qwen2_5_omni.modeling_qwen2_5_omni.Qwen2_5OmniThinkerForConditionalGeneration'>

from transformers import AutoModelForMultimodalLM

model = AutoModelForMultimodalLM.from_pretrained("LCO-Embedding/LCO-Embedding-Omni-3B")
print(type(model))
# <class 'transformers.models.qwen2_5_omni.modeling_qwen2_5_omni.Qwen2_5OmniThinkerForConditionalGeneration'>

Tom Aarsen

zucchini-nlp

Nice! We don't need the base model without head in AutoModel?

tomaarsen · 2026-04-16T07:30:11Z

There's not really a good AutoModel target per se, I think. I.e. not one class that accepts multi-modal inputs, but doesn't have a language modeling head. E.g. the Qwen2_5OmniThinkerForConditionalGeneration class initializes the Qwen2_5OmniAudioEncoder, Qwen2_5OmniVisionEncoder, and Qwen2_5OmniThinkerTextModel alongside a lm_head Linear.

The closest would be to export qwen2_5_omni_audio_encoder with Qwen2_5OmniAudioEncoderConfig and Qwen2_5OmniAudioEncoder (and idem for visual/text), this would likely be viable for AutoModel.

I think perhaps we can leave that be and merge this as-is?

Tom Aarsen

zucchini-nlp · 2026-04-16T09:17:51Z

The closest would be to export qwen2_5_omni_audio_encoder with Qwen2_5OmniAudioEncoderConfig and Qwen2_5OmniAudioEncoder (and idem for visual/text), this would likely be viable for AutoModel.

Yes, this is the way if we need AutoModel support. Just making sure ST doesn't need to load a base model. If it works for ST, feel free to merge as is

tomaarsen · 2026-04-16T11:48:01Z

ST used to require AutoModel support, but as of last week's release, it can also wrap other Auto-classes 🤗

Tom Aarsen

github-actions · 2026-04-16T11:52:14Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto

tomaarsen · 2026-04-16T11:53:14Z

#45018 was quite a big change, but I think the idea is that I don't have to specify anything in the main configuration files anymore. The script still seems to run, but can you double-check that I should indeed be good to just not specify anything in the main configuration files @zucchini-nlp ?

from transformers import AutoModelForImageTextToText

model = AutoModelForImageTextToText.from_pretrained("LCO-Embedding/LCO-Embedding-Omni-3B")
print(type(model))
# <class 'transformers.models.qwen2_5_omni.modeling_qwen2_5_omni.Qwen2_5OmniThinkerForConditionalGeneration'>

from transformers import AutoModelForMultimodalLM

model = AutoModelForMultimodalLM.from_pretrained("LCO-Embedding/LCO-Embedding-Omni-3B")
print(type(model))
# <class 'transformers.models.qwen2_5_omni.modeling_qwen2_5_omni.Qwen2_5OmniThinkerForConditionalGeneration'>

Tom Aarsen

zucchini-nlp · 2026-04-16T11:56:14Z

Oops, it got merged first. Yes, it should just click now for all possible configs in the codebase, so we only need to manually add cortect model-mapping type :)

zucchini-nlp reviewed Apr 15, 2026

View reviewed changes

tomaarsen marked this pull request as draft April 15, 2026 12:55

Added thinker architectures to MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_N…

b028651

…AMES instead

tomaarsen marked this pull request as ready for review April 15, 2026 13:27

zucchini-nlp approved these changes Apr 15, 2026

View reviewed changes

Merge branch 'main' into feat/load_thinker_base

42ca0b2

tomaarsen requested a review from zucchini-nlp April 16, 2026 11:51

zucchini-nlp enabled auto-merge April 16, 2026 11:51

zucchini-nlp disabled auto-merge April 16, 2026 11:54

tomaarsen added this pull request to the merge queue Apr 16, 2026

Merged via the queue into huggingface:main with commit bc7ee23 Apr 16, 2026
28 checks passed

tomaarsen deleted the feat/load_thinker_base branch April 16, 2026 12:24

Conversation

tomaarsen commented Apr 15, 2026

What does this PR do?

Code Agent Policy

Before submitting

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented Apr 15, 2026

Uh oh!

zucchini-nlp Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

tomaarsen Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

tomaarsen commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

tomaarsen commented Apr 16, 2026

Uh oh!

zucchini-nlp commented Apr 16, 2026

Uh oh!

tomaarsen commented Apr 16, 2026

Uh oh!

github-actions bot commented Apr 16, 2026

Uh oh!

tomaarsen commented Apr 16, 2026

Uh oh!

zucchini-nlp commented Apr 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tomaarsen Apr 15, 2026 •

edited

Loading

tomaarsen commented Apr 15, 2026 •

edited

Loading