Skip to content

Conversation

zucchini-nlp
Copy link
Member

@zucchini-nlp zucchini-nlp commented Aug 21, 2025

What does this PR do?

Unblocks #39722 so we can use check_model_inputs in VLMs copied from llava. Otehrwise #39722 would need to re-define model classes to change forward pass and delete output_xxx kwargs

@zucchini-nlp zucchini-nlp requested a review from qubvel August 21, 2025 10:05
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Contributor

@qubvel qubvel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks very nice! Did you check if that's breaking in terms of actually returned hidden_states? They might be different, especially for the last item in the list. Still worth unbloating, but we might need to add 🚨

@zucchini-nlp zucchini-nlp changed the title Allow check_model_inputs in core VLMs 🚨 Allow check_model_inputs in core VLMs Aug 21, 2025
@zucchini-nlp
Copy link
Member Author

Did you check if that's breaking in terms of actually returned hidden_states

Checked the test_hidden_states and test_attentions which checks only sizes and lengths. I will trigger check with an actual model for values as well, to be sure it works

@zucchini-nlp
Copy link
Member Author

run-slow: aya_vision, got_ocr2, internvl, llava, mistral3, perception_lm, vipllava

@zucchini-nlp zucchini-nlp requested a review from qubvel August 22, 2025 11:26
Copy link
Contributor

This comment contains run-slow, running the specified jobs:

models: ['models/aya_vision', 'models/got_ocr2', 'models/internvl', 'models/llava', 'models/mistral3', 'models/perception_lm', 'models/vipllava']
quantizations: [] ...

Copy link
Contributor

@qubvel qubvel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I still see output_attentions/output_hidden_states/return_dict in modeling files, but suppose they should gone entirely in case you enable check_model_inputs. Am I missing smth?

Comment on lines -1072 to +1082
collected_outputs[key] = collected_outputs[key][:-1]
if hasattr(outputs, "vision_hidden_states"):
collected_outputs[key] = collected_outputs[key][:-1]
collected_outputs[key] += (outputs.vision_hidden_states,)
elif hasattr(outputs, "last_hidden_state"):
collected_outputs[key] = collected_outputs[key][:-1]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not all models have a last_hidden_state. In some multimodals, we return only vision_tower.hidden_states or similar and thus cropping the last hidden is not correct. We want to crop it only if it is being replaced right away with a hidden state after final "layer_norm" (e.g llama)

@zucchini-nlp
Copy link
Member Author

run-slow: aimv2, aya_vision, blip, blip_2, got_ocr2, idefics, idefics2, idefics3, instructblip, instructblipvideo, internvl, janus, llava, mistral3, ovis2, phi4_multimodal

Copy link
Contributor

This comment contains run-slow, running the specified jobs:

models: ['models/aimv2', 'models/aya_vision', 'models/blip', 'models/blip_2', 'models/got_ocr2', 'models/idefics', 'models/idefics2', 'models/idefics3', 'models/instructblip', 'models/instructblipvideo', 'models/internvl', 'models/janus', 'models/llava', 'models/mistral3', 'models/ovis2', 'models/phi4_multimodal']
quantizations: [] ...

@zucchini-nlp zucchini-nlp requested a review from qubvel August 29, 2025 11:16
@zucchini-nlp
Copy link
Member Author

zucchini-nlp commented Aug 29, 2025

Should be ready now, updated all VLMs (+ had to update Siglip from which some model copy). In base models which consist of backbones, I didn't add any check_model_inputs decorator since outputs will be handled by each backbone and we just need to pass forward what the backbone returned

Tests are very flaky, so CI will be red

@zucchini-nlp
Copy link
Member Author

run-slow: aimv2, aya_vision, blip, blip_2, got_ocr2, idefics, idefics2, idefics3, instructblip, instructblipvideo, internvl, janus, llava, mistral3, ovis2, smolvlm

Copy link
Contributor

This comment contains run-slow, running the specified jobs:

models: ['models/aimv2', 'models/aya_vision', 'models/blip', 'models/blip_2', 'models/got_ocr2', 'models/idefics', 'models/idefics2', 'models/idefics3', 'models/instructblip', 'models/instructblipvideo', 'models/internvl', 'models/janus', 'models/llava', 'models/mistral3', 'models/ovis2', 'models/smolvlm']
quantizations: [] ...

Copy link
Contributor

@qubvel qubvel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! I did not review every line, but the main idea is to double-check return type hints to make sure they match with the changed outputs

@zucchini-nlp zucchini-nlp requested a review from qubvel September 1, 2025 15:13
Copy link
Contributor

@qubvel qubvel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update, looks good to me!

This one might be relevant to merge to make sure users are aware if they request output_attentions=True

Also, for Siglip (and later for CLIP), it might be worthwhile to keep hidden_states collection explicit because that's something natural for vision models to output for many downstream tasks, such as segmentation. As far as I understand, CLIP hidden_states are used in diffusers as well. I did it partially in

can rebase on your version if it would be merged earlier

@zucchini-nlp zucchini-nlp enabled auto-merge (squash) September 2, 2025 11:06
@zucchini-nlp
Copy link
Member Author

Oke, ig all loading issues are resolved in main, so rebase will help with flakiness once and for all

@zucchini-nlp
Copy link
Member Author

Why is it flaky, it downloads with hf hub 😭

Copy link
Contributor

github-actions bot commented Sep 5, 2025

[For maintainers] Suggested jobs to run (before merge)

run-slow: aimv2, altclip, aya_vision, blip, blip_2, clip, clipseg, git, got_ocr2, idefics, idefics2, idefics3, instructblip, instructblipvideo, internvl, janus

@zucchini-nlp zucchini-nlp merged commit 4e195f1 into huggingface:main Sep 5, 2025
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants