-
Notifications
You must be signed in to change notification settings - Fork 30.6k
🚨 Allow check_model_inputs
in core VLMs
#40342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🚨 Allow check_model_inputs
in core VLMs
#40342
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, looks very nice! Did you check if that's breaking in terms of actually returned hidden_states
? They might be different, especially for the last item in the list. Still worth unbloating, but we might need to add 🚨
check_model_inputs
in core VLMscheck_model_inputs
in core VLMs
Checked the |
run-slow: aya_vision, got_ocr2, internvl, llava, mistral3, perception_lm, vipllava |
This comment contains run-slow, running the specified jobs: models: ['models/aya_vision', 'models/got_ocr2', 'models/internvl', 'models/llava', 'models/mistral3', 'models/perception_lm', 'models/vipllava'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I still see output_attentions/output_hidden_states/return_dict
in modeling files, but suppose they should gone entirely in case you enable check_model_inputs
. Am I missing smth?
collected_outputs[key] = collected_outputs[key][:-1] | ||
if hasattr(outputs, "vision_hidden_states"): | ||
collected_outputs[key] = collected_outputs[key][:-1] | ||
collected_outputs[key] += (outputs.vision_hidden_states,) | ||
elif hasattr(outputs, "last_hidden_state"): | ||
collected_outputs[key] = collected_outputs[key][:-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not all models have a last_hidden_state
. In some multimodals, we return only vision_tower.hidden_states
or similar and thus cropping the last hidden is not correct. We want to crop it only if it is being replaced right away with a hidden state after final "layer_norm" (e.g llama)
run-slow: aimv2, aya_vision, blip, blip_2, got_ocr2, idefics, idefics2, idefics3, instructblip, instructblipvideo, internvl, janus, llava, mistral3, ovis2, phi4_multimodal |
This comment contains run-slow, running the specified jobs: models: ['models/aimv2', 'models/aya_vision', 'models/blip', 'models/blip_2', 'models/got_ocr2', 'models/idefics', 'models/idefics2', 'models/idefics3', 'models/instructblip', 'models/instructblipvideo', 'models/internvl', 'models/janus', 'models/llava', 'models/mistral3', 'models/ovis2', 'models/phi4_multimodal'] |
Should be ready now, updated all VLMs (+ had to update Siglip from which some model copy). In base models which consist of backbones, I didn't add any Tests are very flaky, so CI will be red |
run-slow: aimv2, aya_vision, blip, blip_2, got_ocr2, idefics, idefics2, idefics3, instructblip, instructblipvideo, internvl, janus, llava, mistral3, ovis2, smolvlm |
This comment contains run-slow, running the specified jobs: models: ['models/aimv2', 'models/aya_vision', 'models/blip', 'models/blip_2', 'models/got_ocr2', 'models/idefics', 'models/idefics2', 'models/idefics3', 'models/instructblip', 'models/instructblipvideo', 'models/internvl', 'models/janus', 'models/llava', 'models/mistral3', 'models/ovis2', 'models/smolvlm'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice! I did not review every line, but the main idea is to double-check return type hints to make sure they match with the changed outputs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update, looks good to me!
This one might be relevant to merge to make sure users are aware if they request output_attentions=True
Also, for Siglip (and later for CLIP), it might be worthwhile to keep hidden_states collection explicit because that's something natural for vision models to output for many downstream tasks, such as segmentation. As far as I understand, CLIP hidden_states are used in diffusers as well. I did it partially in
can rebase on your version if it would be merged earlier
Oke, ig all loading issues are resolved in main, so rebase will help with flakiness once and for all |
Why is it flaky, it downloads with hf hub 😭 |
[For maintainers] Suggested jobs to run (before merge) run-slow: aimv2, altclip, aya_vision, blip, blip_2, clip, clipseg, git, got_ocr2, idefics, idefics2, idefics3, instructblip, instructblipvideo, internvl, janus |
What does this PR do?
Unblocks #39722 so we can use
check_model_inputs
in VLMs copied from llava. Otehrwise #39722 would need to re-define model classes to change forward pass and deleteoutput_xxx
kwargs