-
Notifications
You must be signed in to change notification settings - Fork 30.9k
Qwen2-VL: clean-up and add more tests #33354
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this!
| Maximum length of the returned list and optionally padding length (see above). | ||
| truncation (`bool`, *optional*): | ||
| Activates truncation to cut input sequences longer than `max_length` to `max_length`. | ||
| return_tensors (`str` or [`~utils.TensorType`], *optional*): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep the return tensors in the docstring for now. Although we're removing it as a defined kwarg, users need to set to to be able to properly use the processor for training.
| all_kwargs = { | ||
| "common_kwargs": {"return_tensors": "pt"}, | ||
| "images_kwargs": {"size": {"height": 214, "width": 214}}, | ||
| "videos_kwargs": {"size": {"height": 214, "width": 214}}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing the processing doesn't work if the video and image have different image sizes set as we can't batch?
Could we test passing one and not the other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, passing only one worked for me but I'll add better tests also
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oke, I found that the size wasn't being used at all, my bad for missing that when reviewing. Since Qwen2-VL can't resize to the given size (it needs to follow some methods to smart resize and keep as much resolution as possible), the size dict will be different from other models. Added that to the docsting
Also modified all tests to follow the new size format and actually use if if users pass it with correct dict-keys. WDYT? Should we also find a way to support tuple-size? I'll also update model doc page soon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated docs and added an example usage where users can change max-min resolution in processor's call-time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also find a way to support tuple-size?
No. Previously size was a tuple and this caused issues as:
- some models had it stored as (width, height)
- some models just stored an int. Within that set of models, some would then resize to
(int, int)others to(int, int * height / width)etc. - some models used a tuple to denote the longest and shortest edges
There was a huge level of ambiguity and inconsistency across the configs which made image processor behaviour difficult to predict without having to dig into the code.
Technically, tuples are supported using get_size_dict, but this is for backwards compatibility rather than an encouraged way to pass in inputs.
Adding min_pixels and max_pixels isn't a light decision, as this is something which sets in stone how this will be expressed for all configs going forward. I think this change is OK but we should be cautious about just adding keys to size dict. You'll also need to update some other code which verifies the keys in the size dicts -- namely VALID_SIZE_DICT_KEYS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. Previously size was a tuple and this caused issues as:
I see, that makes it easier in general to use max-min pixels now cause I didn't want to add the new possible keys to VALID_SIZE_DICT_KEYS. Qwen is and will prob be the only one accepting such kwargs. But then we couldn't call get_size_dict to validate the size dict is passed correctly
Oke, now I added it in VALID_DICT and added one more test with incorrect sizes passed as kwargs, in which case the defaults from self.min_pixels have to be used
Ready for re-review
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a comment re the size dict -- just some updates needed in image_processing_utils.py to reflect this addition
| all_kwargs = { | ||
| "common_kwargs": {"return_tensors": "pt"}, | ||
| "images_kwargs": {"size": {"height": 214, "width": 214}}, | ||
| "videos_kwargs": {"size": {"height": 214, "width": 214}}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also find a way to support tuple-size?
No. Previously size was a tuple and this caused issues as:
- some models had it stored as (width, height)
- some models just stored an int. Within that set of models, some would then resize to
(int, int)others to(int, int * height / width)etc. - some models used a tuple to denote the longest and shortest edges
There was a huge level of ambiguity and inconsistency across the configs which made image processor behaviour difficult to predict without having to dig into the code.
Technically, tuples are supported using get_size_dict, but this is for backwards compatibility rather than an encouraged way to pass in inputs.
Adding min_pixels and max_pixels isn't a light decision, as this is something which sets in stone how this will be expressed for all configs going forward. I think this change is OK but we should be cautious about just adding keys to size dict. You'll also need to update some other code which verifies the keys in the size dicts -- namely VALID_SIZE_DICT_KEYS
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
| self.temporal_patch_size = temporal_patch_size | ||
| self.merge_size = merge_size | ||
| self.size = {"min_pixels": min_pixels, "max_pixels": max_pixels} | ||
| self.size = size if size is not None else {"min_pixels": min_pixels, "max_pixels": max_pixels} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last thing - if size can now be set in the input i.e. the values of min_pixels and max_pixels in the size dict are taken as precedence over the min_pixels and max_pixels input arguments, then we should pop max_pixels and min_pixels in the to_dict method so that only size is saved out into the config.
That is - if someone were to look at the config file, it's not clear which value takes precedence, and you'd have to look at the code to understand, which we should avoid
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, that's a good point. But I think the priority should be given to the max_pixel and min_pixels rather than size dict, which is the default setting for qwen2-vl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - what I'd recommend is just not accepting the size argument and sticking with min_pixels and max_pixels then
| min_pixels = size.get("min_pixels", self.min_pixels) | ||
| max_pixels = size.get("max_pixels", self.max_pixels) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is funny - it implies size can be a dict without the min_pixels and max_pixels values, but this isn't compatible with this image processor e.g. "height" and "width" from the size dict wouldn't be used. It would be better to raise an error if these keys are missing rather than default to self.min_pixels or self.max_pixels (which create ambiguity in the behaviour) and can mask proper initialization in the init
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, this was the first thing I thought but didn't want to raise errors that seemed to run smoothly earlier. We can raise an error and explain how to pass size, and this is the place where we can say that passing as tuple is neither good (related to below comment). Since we added min/max pixels to VALID_DICT, i think we did it to be able to run get_size_dict and that way validate that no weird keys are used
I am pro checking the size dict, but we can also check it within Qwen2 code only and not add it in VALID_DICT maybe
| """ | ||
| do_resize = do_resize if do_resize is not None else self.do_resize | ||
| size = size if size is not None else self.size | ||
| size = get_size_dict(size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't work - this will produce a dictionary with "height", "width"; "shortest_edge" or "shortest_edge", "longest_edge" keys which are not compatible with this image processor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating on this and the clean up!
* clean-up on qwen2-vl and add generation tests * add video tests * Update tests/models/qwen2_vl/test_processing_qwen2_vl.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * fix and add better tests * Update src/transformers/models/qwen2_vl/image_processing_qwen2_vl.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * update docs and address comments * Update docs/source/en/model_doc/qwen2_vl.md Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update docs/source/en/model_doc/qwen2_vl.md Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * update * remove size at all --------- Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
What does this PR do?
This PR adds standardization on Qwen2-VL processors and adds tests for processing and generation. One thing to note is that generation tests currently will operate on text-only modality. I will add support of multimodal tests very soon
TODO: