-
Notifications
You must be signed in to change notification settings - Fork 25.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add LLaVa 1.6 #29012
Add LLaVa 1.6 #29012
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating on adding this model!
The image processor still needs a bit of work. Namely, numpy equivalents of PIL methods should be written instead of converting back and forth between numpy and PIL which is very inefficient and also causes breaks in assumptions about the data format of the images as they go through the processor.
Please make sure the look over the PR before asking for review - there's a file which should be included in the PR.
Modified forward pass looks OK to me in terms of logic but does make more complex than before. Would be good to have @ArthurZucker's input there if it's OK.
dim=0, | ||
) | ||
|
||
device = "cuda:3" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be resolved
|
||
# verify inputs | ||
if model_id == "liuhaotian/llava-v1.6-mistral-7b": | ||
# replace -200 by 32000 (ask Arthur) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment should reflect this
Thanks, I've addressed all comments! Will update the conversion script to use cuda:0 when all checkpoints are uploaded (34b models still need to be uploaded) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 🔴 changes to the
LlavaForCausalLM
are just to big to say it's still Llava. We can either have a new forward, or we do what we always do, add a new "Architecture" to properly seperate both. I don't mind having it in the same file, but the expected input is different from normal Llava. - 🔴 changes are not explained in the slightest, so coming from the llava code I wrote I have no idea what is happening, why we need to iterate of the images etc. This needs to be explained and needs a separate code-path
- 🟢 on pretty much the other changes (processor tests etc) and leave it to @amyeroberts for them!
|
||
# NOTE we only support multimodal_patch_merge_type == "spatial_unpad" | ||
new_image_features = [] | ||
for image_idx, image_feature in enumerate(image_features): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the goal is to not follow the original implementation. Which means more work for use of course but we did not follow the original implementation either and successfully vectorized the forward pass 😉
Ok thanks for your review.
I feel like adding a whole new architecture is a bit of an overkill, as the only change of llava 1.6 is that pixel values are now of shape 5 instead of 4 (batch_size, num_patches, num_channels, height, width). Llava 1.6 is also still part of the same paper, they just extend it to work on arbitrary image resolutions. Could you clarify what you mean by a new forward, since the model LlavaForConditionalGeneration can only have a single forward method defined, I assume. Do you mean defining a whole new class as well?
I'll add some more comments to explain what's happening. |
Closing this PR in favor of #29586 |
What does this PR do?
This PR adds the new LLaVa 1.6 model.
To do:
image_sizes
are the same for all imagesimage_sizes
a tensor instead of a list