-
Notifications
You must be signed in to change notification settings - Fork 31.3k
Add FastVLM #41112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add FastVLM #41112
Conversation
|
@kamila-chay is this ready for review? |
|
@jackzhxng sorry for the delay, I had trouble with GPU access, today/tomorrow I should be able to wrap everything up and it will be ready :) |
57adb18 to
15ceb09
Compare
|
Hi @zucchini-nlp, we can start the review process, everything is ready except for the tests. I'm writing them now but I think I can do that in the meantime while responding to any comments from your end:) |
zucchini-nlp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kamila-chay , looks really nice and clean. Let's standardize a bit and then we can request core maintainer's review
docs/source/en/model_doc/fast_vlm.md
Outdated
| ## Resources | ||
|
|
||
| A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with image-to-text transformers (here using Llava as an example). | ||
|
|
||
| <PipelineTag pipeline="image-to-text"/> | ||
|
|
||
| - A [Google Colab demo](https://colab.research.google.com/drive/1qsl6cd2c8gGtEW1xV5io7S8NHh-Cp1TV?usp=sharing) on how to run Llava on a free-tier Google colab instance leveraging 4-bit inference. | ||
| - A [similar notebook](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LLaVa/Inference_with_LLaVa_for_multimodal_generation.ipynb) showcasing batched inference. 🌎 | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not needed unless it is directly related to FastVLM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deleted
| # only this value makes sense in FastVLM (we can't have a CLS token in conv layers) | ||
| if vision_feature_select_strategy != "full": | ||
| raise ValueError( | ||
| f"Unexpected select feature strategy: {vision_feature_select_strategy}, Only 'full' is supported in FastVLM." | ||
| ) | ||
|
|
||
| if any( | ||
| layer >= 0 | ||
| for layer in ( | ||
| vision_feature_layer if isinstance(vision_feature_layer, Iterable) else [vision_feature_layer] | ||
| ) | ||
| ): | ||
| raise ValueError(f"Only negative vision feature layer values are supported. Got {vision_feature_layer}") | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's delete this. Having in config init is enough for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deleted
| >>> prompt = "<|im_start|>user\n<image>\nWhat's the content of the image?<|im_end|>\n<|im_start|>assistant\n" | ||
| >>> url = "https://www.ilankelman.org/stopsigns/australia.jpg" | ||
| >>> image = Image.open(requests.get(url, stream=True).raw) | ||
| >>> inputs = processor(images=image, text=prompt, return_tensors="pt").to(device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets use chat template for code snippet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed :)
ed83be0 to
ef4bce6
Compare
|
@ArthurZucker @Cyrilvallez it's been over a month, pinging you in case you forgot 😊 |
|
I did I am sorry |
ArthurZucker
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM perfect 🤗 Thanks @zucchini-nlp for the review!
|
run-slow: auto, fast_vlm |
|
This comment contains models: ["models/auto", "models/fast_vlm"] |
CI ResultsModel CI Report❌ Failed tests
|
|
@kamila-chay we can merge when the CI turns green, The weight tying failures aren't related to the model specifically, I will rebase and see if it was fixed on main already. There was a huge refactor recently and the main branch is a bit unstable upd: We need to run |
|
@bot /style |
|
Style fix runs successfully without any file modified. |
|
oh wow, style bot doesn't run fix copies, didn't know about that 🥲 |
|
Okk let me look at this issue and fix it, i've been away for a few days and didn't see |
|
A simple fix-copies might fix it all :) |
|
run-slow: fast_vlm |
|
This comment contains models: ["models/fast_vlm"] |
CI ResultsModel CI Report❌ Failed tests
|
|
[For maintainers] Suggested jobs to run (before merge) run-slow: auto, fast_vlm |
c53f1c5 to
e73894c
Compare
e73894c to
8e8c12a
Compare
|
Everything's green now, slow tests should be ok too :) @zucchini-nlp |
|
run-slow: fast_vlm |
|
This comment contains models: ["models/fast_vlm"] |
CI Results✅ No failing test specific to this PR 🎉 ! |
|
Thanks, great work @kamila-chay ! Let's merge 🚀 |
What does this PR do?
This PR adds FastVLM from Apple. The model's architecture is very similar to LlaVA, the main difference is that it uses a very fast hybrid encoder called FastViTHD. Timm's FastViT implementation is used and the LlaVA modality connector has been slightly modified.
Addresses (#38765)
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@zucchini-nlp @ariG23498