Skip to content

Conversation

@kamila-chay
Copy link
Contributor

@kamila-chay kamila-chay commented Sep 23, 2025

What does this PR do?

This PR adds FastVLM from Apple. The model's architecture is very similar to LlaVA, the main difference is that it uses a very fast hybrid encoder called FastViTHD. Timm's FastViT implementation is used and the LlaVA modality connector has been slightly modified.

Addresses (#38765)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@zucchini-nlp @ariG23498

@kamila-chay kamila-chay changed the title [WIP] Add fast vlm [WIP] Add FastVLM Sep 23, 2025
@Rocketknight1
Copy link
Member

cc @zucchini-nlp

@jackzhxng
Copy link
Contributor

@kamila-chay is this ready for review?

@kamila-chay
Copy link
Contributor Author

@jackzhxng sorry for the delay, I had trouble with GPU access, today/tomorrow I should be able to wrap everything up and it will be ready :)

@kamila-chay kamila-chay force-pushed the add_FastVLM branch 3 times, most recently from 57adb18 to 15ceb09 Compare October 8, 2025 23:05
@kamila-chay kamila-chay changed the title [WIP] Add FastVLM Add FastVLM Oct 8, 2025
@kamila-chay
Copy link
Contributor Author

Hi @zucchini-nlp, we can start the review process, everything is ready except for the tests. I'm writing them now but I think I can do that in the meantime while responding to any comments from your end:)

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kamila-chay , looks really nice and clean. Let's standardize a bit and then we can request core maintainer's review

Comment on lines 216 to 224
## Resources

A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with image-to-text transformers (here using Llava as an example).

<PipelineTag pipeline="image-to-text"/>

- A [Google Colab demo](https://colab.research.google.com/drive/1qsl6cd2c8gGtEW1xV5io7S8NHh-Cp1TV?usp=sharing) on how to run Llava on a free-tier Google colab instance leveraging 4-bit inference.
- A [similar notebook](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LLaVa/Inference_with_LLaVa_for_multimodal_generation.ipynb) showcasing batched inference. 🌎

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed unless it is directly related to FastVLM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted

Comment on lines 218 to 231
# only this value makes sense in FastVLM (we can't have a CLS token in conv layers)
if vision_feature_select_strategy != "full":
raise ValueError(
f"Unexpected select feature strategy: {vision_feature_select_strategy}, Only 'full' is supported in FastVLM."
)

if any(
layer >= 0
for layer in (
vision_feature_layer if isinstance(vision_feature_layer, Iterable) else [vision_feature_layer]
)
):
raise ValueError(f"Only negative vision feature layer values are supported. Got {vision_feature_layer}")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's delete this. Having in config init is enough for now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted

Comment on lines 307 to 311
>>> prompt = "<|im_start|>user\n<image>\nWhat's the content of the image?<|im_end|>\n<|im_start|>assistant\n"
>>> url = "https://www.ilankelman.org/stopsigns/australia.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> inputs = processor(images=image, text=prompt, return_tensors="pt").to(device)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets use chat template for code snippet

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed :)

@kamila-chay kamila-chay force-pushed the add_FastVLM branch 2 times, most recently from ed83be0 to ef4bce6 Compare October 21, 2025 11:16
@kamila-chay
Copy link
Contributor Author

@ArthurZucker @Cyrilvallez it's been over a month, pinging you in case you forgot 😊

@ArthurZucker
Copy link
Collaborator

I did I am sorry

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM perfect 🤗 Thanks @zucchini-nlp for the review!

@ArthurZucker
Copy link
Collaborator

run-slow: auto, fast_vlm

@github-actions
Copy link
Contributor

This comment contains run-slow, running the specified jobs:

models: ["models/auto", "models/fast_vlm"]
quantizations: []

@github-actions
Copy link
Contributor

CI Results

Workflow Run ⚙️

Model CI Report

❌ Failed tests

  • fast_vlm:
    tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_can_load_with_global_device_set
    tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_attn_implementation_composite_models
    tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_bc_torch_dtype
    tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_can_load_with_device_context_manager
    tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_can_use_safetensors
    tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_cannot_load_with_meta_device_context_manager
    tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_00_fp16_pad_left_sdpa_kernels
    tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_01_fp16_pad_left
    tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_02_fp16_pad_left_no_attn_mask_sdpa_kernels
    tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_03_fp16_pad_left_no_attn_mask
    tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_04_fp16_pad_right_sdpa_kernels
    tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_05_fp16_pad_right
    tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_06_fp16_pad_right_no_attn_mask_sdpa_kernels
    tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_07_fp16_pad_right_no_attn_mask
    tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_08_fp32_pad_left_sdpa_kernels
    tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_09_fp32_pad_left
    tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_10_fp32_pad_left_no_attn_mask_sdpa_kernels
    tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_11_fp32_pad_left_no_attn_mask
    tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_12_fp32_pad_right_sdpa_kernels
    tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_13_fp32_pad_right
    tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_14_fp32_pad_right_no_attn_mask_sdpa_kernels
    tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_15_fp32_pad_right_no_attn_mask
    tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_16_bf16_pad_left_sdpa_kernels
    tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_17_bf16_pad_left
    tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_18_bf16_pad_left_no_attn_mask_sdpa_kernels
    tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_19_bf16_pad_left_no_attn_mask
    tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_20_bf16_pad_right_sdpa_kernels
    tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_21_bf16_pad_right
    tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_22_bf16_pad_right_no_attn_mask_sdpa_kernels
    tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_23_bf16_pad_right_no_attn_mask
    tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_24_fp32_pad_left_output_attentions
    tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_from_pretrained_no_checkpoint
    tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_load_save_without_tied_weights
    tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_model_base_model_prefix
    tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_model_weights_reload_no_missing_tied_weights
    tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_save_load
    tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_sdpa_can_dispatch_composite_models
    tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_tied_weights_keys
    tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationIntegrationTest::test_generation_no_images
    tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationIntegrationTest::test_small_model_integration_test
    tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationIntegrationTest::test_small_model_integration_test_batch

@zucchini-nlp
Copy link
Member

zucchini-nlp commented Nov 24, 2025

@kamila-chay we can merge when the CI turns green, The weight tying failures aren't related to the model specifically, I will rebase and see if it was fixed on main already. There was a huge refactor recently and the main branch is a bit unstable

upd: We need to run make fix-copies to update the model first

@zucchini-nlp
Copy link
Member

@bot /style

@github-actions
Copy link
Contributor

github-actions bot commented Nov 24, 2025

Style fix runs successfully without any file modified.

@zucchini-nlp
Copy link
Member

oh wow, style bot doesn't run fix copies, didn't know about that 🥲

@kamila-chay
Copy link
Contributor Author

Okk let me look at this issue and fix it, i've been away for a few days and didn't see

@zucchini-nlp
Copy link
Member

A simple fix-copies might fix it all :)

@zucchini-nlp
Copy link
Member

run-slow: fast_vlm

@github-actions
Copy link
Contributor

This comment contains run-slow, running the specified jobs:

models: ["models/fast_vlm"]
quantizations: []

@github-actions
Copy link
Contributor

CI Results

Workflow Run ⚙️

Model CI Report

❌ Failed tests

  • fast_vlm:
    tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationIntegrationTest::test_generation_no_images
    tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationIntegrationTest::test_small_model_integration_test
    tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationIntegrationTest::test_small_model_integration_test_batch

@github-actions
Copy link
Contributor

github-actions bot commented Dec 1, 2025

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, fast_vlm

@kamila-chay
Copy link
Contributor Author

kamila-chay commented Dec 1, 2025

Everything's green now, slow tests should be ok too :) @zucchini-nlp

@zucchini-nlp
Copy link
Member

run-slow: fast_vlm

@github-actions
Copy link
Contributor

github-actions bot commented Dec 2, 2025

This comment contains run-slow, running the specified jobs:

models: ["models/fast_vlm"]
quantizations: []

@github-actions
Copy link
Contributor

github-actions bot commented Dec 2, 2025

CI Results

Workflow Run ⚙️

✅ No failing test specific to this PR 🎉 !

@zucchini-nlp
Copy link
Member

Thanks, great work @kamila-chay ! Let's merge 🚀

@zucchini-nlp zucchini-nlp merged commit a649767 into huggingface:main Dec 2, 2025
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants