Add FastVLM #41112

kamila-chay · 2025-09-23T19:45:16Z

What does this PR do?

This PR adds FastVLM from Apple. The model's architecture is very similar to LlaVA, the main difference is that it uses a very fast hybrid encoder called FastViTHD. Timm's FastViT implementation is used and the LlaVA modality connector has been slightly modified.

Addresses (#38765)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@zucchini-nlp @ariG23498

Rocketknight1 · 2025-09-24T11:24:03Z

cc @zucchini-nlp

jackzhxng · 2025-10-05T04:42:08Z

@kamila-chay is this ready for review?

kamila-chay · 2025-10-06T11:16:02Z

@jackzhxng sorry for the delay, I had trouble with GPU access, today/tomorrow I should be able to wrap everything up and it will be ready :)

kamila-chay · 2025-10-08T23:09:25Z

Hi @zucchini-nlp, we can start the review process, everything is ready except for the tests. I'm writing them now but I think I can do that in the meantime while responding to any comments from your end:)

zucchini-nlp

Thanks @kamila-chay , looks really nice and clean. Let's standardize a bit and then we can request core maintainer's review

docs/source/en/model_doc/fast_vlm.md

zucchini-nlp · 2025-10-13T10:12:53Z

docs/source/en/model_doc/fast_vlm.md

+## Resources
+
+A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with image-to-text transformers (here using Llava as an example).
+
+<PipelineTag pipeline="image-to-text"/>
+
+- A [Google Colab demo](https://colab.research.google.com/drive/1qsl6cd2c8gGtEW1xV5io7S8NHh-Cp1TV?usp=sharing) on how to run Llava on a free-tier Google colab instance leveraging 4-bit inference.
+- A [similar notebook](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LLaVa/Inference_with_LLaVa_for_multimodal_generation.ipynb) showcasing batched inference. 🌎
+


not needed unless it is directly related to FastVLM

src/transformers/models/fast_vlm/modular_fast_vlm.py

zucchini-nlp · 2025-10-13T10:20:25Z

src/transformers/models/fast_vlm/modular_fast_vlm.py

+        # only this value makes sense in FastVLM (we can't have a CLS token in conv layers)
+        if vision_feature_select_strategy != "full":
+            raise ValueError(
+                f"Unexpected select feature strategy: {vision_feature_select_strategy}, Only 'full' is supported in FastVLM."
+            )
+
+        if any(
+            layer >= 0
+            for layer in (
+                vision_feature_layer if isinstance(vision_feature_layer, Iterable) else [vision_feature_layer]
+            )
+        ):
+            raise ValueError(f"Only negative vision feature layer values are supported. Got {vision_feature_layer}")
+


let's delete this. Having in config init is enough for now

zucchini-nlp · 2025-10-13T10:21:39Z

src/transformers/models/fast_vlm/modular_fast_vlm.py

+        >>> prompt = "<|im_start|>user\n<image>\nWhat's the content of the image?<|im_end|>\n<|im_start|>assistant\n"
+        >>> url = "https://www.ilankelman.org/stopsigns/australia.jpg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+
+        >>> inputs = processor(images=image, text=prompt, return_tensors="pt").to(device)


lets use chat template for code snippet

src/transformers/models/__init__.py

kamila-chay · 2025-11-16T11:13:46Z

@ArthurZucker @Cyrilvallez it's been over a month, pinging you in case you forgot 😊

ArthurZucker · 2025-11-21T15:01:29Z

I did I am sorry

ArthurZucker

LGTM perfect 🤗 Thanks @zucchini-nlp for the review!

ArthurZucker · 2025-11-21T15:08:37Z

run-slow: auto, fast_vlm

github-actions · 2025-11-21T15:09:48Z

This comment contains run-slow, running the specified jobs:

models: ["models/auto", "models/fast_vlm"]
quantizations: []

github-actions · 2025-11-21T15:59:44Z

CI Results

Workflow Run ⚙️

Model CI Report

❌ Failed tests

fast_vlm:
tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_can_load_with_global_device_set
tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_attn_implementation_composite_models
tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_bc_torch_dtype
tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_can_load_with_device_context_manager
tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_can_use_safetensors
tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_cannot_load_with_meta_device_context_manager
tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_00_fp16_pad_left_sdpa_kernels
tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_01_fp16_pad_left
tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_02_fp16_pad_left_no_attn_mask_sdpa_kernels
tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_03_fp16_pad_left_no_attn_mask
tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_04_fp16_pad_right_sdpa_kernels
tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_05_fp16_pad_right
tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_06_fp16_pad_right_no_attn_mask_sdpa_kernels
tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_07_fp16_pad_right_no_attn_mask
tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_08_fp32_pad_left_sdpa_kernels
tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_09_fp32_pad_left
tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_10_fp32_pad_left_no_attn_mask_sdpa_kernels
tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_11_fp32_pad_left_no_attn_mask
tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_12_fp32_pad_right_sdpa_kernels
tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_13_fp32_pad_right
tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_14_fp32_pad_right_no_attn_mask_sdpa_kernels
tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_15_fp32_pad_right_no_attn_mask
tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_16_bf16_pad_left_sdpa_kernels
tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_17_bf16_pad_left
tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_18_bf16_pad_left_no_attn_mask_sdpa_kernels
tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_19_bf16_pad_left_no_attn_mask
tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_20_bf16_pad_right_sdpa_kernels
tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_21_bf16_pad_right
tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_22_bf16_pad_right_no_attn_mask_sdpa_kernels
tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_23_bf16_pad_right_no_attn_mask
tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_24_fp32_pad_left_output_attentions
tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_from_pretrained_no_checkpoint
tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_load_save_without_tied_weights
tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_model_base_model_prefix
tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_model_weights_reload_no_missing_tied_weights
tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_save_load
tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_sdpa_can_dispatch_composite_models
tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationModelTest::test_tied_weights_keys
tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationIntegrationTest::test_generation_no_images
tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationIntegrationTest::test_small_model_integration_test
tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationIntegrationTest::test_small_model_integration_test_batch

zucchini-nlp · 2025-11-24T09:42:05Z

@kamila-chay we can merge when the CI turns green, The weight tying failures aren't related to the model specifically, I will rebase and see if it was fixed on main already. There was a huge refactor recently and the main branch is a bit unstable

upd: We need to run make fix-copies to update the model first

zucchini-nlp · 2025-11-24T09:55:09Z

@bot /style

github-actions · 2025-11-24T09:56:56Z

Style fix runs successfully without any file modified.

zucchini-nlp · 2025-11-24T10:54:05Z

oh wow, style bot doesn't run fix copies, didn't know about that 🥲

kamila-chay · 2025-11-24T16:21:55Z

Okk let me look at this issue and fix it, i've been away for a few days and didn't see

zucchini-nlp · 2025-11-24T16:32:14Z

A simple fix-copies might fix it all :)

zucchini-nlp · 2025-11-24T17:59:12Z

run-slow: fast_vlm

github-actions · 2025-11-24T18:00:33Z

This comment contains run-slow, running the specified jobs:

models: ["models/fast_vlm"]
quantizations: []

github-actions · 2025-11-24T18:14:59Z

CI Results

Workflow Run ⚙️

Model CI Report

❌ Failed tests

fast_vlm:
tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationIntegrationTest::test_generation_no_images
tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationIntegrationTest::test_small_model_integration_test
tests/models/fast_vlm/test_modeling_fast_vlm.py::FastVlmForConditionalGenerationIntegrationTest::test_small_model_integration_test_batch

github-actions · 2025-12-01T13:46:53Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, fast_vlm

kamila-chay · 2025-12-01T14:35:48Z

Everything's green now, slow tests should be ok too :) @zucchini-nlp

zucchini-nlp · 2025-12-02T10:26:51Z

run-slow: fast_vlm

github-actions · 2025-12-02T10:28:38Z

This comment contains run-slow, running the specified jobs:

models: ["models/fast_vlm"]
quantizations: []

github-actions · 2025-12-02T10:33:58Z

CI Results

Workflow Run ⚙️

✅ No failing test specific to this PR 🎉 !

zucchini-nlp · 2025-12-02T10:50:33Z

Thanks, great work @kamila-chay ! Let's merge 🚀

kamila-chay changed the title ~~[WIP] Add fast vlm~~ [WIP] Add FastVLM Sep 23, 2025

kamila-chay force-pushed the add_FastVLM branch 3 times, most recently from 57adb18 to 15ceb09 Compare October 8, 2025 23:05

kamila-chay changed the title ~~[WIP] Add FastVLM~~ Add FastVLM Oct 8, 2025

zucchini-nlp reviewed Oct 13, 2025

View reviewed changes

kamila-chay force-pushed the add_FastVLM branch 2 times, most recently from ed83be0 to ef4bce6 Compare October 21, 2025 11:16

kamila-chay added 18 commits October 21, 2025 16:13

Added an initial conversion script

dbe7ec2

Added a modular where FastVLM is different from LlaVA

977d05e

Improved the conversion script

4e3679f

Adjusted the conversion script

dd2da9a

Removed redundant labels from FastViT & improved the template

9715630

Added docs and changed default config

a75c141

Fix default config

030ad24

Fix default config

af251d2

Fixed layer feature handling and more docs

17b9e89

Fixed documentation

51010f5

Style fixed

1e92007

Some small fixes

dc5e83e

Improved the example script to be more inclusive

64e24ae

Fixes after the rebase

cf6336a

Made the code and docs more readable and consistent

d428d60

Some fixes from the review

adcea05

Reverted back to last layer only

d8664ec

Typos fixed

065b79d

Redundant config attr deleted

c637077

ArthurZucker added the New model label Nov 21, 2025

ArthurZucker approved these changes Nov 21, 2025

View reviewed changes

Merge branch 'main' into add_FastVLM

591f134

Merge branch 'main' into add_FastVLM

d1a52a7

Consistency fixed

631553d

kamila-chay force-pushed the add_FastVLM branch from c53f1c5 to e73894c Compare December 1, 2025 14:02

Fixed integration tests after rebase

8e8c12a

kamila-chay force-pushed the add_FastVLM branch from e73894c to 8e8c12a Compare December 1, 2025 14:12

zucchini-nlp merged commit a649767 into huggingface:main Dec 2, 2025
24 checks passed

Add FastVLM #41112

Add FastVLM #41112

Uh oh!

Conversation

kamila-chay commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

Rocketknight1 commented Sep 24, 2025

Uh oh!

jackzhxng commented Oct 5, 2025

Uh oh!

kamila-chay commented Oct 6, 2025

Uh oh!

kamila-chay commented Oct 8, 2025

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zucchini-nlp Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

kamila-chay Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

zucchini-nlp Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

kamila-chay Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

kamila-chay Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kamila-chay commented Nov 16, 2025

Uh oh!

ArthurZucker commented Nov 21, 2025

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker commented Nov 21, 2025

Uh oh!

github-actions bot commented Nov 21, 2025

Uh oh!

github-actions bot commented Nov 21, 2025

CI Results

Model CI Report

❌ Failed tests

Uh oh!

zucchini-nlp commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zucchini-nlp commented Nov 24, 2025

Uh oh!

github-actions bot commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zucchini-nlp commented Nov 24, 2025

Uh oh!

kamila-chay commented Nov 24, 2025

Uh oh!

zucchini-nlp commented Nov 24, 2025

Uh oh!

zucchini-nlp commented Nov 24, 2025

Uh oh!

github-actions bot commented Nov 24, 2025

Uh oh!

kamila-chay commented Sep 23, 2025 •

edited

Loading

zucchini-nlp commented Nov 24, 2025 •

edited

Loading

github-actions bot commented Nov 24, 2025 •

edited

Loading

kamila-chay commented Dec 1, 2025 •

edited

Loading