Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
dbe7ec2
Added an initial conversion script
kamila-chay Sep 23, 2025
977d05e
Added a modular where FastVLM is different from LlaVA
kamila-chay Sep 23, 2025
4e3679f
Improved the conversion script
kamila-chay Sep 24, 2025
dd2da9a
Adjusted the conversion script
kamila-chay Oct 6, 2025
9715630
Removed redundant labels from FastViT & improved the template
kamila-chay Oct 7, 2025
a75c141
Added docs and changed default config
kamila-chay Oct 7, 2025
030ad24
Fix default config
kamila-chay Oct 8, 2025
af251d2
Fix default config
kamila-chay Oct 8, 2025
17b9e89
Fixed layer feature handling and more docs
kamila-chay Oct 8, 2025
51010f5
Fixed documentation
kamila-chay Oct 8, 2025
1e92007
Style fixed
kamila-chay Oct 8, 2025
dc5e83e
Some small fixes
kamila-chay Oct 8, 2025
64e24ae
Improved the example script to be more inclusive
kamila-chay Oct 8, 2025
cf6336a
Fixes after the rebase
kamila-chay Oct 8, 2025
d428d60
Made the code and docs more readable and consistent
kamila-chay Oct 8, 2025
adcea05
Some fixes from the review
kamila-chay Oct 16, 2025
d8664ec
Reverted back to last layer only
kamila-chay Oct 16, 2025
065b79d
Typos fixed
kamila-chay Oct 17, 2025
3b9d907
added initial tests - some still failing
kamila-chay Oct 17, 2025
4ec2d23
Style and quality fixes
kamila-chay Oct 17, 2025
6204dc2
Updated modular according to the review
kamila-chay Oct 17, 2025
c15f4d7
Tests passing and some suggested generic improvements
kamila-chay Oct 20, 2025
8d7ebfa
Docs updated with another usage tip and an auto model
kamila-chay Oct 20, 2025
b3140d4
Reversed changes to test_can_intialize_on_meta becuase it's not fully…
kamila-chay Oct 21, 2025
dd30f1f
Some tweaks
kamila-chay Oct 21, 2025
d5b6329
Typo fix
kamila-chay Oct 21, 2025
3ee84e9
Consistency fixed
kamila-chay Oct 21, 2025
46d401d
Review comment
kamila-chay Oct 22, 2025
c637077
Redundant config attr deleted
kamila-chay Oct 22, 2025
591f134
Merge branch 'main' into add_FastVLM
ArthurZucker Nov 21, 2025
d1a52a7
Merge branch 'main' into add_FastVLM
zucchini-nlp Nov 24, 2025
631553d
Consistency fixed
kamila-chay Nov 24, 2025
8e8c12a
Fixed integration tests after rebase
kamila-chay Dec 1, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -1020,6 +1020,8 @@
title: Emu3
- local: model_doc/evolla
title: Evolla
- local: model_doc/fast_vlm
title: FastVLM
- local: model_doc/flava
title: FLAVA
- local: model_doc/florence2
Expand Down
175 changes: 175 additions & 0 deletions docs/source/en/model_doc/fast_vlm.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,175 @@
<!--Copyright 2025 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.

⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.

-->

*This model was released on 2025-05-06 and added to Hugging Face Transformers on 2025-10-07.*

# FastVLM

<div class="flex flex-wrap space-x-1">
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
</div>

## Overview

FastVLM is an open-source vision-language model featuring a novel hybrid vision encoder, FastViTHD. Leveraging reparameterizable convolutional layers, scaled input resolution, and a reduced number of visual tokens, FastVLM delivers high accuracy with exceptional efficiency. Its optimized architecture enables deployment even on edge devices, achieving ultra-low TTFT (time to first token) without sacrificing performance.

The model was proposed in [FastVLM: Efficient Vision Encoding for Vision Language Models](https://huggingface.co/papers/2412.13303) by Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li, Cem Koc, Nate True, Albert Antony, Gokul Santhanam, James Gabriel, Peter Grasch, Oncel Tuzel and Hadi Pouransari.

The abstract from the paper is the following:

*Scaling the input image resolution is essential for enhancing the performance of Vision Language Models (VLMs), particularly in text-rich image understanding tasks. However, popular visual encoders such as ViTs become inefficient at high resolutions due to the large number of tokens and high encoding latency. At different operational resolutions, the vision encoder of a VLM can be optimized along two axes: reducing encoding latency and minimizing the number of visual tokens passed to the LLM, thereby lowering overall latency. Based on a comprehensive efficiency analysis of the interplay between image resolution, vision latency, token count, and LLM size, we introduce FastVLM—a model that achieves an optimized trade-off between resolution, latency, and accuracy. FastVLM incorporates FastViTHD, a novel hybrid vision encoder designed to output fewer tokens and significantly reduce encoding time for high-resolution images. Unlike previous methods, FastVLM achieves the optimal balance between visual token count and image resolution solely by scaling the input image, eliminating the need for additional token pruning and simplifying the model design. In the LLaVA-1.5 setup, FastVLM achieves 3.2× improvement in time-to-first-token (TTFT) while maintaining similar performance on VLM benchmarks compared to prior works. Compared to LLaVa-OneVision at the highest resolution (1152×1152), FastVLM achieves better performance on key benchmarks like SeedBench, MMMU and DocVQA, using the same 0.5B LLM, but with 85× faster TTFT and a vision encoder that is 3.4× smaller.*

This model was contributed by [Kamila](https://github.com/kamila-chay).
The original code can be found [here](https://github.com/apple/ml-fastvlm).

## Usage tips

- We advise users to use `padding_side="left"` when computing batched generation as it leads to more accurate results. Simply make sure to call `processor.tokenizer.padding_side = "left"` before generating.

- Note the model has not been explicitly trained to process multiple images in the same prompt, although this is technically possible, you may experience inaccurate results.

**Important: **

Hugging Face models use SDPA by default; however, this model’s visual backbone supports only eager attention, so it automatically falls back to `"eager"`.

If you want to use a different attention implementation in the language decoder, make sure to set it explicitly, for example:

`model = FastVlmForConditionalGeneration.from_pretrained("KamilaMila/FastVLM-0.5B", attn_implementation={"text_config": "flash_attention_2"})`

Setting it for the entire model, e.g.

`model = FastVlmForConditionalGeneration.from_pretrained("KamilaMila/FastVLM-0.5B", attn_implementation="flash_attention_2")`

will result in an error.

### Formatting Prompts with Chat Templates

Each **checkpoint** is trained with a specific prompt format, depending on the underlying large language model backbone. To ensure correct formatting, use the processor’s `apply_chat_template` method.

**Important:**
- You must construct a conversation history — passing a plain string won't work.
- Each message should be a dictionary with `"role"` and `"content"` keys.
- The `"content"` should be a list of dictionaries for different modalities like `"text"` and `"image"`.

## Usage examples

### Single input inference


```python
import torch
from transformers import AutoProcessor, FastVlmForConditionalGeneration

# Load the model in half-precision
model = FastVlmForConditionalGeneration.from_pretrained("KamilaMila/FastVLM-0.5B", dtype=torch.bfloat16, device_map="auto")
processor = AutoProcessor.from_pretrained("KamilaMila/FastVLM-0.5B")

conversation = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"},
{"type": "text", "text": "What is shown in this image?"},
],
},
]

inputs = processor.apply_chat_template(
conversation,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt"
).to(model.device, torch.bfloat16)

# Generate
generate_ids = model.generate(**inputs, max_new_tokens=30)
processor.batch_decode(generate_ids, skip_special_tokens=True)
```


### Batched inference

FastVLM also supports batched inference. Here is how you can do it:

```python
import torch
from transformers import AutoProcessor, FastVlmForConditionalGeneration

# Load the model in half-precision
model = FastVlmForConditionalGeneration.from_pretrained("KamilaMila/FastVLM-0.5B", dtype=torch.bfloat16, device_map="auto")
processor = AutoProcessor.from_pretrained("KamilaMila/FastVLM-0.5B")


# Prepare a batch of two prompts
conversation_1 = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"},
{"type": "text", "text": "What is shown in this image?"},
],
},
]

conversation_2 = [
{
"role": "user",
"content": [
{"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"},
{"type": "text", "text": "What is shown in this image?"},
],
},
]

inputs = processor.apply_chat_template(
[conversation_1, conversation_2],
add_generation_prompt=True,
tokenize=True,
return_dict=True,
padding=True,
return_tensors="pt"
).to(model.device, torch.bfloat16)


# Generate
generate_ids = model.generate(**inputs, max_new_tokens=30)
processor.batch_decode(generate_ids, skip_special_tokens=True)
```


## Note regarding reproducing original implementation

In order to match the logits of the [original implementation](https://github.com/apple/ml-fastvlm), one needs to use float32. In half precision the logit difference is higher due to tiny differences in how some ops are implemented in timm.

### Using Flash Attention 2

Flash Attention 2 is an even faster, optimized version of the previous optimization, please refer to the [Flash Attention 2 section of performance docs](https://huggingface.co/docs/transformers/perf_infer_gpu_one).

## FastVlmConfig

[[autodoc]] FastVlmConfig

## FastVlmModel

[[autodoc]] FastVlmModel

## FastVlmForConditionalGeneration

[[autodoc]] FastVlmForConditionalGeneration
- forward
1 change: 1 addition & 0 deletions src/transformers/models/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,7 @@
from .falcon import *
from .falcon_h1 import *
from .falcon_mamba import *
from .fast_vlm import *
from .fastspeech2_conformer import *
from .flaubert import *
from .flava import *
Expand Down
2 changes: 2 additions & 0 deletions src/transformers/models/auto/configuration_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -147,6 +147,7 @@
("falcon", "FalconConfig"),
("falcon_h1", "FalconH1Config"),
("falcon_mamba", "FalconMambaConfig"),
("fast_vlm", "FastVlmConfig"),
("fastspeech2_conformer", "FastSpeech2ConformerConfig"),
("fastspeech2_conformer_with_hifigan", "FastSpeech2ConformerWithHifiGanConfig"),
("flaubert", "FlaubertConfig"),
Expand Down Expand Up @@ -580,6 +581,7 @@
("falcon3", "Falcon3"),
("falcon_h1", "FalconH1"),
("falcon_mamba", "FalconMamba"),
("fast_vlm", "FastVlm"),
("fastspeech2_conformer", "FastSpeech2Conformer"),
("fastspeech2_conformer_with_hifigan", "FastSpeech2ConformerWithHifiGan"),
("flan-t5", "FLAN-T5"),
Expand Down
2 changes: 2 additions & 0 deletions src/transformers/models/auto/modeling_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -150,6 +150,7 @@ class _BaseModelWithGenerate(PreTrainedModel, GenerationMixin):
("falcon", "FalconModel"),
("falcon_h1", "FalconH1Model"),
("falcon_mamba", "FalconMambaModel"),
("fast_vlm", "FastVlmModel"),
("fastspeech2_conformer", "FastSpeech2ConformerModel"),
("fastspeech2_conformer_with_hifigan", "FastSpeech2ConformerWithHifiGan"),
("flaubert", "FlaubertModel"),
Expand Down Expand Up @@ -986,6 +987,7 @@ class _BaseModelWithGenerate(PreTrainedModel, GenerationMixin):
("deepseek_vl_hybrid", "DeepseekVLHybridForConditionalGeneration"),
("emu3", "Emu3ForConditionalGeneration"),
("evolla", "EvollaForProteinText2Text"),
("fast_vlm", "FastVlmForConditionalGeneration"),
("florence2", "Florence2ForConditionalGeneration"),
("fuyu", "FuyuForCausalLM"),
("gemma3", "Gemma3ForConditionalGeneration"),
Expand Down
27 changes: 27 additions & 0 deletions src/transformers/models/fast_vlm/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Copyright 2025 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING

from ...utils import _LazyModule
from ...utils.import_utils import define_import_structure


if TYPE_CHECKING:
from .configuration_fast_vlm import *
from .modeling_fast_vlm import *
else:
import sys

_file = globals()["__file__"]
sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__)
137 changes: 137 additions & 0 deletions src/transformers/models/fast_vlm/configuration_fast_vlm.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
# 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
# This file was automatically generated from src/transformers/models/fast_vlm/modular_fast_vlm.py.
# Do NOT edit this file manually as any edits will be overwritten by the generation of
# the file from the modular. If any change should be done, please apply the change to the
# modular_fast_vlm.py file directly. One of our CI enforces this.
# 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
# Copyright 2025 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from ...configuration_utils import PreTrainedConfig
from ..auto import CONFIG_MAPPING, AutoConfig


class FastVlmConfig(PreTrainedConfig):
r"""
This is the configuration class to store the configuration of a [`FastVlmForConditionalGeneration`]. It is used to instantiate a
FastVLM model according to the specified arguments, defining the model architecture. Instantiating a configuration
with the defaults will yield the same configuration as the one of FastVLM-7B.

e.g. [KamilaMila/FastVLM-7B](https://huggingface.co/KamilaMila/FastVLM-7B)

Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.

Args:
vision_config (`Union[AutoConfig, dict]`, *optional*, defaults to `TimmWrapperConfig` for `fastvit_mci3`):
The config object or dictionary of the vision backbone.
text_config (`Union[AutoConfig, dict]`, *optional*, defaults to `Qwen2Config`):
The config object or dictionary of the text backbone.
image_token_id (`int`, *optional*, defaults to 151646):
The image token index to encode the image prompt.
projector_hidden_act (`str`, *optional*, defaults to `"gelu"`):
The activation function used by the multimodal projector.
vision_feature_select_strategy (`str`, *optional*, defaults to `"full"`):
The feature selection strategy used to select the vision feature from the vision backbone.
Only "full" supported.
vision_feature_layer (`Union[int, list[int]]`, *optional*, defaults to -1):
The index of the layer to select the vision feature. If multiple indices are provided,
the vision feature of the corresponding indices will be concatenated to form the
vision features. Only -1 supported.
multimodal_projector_bias (`bool`, *optional*, defaults to `True`):
Whether to use bias in the multimodal projector.

Example:

```python
>>> from transformers import FastVlmForConditionalGeneration, FastVlmConfig

>>> # Initializing a FastVLM-7B style configuration
>>> configuration = FastVlmConfig()

>>> # Initializing a model from the FastVLM-7B style configuration
>>> model = FastVlmForConditionalGeneration(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config
```"""

model_type = "fast_vlm"
attribute_map = {
"image_token_id": "image_token_index",
}
sub_configs = {"text_config": AutoConfig, "vision_config": AutoConfig}

def __init__(
self,
vision_config=None,
text_config=None,
image_token_id=151646,
projector_hidden_act="gelu",
vision_feature_select_strategy="full",
vision_feature_layer=-1,
multimodal_projector_bias=True,
**kwargs,
):
self.image_token_id = image_token_id
self.projector_hidden_act = projector_hidden_act

if vision_feature_select_strategy != "full":
raise ValueError(
f"Unexpected select feature strategy: {vision_feature_select_strategy}. Only 'full' is supported in FastVLM."
)

if vision_feature_layer != -1:
raise ValueError(
f"Unexpected vision feature layer: {vision_feature_layer}. Only -1 is supported in FastVLM."
)

self.vision_feature_select_strategy = vision_feature_select_strategy
self.vision_feature_layer = vision_feature_layer

if isinstance(vision_config, dict):
vision_config["model_type"] = vision_config.get("model_type", "timm_wrapper")
vision_config = CONFIG_MAPPING[vision_config["model_type"]](**vision_config)
elif vision_config is None:
vision_config = CONFIG_MAPPING["timm_wrapper"](
architecture="fastvit_mci3",
do_pooling=True,
global_pool="avg",
hidden_size=3072,
initializer_range=0.02,
model_args={"inference_mode": True},
)

self.vision_config = vision_config

if isinstance(text_config, dict):
text_config["model_type"] = text_config.get("model_type", "qwen2")
text_config = CONFIG_MAPPING[text_config["model_type"]](**text_config)
elif text_config is None:
text_config = CONFIG_MAPPING["qwen2"](
hidden_size=3584,
vocab_size=152128,
intermediate_size=18944,
num_attention_heads=28,
num_key_value_heads=4,
num_hidden_layers=28,
)

self.text_config = text_config
self.multimodal_projector_bias = multimodal_projector_bias

super().__init__(**kwargs)


__all__ = ["FastVlmConfig"]
Loading