Add Granite 4.1 Vision (granite4_vision) by artem-spector · Pull Request #45597 · huggingface/transformers

artem-spector · 2026-04-23T06:51:56Z

What does this PR do?

Adds built-in support for Granite 4.1 Vision (granite4_vision), IBM's multimodal vision-language model for enterprise document understanding.

Architecture highlights

Vision encoder: SigLIP2 (google/siglip2-so400m-patch16-384), tiled 384×384 patches
Window Q-Former projector: 4×4 patch windows compressed to 2×2 query tokens via cross-attention (downsample_rate="4/8")
DeepStack feature injection: 8 vision-to-LLM injection points across two mechanisms:
- LayerDeepstack: features from 4 vision encoder depths injected at 4 LLM layers (reversed order — deepest vision → earliest LLM)
- SpatialDeepstack: deepest features split into 4 spatial offset groups (TL/TR/BL/BR), injected at 4 later LLM layers
Language model: GraniteForCausalLM (3.5B) with a rank-256 LoRA adapter (same-repo, LM-only)

Files added

File	Purpose
`modular_granite4_vision.py`	Source of truth — inherits from LLaVA-Next, overrides novel components
`configuration_granite4_vision.py`	Config (generated)
`modeling_granite4_vision.py`	Model (generated)
`processing_granite4_vision.py`	Unified processor (generated)
`image_processing_granite4_vision.py`	Torchvision-based image processor
`image_processing_pil_granite4_vision.py`	PIL/NumPy image processor
`tests/models/granite4_vision/`	Modeling, image processing, and processor tests
`docs/source/en/model_doc/granite4_vision.md`	Model documentation

Auto-registration

Config: auto-generated via configuration_granite4_vision.py model_type
Modeling: MODEL_MAPPING_NAMES + MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES
Processing + image processing: registered in respective auto files

Tests

Unit tests pass locally (pytest tests/models/granite4_vision/ -x -q)
@slow integration tests load real checkpoint and assert outputs within tolerance
make style and make check-repo pass (3 remaining failures are pre-existing upstream issues: mlinter version mismatch and Sam3Lite incomplete model)

Before submitting

This PR is not a duplicate
I have read the contributor guidelines
The documentation reflects the changes
The tests pass

Full implementation of IBM Granite 4.1 Vision as a built-in HF model: - Modular implementation (modular_granite4_vision.py) - Generated files: config, modeling, image processing, processing - Auto-registration: config, modeling, processing, image processing - Tests: modeling (unit + @slow), image processor, processor - Documentation (docs/source/en/model_doc/granite4_vision.md) - WeightRenaming to handle SiglipVisionModel vision_model. nesting Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Upstream moved CONFIG_MAPPING_NAMES to auto_mappings.py. Add granite4_vision entry there; resolve leftover conflict markers in configuration_auto.py (granite4_vision is already in modeling_auto.py and processing_auto.py). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…mappings duplicate Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Remove granite4_vision from MISSING_IMAGE_PROCESSOR_MAPPING_NAMES (auto-discovered via TorchvisionBackend/PilBackend) - Add granite4-vision to HARDCODED_CONFIG_FOR_MODELS in auto_docstring.py - Add granite4_vision to DOC_MODEL_NAMES_NOT_IN_AUTO in check_repo.py - Fix import sort in models/__init__.py and test file - Regenerate auto_mappings.py via check_auto.py --fix_and_overwrite - Add dates to granite4_vision.md Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: artemspector <artems@il.ibm.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: artemspector <artems@il.ibm.com>

- Fix processing_auto.py sort order (sort_auto_mappings) - Add hy-v3, openai-privacy-filter, slanet to HARDCODED_CONFIG_FOR_MODELS - Add hy_v3, openai_privacy_filter, slanet to DOC_MODEL_NAMES_NOT_IN_AUTO (new upstream models missing from these registries) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: artemspector <artems@il.ibm.com>

sam3_vision_model and sam3_vit_model were incorrectly mapped to Sam3LiteTextVisionConfig/Sam3LiteTextViTConfig instead of Sam3VisionConfig/Sam3ViTConfig (and sam3_lite_text module instead of sam3). These are unrelated to granite4_vision; restoring upstream/main values. Signed-off-by: artemspector <artems@il.ibm.com>

…ebase regeneration These three upstream model entries were accidentally removed from CONFIG_MAPPING_NAMES in auto_mappings.py by a previous run of check_auto.py --fix_and_overwrite during an incomplete rebase state. Restoring verbatim from upstream/main. Signed-off-by: artemspector <artems@il.ibm.com>

Signed-off-by: artemspector <artems@il.ibm.com>

zucchini-nlp · 2026-04-23T14:22:51Z

LMK when ready for review, and ig this PR supersedes #45350?

artem-spector · 2026-04-25T08:06:01Z

@zucchini-nlp - yes, this PR supersedes #45350. Its our team that is responsible for producing/release IBM vision models.
This PR is ready for review from my side.

HuggingFaceDocBuilderDev · 2026-04-27T09:24:07Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

zucchini-nlp

@artem-spector great usage of modular!

Seems like the model uses granite llm as backbone with deepstack features. We will need to add an llm class in that case, since calling each backbone layer manually doesn't align well with our API. We can use modular to copy everything except for a single forward

As per adapters, can you explain how the weights are released? I am not really sure we have to manually add merge_adapters, prob I can suggest a cleaner way

zucchini-nlp · 2026-04-27T10:03:07Z

+```bibtex
+@misc{granite-vision-4.1-4b,
+  title={Granite Vision 4.1},
+  author={IBM Granite Vision Team},
+  year={2026},
+  url={https://huggingface.co/ibm-granite/granite-vision-4.1-4b}
+}
+```


nit: I think we dont need a bibtext entry and as long as there is a link to HF papers/arxiv, that is enought

Done — removed.

zucchini-nlp · 2026-04-27T10:03:47Z

+    device=0,
+    torch_dtype=torch.bfloat16,
+)


nit: these two are by default "auto" so we dont need to manually set

Done — removed.

zucchini-nlp · 2026-04-27T10:04:07Z

+
+processor = AutoProcessor.from_pretrained(model_id)
+model = AutoModelForImageTextToText.from_pretrained(
+    model_id, torch_dtype=torch.bfloat16, device_map="auto"


same here, torch_dtype is "auto" by default and can be deleted

Done — removed.

zucchini-nlp · 2026-04-27T10:04:52Z

+## Notes
+
+- The model includes LoRA adapters. Call `model.merge_lora_adapters()` after loading to merge them into base weights for faster inference.
+
+- Set `padding_side="left"` during batched generation for more accurate results.
+
+```py
+processor.tokenizer.padding_side = "left"
+```
+
+- The model supports specialized task tags for document extraction: `<chart2csv>`, `<chart2summary>`, `<chart2code>`, `<tables_html>`, `<tables_otsl>`, `<tables_json>`. Pass these as the text prompt along with a document image.
+
+- For key-value pair extraction, provide a JSON schema describing the fields to extract. The model returns structured JSON matching the schema.


lets move this block as Usage Tips section, before the usage example code snippets

Done — moved to a "Usage Tips" section before the code examples.

zucchini-nlp · 2026-04-27T10:05:56Z

@@ -0,0 +1,155 @@
+import math


following "one-model - one-file" philosophy, it is better put inside modular/modeling files

Done — downsampling_granite4_vision.py deleted, all contents inlined into the modular.

zucchini-nlp · 2026-04-27T10:54:26Z

+    "openai-privacy-filter": "OpenAIPrivacyFilterConfig",
    "lasr": "LasrCTCConfig",
    "wav2vec2-with-lm": "Wav2Vec2Config",
+    "granite4-vision": "Granite4VisionConfig",
+    "hy-v3": "HYV3Config",
+    "slanet": "SLANetConfig",
 }


a few bad rebases :)

Done — removed the stale entries introduced by bad rebases.

zucchini-nlp · 2026-04-27T10:55:03Z

+            WeightRenaming(
+                source_patterns=r"(vision_tower\.)vision_model\.",
+                target_patterns=r"\1",
+            ),


I think it is not needed anymore, we added PrefixWeights recently and fixed all llava models

Done — removed the granite4_vision entry from conversion_mapping.py.

zucchini-nlp · 2026-04-27T10:55:11Z

@@ -0,0 +1,253 @@
+# Copyright 2025 IBM. All rights reserved.


zucchini-nlp · 2026-04-27T10:55:51Z

+class Granite4VisionModelTester(VLMModelTester):
+    base_model_class = Granite4VisionModel
+    config_class = Granite4VisionConfig
+    conditional_generation_class = Granite4VisionForConditionalGeneration
+    text_config_class = GraniteConfig
+    vision_config_class = CLIPVisionConfig
+
+    def __init__(self, parent, **kwargs):


we need only this tester, since processing is identical to llava-next. Thanks for using VLMTester 🤩

Done — removed test_image_processing_granite4_vision.py entirely (processing is identical to LlavaNext, no re-definition needed).

zucchini-nlp · 2026-04-27T10:56:20Z

+    "granite4_vision",
    "falcon3",
    "megatron_gpt2",
    "code_llama",
+    "hy_v3",
+    "openai_privacy_filter",
+    "slanet",


also bad rebase

…m auto_docstring and check_repo These entries belong to other upstream PRs and were accidentally included during a previous rebase. Our PR only owns the granite4_vision entries. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: artemspector <artems@il.ibm.com>

The hub checkpoint ships with pre-merged weights; PEFT-style merging doesn't fit the HF API. Regenerated modeling file from modular via converter. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: artemspector <artems@il.ibm.com>

…layer loop Instead of iterating self.language_model.layers from the VLM model's forward, introduce Granite4VisionTextModel(GraniteModel) that owns the layer loop and accepts deepstack_features (dict[layer_idx -> tensor]) and vision_mask. Granite4VisionModel.forward() now calls self.language_model(...) cleanly. Pattern follows Qwen3VL. Regenerated modeling file from modular. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: artemspector <artems@il.ibm.com>

…ted file The modular converter generates a TextConfig subclass for the text model's sub-layers. Define Granite4VisionTextConfig(GraniteConfig) explicitly in modular so the converter resolves it correctly instead of creating an undefined reference. Regenerated config and modeling files. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: artemspector <artems@il.ibm.com>

… import Inheriting GraniteConfig caused the converter to drop the import in the generated config file. Align with Qwen3VL pattern: TextConfig inherits PreTrainedConfig directly. Also add PreTrainedConfig import to modular. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: artemspector <artems@il.ibm.com>

…odel The converter respects source order; TextModel must come after PreTrainedModel. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: artemspector <artems@il.ibm.com>

…rom_pretrained When loading with device_map, HF's _move_missing_keys_from_meta_to_device replaces all non-persistent buffers with torch.empty_like() (garbage memory). Add a _init_weights handler for Granite4VisionTextRotaryEmbedding that recomputes inv_freq and original_inv_freq from config, so _initialize_missing_keys restores correct values after the corruption. Also adds Granite4VisionTextRotaryEmbedding as an explicit subclass in the modular file so the isinstance check resolves correctly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ert to pure functions - Delete downsampling_granite4_vision.py; move WindowQFormerDownsampler, interpolate_downsample, and spatial_offset_downsample into modular - Replace stateless InterpolateDownsampler/SpatialOffsetDownsampler classes with plain functions (items 2 and 4 from reviewer feedback) - Add config.qformer_config (Blip2QFormerConfig) as a proper sub-config field on Granite4VisionConfig following the Blip2Config pattern; remove inline Blip2QFormerConfig construction from WindowQFormerDownsampler.__init__ (item 3) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace the raw list-of-tuples return from get_image_features with a proper @DataClass ModelOutput subclass (Granite4VisionImageFeaturesOutput), following the Qwen3-VL BaseModelOutputWithDeepstackFeatures pattern. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…Next The image processors are identical to LlavaNextImageProcessor and LlavaNextImageProcessorPil; no need to re-define them. Map 'granite4_vision' to the LlavaNext processors in image_processing_auto.py. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Item 8: move query/image_positions init to _init_weights (embed_std pattern) - Item 9: rename _win/_unwin to _windowed_raster/_unwindowed_raster, replace single-letter vars with descriptive names - Item 10: add deepstack_features field to Granite4VisionModelOutputWithPast and Granite4VisionCausalLMOutputWithPast instead of reusing image_hidden_states - Item 11: use TransformersKwargs instead of FlashAttentionKwargs in Granite4VisionModel.forward; remove unused FlashAttentionKwargs import - Item 12: raise ValueError instead of warning_once for patch shape mismatch; remove now-unused logger - Item 13: drop use_image_newline_parameter (not used in released checkpoint) - Item 14: read pad_token_id from config.text_config instead of top-level config Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Item 15: fix copyright to "2026 IBM and The HuggingFace Team" - Item 16: remove bibtex entry from docs - Item 17: remove torch_dtype/device_map from docs examples - Item 18: move Notes to "Usage Tips" section before code examples - Item 19: remove model_type from Granite4VisionProcessor - Item 20: revert AttributeError() (converter incompatible); keep del self. - Item 21: remove granite4_vision from conversion_mapping (PrefixWeights handles it) - Item 22: remove granite4_vision from check_repo DOC_MODEL_NAMES_NOT_IN_AUTO and HARDCODED_CONFIG_FOR_MODELS in auto_docstring (bad rebase entries) - Item 23: update test copyright, remove use_image_newline_parameter from tester, update skip reasons for get_image_features tests Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Item 20: drop get_image_token_mask override, use parent's get_placeholder_mask - Item 29: delete test_image_processing_granite4_vision.py (identical to LlavaNext) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Pass output_attentions/output_hidden_states explicitly to language_model in Granite4VisionModel.forward (were swallowed as explicit params, not forwarded via **kwargs) - Collect all_hidden_states and all_self_attns in Granite4VisionTextModel layer loop; add output_attentions/output_hidden_states params - Fix qformer_config dict→object conversion to run before super().__post_init__() so _attn_implementation.setter doesn't hit a raw dict during sub_configs iteration - Use Blip2QFormerConfig directly in sub_configs (instead of AutoConfig) so save/load round-trip resolves the type correctly; add missing import to generated configuration_granite4_vision.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…es AutoConfig blip_2_qformer is registered in CONFIG_MAPPING so AutoConfig resolves it correctly. Moving the Blip2QFormerConfig import inside __post_init__ avoids a cross-model top-level import that the modular converter drops from the generated file. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ing public classes - IGNORE_NON_TESTED + IGNORE_NON_AUTO_CONFIGURED: Granite4VisionTextModel is an internal subcomponent tested implicitly through Granite4VisionModel - Doc: add autodoc entries for Granite4VisionTextConfig, Granite4VisionTextModel, Granite4VisionImageProcessor, Granite4VisionImageProcessorPil Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Needed for ruff F821 (undefined name) to pass under make style. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…isionModel.forward Aligns with reviewer feedback: these args are not needed in the explicit signature since they flow through kwargs: Unpack[TransformersKwargs]. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

@strict

- TRF010: add @strict to Granite4VisionTextConfig (direct PreTrainedConfig subclass) - TRF002: set base_model_prefix = "model" on Granite4VisionTextModel (was "") - TRF009: add trf-ignore comment on Blip2QFormerModel lazy import (cross-model import is intentional — QFormer is a shared building block) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Reorder imports in modular to satisfy ruff isort (stdlib → third-party → first-party) - Sync processing_granite4_vision.py to match converter output (BatchFeature from feature_extraction_utils, no model_type on processor) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

downsampling_granite4_vision.py, image_processing_granite4_vision.py, and image_processing_pil_granite4_vision.py are regenerated by the converter but were previously intentionally deleted: image processors delegate to LlavaNext (registered in image_processing_auto.py), and downsampling is inlined in modular/modeling. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… this model Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…tent ImageProcessor autodocs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions · 2026-04-28T07:16:02Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, granite4_vision

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions · 2026-04-28T07:52:42Z

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=45597&sha=2f2f52

artem-spector force-pushed the add-gv41 branch from 5ae88fd to 5160359 Compare April 23, 2026 07:59

artem-spector force-pushed the add-gv41 branch from 5160359 to da43584 Compare April 23, 2026 09:21

artemspector and others added 8 commits April 23, 2026 14:13

Fix conflict marker in image_processing_auto.py

624d66a

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Fix check-repo: remove spatial_stride (unused in modeling), fix auto_…

3ad5f7f

…mappings duplicate Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Fix duplicate legacy key in conversion_mapping.py

a1ed13d

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Regenerate auto_mappings.py after rebase onto upstream/main

c9c3c3c

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: artemspector <artems@il.ibm.com>

artem-spector force-pushed the add-gv41 branch from da43584 to ede8894 Compare April 23, 2026 11:13

artemspector added 3 commits April 23, 2026 14:17

Revert dependency_versions_table.py to match setup.py (upstream state)

59335e3

Signed-off-by: artemspector <artems@il.ibm.com>

zucchini-nlp reviewed Apr 27, 2026

View reviewed changes

artemspector and others added 10 commits April 27, 2026 14:30

Fix class ordering: define Granite4VisionPreTrainedModel before TextM…

8c4e3cc

…odel The converter respects source order; TextModel must come after PreTrainedModel. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: artemspector <artems@il.ibm.com>

artemspector and others added 13 commits April 27, 2026 18:46

Address remaining review items 20 and 29

1e1eefe

- Item 20: drop get_image_token_mask override, use parent's get_placeholder_mask - Item 29: delete test_image_processing_granite4_vision.py (identical to LlavaNext) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Fix missing imports in modular: math, AutoConfig, select_best_resolution

2ee1d91

Needed for ruff F821 (undefined name) to pass under make style. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Remove autodoc entries for ImageProcessor classes that don't exist in…

72b0d22

… this model Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Allow unused LlavaNext attrs in Granite4VisionConfig; remove non-exis…

e49084d

…tent ImageProcessor autodocs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

artemspector and others added 2 commits April 28, 2026 10:28

Fix ruff formatting in check_config_attributes.py

bfbc523

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Fix model card date for add_dates.py check

2f2f523

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

evalstate mentioned this pull request Apr 28, 2026

Cumulative defect fixes from recent Transformers PRs evalstate/transformers#41

Open

Conversation

artem-spector commented Apr 23, 2026

What does this PR do?

Architecture highlights

Files added

Auto-registration

Tests

Before submitting

Related

Uh oh!

artem-spector commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

artem-spector commented Apr 23, 2026

Uh oh!

zucchini-nlp commented Apr 23, 2026

Uh oh!

artem-spector commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Apr 27, 2026

Uh oh!

zucchini-nlp left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 28, 2026

Uh oh!

github-actions Bot commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

artem-spector commented Apr 23, 2026 •

edited

Loading

artem-spector commented Apr 25, 2026 •

edited

Loading

zucchini-nlp left a comment •

edited

Loading