🚨 [generate] update paligemma mask updates (and other assisted generation-related fixes) #40917

gante · 2025-09-16T17:22:43Z

What does this PR do?

🚨 BC-breaking: paligemma processor now returns token_type_ids by default. This is required to disambiguate forward passes, due to the bidirectional attention mask in the prompt. Advanced generation methods may run forward passes with prompt + generated tokens, so they will fail without token_type_ids.

This PR is originally aimed at fixing two flaky tests:

imageGPT + test_prompt_lookup_decoding_matches_greedy_search -> skip the test, imageGPT has dodgy layer initialization. This is better documented in the skip;
⚠️ paligemma2 + test_prompt_lookup_decoding_matches_greedy_search -> upstreams attention mask creation from gemma3 to paligemma, since their masking strategy is the same. This also improves standardization, as we got rid of some legacy code 💛 Fixing this actually required a cascade of changes (changes in gemma for paligemma -> gemma-dependent models also needed updates)

✅ slow paligemma tests passing
✅ slow paligemma2 tests passing (but there are no integration tests ⚠️ )
✅ no regressions on slow gemma tests (i.e. some failures, same as in main)

src/transformers/models/gemma3/modeling_gemma3.py

gante · 2025-09-16T17:40:47Z

tests/models/paligemma2/test_modeling_paligemma2.py

-    @parameterized.expand([("random",), ("same",)])
-    @pytest.mark.generate
-    @unittest.skip("Paligemma2 does not seem to be compatible with assisted decoding")
-    def test_assisted_decoding_matches_greedy_search(self, assistant_type):


removing one skip at a time 🫡

great work!

HuggingFaceDocBuilderDev · 2025-09-16T17:48:38Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

src/transformers/models/gemma/modeling_gemma.py

src/transformers/models/paligemma/modeling_paligemma.py

molbap

looks good to me, thanks a lot! left some minor comments

src/transformers/models/paligemma/modeling_paligemma.py

tests/generation/test_utils.py

src/transformers/models/paligemma/modeling_paligemma.py

src/transformers/models/gemma/modeling_gemma.py

src/transformers/models/paligemma/modeling_paligemma.py

molbap · 2025-09-16T18:16:04Z

tests/models/paligemma2/test_modeling_paligemma2.py

-    @parameterized.expand([("random",), ("same",)])
-    @pytest.mark.generate
-    @unittest.skip("Paligemma2 does not seem to be compatible with assisted decoding")
-    def test_assisted_decoding_matches_greedy_search(self, assistant_type):


great work!

zucchini-nlp

Great, happy to see all Gemmas updated with the new API. Left a few questions 👇🏻

src/transformers/models/paligemma/modeling_paligemma.py

src/transformers/models/gemma/modeling_gemma.py

zucchini-nlp · 2025-09-17T09:14:14Z

Before we merge, could you also check if the correct dtype is used or mask creation with the general API? The reproducer is linked to this PR (#40912) and was fixed just yesterday

I believe that we text config's dtype, just for safety :)

gante · 2025-09-17T09:29:46Z

run-slow: gemma, gemma3, gemma3n, paligemma, paligemma2

gante · 2025-09-17T09:30:30Z

most PR comments addressed :)

The only open one is this one (@zucchini-nlp )

github-actions · 2025-09-17T09:31:10Z

This comment contains run-slow, running the specified jobs:

models: ['models/gemma', 'models/gemma3', 'models/gemma3n', 'models/paligemma', 'models/paligemma2']
quantizations: [] ...

gante · 2025-09-17T09:51:48Z

The dtype fix in an upstream issue has been preserved:

from transformers import ColPaliForRetrieval, ColPaliProcessor
import torch
import numpy as np
from PIL import Image

device = "cuda"
model = ColPaliForRetrieval.from_pretrained(
    "vidore/colpali-v1.3-hf",
    dtype=torch.float16, # can also be bfloat16
).to(device)

processor = ColPaliProcessor.from_pretrained("vidore/colpali-v1.3-hf")

image_inputs = [np.random.randint(255, size=(3, 30, 400), dtype=np.uint8)]
image_inputs = [Image.fromarray(np.moveaxis(x, 0, -1)) for x in image_inputs]

with torch.no_grad():
    image_inputs = processor(images=image_inputs)
    image_inputs = image_inputs.to(model.device, model.dtype)
    image_outputs = model(**image_inputs)
    image_embeddings_torch = image_outputs.embeddings
    print(image_embeddings_torch.dtype)
    # torch.float16

zucchini-nlp · 2025-09-17T09:55:58Z

Thanks, replied under the comment. Agreed with your suggestion :)

gante · 2025-09-17T11:45:13Z

run-slow: gemma, gemma3, gemma3n, helium, paligemma, paligemma2

github-actions · 2025-09-17T11:46:45Z

This comment contains run-slow, running the specified jobs:

models: ['models/gemma', 'models/gemma3', 'models/gemma3n', 'models/helium', 'models/paligemma', 'models/paligemma2']
quantizations: [] ...

gante · 2025-09-17T11:55:48Z

run-slow: gemma, gemma3, gemma3n, helium, paligemma, paligemma2

github-actions · 2025-09-17T11:57:13Z

This comment contains run-slow, running the specified jobs:

models: ['models/gemma', 'models/gemma3', 'models/gemma3n', 'models/helium', 'models/paligemma', 'models/paligemma2']
quantizations: [] ...

gante · 2025-09-22T14:06:53Z

tests/models/paligemma/test_modeling_paligemma.py

                ("rocm", (9, 5)): "detect shoe\n<loc0051><loc0309><loc0708><loc0644> shoe",
                (None, None): "detect shoe\n<loc0051><loc0309><loc0708><loc0646> shoe",
-                ("cuda", 8): "detect shoe\n<loc0045><loc0309><loc0708><loc0646> shoe",
+                ("cuda", 8): "detect shoe\n<loc0051><loc0309><loc0708><loc0646> shoe",


this test was failing on main as well, with the same output

gante · 2025-09-22T14:18:44Z

@molbap @zucchini-nlp requesting a full re-review :) the final steps required a few chains of changes (change base model -> modular kicks in -> more models need changes).

I've rewritten the PR header, make sure to check it
I've added comments in the GH diff, in an effort to make review easier 🤗

zucchini-nlp

Huuge work, thanks for getting to the root if the problem. I can feel the pain of refactoring these model's attentions

Overall I agree with the changes. In unrelated models I think we can and should delete attributes in modular, to not keep unused config attributes

zucchini-nlp · 2025-09-22T15:23:05Z

src/transformers/models/gemma/modular_gemma.py

+        super().__init__()
+        self.config = config
+        self.layer_idx = layer_idx
+        self.head_dim = getattr(config, "head_dim", config.hidden_size // config.num_attention_heads)
+        self.num_key_value_groups = config.num_attention_heads // config.num_key_value_heads
+        self.scaling = self.head_dim**-0.5
+        self.attention_dropout = config.attention_dropout
+        self.is_causal = not getattr(config, "use_bidirectional_attention", False)
+
+        self.q_proj = nn.Linear(
+            config.hidden_size, config.num_attention_heads * self.head_dim, bias=config.attention_bias
+        )
+        self.k_proj = nn.Linear(
+            config.hidden_size, config.num_key_value_heads * self.head_dim, bias=config.attention_bias
+        )
+        self.v_proj = nn.Linear(
+            config.hidden_size, config.num_key_value_heads * self.head_dim, bias=config.attention_bias
+        )
+        self.o_proj = nn.Linear(
+            config.num_attention_heads * self.head_dim, config.hidden_size, bias=config.attention_bias
+        )


I think we can indicate only self.is_causal, other attr will be copied by modular. Same for the decoder-layer

zucchini-nlp · 2025-09-22T15:29:12Z

src/transformers/models/gemma3/modeling_gemma3.py

+    # NOTE: this `may_have_image_input` logic is not flawless, it fails when we're using a cache eagerly initialized
+    # (e.g. compiled prefill) AND `pixel_values` are not provided (i.e. the image data is provided through other
+    # means). Determining prefill in that case requires checking data values, which is not compile-compatible.
+    may_have_image_input = past_key_values is None or not past_key_values.is_initialized or pixel_values is not None


I think we should pass token_type_ids in all cases when it is present, since the logic isn't perfect. The only issue of passing in all cases I can think of is that token_type_ids do not grow together with attention mask

So we can do if token_type_ids.shape == attention_mask.shape: do the optional bidirectional mask

src/transformers/models/gemma3/modular_gemma3.py

zucchini-nlp · 2025-09-22T15:41:15Z

src/transformers/models/helium/modular_helium.py

        self.mlp = HeliumMLP(config)
        self.input_layernorm = HeliumRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
        self.post_attention_layernorm = HeliumRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        self.attention_type = config.layer_types[layer_idx]


i think we should avoid using attention types with Helium. The changes in modular are done for layers using Llama as base model so we shouldn't need them, no?

src/transformers/models/paligemma/modeling_paligemma.py

zucchini-nlp · 2025-09-22T15:59:48Z

src/transformers/models/paligemma/modeling_paligemma.py

+        # The logic bellow was originally written for gemma3, where `token_type_ids` is reversed. Let's reverse it to
+        # then use exactly the same logic.
+        token_type_ids = 1 - token_type_ids


I saw that we need to break BC to get this working, maybe we can do as follows to keep the model as in main branch

if is_training: token_type_ids = 1 - token_type_ids else: token_type_ids = torch.ones_like(attention_mask)

zucchini-nlp · 2025-09-22T16:00:52Z

src/transformers/models/t5gemma/configuration_t5gemma.py

            scaling factor when applying tanh softcapping on the logits.
        attn_logit_softcapping (`float`, *optional*, defaults to 50.0):
            scaling factor when applying tanh softcapping on the attention scores.
+        use_bidirectional_attention (`bool`, *optional*):


what if we del use_bidirectional_attention in modular file config?

zucchini-nlp · 2025-09-22T16:01:34Z

src/transformers/models/vaultgemma/configuration_vaultgemma.py

            scaling factor when applying tanh softcapping on the logits.
        attn_logit_softcapping (`float`, *optional*, defaults to 50.0):
            scaling factor when applying tanh softcapping on the attention scores.
+        use_bidirectional_attention (`bool`, *optional*):


same here, I'd prefer to explicitly del from modular

zucchini-nlp · 2025-09-22T16:02:58Z

src/transformers/models/gemma/configuration_gemma.py

+        use_bidirectional_attention (`bool`, *optional*):
+            If True, the model will attend to all text tokens instead of using a causal mask.
+


wow, this makes so much sense. I wonder how gemma3 worked prev, afair we didn't have a flag for defining bidirectional attention at release time

I actually took it from gemma3 🤗 Most of the changes here are gemma3-inspired

looks like it was added recently. Prev it used is_causal = True 🙈

gante · 2025-09-23T14:25:21Z

@zucchini-nlp comments addressed and CI green 💚

Summary of the changes:

Modular updates -- minimal rewrites where possible, no extra config parameters. Thank you for the suggestions, I was unaware it was possible to simply add/delete attributes 🫶
Helium actually needs no changes. The issue was in the tester, which got updated 😢
Added a BC path for paligemma, as you suggested. I've also added an exception when token_type_ids is missing at train time (present in the original code, but missing in gemma3), and a warning when it may be missing at inference time (to nudge users away from the bug-prone situation);
Last mile of CI issues: propagate processor-related changes to paligemma-related models

zucchini-nlp · 2025-09-23T14:39:59Z

src/transformers/models/gemma3/modeling_gemma3.py

+    elif may_have_image_input:
+        logger.warning_once(
+            "There may be an image in the input to Gemma3 but `token_type_ids` is not provided. We recommend "
+            "passing `token_type_ids` to the model to prevent bad attention masking."
+        )


I think this message will false log when the model is used with text-only input. Can we keep it without warnings?

I'm going to delete it then.

(I'm not super happy about it -- in general, if a model needs some input to correctly do some operation, and we can't safely detect whether we need that operation, then the input should be required. Otherwise, it's prone to silent bugs, which are the worst kind of bugs 😢 )

Agree, we need to keep the token types growing in correct way so that it is used always, without us checking for prefill etc.

zucchini-nlp

Thanks, approving so it can be merged when CI is green :)

github-actions · 2025-09-23T16:12:04Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: bark, chameleon, colpali, colqwen2, gemma, gemma2, gemma3, gemma3n, helium, idefics2, paligemma, t5gemma, vaultgemma

…tion-related fixes) (huggingface#40917) * tmp * fix modular inheritance * nit * paligemma 1 doesn't have swa * use same pattern as in models with hybrid layers * PR comments * helium also needs layer_typed (bc it relies on gemma) * paligemma/gemma3: same mask creation fn in fwd and generate * propagate changes to helium (gemma-based) * tmp commit * slow paligemma tests passing, let's see what breaks * fix test_left_padding_compatibility * tmp commit * tmp commit * rebase error * docs * reduce diff * like this? * t5gemma * better comment * shorter diff * exception * ffs type * optional * shorter modular_gemma.py * helium model actually needs no changes -- the tester is the issue * t5gemma modular config * a few more modular; paligemma BC * fix processor issues? * rm config exception * lift warning in gemma

ArthurZucker

Thanks for the PR! A pity no core maintainers were pinged as adding a flag that is not part of the original model release is rarely something we are gonna agree on 😓

ArthurZucker · 2025-10-03T09:31:40Z

src/transformers/models/gemma2/modular_gemma2.py

        self.final_logit_softcapping = final_logit_softcapping
        self.attn_logit_softcapping = attn_logit_softcapping
        self.layer_types = layer_types
+        self.use_bidirectional_attention = use_bidirectional_attention


this should not have been added, it goes agains our philosophy 😢

ArthurZucker · 2025-10-03T09:32:52Z

src/transformers/models/paligemma/configuration_paligemma.py

+        # BC: `use_bidirectional_attention` was originally unset in PaliGemma1 (backbone = Gemma1) AND PaliGemma2
+        # (backbone = Gemma2). Both PaliGemmas want to default to True.
+        if self.text_config.use_bidirectional_attention is None:
+            self.text_config.use_bidirectional_attention = True


I don't think that's what we want TBH

…tion-related fixes) (huggingface#40917) * tmp * fix modular inheritance * nit * paligemma 1 doesn't have swa * use same pattern as in models with hybrid layers * PR comments * helium also needs layer_typed (bc it relies on gemma) * paligemma/gemma3: same mask creation fn in fwd and generate * propagate changes to helium (gemma-based) * tmp commit * slow paligemma tests passing, let's see what breaks * fix test_left_padding_compatibility * tmp commit * tmp commit * rebase error * docs * reduce diff * like this? * t5gemma * better comment * shorter diff * exception * ffs type * optional * shorter modular_gemma.py * helium model actually needs no changes -- the tester is the issue * t5gemma modular config * a few more modular; paligemma BC * fix processor issues? * rm config exception * lift warning in gemma

gante marked this pull request as ready for review September 16, 2025 17:36

gante requested review from zucchini-nlp and molbap September 16, 2025 17:36

gante commented Sep 16, 2025

View reviewed changes

src/transformers/models/gemma3/modeling_gemma3.py Show resolved Hide resolved

gante commented Sep 16, 2025

View reviewed changes

src/transformers/models/gemma/modeling_gemma.py Outdated Show resolved Hide resolved

gante commented Sep 16, 2025

View reviewed changes

src/transformers/models/paligemma/modeling_paligemma.py Outdated Show resolved Hide resolved

molbap approved these changes Sep 16, 2025

View reviewed changes

zucchini-nlp reviewed Sep 17, 2025

View reviewed changes

src/transformers/models/paligemma/modeling_paligemma.py Outdated Show resolved Hide resolved

src/transformers/models/paligemma/modeling_paligemma.py Show resolved Hide resolved

src/transformers/models/gemma/modeling_gemma.py Outdated Show resolved Hide resolved

gante force-pushed the flaky_assisted_gen_tests branch from fe16966 to 0ee61ee Compare September 17, 2025 18:18

gante mentioned this pull request Sep 18, 2025

[tests] update test_left_padding_compatibility (and minimize overwrites) #40980

Merged

gante added the WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress label Sep 18, 2025

gante added 7 commits September 22, 2025 12:28

tmp

0484787

fix modular inheritance

4990c46

nit

5424b99

paligemma 1 doesn't have swa

e2f6550

use same pattern as in models with hybrid layers

d8d02ff

PR comments

93ff456

helium also needs layer_typed (bc it relies on gemma)

132e35f

gante commented Sep 22, 2025

View reviewed changes

gante requested review from zucchini-nlp and molbap September 22, 2025 14:15

exception

a3ac80c

gante added 2 commits September 22, 2025 14:22

ffs type

40eed3d

optional

c725120

gante removed the WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress label Sep 22, 2025

zucchini-nlp reviewed Sep 22, 2025

View reviewed changes

gante added 6 commits September 23, 2025 09:48

shorter modular_gemma.py

a74fb93

helium model actually needs no changes -- the tester is the issue

b79e312

t5gemma modular config

f916d7c

a few more modular; paligemma BC

5e518c8

fix processor issues?

b0a9d50

rm config exception

c0c89b2

zucchini-nlp reviewed Sep 23, 2025

View reviewed changes

zucchini-nlp approved these changes Sep 23, 2025

View reviewed changes

gante and others added 2 commits September 23, 2025 16:10

lift warning in gemma

3fcf7a7

Merge branch 'main' into flaky_assisted_gen_tests

9bf860d

gante enabled auto-merge (squash) September 23, 2025 16:11

gante merged commit 869735d into huggingface:main Sep 23, 2025
25 checks passed

gante deleted the flaky_assisted_gen_tests branch September 23, 2025 16:55

This was referenced Sep 25, 2025

CI fails: ValueError: token_type_ids is required as a model input when training huggingface/trl#4142

Closed

Pass required token_type_ids huggingface/trl#4148

Merged

ArthurZucker reviewed Oct 3, 2025

View reviewed changes

		use_bidirectional_attention (`bool`, optional):
		If True, the model will attend to all text tokens instead of using a causal mask.

🚨 [generate] update paligemma mask updates (and other assisted generation-related fixes) #40917

🚨 [generate] update paligemma mask updates (and other assisted generation-related fixes) #40917

Conversation

gante commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Sep 16, 2025

Uh oh!

Uh oh!

Uh oh!

molbap left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zucchini-nlp commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gante commented Sep 17, 2025

Uh oh!

gante commented Sep 17, 2025

Uh oh!

github-actions bot commented Sep 17, 2025

Uh oh!

gante commented Sep 17, 2025

Uh oh!

zucchini-nlp commented Sep 17, 2025

Uh oh!

gante commented Sep 17, 2025

Uh oh!

github-actions bot commented Sep 17, 2025

Uh oh!

gante commented Sep 17, 2025

Uh oh!

github-actions bot commented Sep 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gante commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gante commented Sep 16, 2025 •

edited

Loading

zucchini-nlp commented Sep 17, 2025 •

edited

Loading

gante commented Sep 22, 2025 •

edited

Loading

gante commented Sep 23, 2025 •

edited

Loading