Allow bi-directional attention for all models by Cyrilvallez · Pull Request #43705 · huggingface/transformers

Cyrilvallez · 2026-02-03T11:45:43Z

What does this PR do?

Allow the is_causal kwarg and config attribute to make well-behaved decoder-only models act as encoders

HuggingFaceDocBuilderDev · 2026-02-03T11:54:03Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

vasqu · 2026-02-04T14:00:16Z

src/transformers/utils/generic.py

                return_dict = getattr(self.config, "return_dict", True)

+            # Maybe temporarily overwrite config value to create the correct mask - kwarg takes precedence
+            is_causal = kwargs.get("is_causal", True) and getattr(self.config, "is_causal", True)


If we default to true, it will break all encoder and encoder-decoder models.

Imo, we should properly add an is_causal flag to all models where it's obvious and default to None, i.e. don't do anything

We can default to None to be sure, but I was under the impression that we would only add this to a few model. Then as we allow the kwarg, it's true that it becomes usable for all

Imo, it's just a bit brittle and we don't properly document it --> easy for users to encounter weird behavior

Made the change! But in general encoder-decoder models will explicitly use the create_bi_directional_mask functions, so behavior will never be causal (though we could do the opposite in the mask funcrions as well, i.e. turn bi-directional -> causal as well)

If we make one wrong move and pass the is_causal as kwarg we already get that mess 👀 but yes the mask will definitely not be affected in any case

So the edge case of a user using bert as causal model (for whatever reason) and passing a custom mask along the kwarg

Imo, we should properly add an is_causal flag to all models where it's obvious

This might help vLLM as well, makes it easier to recognize pure decoders vs encoders.

Yes, I will try to open a PR about this. For example it could allow users to just change the text model portions causality

tomaarsen · 2026-02-04T14:26:15Z

src/transformers/utils/generic.py

+            # Maybe temporarily overwrite config value to create the correct mask - kwarg takes precedence
+            is_causal = kwargs.get("is_causal", True) and getattr(self.config, "is_causal", True)
+            if not is_causal:
+                is_causal_in_config = hasattr(self.config, "is_causal")
+                if is_causal_in_config:
+                    is_causal_original_value = self.config.is_causal
+                # Set it to both config and kwargs (it's needed in both, and can come from only 1 of the sources)
+                self.config.is_causal = False
+                kwargs["is_causal"] = False


If I understand correctly, then the final is_causal is:

kwargs["is_causal"] = True kwargs["is_causal"] = False kwargs["is_causal"] not defined

config.is_causal = True True False True

config.is_causal = False False False False

config.is_causal not defined True False True

The bold False here is an outlier: if the architecture is bidirectional in nature (i.e. config.is_causal=False), then the user can't override that to causal. Personally, I'm exclusively interested in the decoder -> encoder case, so this is not a problem for my use cases, but perhaps we want to allow the kwargs to always have priority?

Then we'd instead use something like this

Suggested change

# Maybe temporarily overwrite config value to create the correct mask - kwarg takes precedence

is_causal = kwargs.get("is_causal", True) and getattr(self.config, "is_causal", True)

if not is_causal:

is_causal_in_config = hasattr(self.config, "is_causal")

if is_causal_in_config:

is_causal_original_value = self.config.is_causal

# Set it to both config and kwargs (it's needed in both, and can come from only 1 of the sources)

self.config.is_causal = False

kwargs["is_causal"] = False

# Maybe temporarily overwrite config value to create the correct mask - kwarg takes precedence

is_causal = kwargs.get("is_causal", getattr(self.config, "is_causal", True))

is_causal_in_config = hasattr(self.config, "is_causal")

if is_causal_in_config:

is_causal_original_value = self.config.is_causal

# Set it to both config and kwargs (it's needed in both, and can come from only 1 of the sources)

self.config.is_causal = is_causal

kwargs["is_causal"] = is_causal

And then also drop the if not is_causal a little later on:

if is_causal_in_config: self.config.is_causal = is_causal_original_value else: del self.config.is_causal

This just got changed haha, sorry bad timing here 😬

Edit: Looks like this is a bit dated now, and I see now that you were planning on only having this act on a few models.

After another look at the new changes, it looks like you're now doing pretty much what I proposed re. also being able to go from encoder -> decoder.

vasqu

Can we maybe add a fast test to llama to check that we get different logits? Otherwise lgtm, @tomaarsen does it work for you?

tomaarsen

does it work for you?

Yes, I've ran some tests with voyage-4-nano, and apart from some tiny numerical differences caused by having an all-True attention_mask (Voyage's custom code) vs a None attention_mask (this PR), the performance is identical. It should allow this model, as well as many others that are simply e.g. "Qwen3 but bidirectional" to work by setting is_causal in the config.

Tom Aarsen

mask

3a0eee0

Cyrilvallez added 4 commits February 3, 2026 13:13

decorator

d5c3c76

oupsi

0d3699d

fix

f21f428

fix

0f5f8f5

vasqu reviewed Feb 4, 2026

View reviewed changes

use None as default

92e9f0a

tomaarsen reviewed Feb 4, 2026

View reviewed changes

vasqu approved these changes Feb 4, 2026

View reviewed changes

tomaarsen approved these changes Feb 4, 2026

View reviewed changes

Cyrilvallez added 3 commits February 4, 2026 17:25

add test

18e06c0

add fa2

b32b15f

device

a2bcc04

Cyrilvallez merged commit 83bce8d into main Feb 4, 2026
23 of 26 checks passed

Cyrilvallez deleted the power-mask branch February 4, 2026 17:24

	kwargs["is_causal"] = True	kwargs["is_causal"] = False	kwargs["is_causal"] not defined
config.is_causal = True	True	False	True
config.is_causal = False	False	False	False
config.is_causal not defined	True	False	True

Conversation

Cyrilvallez commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Feb 3, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

tomaarsen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Cyrilvallez commented Feb 3, 2026 •

edited

Loading