[CTRL] Support attn_implementation="sdpa" dispatch by YangKai0616 · Pull Request #46073 · huggingface/transformers

YangKai0616 · 2026-05-19T09:25:05Z

What does this PR do?

The CTRL model fails during the initialization phase when using from_pretrained(..., attn_implementation="sdpa"). This PR enables standard attn_implementation="sdpa" dispatch to CTRL .

Hi @ArthurZucker, pls help review, thx!

github-actions · 2026-05-29T08:41:26Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: ctrl

YangKai0616 · 2026-05-29T08:42:15Z

Hi @vasqu , please help review this PR, thank you!

vasqu

Overall looks good, would like to align a few smaller things here and there but shouldn't be too much. nice work!

vasqu · 2026-06-05T08:33:26Z

+        self.is_causal = True

-        self.depth = int(d_model_size / self.num_heads)
+        self.depth = int(self.d_model_size / self.num_heads)


I think this model is super old, would rather align a bit more with modern terminology - ig this should be head_dim

vasqu · 2026-06-05T08:36:25Z

        q = self.split_into_heads(q, batch_size)
        k = self.split_into_heads(k, batch_size)
        v = self.split_into_heads(v, batch_size)



Imo we should remove this function and just follow what llama did (with the same reshapes etc)

vasqu · 2026-06-05T08:43:41Z

    ):
        normed = self.layernorm1(x)
-        attn_outputs = self.multi_head_attention(
+        attn_output = self.multi_head_attention(


Suggested change

attn_output = self.multi_head_attention(

attn_output, _ = self.multi_head_attention(

would rather do this then the ...[0]

vasqu · 2026-06-05T08:44:27Z

+    _supports_sdpa = True
+    _can_record_outputs = {
+        "hidden_states": EncoderLayer,
+        "attentions": OutputRecorder(MultiHeadAttention, index=1),


Suggested change

"attentions": OutputRecorder(MultiHeadAttention, index=1),

"attentions": MultiHeadAttention,

pretty sure we default to index 1 on attention so no need but not 100% sure

vasqu · 2026-06-05T08:45:13Z

 class CTRLPreTrainedModel(PreTrainedModel):
    config: CTRLConfig
    base_model_prefix = "transformer"
+    _supports_sdpa = True


The attention looks fairly simple, could we also support all the other flags for this - flex, flash, attention backend (needs interface + kwargs passing from top to bottom)?

vasqu · 2026-06-05T08:45:29Z

    def set_input_embeddings(self, new_embeddings):
        self.w = new_embeddings

+    @capture_outputs


missing merge config with defaults

vasqu · 2026-06-05T08:45:49Z

-def scaled_dot_product_attention(q, k, v, mask, attention_mask=None):
-    # calculate attention
-    matmul_qk = torch.matmul(q, k.permute(0, 1, 3, 2))
+def eager_attention_forward(module, query, key, value, attention_mask, scaling=None, dropout=0.0, **kwargs):


Imo, we could use a copied from here?

vasqu · 2026-06-05T08:48:06Z

@@ -289,26 +285,6 @@
            position_ids = torch.arange(past_length, input_shape[-1] + past_length, dtype=torch.long, device=device)
            position_ids = position_ids.unsqueeze(0)


Not on you but imo we can refactor this as well, see e.g. llama where we create the input embeds earlier and get the shape from there then most of this can be reduced by quite a bit

YangKai0616 added 2 commits May 19, 2026 08:52

[CTRL] Support attn_implementation dispatch

a2a6264

Fix sdpa+attention_mask

80edd73

YangKai0616 changed the title ~~[CTRL] Support attn_implementation dispatch~~ [CTRL] Support attn_implementation="sdpa" dispatch May 20, 2026

YangKai0616 added 5 commits May 20, 2026 06:53

Doc

432dcd6

Merge branch 'main' into sdpa-ctrl

9230721

Refactor

ec1b5fc

Refactor

e33eaec

Refactor

2758c61

vasqu reviewed Jun 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CTRL] Support attn_implementation="sdpa" dispatch#46073

[CTRL] Support attn_implementation="sdpa" dispatch#46073
YangKai0616 wants to merge 7 commits into
huggingface:mainfrom
YangKai0616:sdpa-ctrl

YangKai0616 commented May 19, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

YangKai0616 commented May 29, 2026

Uh oh!

vasqu left a comment

Uh oh!

vasqu Jun 5, 2026

Uh oh!

vasqu Jun 5, 2026

Uh oh!

vasqu Jun 5, 2026

Uh oh!

vasqu Jun 5, 2026

Uh oh!

vasqu Jun 5, 2026

Uh oh!

vasqu Jun 5, 2026

Uh oh!

vasqu Jun 5, 2026

Uh oh!

vasqu Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	attn_output = self.multi_head_attention(
	attn_output, _ = self.multi_head_attention(

	"attentions": OutputRecorder(MultiHeadAttention, index=1),
	"attentions": MultiHeadAttention,

		@@ -289,26 +285,6 @@
		position_ids = torch.arange(past_length, input_shape[-1] + past_length, dtype=torch.long, device=device)
		position_ids = position_ids.unsqueeze(0)

Conversation

YangKai0616 commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

YangKai0616 commented May 29, 2026

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

YangKai0616 commented May 19, 2026 •

edited

Loading