[RecurrentGemma] Support attn_implementation dispatch by YangKai0616 · Pull Request #46320 · huggingface/transformers

YangKai0616 · 2026-06-01T10:19:44Z

What does this PR do?

As per the title.

github-actions · 2026-06-01T10:20:54Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: recurrent_gemma

YangKai0616 · 2026-06-01T10:24:12Z

+        return hidden_states
+
+
+class RecurrentGemmaRecurrentDecoderLayer(GradientCheckpointingLayer):


Refer to Jamba.

I would not follow jamba in this case. Could we refuse them into one class? The only difference is the temporal block, right? I.e. the used class

YangKai0616 · 2026-06-01T10:32:52Z

On my local A100, the run_slow tests output_text from this PR are identical to upstream/main (note: upstream/main has 4 failed, 1 passed. IntegrationTest issues will not be addressed in this PR for now.).

YangKai0616 · 2026-06-04T08:38:52Z

@vasqu

vasqu

Just one bigger comment but other than that rather small changes 🤗 I'm not sure how long we are maintaining this model so my focus here will be a bit less

vasqu · 2026-06-05T08:54:36Z

@@ -178,7 +179,32 @@ def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
    return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)


-class RecurrentGemmaSdpaAttention(nn.Module):
+def eager_attention_forward(


probably can be copied from 🤔

vasqu · 2026-06-05T08:54:55Z

-        self.partial_rotary_factor = config.partial_rotary_factor
+        self.scaling = self.head_dim**-0.5
+        self.is_causal = True
+        self.rotary_ndims = int(self.head_dim * config.partial_rotary_factor)


Suggested change

self.rotary_ndims = int(self.head_dim * config.partial_rotary_factor)

self.rotary_dim = int(self.head_dim * config.partial_rotary_factor)

nit

vasqu · 2026-06-05T08:56:02Z

+        key_rot, key_pass = (
+            key_states[..., : self.rotary_ndims],
+            key_states[..., self.rotary_ndims :],
+        )
        query_rot, key_rot = apply_rotary_pos_emb(query_rot, key_rot, cos, sin)


Hmm we have a few partial rope, could you check if we could refactor this as well to existing functions? Not high prio but would be nice

vasqu · 2026-06-05T08:57:43Z

@@ -452,9 +487,6 @@ def _setup_cache(self, batch, device, dtype):
        self.conv1d_state = torch.zeros((batch, self.hidden_size, self.conv1d_width - 1), device=device, dtype=dtype)


-TEMPORAL_BLOCK_CLASSES = {"recurrent": RecurrentGemmaRecurrentBlock, "attention": RecurrentGemmaSdpaAttention}


This is too breaking imo and not worth the effort, let's keep this

vasqu · 2026-06-05T08:58:30Z

+        return hidden_states
+
+
+class RecurrentGemmaRecurrentDecoderLayer(GradientCheckpointingLayer):


I would not follow jamba in this case. Could we refuse them into one class? The only difference is the temporal block, right? I.e. the used class

vasqu · 2026-06-05T08:59:00Z

-    _supports_flash_attn = False
-    _supports_sdpa = False  # we can't compare with eager for now
+    _supports_flash_attn = True
+    _supports_sdpa = True


flex, attn backend?

YangKai0616 added 2 commits June 1, 2026 10:12

[RecurrentGemma] Support attn_implementation dispatch

346b5af

Doc

a2c00f8

YangKai0616 commented Jun 1, 2026

View reviewed changes

Code quality

524dd55

vasqu reviewed Jun 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RecurrentGemma] Support attn_implementation dispatch#46320

[RecurrentGemma] Support attn_implementation dispatch#46320
YangKai0616 wants to merge 3 commits into
huggingface:mainfrom
YangKai0616:sdpa-RecurrentGemmaForCausalLM

YangKai0616 commented Jun 1, 2026

Uh oh!

github-actions Bot commented Jun 1, 2026

Uh oh!

YangKai0616 Jun 1, 2026

Uh oh!

vasqu Jun 5, 2026

Uh oh!

YangKai0616 commented Jun 1, 2026

Uh oh!

YangKai0616 commented Jun 4, 2026

Uh oh!

vasqu left a comment

Uh oh!

vasqu Jun 5, 2026

Uh oh!

vasqu Jun 5, 2026

Uh oh!

vasqu Jun 5, 2026

Uh oh!

vasqu Jun 5, 2026

Uh oh!

vasqu Jun 5, 2026

Uh oh!

vasqu Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		return hidden_states


		class RecurrentGemmaRecurrentDecoderLayer(GradientCheckpointingLayer):

	self.rotary_ndims = int(self.head_dim * config.partial_rotary_factor)
	self.rotary_dim = int(self.head_dim * config.partial_rotary_factor)

		@@ -452,9 +487,6 @@ def _setup_cache(self, batch, device, dtype):
		self.conv1d_state = torch.zeros((batch, self.hidden_size, self.conv1d_width - 1), device=device, dtype=dtype)


		TEMPORAL_BLOCK_CLASSES = {"recurrent": RecurrentGemmaRecurrentBlock, "attention": RecurrentGemmaSdpaAttention}

Conversation

YangKai0616 commented Jun 1, 2026

What does this PR do?

Uh oh!

github-actions Bot commented Jun 1, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

YangKai0616 commented Jun 1, 2026

Uh oh!

YangKai0616 commented Jun 4, 2026

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants