batched_mm is slow on cpu by IlyasMoutawwakil · Pull Request #43438 · huggingface/transformers

IlyasMoutawwakil · 2026-01-23T10:33:19Z

What does this PR do?

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

HuggingFaceDocBuilderDev · 2026-01-23T10:42:09Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker

Let's update the doc to add justification!

ArthurZucker

TY let's explicit in comments or doc with justification maybe?

vasqu

Similar problem as we had with test_torch_compile_for_training - can you take a look at test_generate_compile_model_forward_fullgraph

Forcing batched_mm or changing the type (although we do compare outputs, so not sure if it would introduce flakiness) should solve it

stevhliu

thanks for clarifying!

stevhliu · 2026-01-23T19:04:33Z

docs/source/en/experts_interface.md

+| `"grouped_mm"`  | Orders tokens by selected experts and uses `torch._grouped_mm` to project all tokens in a single grouped GEMMF (Requires PyTorch 2.9+). |

-`batched_mm` is fastest for very small inputs and compilation speeds it up further. `grouped_mm` performs best for larger inputs.
+On GPU:


i think it'd be cleaner to add two separate columns to the table for GPU and CPU, and then you can add the relevant comments for each implementation. makes it easier to quickly scan as well!

aah makes sense ! hope it won't get crowded when rendered

docs/source/en/experts_interface.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

IlyasMoutawwakil · 2026-01-26T09:38:10Z

thanks @stevhliu I updated the table and left one note about the decode-stage optimization on gpu.

@vasqu I switched to bf16 on cpu+grouped_mm+compile, imo it's better to test the grouped_mm on cpu here because it's what a user will get by default, switching to batched_mm will pass the tests but won't catch errors in the default cpu path. wdyt ?

vasqu

Yes ok let's move to bf16 but gotta keep an eye out if it does indeed produce flakiness / failing tests

stevhliu

one last nit, otherwise lgtm! 😄

docs/source/en/experts_interface.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

github-actions · 2026-01-26T17:08:03Z

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=43438&sha=1d01c7

batched_mm is slow on cpu

6b9a75a

IlyasMoutawwakil requested review from ArthurZucker and vasqu January 23, 2026 11:08

ArthurZucker reviewed Jan 23, 2026

View reviewed changes

ArthurZucker approved these changes Jan 23, 2026

View reviewed changes

comments and docs

b930c4d

vasqu reviewed Jan 23, 2026

View reviewed changes

stevhliu approved these changes Jan 23, 2026

View reviewed changes

IlyasMoutawwakil and others added 4 commits January 26, 2026 09:23

Apply suggestion from @stevhliu

c22c29b

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

Apply suggestion from @stevhliu

32e25f3

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

Update docs/source/en/experts_interface.md

5f6122f

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

table and note

1bb7041

IlyasMoutawwakil force-pushed the grouped-mm-cpu branch from 8d082da to 14b8e0f Compare January 26, 2026 09:23

fix groued_mm+cpu+compile tests by using bf16

3134916

IlyasMoutawwakil force-pushed the grouped-mm-cpu branch from 5524dea to 3134916 Compare January 26, 2026 09:31

vasqu approved these changes Jan 26, 2026

View reviewed changes

stevhliu approved these changes Jan 26, 2026

View reviewed changes

docs/source/en/experts_interface.md Outdated Show resolved Hide resolved

Update docs/source/en/experts_interface.md

1d01c70

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

j6n5nwwmx9-cpu approved these changes Jan 26, 2026

View reviewed changes

Merge branch 'main' into grouped-mm-cpu

5dd3ace

vasqu merged commit a99a913 into main Jan 27, 2026
21 of 26 checks passed

vasqu deleted the grouped-mm-cpu branch January 27, 2026 13:32

Conversation

IlyasMoutawwakil commented Jan 23, 2026

What does this PR do?

Before submitting

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented Jan 23, 2026

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

stevhliu left a comment

Choose a reason for hiding this comment

Uh oh!

stevhliu Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

IlyasMoutawwakil Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

IlyasMoutawwakil commented Jan 26, 2026

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

stevhliu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Jan 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants