[Performance] FP8 Grouped and Batched Matmuls by IlyasMoutawwakil · Pull Request #44231 · huggingface/transformers

IlyasMoutawwakil · 2026-02-23T15:45:47Z

What does this PR do?

up to 30x faster than current fp8 experts, the kernels are also tailored for full torch.compile and cuda graphs compatibility.

============================================================
FP8Expert parity: eager / batched_mm / grouped_mm
============================================================

[case 1/5]
device=cuda  batch_size=1  num_tokens=64  total_tokens=64  num_experts=8  hidden=256  intermediate=512  top_k=2
  [eager vs eager_fused]  max=0.000000  mean=0.000000  PASS ✓
  [eager vs grouped_mm]  max=0.000000  mean=0.000000  PASS ✓
  [eager vs batched_mm]  max=0.000000  mean=0.000000  PASS ✓
  [batched_mm vs grouped_mm]  max=0.000000  mean=0.000000  PASS ✓

[case 2/5]
device=cuda  batch_size=1  num_tokens=1  total_tokens=1  num_experts=8  hidden=256  intermediate=512  top_k=2
  [eager vs eager_fused]  max=0.000000  mean=0.000000  PASS ✓
  [eager vs grouped_mm]  max=0.000000  mean=0.000000  PASS ✓
  [eager vs batched_mm]  max=0.000000  mean=0.000000  PASS ✓
  [batched_mm vs grouped_mm]  max=0.000000  mean=0.000000  PASS ✓

[case 3/5]
device=cuda  batch_size=1  num_tokens=7  total_tokens=7  num_experts=8  hidden=256  intermediate=512  top_k=1
  [eager vs eager_fused]  max=0.000000  mean=0.000000  PASS ✓
  [eager vs grouped_mm]  max=0.000000  mean=0.000000  PASS ✓
  [eager vs batched_mm]  max=0.000000  mean=0.000000  PASS ✓
  [batched_mm vs grouped_mm]  max=0.000000  mean=0.000000  PASS ✓

[case 4/5]
device=cuda  batch_size=1  num_tokens=4  total_tokens=4  num_experts=8  hidden=256  intermediate=512  top_k=8
  [eager vs eager_fused]  max=0.000000  mean=0.000000  PASS ✓
  [eager vs grouped_mm]  max=0.000000  mean=0.000000  PASS ✓
  [eager vs batched_mm]  max=0.000000  mean=0.000000  PASS ✓
  [batched_mm vs grouped_mm]  max=0.000000  mean=0.000000  PASS ✓

[case 5/5]
device=cuda  batch_size=4  num_tokens=64  total_tokens=256  num_experts=8  hidden=256  intermediate=512  top_k=2
  [eager vs eager_fused]  max=0.000000  mean=0.000000  PASS ✓
  [eager vs grouped_mm]  max=0.000000  mean=0.000000  PASS ✓
  [eager vs batched_mm]  max=0.000000  mean=0.000000  PASS ✓
  [batched_mm vs grouped_mm]  max=0.000000  mean=0.000000  PASS ✓

============================================================
All parity checks PASSED ✓

============================================================
Benchmark sweep
============================================================

────────────────────────────────────────────────────────────
Benchmark  device=cuda  batch_size=1  tokens=1  total=1  experts=8
           hidden=256  intermediate=512  top_k=2
────────────────────────────────────────────────────────────
  impl                            median (ms)    p10 (ms)    p90 (ms)   speedup
  eager                                 1.477       1.458       1.503  (baseline)
  eager_fused                           1.206       1.193       1.238     1.22x
  grouped_mm                            0.830       0.808       0.858     1.78x
  batched_mm                            0.423       0.418       0.441     3.49x
  eager (compiled)                      1.413       1.396       1.465     1.05x
  grouped_mm (compiled)                 0.138       0.136       0.147    10.67x
  batched_mm (compiled)                 0.138       0.136       0.151    10.71x

────────────────────────────────────────────────────────────
Benchmark  device=cuda  batch_size=1  tokens=8  total=8  experts=8
           hidden=256  intermediate=512  top_k=2
────────────────────────────────────────────────────────────
  impl                            median (ms)    p10 (ms)    p90 (ms)   speedup
  eager                                 4.506       4.430       4.581  (baseline)
  eager_fused                           3.672       3.654       3.700     1.23x
  grouped_mm                            0.809       0.797       0.840     5.57x
  batched_mm                            0.423       0.419       0.444    10.64x
  eager (compiled)                      4.752       4.733       4.775     0.95x
  grouped_mm (compiled)                 0.158       0.155       0.168    28.52x
  batched_mm (compiled)                 0.153       0.151       0.163    29.41x

────────────────────────────────────────────────────────────
Benchmark  device=cuda  batch_size=1  tokens=32  total=32  experts=8
           hidden=256  intermediate=512  top_k=2
────────────────────────────────────────────────────────────
  impl                            median (ms)    p10 (ms)    p90 (ms)   speedup
  eager                                 5.106       5.064       5.142  (baseline)
  eager_fused                           4.065       4.052       4.086     1.26x
  grouped_mm                            0.815       0.807       0.844     6.27x
  batched_mm                            0.433       0.428       0.455    11.80x
  eager (compiled)                      5.346       5.326       5.379     0.95x
  grouped_mm (compiled)                 0.159       0.157       0.167    32.01x
  batched_mm (compiled)                 0.167       0.165       0.177    30.62x

────────────────────────────────────────────────────────────
Benchmark  device=cuda  batch_size=1  tokens=128  total=128  experts=8
           hidden=256  intermediate=512  top_k=2
────────────────────────────────────────────────────────────
  impl                            median (ms)    p10 (ms)    p90 (ms)   speedup
  eager                                 5.236       5.098       5.275  (baseline)
  eager_fused                           4.094       4.073       4.197     1.28x
  grouped_mm                            0.826       0.815       0.854     6.34x
  batched_mm                            0.441       0.436       0.460    11.87x
  eager (compiled)                      5.572       5.545       5.611     0.94x
  grouped_mm (compiled)                 0.183       0.180       0.193    28.65x
  batched_mm (compiled)                 0.284       0.282       0.291    18.42x

────────────────────────────────────────────────────────────
Benchmark  device=cuda  batch_size=1  tokens=512  total=512  experts=8
           hidden=256  intermediate=512  top_k=2
────────────────────────────────────────────────────────────
  impl                            median (ms)    p10 (ms)    p90 (ms)   speedup
  eager                                 5.341       5.305       5.375  (baseline)
  eager_fused                           4.215       4.199       4.233     1.27x
  grouped_mm                            0.822       0.813       0.848     6.50x
  batched_mm                            1.405       1.391       1.416     3.80x
  eager (compiled)                      5.622       5.563       5.674     0.95x
  grouped_mm (compiled)                 0.219       0.217       0.224    24.40x
  batched_mm (compiled)                 0.724       0.712       0.734     7.38x

============================================================

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

HuggingFaceDocBuilderDev · 2026-02-23T15:56:09Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

… library wrappers for better torch compileability

IlyasMoutawwakil · 2026-02-26T14:43:39Z

src/transformers/integrations/moe.py

-    return is_grouped_mm_available()
+    return hasattr(torch.nn.functional, "grouped_mm") or hasattr(torch, "_grouped_mm")


not sure why but sometimes is_grouped_mm_available() and other functions using metadata/versions result in compilation failures

Copilot

Pull request overview

This PR introduces new FP8 MoE expert implementations (batched and grouped matmuls) intended to significantly speed up fine-grained FP8 expert execution, with torch.compile / CUDA graphs compatibility, and expands tests to cover the different expert implementations.

Changes:

Add FP8 batched/grouped experts forward paths and CUTLASS/Triton dispatch plumbing in the fine-grained FP8 integration.
Update MoE grouped_mm availability checks and make grouped_mm token reordering more graph-friendly.
Parameterize the fine-grained FP8 MoE forward test across eager, batched_mm, and grouped_mm implementations.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`tests/quantization/finegrained_fp8/test_fp8.py`	Runs the FP8 MoE forward smoke test across multiple experts implementations.
`src/transformers/utils/generic.py`	Fixes a docstring typo (“though” → “through”).
`src/transformers/quantizers/quantizer_finegrained_fp8.py`	Updates quantizer to target the new FP8 experts module type and marks it as compileable.
`src/transformers/integrations/moe.py`	Adjusts grouped_mm availability detection and permutation inversion logic for compile/cudagraph friendliness.
`src/transformers/integrations/finegrained_fp8.py`	Major refactor: kernel loading/dispatch changes, new `FP8Experts`, and new FP8 batched/grouped expert forward implementations.

src/transformers/integrations/finegrained_fp8.py

Copilot · 2026-03-03T10:05:02Z

src/transformers/integrations/finegrained_fp8.py

+ALL_EXPERTS_FUNCTIONS["batched_mm"] = fp8_batched_mm_experts_forward
+ALL_EXPERTS_FUNCTIONS["grouped_mm"] = fp8_grouped_mm_experts_forward


Assigning ALL_EXPERTS_FUNCTIONS["batched_mm"] / ["grouped_mm"] here overrides the global experts dispatch used by all @use_experts_implementation models. That will route non-FP8 MoE layers into these FP8-specific functions (which expect FP8 scales / kernel APIs) and will likely crash. Prefer keeping the global mappings intact and dispatching to the FP8 implementations only for FP8 expert modules (e.g., by handling the selection inside FP8Experts.forward, or by using distinct keys and setting config._experts_implementation accordingly for FP8 models only).

Suggested change

ALL_EXPERTS_FUNCTIONS["batched_mm"] = fp8_batched_mm_experts_forward

ALL_EXPERTS_FUNCTIONS["grouped_mm"] = fp8_grouped_mm_experts_forward

ALL_EXPERTS_FUNCTIONS["fp8_batched_mm"] = fp8_batched_mm_experts_forward

ALL_EXPERTS_FUNCTIONS["fp8_grouped_mm"] = fp8_grouped_mm_experts_forward

my understanding is that the setter results in file-specific changes and registering results in global changes. @ArthurZucker if you can confirm

not sure about this, but maybe run some tests

Yeah I don't remember the scope, registering would work better + having quantization config class change the impl to prefix with FP8 would be better no?

i created a new interface specifically for fp8, "quantization config class change the impl to prefix with FP8" not sure how to achieve that because we read directly from model.config._experts_implementation

src/transformers/integrations/finegrained_fp8.py

src/transformers/quantizers/quantizer_finegrained_fp8.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

SunMarc

Thanks a lot, left mostly minor comments

src/transformers/integrations/finegrained_fp8.py

SunMarc · 2026-03-05T13:25:41Z

src/transformers/integrations/finegrained_fp8.py

+ALL_EXPERTS_FUNCTIONS["batched_mm"] = fp8_batched_mm_experts_forward
+ALL_EXPERTS_FUNCTIONS["grouped_mm"] = fp8_grouped_mm_experts_forward


not sure about this, but maybe run some tests

Cyrilvallez

Nice! Did not check the exact mathematics, but trusting you and @SunMarc on this!

ArthurZucker

LGTM

ArthurZucker · 2026-03-10T09:51:55Z

src/transformers/integrations/finegrained_fp8.py

+ALL_EXPERTS_FUNCTIONS["batched_mm"] = fp8_batched_mm_experts_forward
+ALL_EXPERTS_FUNCTIONS["grouped_mm"] = fp8_grouped_mm_experts_forward


Yeah I don't remember the scope, registering would work better + having quantization config class change the impl to prefix with FP8 would be better no?

github-actions · 2026-03-10T16:59:59Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: finegrained_fp8

SunMarc

Thanks for fixing the last bits ! Merging

IlyasMoutawwakil and others added 4 commits February 21, 2026 11:45

simplify

1984e5d

finegrained fp8 moe forwards

b1fcbd8

optimized fp8 fused, batched and grouped paths

12b0546

Merge branch 'main' into fp8-grouped-mm

f47040f

IlyasMoutawwakil and others added 6 commits February 23, 2026 15:58

fix

84e9ef2

wrap triton

94e4cd7

fix calls

9847558

fix

2aa637b

Merge branch 'main' into fp8-grouped-mm

57e4779

remove fused quant kernel (litlle gain and unnecessary) and use torch…

125d8f4

… library wrappers for better torch compileability

IlyasMoutawwakil mentioned this pull request Feb 24, 2026

finegrained-fp8 kernels huggingface/kernels-community#385

Merged

IlyasMoutawwakil added 3 commits February 26, 2026 14:37

use kernels

a2e7dd1

fix

71a1b8c

no need to wrap cutlass

5c33299

IlyasMoutawwakil commented Feb 26, 2026

View reviewed changes

IlyasMoutawwakil added 8 commits February 26, 2026 16:22

cleanup

9212cc3

fix

ffe7931

Merge branch 'main' into fp8-grouped-mm

3b9e9f6

Merge branch 'main' into fp8-grouped-mm

fef6f35

added non gated experts support

25aedb2

remove comments

7e7e2ac

style

6c6e176

fix

4ab554d

IlyasMoutawwakil requested a review from Copilot March 3, 2026 09:59

Copilot started reviewing on behalf of IlyasMoutawwakil March 3, 2026 09:59 View session

Copilot AI reviewed Mar 3, 2026

View reviewed changes

IlyasMoutawwakil and others added 2 commits March 3, 2026 11:10

Update src/transformers/quantizers/quantizer_finegrained_fp8.py

8243a42

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update finegrained_fp8.py

77dde4e

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

IlyasMoutawwakil marked this pull request as ready for review March 3, 2026 14:34

per tensor scaling support

3802cd4

IlyasMoutawwakil requested review from ArthurZucker, Cyrilvallez, SunMarc and vasqu March 5, 2026 09:23

SunMarc approved these changes Mar 6, 2026

View reviewed changes

Cyrilvallez approved these changes Mar 9, 2026

View reviewed changes

ArthurZucker approved these changes Mar 10, 2026

View reviewed changes

IlyasMoutawwakil added 2 commits March 10, 2026 16:58

use custom fp8 interface

6fa940f

document

eca2f01

Merge branch 'main' into fp8-grouped-mm

c3107a9

SunMarc approved these changes Mar 10, 2026

View reviewed changes

SunMarc enabled auto-merge March 10, 2026 17:04

SunMarc added this pull request to the merge queue Mar 10, 2026

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Mar 10, 2026

ArthurZucker merged commit ff2ba44 into main Mar 11, 2026
29 checks passed

ArthurZucker deleted the fp8-grouped-mm branch March 11, 2026 08:51

		return is_grouped_mm_available()
		return hasattr(torch.nn.functional, "grouped_mm") or hasattr(torch, "_grouped_mm")

		ALL_EXPERTS_FUNCTIONS["batched_mm"] = fp8_batched_mm_experts_forward
		ALL_EXPERTS_FUNCTIONS["grouped_mm"] = fp8_grouped_mm_experts_forward

Conversation

IlyasMoutawwakil commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented Feb 23, 2026

Uh oh!

IlyasMoutawwakil Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

IlyasMoutawwakil Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SunMarc Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

IlyasMoutawwakil Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

SunMarc Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Cyrilvallez left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 10, 2026

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

IlyasMoutawwakil commented Feb 23, 2026 •

edited

Loading

IlyasMoutawwakil Mar 3, 2026 •

edited

Loading