Benchmark overhaul #41408

remi-or · 2025-10-07T13:10:05Z

This PR overhauls the benchmarking suite that is included in transformers.
The benchmarking suite is now based around three main components:

BenchmarkingConfig is a dataclass-like object which contains everything needed to reproduce a benchmark on the same machine: input length, generation length, whether to use kernels or compile, attention implementation, etc. (subject to name change)
BenchmarkRunner is the class that runs the benchmarks defined by the configs, with a given number of measurement iterations, warmup iterations, and a model-id. The runner takes care of setting up the runs in a way that ensures no run interacts with the downstream ones: the model is reloaded, the cache is emptied and the GPU memory is flushed. It also saves the results, the config, and any additional metadata needed to reproduce the benchmark, like hardware information and package versions.
The created results files, which contain enough informations to induces (to my knowledge) most of the metrics used to evaluate a model: e2e_atency, tpot, ttft, even inter-token latency. Results also include a sample of what has been generated, which is useful to check if it was gibberish. The results files are in json format and are made to be easily created from the dataclass-like objects and vice versa.

For now, the new benchmarking suite replaces the benchmark_v2 part of transformers but it could also overwrite the benchmark (v1) part. It would be good to make that decision in this PR. And update the CI workflows that rely on the current benchmark_v2 (putting the PR in draft mode until then).
An example of how to use the new benchmarking suite can be found in run_benchmarks.py.

The format of the results file can (and may be bound to) change as we develop tools to analyze them.
If there is a metric you want to see measured in transformers, please leave a comment before this is merged 🙂

HuggingFaceDocBuilderDev · 2025-10-07T13:19:19Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

McPatate

btw to disable the associated CI workflows while I rework them later:

on:
  workflow_dispatch:

and you can delete the rest of the triggers

A few naming comments overall, but minor stuff I believe, feel free to ignore, gj 👌🏻

benchmark_v2/framework/benchmark_config.py

benchmark_v2/framework/hardware_metrics.py

benchmark_v2/run_benchmarks.py

Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>

benchmark_v2/framework/benchmark_config.py

benchmark_v2/framework/benchmark_runner.py

benchmark_v2/framework/data_classes.py

benchmark_v2/framework/hardware_metrics.py

benchmark_v2/run_benchmarks.py

* Big refactor, still classes to move around and script to re-complexify * Move to streamer, isolate benches, propagate num tokens * Some refacto * Added compile mode to name * Re-order * Move to dt_tokens * Better format * Fix and disable use_cache by default * Fixed compile and SDPA backend default * Refactor results format * Added default compile mode * Always use cache * Fixed cache and added flex * Plan for missing modules * Experiments: no cg and shuffle * Disable compile for FA * Remove wall time, add sweep mode, get git commit * Review compliance, start * Apply suggestions from code review Co-authored-by: Luc Georges <McPatate@users.noreply.github.com> * Update benchmark_v2/framework/benchmark_runner.py Co-authored-by: Luc Georges <McPatate@users.noreply.github.com> * Disable workflow * Pretty print * Added some pretty names to have pretty logs * Review n2 compliance (end?) * Style and end of PR --------- Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>

remi-or requested review from ArthurZucker and McPatate October 7, 2025 13:10

remi-or self-assigned this Oct 7, 2025

McPatate approved these changes Oct 13, 2025

View reviewed changes

remi-or added 18 commits October 14, 2025 11:03

Big refactor, still classes to move around and script to re-complexify

bf2165e

Move to streamer, isolate benches, propagate num tokens

5004d3f

Some refacto

0f2f6a5

Added compile mode to name

1f24b83

Re-order

3223c2e

Move to dt_tokens

ec65db2

Better format

8c6c1ac

Fix and disable use_cache by default

43297ae

Fixed compile and SDPA backend default

7744c03

Refactor results format

16ee6c0

Added default compile mode

631aad2

Always use cache

cde94a4

Fixed cache and added flex

03f79f6

Plan for missing modules

5b80bc7

Experiments: no cg and shuffle

f53a0d7

Disable compile for FA

650656e

Remove wall time, add sweep mode, get git commit

680d3da

Review compliance, start

d2c89aa

remi-or force-pushed the benchmark-various branch from c7e15c3 to d2c89aa Compare October 14, 2025 11:27

remi-or and others added 5 commits October 14, 2025 13:28

Apply suggestions from code review

fcc467b

Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>

Update benchmark_v2/framework/benchmark_runner.py

0d51db4

Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>

Disable workflow

e30e02a

Pretty print

23a1a85

Added some pretty names to have pretty logs

bd599da

remi-or marked this pull request as ready for review October 14, 2025 13:42

McPatate reviewed Oct 14, 2025

View reviewed changes

remi-or added 2 commits October 14, 2025 18:32

Review n2 compliance (end?)

d701ca9

Style and end of PR

400a616

remi-or merged commit 94df0e6 into huggingface:main Oct 14, 2025
13 checks passed

Benchmark overhaul #41408

Benchmark overhaul #41408

Uh oh!

Conversation

remi-or commented Oct 7, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Oct 7, 2025

Uh oh!

McPatate left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants