qa: speed up dtype regex weight load + reduce dtype tests to 3 random by tarekziade · Pull Request #45635 · huggingface/transformers

tarekziade · 2026-04-24T14:39:03Z

What does this PR do?

make sure we reuse the complied regex across calls
reduce the test to 3 dtypes only (picked randomly across all supported) so we speed up the tests but don't lose coverage over time.

on my m5 that drops

tests/models/d_fine/test_modeling_d_fine.py::DFineModelTest::test_bc_torch_dtype

from 7.13s to 2.17s.

HuggingFaceDocBuilderDev · 2026-04-24T14:50:38Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Cyrilvallez

Hey @tarekziade! I'm not really convinced that the regex compiling and the get_submodule calls are actual bottlenecks!
This very very simple snippet:

import re
import time

N = 1000

a = ["qwe|ui"] * 15
# Construct the regex we will use to rename keys from the sources to the targets
branches = []
for i, source_pattern in enumerate(a):
    group_name = f"g{i}"
    pattern = source_pattern.replace(".*.", r"\..*\.")
    branches.append(f"(?P<{group_name}>{pattern})")

t0 = time.time()
for _ in range(N):
    compiled_sources = re.compile("|".join(branches))
dt = time.time() - t0
print(f"Took {dt/N:.2e} s")

shows that the compile call is only of the order of 1e-6/1e-7.
And for get_submodule, the complexity is bounded by the depth of the graph, which is super small as well (usually maximum 10 I'd say)

Cyrilvallez · 2026-04-27T03:05:36Z

+def _resolve_pending_tensor(tensor_or_future: Future | Callable | torch.Tensor) -> torch.Tensor | None:
+    if isinstance(tensor_or_future, Future):
+        return tensor_or_future.result()
+    elif callable(tensor_or_future):
+        return tensor_or_future()
+    else:
+        return tensor_or_future


Probably not needed to have an outer function here

it's used in two spots that's why I vectorized it here

tarekziade · 2026-04-27T06:10:50Z

Hey @tarekziade! I'm not really convinced that the regex compiling and the get_submodule calls are actual bottlenecks! This very very simple snippet:
import re
import time

N = 1000

a = ["qwe|ui"] * 15
# Construct the regex we will use to rename keys from the sources to the targets
branches = []
for i, source_pattern in enumerate(a):
    group_name = f"g{i}"
    pattern = source_pattern.replace(".*.", r"\..*\.")
    branches.append(f"(?P<{group_name}>{pattern})")

t0 = time.time()
for _ in range(N):
    compiled_sources = re.compile("|".join(branches))
dt = time.time() - t0
print(f"Took {dt/N:.2e} s")
shows that the compile call is only of the order of 1e-6/1e-7. And for get_submodule, the complexity is bounded by the depth of the graph, which is super small as well (usually maximum 10 I'd say)

Sorry I should have been clearer, re.compile itself is not that slow, it's the fact that it's rarely needed, so by lazy loading it we get a small speedup because of the numerous WeightRenaming instances we create when loading weights. That class is a hot path and its constructor should stay as lightweight as possible.

Using ~18,000 keys and 9 repeats, median times:

WeightRenaming init only
- old: 0.589529s
- new: 0.069270s
- gain: 0.520258s (88.2%), 8.51x faster
WeightRenaming init + first rename_source_key()
- old: 0.605083s
- new: 0.602909s
- gain: 0.002174s (0.4%), effectively flat
top-level rename_source_key() prefix remove path
- old: 0.009985s
- new: 0.005094s
- gain: 0.004892s (49.0%), 1.96x faster
top-level rename_source_key() prefix add path
- old: 0.004415s
- new: 0.004897s
- delta: -0.000483s (-10.9%)
- this is a tiny absolute regression, about 0.48 ms over all 18,424 keys

Now with the tests being slow right now, random 3x dtypes solves it. I guess it's deciding if optimizing WeightRenaming worth the added complexity.

tarekziade · 2026-04-27T06:22:38Z

Using the D-FINE test_bc_torch_dtype run as the baseline, the lazy regex compile is much smaller than cutting the dtype matrix, but it is still measurable.

Full 7-dtype version of tests/test_modeling_common.py 7.13s
3-random-dtype version 2.17s
Benefit from reducing dtypes 4.96s saved, about 69.6% faster

For the lazy regex change in src/transformers/core_model_loading.py, I benchmarked the exact number of weights loaded by the old 7-dtype D-FINE run:

The test does 28 loads total: 14 x 650 weights and 14 x 666 weights
That is 18,424 loaded weights total
The isolated old-vs-new regex benchmark over 18,424 keys showed 0.520s saved from lazy regex compilation

So, at the whole-test level:

Lazy regex benefit on the old 7-dtype run: about 0.52s
3-random-dtype benefit: 4.96s
Lazy regex is about 10.5% of the dtype-reduction benefit

To recap, the lazy regex compile is a smaller but real secondary win. But in practice, when running in production, regex is quite a small win since we don't load but a single dtype at a time.

That said I would still recommend doing it and add a comment in that class to say it's instanciated a lot at load time and its consrtuctor should stay as thin as possible

Cyrilvallez · 2026-04-27T07:42:05Z

Alright I see, thanks for the added explanation!
Agreed that we can keep the lazy regex compile, since it's a very easy and sometimes noticeable improvement! Let's just add why we do it in the property - mostly the fact that every key will go through a fresh WeightRenaming if not matching any weight ops, for convenience (but those don't need to call rename_source_key, so the compiling is wasted)

ydshieh · 2026-04-28T12:35:22Z

Hi @tarekziade . Thanks for the work.

The origin of this optimization is for make some tests faster. The part of random 3x dtypes solves it works great.

For the changes in the core model loading file, so far the benefit is nit (in our tests, or if I load a single tiny model). However I am not saying it's not useful, but instead I am wondering:

if that change would show more benefit when we load a large (huge) model a single time?
and for small models (like in our CircleCI), the (absolute) amount of saving would become more visible if we load a model multiple times?
For the deepcopy part, is it really necessary? Does re.compile already use some cache mechanism so the compile of the same pattern would not be slow (I am not sure and I might be wrong here).

Overall, I am fine as long as @Cyrilvallez is happy with the change. But it's nice if we can identify which part is really improving the loading, and limit the change to the smallest scope as possible.

tarekziade · 2026-04-28T13:03:24Z

@ydshieh Thanks for the thoughtful questions, that’s very helpful.

You’re right that most of the measurable speedup comes from reducing the number of dtypes we iterate over.

The changes in the core loading path are more about a design concern that surfaced while doing this: the weight class used for conversion does some preparation work in its constructor that is rarely used, while it sits directly on the critical path (one instance per key).

So the goal there is mainly to keep that constructor as lightweight as possible e.g. avoid creating objects (like regexps) that we don’t actually use, and prevent this from growing over time.

In terms of impact:

On large key counts (~18k), this shaves ~0.5s (what we see in tests today)
For typical models (~500 keys), the gain is indeed negligible on a single load

I haven’t tested large models yet, but I’d expect the benefit to scale with the number of keys rather than model size itself

So I agree this is more of a “keep the hot path clean” improvement than a big performance win.

On deepcopy: it’s needed because collected_tensors, layer_targets, and _was_used accumulate state during loading, so we need a fresh instance per target. This is orthogonal to re.compile.

And you’re right about re.compile it’s cached in the stdlib itself, so repeated calls are not a concern here.

Happy to reduce the scope if we feel this is too much for the current benefit 👍

ydshieh · 2026-04-28T14:28:20Z

OK, thanks for explaining. So deepcopy is needed (but not for the reason of re.compile).

Happy to reduce the scope if we feel this is too much for the current benefit 👍

Since Cyril is convinced, so fine from my side after your above comment 👍

Cyrilvallez reviewed Apr 27, 2026

View reviewed changes

tarekziade added 2 commits April 28, 2026 09:03

qa: speed up dtype regex weight load + reduce dtype tests to 3 random

637726d

extract assign_offload_params

f94c674

tarekziade force-pushed the tarek-loadweight-hotspot branch from 817f736 to f94c674 Compare April 28, 2026 07:21

added an explanation

04a580b

tarekziade mentioned this pull request Apr 28, 2026

[Weight Converter] More fine-grained mappings on classes, scoping for every transforms (including weight converter) #45661

Open

tarekziade requested a review from Cyrilvallez April 28, 2026 07:30

tarekziade self-assigned this Apr 28, 2026

evalstate mentioned this pull request Apr 28, 2026

Cumulative defect fixes from recent Transformers PRs evalstate/transformers#41

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

qa: speed up dtype regex weight load + reduce dtype tests to 3 random#45635

qa: speed up dtype regex weight load + reduce dtype tests to 3 random#45635
tarekziade wants to merge 3 commits intomainfrom
tarek-loadweight-hotspot

tarekziade commented Apr 24, 2026 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Apr 24, 2026

Uh oh!

Cyrilvallez left a comment

Uh oh!

Uh oh!

Cyrilvallez Apr 27, 2026

Uh oh!

tarekziade Apr 28, 2026

Uh oh!

Uh oh!

Uh oh!

tarekziade commented Apr 27, 2026

Uh oh!

tarekziade commented Apr 27, 2026

Uh oh!

Cyrilvallez commented Apr 27, 2026

Uh oh!

ydshieh commented Apr 28, 2026 •

edited

Loading

Uh oh!

tarekziade commented Apr 28, 2026

Uh oh!

ydshieh commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

tarekziade commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Apr 24, 2026

Uh oh!

Cyrilvallez left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Cyrilvallez Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

tarekziade Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tarekziade commented Apr 27, 2026

Uh oh!

tarekziade commented Apr 27, 2026

Uh oh!

Cyrilvallez commented Apr 27, 2026

Uh oh!

ydshieh commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tarekziade commented Apr 28, 2026

Uh oh!

ydshieh commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tarekziade commented Apr 24, 2026 •

edited

Loading

ydshieh commented Apr 28, 2026 •

edited

Loading