Skip to content

qa: speed up dtype regex weight load + reduce dtype tests to 3 random#45635

Open
tarekziade wants to merge 3 commits intomainfrom
tarek-loadweight-hotspot
Open

qa: speed up dtype regex weight load + reduce dtype tests to 3 random#45635
tarekziade wants to merge 3 commits intomainfrom
tarek-loadweight-hotspot

Conversation

@tarekziade
Copy link
Copy Markdown
Collaborator

@tarekziade tarekziade commented Apr 24, 2026

What does this PR do?

  1. make sure we reuse the complied regex across calls
  2. reduce the test to 3 dtypes only (picked randomly across all supported) so we speed up the tests but don't lose coverage over time.

on my m5 that drops

tests/models/d_fine/test_modeling_d_fine.py::DFineModelTest::test_bc_torch_dtype

from 7.13s to 2.17s.

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Copy Markdown
Member

@Cyrilvallez Cyrilvallez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @tarekziade! I'm not really convinced that the regex compiling and the get_submodule calls are actual bottlenecks!
This very very simple snippet:

import re
import time

N = 1000

a = ["qwe|ui"] * 15
# Construct the regex we will use to rename keys from the sources to the targets
branches = []
for i, source_pattern in enumerate(a):
    group_name = f"g{i}"
    pattern = source_pattern.replace(".*.", r"\..*\.")
    branches.append(f"(?P<{group_name}>{pattern})")

t0 = time.time()
for _ in range(N):
    compiled_sources = re.compile("|".join(branches))
dt = time.time() - t0
print(f"Took {dt/N:.2e} s")

shows that the compile call is only of the order of 1e-6/1e-7.
And for get_submodule, the complexity is bounded by the depth of the graph, which is super small as well (usually maximum 10 I'd say)

Comment thread src/transformers/core_model_loading.py
Comment on lines +962 to +968
def _resolve_pending_tensor(tensor_or_future: Future | Callable | torch.Tensor) -> torch.Tensor | None:
if isinstance(tensor_or_future, Future):
return tensor_or_future.result()
elif callable(tensor_or_future):
return tensor_or_future()
else:
return tensor_or_future
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not needed to have an outer function here

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's used in two spots that's why I vectorized it here

Comment thread src/transformers/core_model_loading.py
Comment thread src/transformers/core_model_loading.py Outdated
@tarekziade
Copy link
Copy Markdown
Collaborator Author

Hey @tarekziade! I'm not really convinced that the regex compiling and the get_submodule calls are actual bottlenecks! This very very simple snippet:

import re
import time

N = 1000

a = ["qwe|ui"] * 15
# Construct the regex we will use to rename keys from the sources to the targets
branches = []
for i, source_pattern in enumerate(a):
    group_name = f"g{i}"
    pattern = source_pattern.replace(".*.", r"\..*\.")
    branches.append(f"(?P<{group_name}>{pattern})")

t0 = time.time()
for _ in range(N):
    compiled_sources = re.compile("|".join(branches))
dt = time.time() - t0
print(f"Took {dt/N:.2e} s")

shows that the compile call is only of the order of 1e-6/1e-7. And for get_submodule, the complexity is bounded by the depth of the graph, which is super small as well (usually maximum 10 I'd say)

Sorry I should have been clearer, re.compile itself is not that slow, it's the fact that it's rarely needed, so by lazy loading it we get a small speedup because of the numerous WeightRenaming instances we create when loading weights. That class is a hot path and its constructor should stay as lightweight as possible.

Using ~18,000 keys and 9 repeats, median times:

  • WeightRenaming init only
    • old: 0.589529s
    • new: 0.069270s
    • gain: 0.520258s (88.2%), 8.51x faster
  • WeightRenaming init + first rename_source_key()
    • old: 0.605083s
    • new: 0.602909s
    • gain: 0.002174s (0.4%), effectively flat
  • top-level rename_source_key() prefix remove path
    • old: 0.009985s
    • new: 0.005094s
    • gain: 0.004892s (49.0%), 1.96x faster
  • top-level rename_source_key() prefix add path
    • old: 0.004415s
    • new: 0.004897s
    • delta: -0.000483s (-10.9%)
    • this is a tiny absolute regression, about 0.48 ms over all 18,424 keys

Now with the tests being slow right now, random 3x dtypes solves it. I guess it's deciding if optimizing WeightRenaming worth the added complexity.

@tarekziade
Copy link
Copy Markdown
Collaborator Author

Using the D-FINE test_bc_torch_dtype run as the baseline, the lazy regex compile is much smaller than cutting the dtype matrix, but it is still measurable.

  • Full 7-dtype version of tests/test_modeling_common.py 7.13s
  • 3-random-dtype version 2.17s
  • Benefit from reducing dtypes 4.96s saved, about 69.6% faster

For the lazy regex change in src/transformers/core_model_loading.py, I benchmarked the exact number of weights loaded by the old 7-dtype D-FINE run:

  • The test does 28 loads total: 14 x 650 weights and 14 x 666 weights
  • That is 18,424 loaded weights total
  • The isolated old-vs-new regex benchmark over 18,424 keys showed 0.520s saved from lazy regex compilation

So, at the whole-test level:

  • Lazy regex benefit on the old 7-dtype run: about 0.52s
  • 3-random-dtype benefit: 4.96s
  • Lazy regex is about 10.5% of the dtype-reduction benefit

To recap, the lazy regex compile is a smaller but real secondary win. But in practice, when running in production, regex is quite a small win since we don't load but a single dtype at a time.

That said I would still recommend doing it and add a comment in that class to say it's instanciated a lot at load time and its consrtuctor should stay as thin as possible

@Cyrilvallez
Copy link
Copy Markdown
Member

Alright I see, thanks for the added explanation!
Agreed that we can keep the lazy regex compile, since it's a very easy and sometimes noticeable improvement! Let's just add why we do it in the property - mostly the fact that every key will go through a fresh WeightRenaming if not matching any weight ops, for convenience (but those don't need to call rename_source_key, so the compiling is wasted)

@tarekziade tarekziade force-pushed the tarek-loadweight-hotspot branch from 817f736 to f94c674 Compare April 28, 2026 07:21
@ydshieh
Copy link
Copy Markdown
Collaborator

ydshieh commented Apr 28, 2026

Hi @tarekziade . Thanks for the work.

The origin of this optimization is for make some tests faster. The part of random 3x dtypes solves it works great.

For the changes in the core model loading file, so far the benefit is nit (in our tests, or if I load a single tiny model). However I am not saying it's not useful, but instead I am wondering:

  • if that change would show more benefit when we load a large (huge) model a single time?
  • and for small models (like in our CircleCI), the (absolute) amount of saving would become more visible if we load a model multiple times?
  • For the deepcopy part, is it really necessary? Does re.compile already use some cache mechanism so the compile of the same pattern would not be slow (I am not sure and I might be wrong here).

Overall, I am fine as long as @Cyrilvallez is happy with the change. But it's nice if we can identify which part is really improving the loading, and limit the change to the smallest scope as possible.

@tarekziade
Copy link
Copy Markdown
Collaborator Author

@ydshieh Thanks for the thoughtful questions, that’s very helpful.

You’re right that most of the measurable speedup comes from reducing the number of dtypes we iterate over.

The changes in the core loading path are more about a design concern that surfaced while doing this: the weight class used for conversion does some preparation work in its constructor that is rarely used, while it sits directly on the critical path (one instance per key).

So the goal there is mainly to keep that constructor as lightweight as possible e.g. avoid creating objects (like regexps) that we don’t actually use, and prevent this from growing over time.

In terms of impact:

  • On large key counts (~18k), this shaves ~0.5s (what we see in tests today)
  • For typical models (~500 keys), the gain is indeed negligible on a single load

I haven’t tested large models yet, but I’d expect the benefit to scale with the number of keys rather than model size itself

So I agree this is more of a “keep the hot path clean” improvement than a big performance win.

On deepcopy: it’s needed because collected_tensors, layer_targets, and _was_used accumulate state during loading, so we need a fresh instance per target. This is orthogonal to re.compile.

And you’re right about re.compile it’s cached in the stdlib itself, so repeated calls are not a concern here.

Happy to reduce the scope if we feel this is too much for the current benefit 👍

@ydshieh
Copy link
Copy Markdown
Collaborator

ydshieh commented Apr 28, 2026

OK, thanks for explaining. So deepcopy is needed (but not for the reason of re.compile).

Happy to reduce the scope if we feel this is too much for the current benefit 👍

Since Cyril is convinced, so fine from my side after your above comment 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants