qa: speed up dtype regex weight load + reduce dtype tests to 3 random#45635
qa: speed up dtype regex weight load + reduce dtype tests to 3 random#45635tarekziade wants to merge 3 commits intomainfrom
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Cyrilvallez
left a comment
There was a problem hiding this comment.
Hey @tarekziade! I'm not really convinced that the regex compiling and the get_submodule calls are actual bottlenecks!
This very very simple snippet:
import re
import time
N = 1000
a = ["qwe|ui"] * 15
# Construct the regex we will use to rename keys from the sources to the targets
branches = []
for i, source_pattern in enumerate(a):
group_name = f"g{i}"
pattern = source_pattern.replace(".*.", r"\..*\.")
branches.append(f"(?P<{group_name}>{pattern})")
t0 = time.time()
for _ in range(N):
compiled_sources = re.compile("|".join(branches))
dt = time.time() - t0
print(f"Took {dt/N:.2e} s")shows that the compile call is only of the order of 1e-6/1e-7.
And for get_submodule, the complexity is bounded by the depth of the graph, which is super small as well (usually maximum 10 I'd say)
| def _resolve_pending_tensor(tensor_or_future: Future | Callable | torch.Tensor) -> torch.Tensor | None: | ||
| if isinstance(tensor_or_future, Future): | ||
| return tensor_or_future.result() | ||
| elif callable(tensor_or_future): | ||
| return tensor_or_future() | ||
| else: | ||
| return tensor_or_future |
There was a problem hiding this comment.
Probably not needed to have an outer function here
There was a problem hiding this comment.
it's used in two spots that's why I vectorized it here
Sorry I should have been clearer, Using ~18,000 keys and 9 repeats, median times:
Now with the tests being slow right now, random 3x dtypes solves it. I guess it's deciding if optimizing |
|
Using the D-FINE
For the lazy regex change in src/transformers/core_model_loading.py, I benchmarked the exact number of weights loaded by the old 7-dtype D-FINE run:
So, at the whole-test level:
To recap, the lazy regex compile is a smaller but real secondary win. But in practice, when running in production, regex is quite a small win since we don't load but a single dtype at a time. That said I would still recommend doing it and add a comment in that class to say it's instanciated a lot at load time and its consrtuctor should stay as thin as possible |
|
Alright I see, thanks for the added explanation! |
817f736 to
f94c674
Compare
|
Hi @tarekziade . Thanks for the work. The origin of this optimization is for make some tests faster. The part of random 3x dtypes solves it works great. For the changes in the core model loading file, so far the benefit is nit (in our tests, or if I load a single tiny model). However I am not saying it's not useful, but instead I am wondering:
Overall, I am fine as long as @Cyrilvallez is happy with the change. But it's nice if we can identify which part is really improving the loading, and limit the change to the smallest scope as possible. |
|
@ydshieh Thanks for the thoughtful questions, that’s very helpful. You’re right that most of the measurable speedup comes from reducing the number of dtypes we iterate over. The changes in the core loading path are more about a design concern that surfaced while doing this: the weight class used for conversion does some preparation work in its constructor that is rarely used, while it sits directly on the critical path (one instance per key). So the goal there is mainly to keep that constructor as lightweight as possible e.g. avoid creating objects (like regexps) that we don’t actually use, and prevent this from growing over time. In terms of impact:
I haven’t tested large models yet, but I’d expect the benefit to scale with the number of keys rather than model size itself So I agree this is more of a “keep the hot path clean” improvement than a big performance win. On And you’re right about Happy to reduce the scope if we feel this is too much for the current benefit 👍 |
|
OK, thanks for explaining. So
Since Cyril is convinced, so fine from my side after your above comment 👍 |
What does this PR do?
on my m5 that drops
from 7.13s to 2.17s.