fix: performance of _ls_tree #2103

awgr · 2024-03-09T07:19:22Z

No description provided.

awgr · 2024-03-09T07:23:41Z

Before change:

After change:

HuggingFaceDocBuilderDev · 2024-03-10T15:25:02Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Wauplin · 2024-03-11T17:17:34Z

Hi @awgr, thanks a lot for your PR and clear profiling explanations! And thanks for finding out the deepcopy was the culprit here 😬 I'd like to take advantage of your PR to slightly refactor this part of the code here and in particular:

move your make_cache_path_info method to a separate (internal) helper
reuse this helper both in _ls_tree and info methods which have the same duplicated logic

Would you like to do it? Otherwise, I am more than happy to take care of it myself if it's fine for you that I push changes to your branch?

Thanks in advance!

awgr · 2024-03-11T18:22:48Z

Hi @awgr, thanks a lot for your PR and clear profiling explanations! And thanks for finding out the deepcopy was the culprit here 😬 I'd like to take advantage of your PR to slightly refactor this part of the code here and in particular:

move your make_cache_path_info method to a separate (internal) helper

reuse this helper both in _ls_tree and info methods which have the same duplicated logic

Would you like to do it? Otherwise, I am more than happy to take care of it myself if it's fine for you that I push changes to your branch?

Thanks in advance!

Ah feel free to push code to my branches haha, but if you don't have time I'll make an effort to take care of it later tonight.

awgr · 2024-03-12T10:38:29Z

After reviewing this again I've realized that a slightly larger change is needed, beyond both the original change (which would've caused problems later) and what you've asked for above.
It's to do with the integrity of the dircache; will consider it a bit tomorrow and return.

Wauplin · 2024-03-12T11:03:54Z

Thanks for rechecking and letting me know @awgr! Let me know when the PR is suitable for a second review then :) If you have any question, feel free to ask.

awgr · 2024-03-19T03:34:32Z

Thanks for rechecking and letting me know @awgr! Let me know when the PR is suitable for a second review then :) If you have any question, feel free to ask.

Sorry, not yet. I'd not had time, and then came back tonight to take another look.

What I'd realized is that: I modified the code locally, leading to the improved profile. The modification did not ensure the dircache was safe from the caller modifying the returned values.

After thinking more, I think there are two typical ways to prevent modification:

Don't use any mutable types in the dircache (too late)
Make an eventually consistent read mirror for the dircache and ensure consistency in the background

(1) is ideal, but I think the api components (LastCommitInfo, BlobLfsInfo, BlobSecurityInfo, ...) would need to be effectively immutable on create. I'm not sure what patterns in HF hub that would break...

While I respect the intent of the defensive copy, it seems to be a large proportion of the computing; I wonder whether a caller might simply opt to disable it, under the reasoning that they assume responsibility for problems that arise from mutating the values returned by the API.

What do you think?

Wauplin · 2024-03-19T13:28:07Z

Thanks @awgr for taking a look back at it. I agree that (1) would be ideal but I'd leave this to a separate issue. Making LastCommitInfo, BlobLfsInfo, BlobSecurityInfo, ... immutable will require some internal changes that are not trivial - especially to make sure things are staying backward compatible.

However given the numbers, I agree we should find a solution to replace the deepcopy. What do you think of implementing a weak copy specific to our needs. Something like

def _weak_copy_ls_tree(items: List) -> List:
    return [
        {
            "name": item["name"],
            "size": item["size"],
            "type": item["type"],
            "blob_id": item.get("blob_id"),
            "tree_id": item.get("tree_id"),
            "lfs": BlobLfsInfo(**item.get("lfs")) if item.get("lfs") is not None else None,
            "last_commit": LastCommitInfo(**item.get("last_commit")) if item.get("last_commit") is not None else None,
            "security": BlobSecurityInfo(**item.get("security")) if item.get("security") is not None else None,
        }
        for item in items
    ]

would work and should be faster than a generic deepcopy. Would you like to give it a try?

awgr · 2024-03-19T23:06:52Z

Thanks @Wauplin! I had the same thought, so, there either are not nested mutable objects, or, we are not concerned with mutations of those modifying the dircache? This approach will certainly improve the performance of ls tree, but may introduce the same dircache polluting problems, e.g. BlobSecurityInfo's field av_scan is also a mutable dictionary.

mariosasko · 2024-03-26T16:09:07Z

s3fs and gcsfs, as the two most popular fsspec filesystems, do not have special logic to make dircache resistant to modifications, so I think @Wauplin's solution is already good enough (not guarding against modifying deeply nested fields should be okay)

awgr · 2024-04-07T20:26:15Z

Hey, sorry I'm a bit late here; I've been a little busy.

I've updated the PR to shallow copy in the manner described; I used copy.copy for the shallow copy; it perform identically to @Wauplin's code.

So, this PR now moves the logic for making the cache_path_info into a single function, and that function can return either a shallow copy, or not.

I almost revised to

def _shallow_copy(self, obj: Any, depth: int) -> List:
    if depth == 0:
        return copy.copy(obj)
    if isinstance(obj, dict):
        return {k: self._improved_shallow_copy(v, depth - 1) for k, v in obj.items()}
    if isinstance(obj, list):
        return [self._improved_shallow_copy(v, depth - 1) for v in obj]
    return obj

to be called as _shallow_copy(out, depth=1), but it felt a little too generic.
I'll be more available this week, if there is any other feedback to handle.

awgr · 2024-04-07T20:27:40Z

I'll profile this again on Monday or so...

Wauplin · 2024-04-08T14:04:35Z

Thanks for getting back to this @awgr!

Actually I think that if we go for a "non-safe-but-ok" solution, we should have a flag to enable/disable deepcopy that would be set to False by disable (i.e. to not deepcopy). And when we don't deepcopy, no need to make copies at all (i;e. no intermediate state where we partially copy stuff).

mariosasko · 2024-04-08T17:30:04Z

I'm not a fan of a flag that would be hidden/rarely used (does not follow the spec).

I couldn't find a fsspec filesystem that takes special care of such modifications, so now I think removing the copy.deepcopy calls (introduced in #1888) is probably the best solution. (Maybe defining a typing.TypedDict with fixed keys and typing.ReadOnly attributes would make it more obvious that the cache is not robust to modifications.)

Wauplin · 2024-04-09T08:10:25Z

Ok for not implement the flag then. Thanks for letting know about other fsspec implementations.

mariosasko · 2024-04-10T16:37:54Z

Performing shallow copy is also quite expensive (on large repos) based on this simple benchmark:

import timeit
setup1 = """\
from huggingface_hub import HfFileSystem
fs = HfFileSystem(skip_instance_cache=True)
"""
t1 = timeit.timeit("fs.find('datasets/bigcode/the-stack-dedup')", setup=setup1, number=20)
print("Avg time taken:", t1)

setup2 = """\
from huggingface_hub import HfFileSystem
fs = HfFileSystem(skip_instance_cache=True)
fs.find('datasets/bigcode/the-stack-dedup')
"""
t2 = timeit.timeit("fs.find('datasets/bigcode/the-stack-dedup')", setup=setup2, number=20)
print("Avg time taken (cached):", t2)

Results:

## Shallow copy
# Avg time taken: 3.8506534169999997
# Avg time taken (cached): 0.5571763330000001

## Deep copy
# Avg time taken: 4.7072871670000005
# Avg time taken (cached): 1.402280417

## No copy
# Avg time taken: 2.913884667
# Avg time taken (cached): 0.026945500000000067

The datasets lib's file resolution relies on the cached .find, so it's important not to have unnecessary copies for the best performance. @Wauplin So, would you be okay with removing the copy.copy/copy.deepcopy calls and using them in the tests instead, as it was initially done in #1888?

PS: This PR does not properly copy the cached .find's result, so https://github.com/huggingface/huggingface_hub/tree/fix/performance_of_ls_tree was used instead in the benchmark (for the "shallow copy" case)

Wauplin · 2024-04-10T16:43:29Z

So, would you be okay with removing the copy.copy/copy.deepcopy calls and using them in the tests instead, as it was initially done in #1888?

Yes I am! Thanks for making benchmarking these. If a downstream script (or the tests) really wants to modify the output, then they copy make the copy themselves.

…x/performance_of_ls_tree

awgr · 2024-04-10T18:32:45Z

It seems we've come full circle 😁. Dropping the copy is ideal for me as it achieves the ~3x speed up we recorded in the linked issue.

Wauplin

Finally approving! 🎉

Thanks both for the insights and discussion! At least we'll be able to come back here in the future if we wonder why we took this decision :)

awgr force-pushed the fix/performance_of_ls_tree branch 3 times, most recently from 72e192a to 773070c Compare April 7, 2024 20:22

fix: performance of _ls_tree

db7a6ec

awgr force-pushed the fix/performance_of_ls_tree branch from 773070c to db7a6ec Compare April 7, 2024 20:24

mariosasko mentioned this pull request Apr 8, 2024

load_dataset('bigcode/the-stack-dedup', streaming=True) very slow! huggingface/datasets#5846

Closed

mariosasko added 2 commits April 10, 2024 19:08

Merge branch 'main' of github.com:huggingface/huggingface_hub into fi…

80df079

…x/performance_of_ls_tree

Remove copying logic

440f19a

mariosasko requested a review from Wauplin April 10, 2024 17:20

Wauplin approved these changes Apr 11, 2024

View reviewed changes

Wauplin merged commit 43ceaa4 into huggingface:main Apr 11, 2024
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: performance of _ls_tree #2103

fix: performance of _ls_tree #2103

awgr commented Mar 9, 2024

awgr commented Mar 9, 2024

HuggingFaceDocBuilderDev commented Mar 10, 2024

Wauplin commented Mar 11, 2024

awgr commented Mar 11, 2024

awgr commented Mar 12, 2024 •

edited

Wauplin commented Mar 12, 2024

awgr commented Mar 19, 2024 •

edited

Wauplin commented Mar 19, 2024

awgr commented Mar 19, 2024

mariosasko commented Mar 26, 2024

awgr commented Apr 7, 2024 •

edited

awgr commented Apr 7, 2024

Wauplin commented Apr 8, 2024

mariosasko commented Apr 8, 2024

Wauplin commented Apr 9, 2024

mariosasko commented Apr 10, 2024

Wauplin commented Apr 10, 2024

awgr commented Apr 10, 2024

Wauplin left a comment

fix: performance of _ls_tree #2103

fix: performance of _ls_tree #2103

Conversation

awgr commented Mar 9, 2024

awgr commented Mar 9, 2024

HuggingFaceDocBuilderDev commented Mar 10, 2024

Wauplin commented Mar 11, 2024

awgr commented Mar 11, 2024

awgr commented Mar 12, 2024 • edited

Wauplin commented Mar 12, 2024

awgr commented Mar 19, 2024 • edited

Wauplin commented Mar 19, 2024

awgr commented Mar 19, 2024

mariosasko commented Mar 26, 2024

awgr commented Apr 7, 2024 • edited

awgr commented Apr 7, 2024

Wauplin commented Apr 8, 2024

mariosasko commented Apr 8, 2024

Wauplin commented Apr 9, 2024

mariosasko commented Apr 10, 2024

Wauplin commented Apr 10, 2024

awgr commented Apr 10, 2024

Wauplin left a comment

Choose a reason for hiding this comment

awgr commented Mar 12, 2024 •

edited

awgr commented Mar 19, 2024 •

edited

awgr commented Apr 7, 2024 •

edited