Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: performance of _ls_tree #2103

Merged
merged 3 commits into from
Apr 11, 2024
Merged

Conversation

awgr
Copy link
Contributor

@awgr awgr commented Mar 9, 2024

No description provided.

@awgr
Copy link
Contributor Author

awgr commented Mar 9, 2024

huggingface/datasets#6726

Before change:
image

After change:
image

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@Wauplin
Copy link
Contributor

Wauplin commented Mar 11, 2024

Hi @awgr, thanks a lot for your PR and clear profiling explanations! And thanks for finding out the deepcopy was the culprit here 😬 I'd like to take advantage of your PR to slightly refactor this part of the code here and in particular:

  • move your make_cache_path_info method to a separate (internal) helper
  • reuse this helper both in _ls_tree and info methods which have the same duplicated logic

Would you like to do it? Otherwise, I am more than happy to take care of it myself if it's fine for you that I push changes to your branch?

Thanks in advance!

@awgr
Copy link
Contributor Author

awgr commented Mar 11, 2024

Hi @awgr, thanks a lot for your PR and clear profiling explanations! And thanks for finding out the deepcopy was the culprit here 😬 I'd like to take advantage of your PR to slightly refactor this part of the code here and in particular:

  • move your make_cache_path_info method to a separate (internal) helper
  • reuse this helper both in _ls_tree and info methods which have the same duplicated logic

Would you like to do it? Otherwise, I am more than happy to take care of it myself if it's fine for you that I push changes to your branch?

Thanks in advance!

Ah feel free to push code to my branches haha, but if you don't have time I'll make an effort to take care of it later tonight.

@awgr
Copy link
Contributor Author

awgr commented Mar 12, 2024

After reviewing this again I've realized that a slightly larger change is needed, beyond both the original change (which would've caused problems later) and what you've asked for above.
It's to do with the integrity of the dircache; will consider it a bit tomorrow and return.

@Wauplin
Copy link
Contributor

Wauplin commented Mar 12, 2024

Thanks for rechecking and letting me know @awgr! Let me know when the PR is suitable for a second review then :) If you have any question, feel free to ask.

@awgr
Copy link
Contributor Author

awgr commented Mar 19, 2024

Thanks for rechecking and letting me know @awgr! Let me know when the PR is suitable for a second review then :) If you have any question, feel free to ask.

Sorry, not yet. I'd not had time, and then came back tonight to take another look.

What I'd realized is that: I modified the code locally, leading to the improved profile. The modification did not ensure the dircache was safe from the caller modifying the returned values.

After thinking more, I think there are two typical ways to prevent modification:

  1. Don't use any mutable types in the dircache (too late)
  2. Make an eventually consistent read mirror for the dircache and ensure consistency in the background

(1) is ideal, but I think the api components (LastCommitInfo, BlobLfsInfo, BlobSecurityInfo, ...) would need to be effectively immutable on create. I'm not sure what patterns in HF hub that would break...

While I respect the intent of the defensive copy, it seems to be a large proportion of the computing; I wonder whether a caller might simply opt to disable it, under the reasoning that they assume responsibility for problems that arise from mutating the values returned by the API.

What do you think?

@Wauplin
Copy link
Contributor

Wauplin commented Mar 19, 2024

Thanks @awgr for taking a look back at it. I agree that (1) would be ideal but I'd leave this to a separate issue. Making LastCommitInfo, BlobLfsInfo, BlobSecurityInfo, ... immutable will require some internal changes that are not trivial - especially to make sure things are staying backward compatible.

However given the numbers, I agree we should find a solution to replace the deepcopy. What do you think of implementing a weak copy specific to our needs. Something like

def _weak_copy_ls_tree(items: List) -> List:
    return [
        {
            "name": item["name"],
            "size": item["size"],
            "type": item["type"],
            "blob_id": item.get("blob_id"),
            "tree_id": item.get("tree_id"),
            "lfs": BlobLfsInfo(**item.get("lfs")) if item.get("lfs") is not None else None,
            "last_commit": LastCommitInfo(**item.get("last_commit")) if item.get("last_commit") is not None else None,
            "security": BlobSecurityInfo(**item.get("security")) if item.get("security") is not None else None,
        }
        for item in items
    ]

would work and should be faster than a generic deepcopy. Would you like to give it a try?

@awgr
Copy link
Contributor Author

awgr commented Mar 19, 2024

Thanks @Wauplin! I had the same thought, so, there either are not nested mutable objects, or, we are not concerned with mutations of those modifying the dircache? This approach will certainly improve the performance of ls tree, but may introduce the same dircache polluting problems, e.g. BlobSecurityInfo's field av_scan is also a mutable dictionary.

@mariosasko
Copy link
Contributor

s3fs and gcsfs, as the two most popular fsspec filesystems, do not have special logic to make dircache resistant to modifications, so I think @Wauplin's solution is already good enough (not guarding against modifying deeply nested fields should be okay)

@awgr awgr force-pushed the fix/performance_of_ls_tree branch 3 times, most recently from 72e192a to 773070c Compare April 7, 2024 20:22
@awgr awgr force-pushed the fix/performance_of_ls_tree branch from 773070c to db7a6ec Compare April 7, 2024 20:24
@awgr
Copy link
Contributor Author

awgr commented Apr 7, 2024

Hey, sorry I'm a bit late here; I've been a little busy.

I've updated the PR to shallow copy in the manner described; I used copy.copy for the shallow copy; it perform identically to @Wauplin's code.

So, this PR now moves the logic for making the cache_path_info into a single function, and that function can return either a shallow copy, or not.

I almost revised to

def _shallow_copy(self, obj: Any, depth: int) -> List:
    if depth == 0:
        return copy.copy(obj)
    if isinstance(obj, dict):
        return {k: self._improved_shallow_copy(v, depth - 1) for k, v in obj.items()}
    if isinstance(obj, list):
        return [self._improved_shallow_copy(v, depth - 1) for v in obj]
    return obj

to be called as _shallow_copy(out, depth=1), but it felt a little too generic.
I'll be more available this week, if there is any other feedback to handle.

@awgr
Copy link
Contributor Author

awgr commented Apr 7, 2024

I'll profile this again on Monday or so...

@Wauplin
Copy link
Contributor

Wauplin commented Apr 8, 2024

Thanks for getting back to this @awgr!

Actually I think that if we go for a "non-safe-but-ok" solution, we should have a flag to enable/disable deepcopy that would be set to False by disable (i.e. to not deepcopy). And when we don't deepcopy, no need to make copies at all (i;e. no intermediate state where we partially copy stuff).

@mariosasko
Copy link
Contributor

I'm not a fan of a flag that would be hidden/rarely used (does not follow the spec).

I couldn't find a fsspec filesystem that takes special care of such modifications, so now I think removing the copy.deepcopy calls (introduced in #1888) is probably the best solution. (Maybe defining a typing.TypedDict with fixed keys and typing.ReadOnly attributes would make it more obvious that the cache is not robust to modifications.)

@Wauplin
Copy link
Contributor

Wauplin commented Apr 9, 2024

Ok for not implement the flag then. Thanks for letting know about other fsspec implementations.

@mariosasko
Copy link
Contributor

Performing shallow copy is also quite expensive (on large repos) based on this simple benchmark:

import timeit
setup1 = """\
from huggingface_hub import HfFileSystem
fs = HfFileSystem(skip_instance_cache=True)
"""
t1 = timeit.timeit("fs.find('datasets/bigcode/the-stack-dedup')", setup=setup1, number=20)
print("Avg time taken:", t1)

setup2 = """\
from huggingface_hub import HfFileSystem
fs = HfFileSystem(skip_instance_cache=True)
fs.find('datasets/bigcode/the-stack-dedup')
"""
t2 = timeit.timeit("fs.find('datasets/bigcode/the-stack-dedup')", setup=setup2, number=20)
print("Avg time taken (cached):", t2)

Results:

## Shallow copy
# Avg time taken: 3.8506534169999997
# Avg time taken (cached): 0.5571763330000001

## Deep copy
# Avg time taken: 4.7072871670000005
# Avg time taken (cached): 1.402280417

## No copy
# Avg time taken: 2.913884667
# Avg time taken (cached): 0.026945500000000067

The datasets lib's file resolution relies on the cached .find, so it's important not to have unnecessary copies for the best performance. @Wauplin So, would you be okay with removing the copy.copy/copy.deepcopy calls and using them in the tests instead, as it was initially done in #1888?

PS: This PR does not properly copy the cached .find's result, so https://github.com/huggingface/huggingface_hub/tree/fix/performance_of_ls_tree was used instead in the benchmark (for the "shallow copy" case)

@Wauplin
Copy link
Contributor

Wauplin commented Apr 10, 2024

So, would you be okay with removing the copy.copy/copy.deepcopy calls and using them in the tests instead, as it was initially done in #1888?

Yes I am! Thanks for making benchmarking these. If a downstream script (or the tests) really wants to modify the output, then they copy make the copy themselves.

@mariosasko mariosasko requested a review from Wauplin April 10, 2024 17:20
@awgr
Copy link
Contributor Author

awgr commented Apr 10, 2024

It seems we've come full circle 😁. Dropping the copy is ideal for me as it achieves the ~3x speed up we recorded in the linked issue.

Copy link
Contributor

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally approving! 🎉

Thanks both for the insights and discussion! At least we'll be able to come back here in the future if we wonder why we took this decision :)

@Wauplin Wauplin merged commit 43ceaa4 into huggingface:main Apr 11, 2024
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants