New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
diff/dvctree: optimize dir cache access #4626
Conversation
Current changes in a repo w/directory containing 1K files cuts cprofile time for |
dvc/tree/dvc.py
Outdated
# cache metadata for sequential exists/isdir/isfile/etc calls | ||
@lru_cache(maxsize=1) | ||
def _get_metadata(tree, path_info): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same lru_cache(maxsize=1)
optimization that we use in git tree _get_object_by_path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With git it is pretty safe, as tree is read only, tied to a specific reference. Are there scenarios in which this could cause a bug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, which method was causing too much metadata
calls? Would it be solved by our planned exception unification and moving to an exception-based workflow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In RepoTree.get_hash
we end up making several repeated metadata
calls through DvcTree.exists
. I think moving to an exception based workflow and avoiding all the exists
calls would definitely help in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose in theory we could run into issues here with the dvc repo (and repo tree) being modified in between metadata calls. I can adjust this so that we only cache metadata calls within the scope of dvctree.walk
, since the tree should not be changing within a single walk
call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ended up reverting this change, as the main performance blocker in our metadata calls was still _get_granular_checksum
. As noted before though, moving to an exception based workflow should cut down on duplicated metadata calls
0671714
to
a2d5ea7
Compare
For a repo containing a single tracked directory with 100k files, cprofiled |
dvc/tree/dvc.py
Outdated
|
||
def _update_dir_entry_hashes(self, out, remote): | ||
# cache the most recently used output dir cache to avoid expensive | ||
# repeated lookups of individual files within the same large output dir | ||
dir_cache = out.get_dir_cache(remote=remote) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this particular one is slow, right?
Note that we have dir_cache caching in the Cache itself
Line 64 in 7b590e7
dir_info = self._dir_info.get(hash_info.value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably because iterating over the whole dir_cache to find one hash is slow. Do I understand correctly that walk
was slow? If so, would it be possible to bulk up the operation somehow so we don't metadata()
for each file in a giant dataset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, walk
was slow because we would re-iterate over the whole dir_cache when looking for each individual entry/file hash
c498b9c
to
d3f9306
Compare
info = {} | ||
for entry in dir_info: | ||
relpath = None | ||
hash_info = None | ||
for key, value in entry.items(): | ||
if key == self.tree.PARAM_RELPATH: | ||
relpath = value | ||
else: | ||
hash_info = HashInfo(key, value) | ||
info[relpath] = hash_info | ||
return info |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks a bit ugly but dir_info may actually contain hashes with a different type than cache.tree.PARAM_CHECKSUM
- i.e. dir_info contains S3 etag entries, but we are working in local cache (w/md5 PARAM_CHECKSUM
)
@pmrowla there are some merge conflicts |
_get_granular_checksum()
- mostly unnecessary after granular checksum changes, will be better addressed by the future exception handling updates
0873b85
to
5e0a3eb
Compare
β I have followed the Contributing to DVC checklist.
π If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.
Thank you for the contribution - we'll try to review it as soon as possible. π
Will fix #4580.
HashInfo.dir_info
is now stored as a dict instead of list, dir cache is still kept in the list format everywhere elseDvcTree._get_granular_checksum
now usesout.hash_info.dir_info
to lookup file hashes