Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

diff/dvctree: optimize dir cache access #4626

Merged
merged 12 commits into from Oct 3, 2020
Merged

Conversation

pmrowla
Copy link
Contributor

@pmrowla pmrowla commented Sep 28, 2020

Thank you for the contribution - we'll try to review it as soon as possible. πŸ™

Will fix #4580.

  • HashInfo.dir_info is now stored as a dict instead of list, dir cache is still kept in the list format everywhere else
  • DvcTree._get_granular_checksum now uses out.hash_info.dir_info to lookup file hashes

@pmrowla pmrowla added performance improvement over resource / time consuming tasks bugfix fixes bug labels Sep 28, 2020
@pmrowla pmrowla self-assigned this Sep 28, 2020
@pmrowla pmrowla added this to In progress in DVC 22 September - 6 October 2020 via automation Sep 28, 2020
@pmrowla pmrowla marked this pull request as draft September 28, 2020 10:01
@pmrowla
Copy link
Contributor Author

pmrowla commented Sep 28, 2020

Current changes in a repo w/directory containing 1K files cuts cprofile time for dvc diff from ~20s to ~2s. Still investigating why diff remains very slow with larger directories (100K+ files)

dvc/tree/dvc.py Outdated
Comment on lines 15 to 17
# cache metadata for sequential exists/isdir/isfile/etc calls
@lru_cache(maxsize=1)
def _get_metadata(tree, path_info):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same lru_cache(maxsize=1) optimization that we use in git tree _get_object_by_path

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With git it is pretty safe, as tree is read only, tied to a specific reference. Are there scenarios in which this could cause a bug?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, which method was causing too much metadata calls? Would it be solved by our planned exception unification and moving to an exception-based workflow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In RepoTree.get_hash we end up making several repeated metadata calls through DvcTree.exists. I think moving to an exception based workflow and avoiding all the exists calls would definitely help in this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose in theory we could run into issues here with the dvc repo (and repo tree) being modified in between metadata calls. I can adjust this so that we only cache metadata calls within the scope of dvctree.walk, since the tree should not be changing within a single walk call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ended up reverting this change, as the main performance blocker in our metadata calls was still _get_granular_checksum. As noted before though, moving to an exception based workflow should cut down on duplicated metadata calls

@pmrowla pmrowla marked this pull request as ready for review September 29, 2020 09:17
@pmrowla pmrowla changed the title [WIP] diff/dvctree: optimize dir cache access diff/dvctree: optimize dir cache access Sep 29, 2020
@pmrowla
Copy link
Contributor Author

pmrowla commented Sep 29, 2020

For a repo containing a single tracked directory with 100k files, cprofiled dvc diff in master previously took almost 2 hours on my machine, with these changes it's down to around 3 minutes. 3 minutes still seems kind of slow for a diff command, especially since we display no progress or status information, but it's better than before at least.

master
Screen Shot 2020-09-29 at 6 23 10 PM

PR branch
Screen Shot 2020-09-29 at 6 23 31 PM

dvc/tree/dvc.py Outdated

def _update_dir_entry_hashes(self, out, remote):
# cache the most recently used output dir cache to avoid expensive
# repeated lookups of individual files within the same large output dir
dir_cache = out.get_dir_cache(remote=remote)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this particular one is slow, right?

Note that we have dir_cache caching in the Cache itself

dir_info = self._dir_info.get(hash_info.value)
, wonder why that one wasn't enough and we need to introduce another level.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably because iterating over the whole dir_cache to find one hash is slow. Do I understand correctly that walk was slow? If so, would it be possible to bulk up the operation somehow so we don't metadata() for each file in a giant dataset?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, walk was slow because we would re-iterate over the whole dir_cache when looking for each individual entry/file hash

@efiop efiop moved this from In progress to Review in progress in DVC 22 September - 6 October 2020 Sep 29, 2020
Comment on lines +618 to +628
info = {}
for entry in dir_info:
relpath = None
hash_info = None
for key, value in entry.items():
if key == self.tree.PARAM_RELPATH:
relpath = value
else:
hash_info = HashInfo(key, value)
info[relpath] = hash_info
return info
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks a bit ugly but dir_info may actually contain hashes with a different type than cache.tree.PARAM_CHECKSUM - i.e. dir_info contains S3 etag entries, but we are working in local cache (w/md5 PARAM_CHECKSUM)

@pared
Copy link
Contributor

pared commented Oct 1, 2020

@pmrowla there are some merge conflicts

@efiop efiop merged commit fcdb503 into iterative:master Oct 3, 2020
DVC 22 September - 6 October 2020 automation moved this from Review in progress to Done Oct 3, 2020
@pmrowla pmrowla deleted the 4580-diff-hang branch October 3, 2020 02:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix fixes bug performance improvement over resource / time consuming tasks
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

diff: appears to hang for large directories
3 participants