diff/dvctree: optimize dir cache access #4626

pmrowla · 2020-09-28T09:59:56Z

❗ I have followed the Contributing to DVC checklist.
📖 If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

Will fix #4580.

HashInfo.dir_info is now stored as a dict instead of list, dir cache is still kept in the list format everywhere else
DvcTree._get_granular_checksum now uses out.hash_info.dir_info to lookup file hashes

pmrowla · 2020-09-28T10:02:35Z

Current changes in a repo w/directory containing 1K files cuts cprofile time for dvc diff from ~20s to ~2s. Still investigating why diff remains very slow with larger directories (100K+ files)

pmrowla · 2020-09-28T10:04:35Z

dvc/tree/dvc.py

+# cache metadata for sequential exists/isdir/isfile/etc calls
+@lru_cache(maxsize=1)
+def _get_metadata(tree, path_info):


same lru_cache(maxsize=1) optimization that we use in git tree _get_object_by_path

With git it is pretty safe, as tree is read only, tied to a specific reference. Are there scenarios in which this could cause a bug?

Btw, which method was causing too much metadata calls? Would it be solved by our planned exception unification and moving to an exception-based workflow?

In RepoTree.get_hash we end up making several repeated metadata calls through DvcTree.exists. I think moving to an exception based workflow and avoiding all the exists calls would definitely help in this case.

I suppose in theory we could run into issues here with the dvc repo (and repo tree) being modified in between metadata calls. I can adjust this so that we only cache metadata calls within the scope of dvctree.walk, since the tree should not be changing within a single walk call.

Ended up reverting this change, as the main performance blocker in our metadata calls was still _get_granular_checksum. As noted before though, moving to an exception based workflow should cut down on duplicated metadata calls

pmrowla · 2020-09-29T09:27:13Z

For a repo containing a single tracked directory with 100k files, cprofiled dvc diff in master previously took almost 2 hours on my machine, with these changes it's down to around 3 minutes. 3 minutes still seems kind of slow for a diff command, especially since we display no progress or status information, but it's better than before at least.

master

PR branch

efiop · 2020-09-29T11:54:04Z

dvc/tree/dvc.py

+
+    def _update_dir_entry_hashes(self, out, remote):
+        # cache the most recently used output dir cache to avoid expensive
+        # repeated lookups of individual files within the same large output dir
        dir_cache = out.get_dir_cache(remote=remote)


So this particular one is slow, right?

Note that we have dir_cache caching in the Cache itself

dvc/dvc/cache/base.py

Line 64 in 7b590e7

dir_info = self._dir_info.get(hash_info.value)

, wonder why that one wasn't enough and we need to introduce another level.

Probably because iterating over the whole dir_cache to find one hash is slow. Do I understand correctly that walk was slow? If so, would it be possible to bulk up the operation somehow so we don't metadata() for each file in a giant dataset?

Right, walk was slow because we would re-iterate over the whole dir_cache when looking for each individual entry/file hash

dvc/repo/diff.py

pmrowla · 2020-09-30T09:54:31Z

dvc/cache/base.py

+        info = {}
+        for entry in dir_info:
+            relpath = None
+            hash_info = None
+            for key, value in entry.items():
+                if key == self.tree.PARAM_RELPATH:
+                    relpath = value
+                else:
+                    hash_info = HashInfo(key, value)
+            info[relpath] = hash_info
+        return info


This looks a bit ugly but dir_info may actually contain hashes with a different type than cache.tree.PARAM_CHECKSUM - i.e. dir_info contains S3 etag entries, but we are working in local cache (w/md5 PARAM_CHECKSUM)

pared · 2020-10-01T13:05:41Z

@pmrowla there are some merge conflicts

_get_granular_checksum()

- mostly unnecessary after granular checksum changes, will be better addressed by the future exception handling updates

pmrowla added performance improvement over resource / time consuming tasks bugfix fixes bug labels Sep 28, 2020

pmrowla self-assigned this Sep 28, 2020

pmrowla added this to In progress in DVC 22 September - 6 October 2020 via automation Sep 28, 2020

pmrowla marked this pull request as draft September 28, 2020 10:01

pmrowla commented Sep 28, 2020

View reviewed changes

pmrowla force-pushed the 4580-diff-hang branch from 0671714 to a2d5ea7 Compare September 28, 2020 10:05

pmrowla marked this pull request as ready for review September 29, 2020 09:17

pmrowla changed the title ~~[WIP] diff/dvctree: optimize dir cache access~~ diff/dvctree: optimize dir cache access Sep 29, 2020

pmrowla requested review from efiop, pared and skshetry September 29, 2020 09:27

efiop reviewed Sep 29, 2020

View reviewed changes

dvc/repo/diff.py Show resolved Hide resolved

efiop moved this from In progress to Review in progress in DVC 22 September - 6 October 2020 Sep 29, 2020

pmrowla force-pushed the 4580-diff-hang branch from c498b9c to d3f9306 Compare September 30, 2020 09:16

pmrowla commented Sep 30, 2020

View reviewed changes

pared approved these changes Oct 1, 2020

View reviewed changes

pmrowla added 9 commits October 2, 2020 09:42

dvctree: use string paths for performance reasons in

2052c80

_get_granular_checksum()

dvctree: cache most recently accessed metadata

3702668

dvctree: cache output dir cache during walk() calls

444ced7

diff: use get_file_hash directly instead of get_hash when appropriate

2cab1d7

dvctree: cache most recen dir_cache hashes in _get_granular_checksum

a795a2c

revert metadata cache changes

146a24b

- mostly unnecessary after granular checksum changes, will be better addressed by the future exception handling updates

fix windows path issue

e4fbae8

HashInfo: store HashInfo.dir_info as a dict instead of list

b0fea97

DvcTree: use hash_info.dir_info for getting granular file hashes

dee2d91

pmrowla added 3 commits October 2, 2020 09:45

use str relpath

ade9bc2

use correct dir_info hash type

32558b9

fix windows paths

5e0a3eb

pmrowla force-pushed the 4580-diff-hang branch from 0873b85 to 5e0a3eb Compare October 2, 2020 00:55

efiop approved these changes Oct 3, 2020

View reviewed changes

efiop merged commit fcdb503 into iterative:master Oct 3, 2020

DVC 22 September - 6 October 2020 automation moved this from Review in progress to Done Oct 3, 2020

pmrowla deleted the 4580-diff-hang branch October 3, 2020 02:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

diff/dvctree: optimize dir cache access #4626

diff/dvctree: optimize dir cache access #4626

pmrowla commented Sep 28, 2020 •

edited

pmrowla commented Sep 28, 2020

pmrowla Sep 28, 2020

efiop Sep 28, 2020

efiop Sep 28, 2020

pmrowla Sep 29, 2020

pmrowla Sep 29, 2020

pmrowla Sep 29, 2020

pmrowla commented Sep 29, 2020

efiop Sep 29, 2020

efiop Sep 29, 2020

pmrowla Sep 29, 2020

pmrowla Sep 30, 2020

pared commented Oct 1, 2020

diff/dvctree: optimize dir cache access #4626

diff/dvctree: optimize dir cache access #4626

Conversation

pmrowla commented Sep 28, 2020 • edited

pmrowla commented Sep 28, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pmrowla commented Sep 29, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pared commented Oct 1, 2020

pmrowla commented Sep 28, 2020 •

edited