ignore: dynamically collect dvcignore #4284

efiop · 2020-07-25T21:58:14Z

This allows us avoid collecting dvcignore for the whole repo if we only
care about particular paths. As a result, in a repo with 2 datasets
(2M + 0.5M files), creating a defunct stage takes ~4sec on 1.2.0, but
~1sec(most of it is actually dvc module initialization) with this PR.

This is also a pre-requisite for dynamic dvcignore and subrepo
collection (#4247) while walking
the tree.

Also, it is important to clarify that regular dvc status(without
arguments) has the same performance after this PR, because when we check
dataset for changes, we call things like tree.exists(), which call
dvcignore and make it collect dvcignore in the dataset itself, so we
still endup collecting dvcignore for the whole repo (including walking
into the datasets). This should be solved soon by telling dvcignore that
it shouldn't walk into the datasets searching for .dvcignores.

❗ I have followed the Contributing to DVC checklist.
📖 If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

efiop · 2020-07-25T22:47:57Z

@karajan1001 Would really appreciate if you could share your thoughts about this 🙂

This allows us avoid collecting dvcignore for the whole repo if we only care about particular paths. As a result, in a repo with 2 datasets (2M + 0.5M files), creating a defunct stage takes ~4sec on 1.2.0, but ~1sec(most of it is actually dvc module initialization) with this PR. This is also a pre-requisite for dynamic dvcignore and subrepo collection (iterative#4247) while walking the tree. Also, it is important to clarify that regular `dvc status`(without arguments) has the same performance after this PR, because when we check dataset for changes, we call things like `tree.exists()`, which call dvcignore and make it collect dvcignore in the dataset itself, so we still endup collecting dvcignore for the whole repo (including walking into the datasets). This should be solved soon by telling dvcignore that it shouldn't walk into the datasets searching for `.dvcignore`s.

karajan1001 · 2020-07-26T03:33:45Z

dvc/ignore.py

+        self._update(self.root_dir)

    def _update(self, dirname):
+        old_pattern = self.ignores_trie_tree.longest_prefix(dirname).value


Maybe here we also need

ignore_pattern = self.ignores_trie_tree.get(dirname) if ignore_pattern: return

to prevent it from running multiply times. But if nowhere else except _get_trie_pattern calls it, it is not necessary.

Seems nowhere else calls it.

Good point! Indeed, for now _update is only called in _get_trie_pattern, so that is already handled for us.

karajan1001 · 2020-07-26T08:33:42Z

@efiop
It's a wonderful work. It saves lots of time when we are scanning parts of the repo.

But I have a question that in

As a result, in a repo with 2 datasets
(2M + 0.5M files), creating a defunct stage takes ~4sec on 1.2.0, but
~1sec(most of it is actually dvc module initialization) with this PR.

There are lots of directories as well? If not so, it would a surprise to me that it took so much time after I had changed

dirs[:], files[:] = self(root, dirs, files)

to

dirs[:], _ = self(root, dirs, [])

preventing useless ignore check on all files.

efiop · 2020-07-26T20:03:51Z

@karajan1001 Your dvcignore implementation works flawlessly and is extremely efficient! 🙏 The thing that was taking time is that before, when dvcignore was initialized, we would walk into those datasets looking for dvcignores and no matter if there are lots of directories in it or not, os.walk still has to do listdir() for each directory, that has to list all of the files inside it, which takes a while with such a large number of files. That issue is actually a bigger problem for dvc, we will solve it by later telling dvcignore to not walk into datasets searching for dvcignores, but for now this dynamic collection mitigates that pretty well as an added bonus.

Thank you so much for your amazing work on dvcignores, what you've built is very neat, I'm truly enjoying working with the architecture you've created. 🙏

karajan1001 · 2020-07-27T04:26:15Z

@efiop
I tried 1M files on my computer.

It takes considerable time.

This PR also improves performance on #4282.

pared

post-merge LGTM

Prevents us from duplicating the work by walking into directories searching for subrepos. Saves around ~1sec (5.8 -> 4.8) in `dvc metrics diff` in a big git-only repo. Related to iterative#4284 (comment)

Prevents us from duplicating the work by walking into directories searching for subrepos. Saves around ~1sec (5.8 -> 4.8) in `dvc metrics diff` in a big git-only repo. Related to #4284 (comment)

efiop force-pushed the dynamic_dvcignore branch 2 times, most recently from bc51716 to 8e68fc3 Compare July 25, 2020 22:40

efiop requested a review from pared July 25, 2020 22:47

efiop force-pushed the dynamic_dvcignore branch from 8e68fc3 to 88addc3 Compare July 25, 2020 22:58

karajan1001 reviewed Jul 26, 2020

View reviewed changes

weekly-digest bot mentioned this pull request Jul 26, 2020

Weekly Digest (19 July, 2020 - 26 July, 2020) #4285

Closed

efiop merged commit fe9ae2c into iterative:master Jul 26, 2020

efiop deleted the dynamic_dvcignore branch July 27, 2020 08:31

pared reviewed Jul 27, 2020

View reviewed changes

weekly-digest bot mentioned this pull request Aug 2, 2020

Weekly Digest (26 July, 2020 - 2 August, 2020) #4315

Closed

efiop mentioned this pull request Dec 14, 2020

dvcignore: don't walk directories twice #5098

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ignore: dynamically collect dvcignore #4284

ignore: dynamically collect dvcignore #4284

Uh oh!

efiop commented Jul 25, 2020 •

edited

Loading

Uh oh!

efiop commented Jul 25, 2020

Uh oh!

karajan1001 Jul 26, 2020 •

edited

Loading

Uh oh!

efiop Jul 26, 2020

Uh oh!

karajan1001 commented Jul 26, 2020

Uh oh!

efiop commented Jul 26, 2020

Uh oh!

karajan1001 commented Jul 27, 2020

Uh oh!

pared left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

ignore: dynamically collect dvcignore #4284

ignore: dynamically collect dvcignore #4284

Uh oh!

Conversation

efiop commented Jul 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

efiop commented Jul 25, 2020

Uh oh!

karajan1001 Jul 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

efiop Jul 26, 2020

Choose a reason for hiding this comment

Uh oh!

karajan1001 commented Jul 26, 2020

Uh oh!

efiop commented Jul 26, 2020

Uh oh!

karajan1001 commented Jul 27, 2020

Uh oh!

pared left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

efiop commented Jul 25, 2020 •

edited

Loading

karajan1001 Jul 26, 2020 •

edited

Loading