Skip to content

Conversation

@efiop
Copy link
Contributor

@efiop efiop commented Jul 25, 2020

This allows us avoid collecting dvcignore for the whole repo if we only
care about particular paths. As a result, in a repo with 2 datasets
(2M + 0.5M files), creating a defunct stage takes ~4sec on 1.2.0, but
~1sec(most of it is actually dvc module initialization) with this PR.

This is also a pre-requisite for dynamic dvcignore and subrepo
collection (#4247) while walking
the tree.

Also, it is important to clarify that regular dvc status(without
arguments) has the same performance after this PR, because when we check
dataset for changes, we call things like tree.exists(), which call
dvcignore and make it collect dvcignore in the dataset itself, so we
still endup collecting dvcignore for the whole repo (including walking
into the datasets). This should be solved soon by telling dvcignore that
it shouldn't walk into the datasets searching for .dvcignores.

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

@efiop efiop force-pushed the dynamic_dvcignore branch 2 times, most recently from bc51716 to 8e68fc3 Compare July 25, 2020 22:40
@efiop efiop requested a review from pared July 25, 2020 22:47
@efiop
Copy link
Contributor Author

efiop commented Jul 25, 2020

@karajan1001 Would really appreciate if you could share your thoughts about this 🙂

This allows us avoid collecting dvcignore for the whole repo if we only
care about particular paths. As a result, in a repo with 2 datasets
(2M + 0.5M files), creating a defunct stage takes ~4sec on 1.2.0, but
~1sec(most of it is actually dvc module initialization) with this PR.

This is also a pre-requisite for dynamic dvcignore and subrepo
collection (iterative#4247) while walking
the tree.

Also, it is important to clarify that regular `dvc status`(without
arguments) has the same performance after this PR, because when we check
dataset for changes, we call things like `tree.exists()`, which call
dvcignore and make it collect dvcignore in the dataset itself, so we
still endup collecting dvcignore for the whole repo (including walking
into the datasets). This should be solved soon by telling dvcignore that
it shouldn't walk into the datasets searching for `.dvcignore`s.
@efiop efiop force-pushed the dynamic_dvcignore branch from 8e68fc3 to 88addc3 Compare July 25, 2020 22:58
self._update(self.root_dir)

def _update(self, dirname):
old_pattern = self.ignores_trie_tree.longest_prefix(dirname).value
Copy link
Contributor

@karajan1001 karajan1001 Jul 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe here we also need

        ignore_pattern = self.ignores_trie_tree.get(dirname)
        if ignore_pattern:
            return

to prevent it from running multiply times. But if nowhere else except _get_trie_pattern calls it, it is not necessary.

Seems nowhere else calls it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Indeed, for now _update is only called in _get_trie_pattern, so that is already handled for us.

@karajan1001
Copy link
Contributor

@efiop
It's a wonderful work. It saves lots of time when we are scanning parts of the repo.

But I have a question that in

As a result, in a repo with 2 datasets
(2M + 0.5M files), creating a defunct stage takes ~4sec on 1.2.0, but
~1sec(most of it is actually dvc module initialization) with this PR.

There are lots of directories as well? If not so, it would a surprise to me that it took so much time after I had changed

dirs[:], files[:] = self(root, dirs, files)

to

dirs[:], _ = self(root, dirs, [])

preventing useless ignore check on all files.

@efiop
Copy link
Contributor Author

efiop commented Jul 26, 2020

@karajan1001 Your dvcignore implementation works flawlessly and is extremely efficient! 🙏 The thing that was taking time is that before, when dvcignore was initialized, we would walk into those datasets looking for dvcignores and no matter if there are lots of directories in it or not, os.walk still has to do listdir() for each directory, that has to list all of the files inside it, which takes a while with such a large number of files. That issue is actually a bigger problem for dvc, we will solve it by later telling dvcignore to not walk into datasets searching for dvcignores, but for now this dynamic collection mitigates that pretty well as an added bonus.

Thank you so much for your amazing work on dvcignores, what you've built is very neat, I'm truly enjoying working with the architecture you've created. 🙏

@efiop efiop merged commit fe9ae2c into iterative:master Jul 26, 2020
@karajan1001
Copy link
Contributor

@efiop
I tried 1M files on my computer.
image
It takes considerable time.

This PR also improves performance on #4282.

@efiop efiop deleted the dynamic_dvcignore branch July 27, 2020 08:31
Copy link
Contributor

@pared pared left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

post-merge LGTM

efiop added a commit to efiop/dvc that referenced this pull request Dec 14, 2020
Prevents us from duplicating the work by walking into directories
searching for subrepos. Saves around ~1sec (5.8 -> 4.8) in
`dvc metrics diff` in a big git-only repo.

Related to iterative#4284 (comment)
efiop added a commit that referenced this pull request Dec 14, 2020
Prevents us from duplicating the work by walking into directories
searching for subrepos. Saves around ~1sec (5.8 -> 4.8) in
`dvc metrics diff` in a big git-only repo.

Related to #4284 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants