-
Couldn't load subscription status.
- Fork 1.2k
ignore: dynamically collect dvcignore #4284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
bc51716 to
8e68fc3
Compare
|
@karajan1001 Would really appreciate if you could share your thoughts about this 🙂 |
This allows us avoid collecting dvcignore for the whole repo if we only care about particular paths. As a result, in a repo with 2 datasets (2M + 0.5M files), creating a defunct stage takes ~4sec on 1.2.0, but ~1sec(most of it is actually dvc module initialization) with this PR. This is also a pre-requisite for dynamic dvcignore and subrepo collection (iterative#4247) while walking the tree. Also, it is important to clarify that regular `dvc status`(without arguments) has the same performance after this PR, because when we check dataset for changes, we call things like `tree.exists()`, which call dvcignore and make it collect dvcignore in the dataset itself, so we still endup collecting dvcignore for the whole repo (including walking into the datasets). This should be solved soon by telling dvcignore that it shouldn't walk into the datasets searching for `.dvcignore`s.
| self._update(self.root_dir) | ||
|
|
||
| def _update(self, dirname): | ||
| old_pattern = self.ignores_trie_tree.longest_prefix(dirname).value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe here we also need
ignore_pattern = self.ignores_trie_tree.get(dirname)
if ignore_pattern:
return
to prevent it from running multiply times. But if nowhere else except _get_trie_pattern calls it, it is not necessary.
Seems nowhere else calls it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Indeed, for now _update is only called in _get_trie_pattern, so that is already handled for us.
|
@efiop But I have a question that in There are lots of directories as well? If not so, it would a surprise to me that it took so much time after I had changed to preventing useless ignore check on all files. |
|
@karajan1001 Your dvcignore implementation works flawlessly and is extremely efficient! 🙏 The thing that was taking time is that before, when dvcignore was initialized, we would walk into those datasets looking for dvcignores and no matter if there are lots of directories in it or not, os.walk still has to do Thank you so much for your amazing work on dvcignores, what you've built is very neat, I'm truly enjoying working with the architecture you've created. 🙏 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
post-merge LGTM
Prevents us from duplicating the work by walking into directories searching for subrepos. Saves around ~1sec (5.8 -> 4.8) in `dvc metrics diff` in a big git-only repo. Related to iterative#4284 (comment)
Prevents us from duplicating the work by walking into directories searching for subrepos. Saves around ~1sec (5.8 -> 4.8) in `dvc metrics diff` in a big git-only repo. Related to #4284 (comment)

This allows us avoid collecting dvcignore for the whole repo if we only
care about particular paths. As a result, in a repo with 2 datasets
(2M + 0.5M files), creating a defunct stage takes ~4sec on 1.2.0, but
~1sec(most of it is actually dvc module initialization) with this PR.
This is also a pre-requisite for dynamic dvcignore and subrepo
collection (#4247) while walking
the tree.
Also, it is important to clarify that regular
dvc status(withoutarguments) has the same performance after this PR, because when we check
dataset for changes, we call things like
tree.exists(), which calldvcignore and make it collect dvcignore in the dataset itself, so we
still endup collecting dvcignore for the whole repo (including walking
into the datasets). This should be solved soon by telling dvcignore that
it shouldn't walk into the datasets searching for
.dvcignores.❗ I have followed the Contributing to DVC checklist.
📖 If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.
Thank you for the contribution - we'll try to review it as soon as possible. 🙏