New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ignore: CleanTree should always check ignores, not just for walk() #3876
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thoguht it seems to me that checking parents in each exists call might have an impact on performance. Creating issue in dvc-bench
for that.
iterative/dvc-bench#18
@pmrowla Could you please run the test from #3867 and see how that compares with and without your patch? There is no doubt that we need to fix these methods, but maybe we'll have to do some lowhanger optimizations too right away π , if the performance degrades too much. We clearly need to revisit dvcignore in general #3869, but maybe there are some simple optimizations (e.g. checking |
I'll look into the performance impact first thing tomorrow. Also need to fix the windows only issues from CI - looks like it's a problem with the case where local cache is on a different drive. |
# if path is outside of tree, assume this is a local remote/local cache | ||
# link/move operation where we do not need to filter ignores | ||
path = relpath(path, self.tree_root) | ||
if path.startswith("..") or ( | ||
os.name == "nt" | ||
and not os.path.commonprefix( | ||
[os.path.abspath(path), self.tree_root] | ||
) | ||
): | ||
return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is non-intuitive, it seems to me that paths outside of repo root should by definition not exist in a repo CleanTree, but the problem is that for local cache/remote we do not currently differentiate between WorkingTree/CleanTree/filesystem paths.
This can be cleaned up by the remote/cache/tree separation:
- local remote/cache should have their own working tree that would handle the case where DVC cache dir is outside of the repo root or on a different logical drive
- DVC repo CleanTree could then disallow (ignore) paths outside of the repo root, since CleanTree paths would always and only be used to refer to paths inside the DVC repo, and not arbitrary local filesystem paths.
@efiop across several profiler runs from #3867, I agree that dvcignore/CleanTree performance definitely needs to be improved, but it seems to me like we can avoid making any optimization related changes in this PR, and handle #3869 in the next sprint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
β I have followed the Contributing to DVC checklist.
π If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here. If the CLI API is changed, I have updated tab completion scripts.
β I will check DeepSource, CodeClimate, and other sanity checks below. (We consider them recommendatory and don't expect everything to be addressed. Please fix things that actually improve code or fix bugs.)
Thank you for the contribution - we'll try to review it as soon as possible. π
CleanTree.exists()
/isfile()
/isdir()
/etc now work as expected.path
relative to tree root, CleanTree will now check thatpath
and its parents are valid and not in our DVC ignores.Will close #3444.