Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ignore: CleanTree should always check ignores, not just for walk() #3876

Merged
merged 5 commits into from May 27, 2020

Conversation

pmrowla
Copy link
Contributor

@pmrowla pmrowla commented May 26, 2020

  • ❗ I have followed the Contributing to DVC checklist.

  • πŸ“– If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here. If the CLI API is changed, I have updated tab completion scripts.

  • ❌ I will check DeepSource, CodeClimate, and other sanity checks below. (We consider them recommendatory and don't expect everything to be addressed. Please fix things that actually improve code or fix bugs.)

Thank you for the contribution - we'll try to review it as soon as possible. πŸ™

  • CleanTree.exists()/isfile()/isdir()/etc now work as expected.
  • For a given path relative to tree root, CleanTree will now check that path and its parents are valid and not in our DVC ignores.

Will close #3444.

@pmrowla pmrowla changed the title ignore: CleanTree should always check ignores, not just for walk() [WIP] ignore: CleanTree should always check ignores, not just for walk() May 26, 2020
@pmrowla pmrowla self-assigned this May 26, 2020
@pmrowla pmrowla changed the title [WIP] ignore: CleanTree should always check ignores, not just for walk() ignore: CleanTree should always check ignores, not just for walk() May 26, 2020
@pmrowla pmrowla requested review from pared, efiop and skshetry May 26, 2020 08:37
Copy link
Contributor

@pared pared left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thoguht it seems to me that checking parents in each exists call might have an impact on performance. Creating issue in dvc-bench for that.
iterative/dvc-bench#18

@efiop
Copy link
Member

efiop commented May 26, 2020

@pmrowla Could you please run the test from #3867 and see how that compares with and without your patch?

There is no doubt that we need to fix these methods, but maybe we'll have to do some lowhanger optimizations too right away πŸ™ , if the performance degrades too much. We clearly need to revisit dvcignore in general #3869, but maybe there are some simple optimizations (e.g. checking root before running matches for each file/dir) that we could do right now, if needed.

@pmrowla
Copy link
Contributor Author

pmrowla commented May 26, 2020

I'll look into the performance impact first thing tomorrow. Also need to fix the windows only issues from CI - looks like it's a problem with the case where local cache is on a different drive.

Comment on lines +181 to +190
# if path is outside of tree, assume this is a local remote/local cache
# link/move operation where we do not need to filter ignores
path = relpath(path, self.tree_root)
if path.startswith("..") or (
os.name == "nt"
and not os.path.commonprefix(
[os.path.abspath(path), self.tree_root]
)
):
return True
Copy link
Contributor Author

@pmrowla pmrowla May 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is non-intuitive, it seems to me that paths outside of repo root should by definition not exist in a repo CleanTree, but the problem is that for local cache/remote we do not currently differentiate between WorkingTree/CleanTree/filesystem paths.

This can be cleaned up by the remote/cache/tree separation:

  • local remote/cache should have their own working tree that would handle the case where DVC cache dir is outside of the repo root or on a different logical drive
  • DVC repo CleanTree could then disallow (ignore) paths outside of the repo root, since CleanTree paths would always and only be used to refer to paths inside the DVC repo, and not arbitrary local filesystem paths.

@pmrowla
Copy link
Contributor Author

pmrowla commented May 27, 2020

@efiop across several profiler runs from #3867, ignore.py takes 5-6s of runtime for both master and this PR branch on my machine.

I agree that dvcignore/CleanTree performance definitely needs to be improved, but it seems to me like we can avoid making any optimization related changes in this PR, and handle #3869 in the next sprint

Copy link
Member

@efiop efiop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@efiop efiop merged commit 10f2bb9 into iterative:master May 27, 2020
@pmrowla pmrowla deleted the 3444-cleantree branch May 27, 2020 09:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

forbid using subrepo as a target
3 participants