-
Notifications
You must be signed in to change notification settings - Fork 1.2k
state: use dvcignore when accessing metadata #2161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole construction there I need to make DvcIgnoreFileHandler and then pass it to walk looks wrong. Here is why:
- it only parameterized by tree
- dvc_walk() also walks the tree
If this must be the same tree, then we should not pass it twice because in case of mismatch we get a bug.
Another question is do we need dvc_walk() with no ignore? If not then this design was flawed from the start and lead to the issue you are fixing here. If yes but trees always match then a simple boolean flag instead of ignore_handler param would do.
P.S. Maybe I don't understand how this works, but walking one tree and reading ignore files from another one may only work by coincidence, which is probable since there is usually a single ignore file at the top. However, if ignore files in different trees differ we will get tricky error and if dir structures differ this may simply blow up.
dvc/utils/fs.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
str(path) is an error. The rule is never pass path stringification to file utils, we should use fspath() or make dvc_walk() accept Path-like objects.
If that is done to force unicode then why it wasn't there already? As far as I remember @efiop removed that stringification.
5ecec0c to
fe8ed15
Compare
|
@Suor |
|
@pared what happens if So, the thing is we shouldn't pass path to |
|
UPDATE. If we never need |
e84b603 to
f9bfcf5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a nice clean up atop of bug fix) Should prevent bugs of this type in the future.
One suspicious place and a few ones to polish are commented below.
dvc/ignore.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we pass wdir to find_root()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. In case when we have .dvcignore in repository root, and have data folder, and run dvc_walk(data), we will not apply ignores contained in repository root to data content.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't pass wdir then it will use cwd, which could be anything. So tree is always made based on repo we are currently in. Is this expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, sorry, I just realized that you were talking about passign wdir to find_root. I agree with that completely. Also, as Ruslan mentioned below, we cannot ignore external directory. I quess the best way to proceed will be to try find repo root, and if it cannot be found, use wdir as WorkingTree root.
Thought in such case we should mention in documentation that DVC will not handle .dvcignore files if they are outside of target external directory
dvc/ignore.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this expected to be outside repo sometimes? Should we throw when trying to ignore outside repo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides some test, I don't think we could expect that it will be used outside of repo. I guess its better to expect user to use dvc_walk in dvc_repo, than try to forcefully pass tests that did not use dvc_repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if we don't make a tree, no .dvcignore files would be applied, even if we will find some there, because .update() is skipped. Is this intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are talking about dvcignore itself, then it can't be outside of dvc repo. But keep in mind that we support external directories as dependencies and outputs (e.g. dvc add /path/to/some/external/dir), so some methods like dvc walk should be able to work there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
I don't think it is working correctly. For example: one would expect the last status to not show anything ,right? because bar was ignored. But it does show that bar was deleted. |
|
@efiop |
tests/func/test_ignore.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, this is an interesting one. So dvcignores work in repo for everything in repo and also for external dependencies and, i suppose, external outputs? So one would be able to add .dvcignore inside of his output directory and dvc would use it? Also, do things like *.pyc(so any *.pyc file, with no specific path, don't confuse it with /*.pyc) and stuff in the .dvcignore in the root of the repo work for external dependencies and outputs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I see repos root .dvcignore is not used when outside of repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Suor is right.
I see one problem with such use case: if we start to support it, we would probably have to be prepare to handle ignores in particular project from .dvcignore laying anywhere inside the project, thus switching from loading .dvcignore during traversing, to loading all dvcignores before traversing any directory.
Other way would be probably to restrict where we can put ignores for external dirs. (eg. root of the project)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm not asking to support it, I'm just asking if it is the intension or not 🙂 But the .dvcignore can be placed inside of an output directory and it will work, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If .dvcignore inside of an output directory is supported, then consider this scenario:
dvc runoutputs a new directory and places .dvcignore in it; We use.dvcignoreto not cache what is says and it itself.- now we clean everything up and run
dvc checkout, which will not createdir/.dvcignoreand so if anything creates some ignored file in there, it no longer will be ignored. E.g. imagine*.swpfiles created, when you decide to open one of the output files with vim. Is this an intended behavior? Should we not even use.dvcignorethat is located inside of an output dir? or do we ignore it already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As to supporting .dvcignore for external outputs/deps in general, I see a few appraoches:
-
Use rules defined in the root of the repo for external deps and outs. This kinda makes sense if you think about external outs/deps as a continuation of your repo workspace. But at the same time, if rules like
*.swpare quite straightforward, it surely creates a confusion when you want to ignore something by an abs path/path/to/*.swp, since in git terms it would actually meenrepo_root/path/to/*.swp; -
Support system-level and globa-level dvcignores, like git does. This one is actually pretty straightforward and makes totall sense for local files(e.g. not s3 or ssh).
-
Invent some syntax for the dvcignore to handle external deps/outs or even a separate file like
.dvcignore-externalor something like that. This one is tricky, but at least it could also be applied to any s3/ssh/etc deps/outs, since we would have to treat those separately anyway.
But yeah, these are just some thoughts, I would not bother with it until we see the demand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the effort of just ignoring .dvcignore files?
@efiop system/global dvcignores is a no go since they will be an invisible dependencies for all stages - when you change one dvc repro will result in different thing than dvc repro -f.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great point about dependencies @Suor !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Suor Although, in that case, our local .dvcignores are also dependencies :) We shouldn't treat .dvcignores as dependencies at all, because the scenarios that they are used in could totally wait for the next repro to ignore listed files. I guess this is the case where we could sacrifice dep logic. Or maybe we should actually re-evaluate dvcignores and make them distributes across dvc-files,? The funny thing is that this problem is the same that we have with locks for imported external repos 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though, unlike locks, dvcignores are clearly a global concept and I can totally see people requesting the current global dvcignore later and I don't see them requesting per-dvcfile dvcignore. So I would stick with the current idea of dvcignore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job!
dvc/ignore.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like self.tree is always there.
69580b6 to
93b3162
Compare
tests/func/test_repo.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Suor this test seems to be artifical, and similar case as
#2161 (comment)
I think i should remove that, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are testing real behavior with code backing out here. If we break that this will show it
tests/func/test_repo.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are testing real behavior with code backing out here. If we break that this will show it
Co-Authored-By: Ruslan Kuprieiev <kupruser@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good
|
@pared Please address DeepSource issues. I can see only |
|
Thank you @pared ! 🎉 |
Have you followed the guidelines in our
Contributing document?
Does your PR affect documented changes or does it add new functionality
that should be documented? If yes, have you created a PR for
dvc.org documenting it or at
least opened an issue for it? If so, please add a link to it.
Fixes #1876 #1834 #1863
This PR should probably be merged after #2159