Skip to content

Conversation

@pared
Copy link
Contributor

@pared pared commented Jun 20, 2019

  • Have you followed the guidelines in our
    Contributing document?

  • Does your PR affect documented changes or does it add new functionality
    that should be documented? If yes, have you created a PR for
    dvc.org documenting it or at
    least opened an issue for it? If so, please add a link to it.


Fixes #1876 #1834 #1863

This PR should probably be merged after #2159

Copy link
Contributor

@Suor Suor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole construction there I need to make DvcIgnoreFileHandler and then pass it to walk looks wrong. Here is why:

  • it only parameterized by tree
  • dvc_walk() also walks the tree

If this must be the same tree, then we should not pass it twice because in case of mismatch we get a bug.

Another question is do we need dvc_walk() with no ignore? If not then this design was flawed from the start and lead to the issue you are fixing here. If yes but trees always match then a simple boolean flag instead of ignore_handler param would do.

P.S. Maybe I don't understand how this works, but walking one tree and reading ignore files from another one may only work by coincidence, which is probable since there is usually a single ignore file at the top. However, if ignore files in different trees differ we will get tricky error and if dir structures differ this may simply blow up.

dvc/utils/fs.py Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

str(path) is an error. The rule is never pass path stringification to file utils, we should use fspath() or make dvc_walk() accept Path-like objects.

If that is done to force unicode then why it wasn't there already? As far as I remember @efiop removed that stringification.

@pared pared force-pushed the 1876 branch 3 times, most recently from 5ecec0c to fe8ed15 Compare June 24, 2019 17:53
@pared
Copy link
Contributor Author

pared commented Jun 24, 2019

@Suor
I removed DvcIgnoreFileHandler. I agree that it is unnecessary, but I am afraid that flag is not enough. It actually will not serve any purpose now that we plan to use dvcignore in each use case of walk. We need to pass tree object to access tree-related operations like exists and open which are required for ignore mechanism to work. I guess we could discuss usability of dvcignore in case of git branches, because that is the main obstacle in removing tree from ignore mechanics.

@pared pared requested a review from Suor June 25, 2019 08:10
@Suor
Copy link
Contributor

Suor commented Jun 25, 2019

@pared what happens if path passed to dvc_walk() doesn't correspond the tree we use for ignores? This could be the same dir, but path means whatever we have on disk (Working Tree), which tree may go over branch. This will mean you will read files and ignore files from different branches, which will work some way most of the time, but is an obvious error.

So, the thing is we shouldn't pass path to dvc_walk() we should pass tree only and use it for both listing and ignore.

@Suor
Copy link
Contributor

Suor commented Jun 25, 2019

UPDATE. If we never need dvc_walk() to go by anything but working tree. We can still pass path and set tree = WorkingTree(path) inside, we should not have to pass both path and tree.

@pared pared force-pushed the 1876 branch 2 times, most recently from e84b603 to f9bfcf5 Compare June 25, 2019 22:06
@pared pared requested a review from efiop June 25, 2019 22:12
@pared pared changed the title [WIP] state: use ignore file handler when accessing metadata state: use ignore file handler when accessing metadata Jun 25, 2019
Copy link
Contributor

@Suor Suor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a nice clean up atop of bug fix) Should prevent bugs of this type in the future.

One suspicious place and a few ones to polish are commented below.

dvc/ignore.py Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we pass wdir to find_root()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. In case when we have .dvcignore in repository root, and have data folder, and run dvc_walk(data), we will not apply ignores contained in repository root to data content.

Copy link
Contributor

@Suor Suor Jun 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't pass wdir then it will use cwd, which could be anything. So tree is always made based on repo we are currently in. Is this expected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, sorry, I just realized that you were talking about passign wdir to find_root. I agree with that completely. Also, as Ruslan mentioned below, we cannot ignore external directory. I quess the best way to proceed will be to try find repo root, and if it cannot be found, use wdir as WorkingTree root.
Thought in such case we should mention in documentation that DVC will not handle .dvcignore files if they are outside of target external directory

dvc/ignore.py Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this expected to be outside repo sometimes? Should we throw when trying to ignore outside repo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides some test, I don't think we could expect that it will be used outside of repo. I guess its better to expect user to use dvc_walk in dvc_repo, than try to forcefully pass tests that did not use dvc_repo.

Copy link
Contributor

@Suor Suor Jun 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if we don't make a tree, no .dvcignore files would be applied, even if we will find some there, because .update() is skipped. Is this intended?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are talking about dvcignore itself, then it can't be outside of dvc repo. But keep in mind that we support external directories as dependencies and outputs (e.g. dvc add /path/to/some/external/dir), so some methods like dvc walk should be able to work there.

Copy link
Contributor Author

@pared pared Jun 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@efiop thats a good point, I guess dvc_walk should be able to handle out of repository directories.
@Suor That is my mistake. I guess we should proceed with tree, if it exists, and if not, we should just go with os.path. What do you think?

@pared pared changed the title state: use ignore file handler when accessing metadata state: use dvcignore when accessing metadata Jun 26, 2019
@efiop
Copy link
Contributor

efiop commented Jun 26, 2019

I don't think it is working correctly. For example:

#!/bin/bash

set -e
set -x

rm -rf myrepo
mkdir myrepo

cd myrepo

git init
dvc init

mkdir data_dir
echo foo > data_dir/foo
echo bar > .dvcignore
echo bar > data_dir/bar

dvc add data_dir

dvc status

rm -rf data_dir/bar

dvc status

one would expect the last status to not show anything ,right? because bar was ignored. But it does show that bar was deleted.

@pared
Copy link
Contributor Author

pared commented Jun 26, 2019

@efiop
It seems I haven't removed os.walk in RemoteLOCAL.walk when writin initial dvc_walk implementation. Introduced proper change and test for that case

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, this is an interesting one. So dvcignores work in repo for everything in repo and also for external dependencies and, i suppose, external outputs? So one would be able to add .dvcignore inside of his output directory and dvc would use it? Also, do things like *.pyc(so any *.pyc file, with no specific path, don't confuse it with /*.pyc) and stuff in the .dvcignore in the root of the repo work for external dependencies and outputs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I see repos root .dvcignore is not used when outside of repo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Suor is right.
I see one problem with such use case: if we start to support it, we would probably have to be prepare to handle ignores in particular project from .dvcignore laying anywhere inside the project, thus switching from loading .dvcignore during traversing, to loading all dvcignores before traversing any directory.

Other way would be probably to restrict where we can put ignores for external dirs. (eg. root of the project)

Copy link
Contributor

@efiop efiop Jun 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not asking to support it, I'm just asking if it is the intension or not 🙂 But the .dvcignore can be placed inside of an output directory and it will work, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If .dvcignore inside of an output directory is supported, then consider this scenario:

  1. dvc run outputs a new directory and places .dvcignore in it; We use .dvcignore to not cache what is says and it itself.
  2. now we clean everything up and run dvc checkout, which will not create dir/.dvcignore and so if anything creates some ignored file in there, it no longer will be ignored. E.g. imagine *.swp files created, when you decide to open one of the output files with vim. Is this an intended behavior? Should we not even use .dvcignore that is located inside of an output dir? or do we ignore it already?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As to supporting .dvcignore for external outputs/deps in general, I see a few appraoches:

  1. Use rules defined in the root of the repo for external deps and outs. This kinda makes sense if you think about external outs/deps as a continuation of your repo workspace. But at the same time, if rules like *.swp are quite straightforward, it surely creates a confusion when you want to ignore something by an abs path /path/to/*.swp, since in git terms it would actually meen repo_root/path/to/*.swp;

  2. Support system-level and globa-level dvcignores, like git does. This one is actually pretty straightforward and makes totall sense for local files(e.g. not s3 or ssh).

  3. Invent some syntax for the dvcignore to handle external deps/outs or even a separate file like .dvcignore-external or something like that. This one is tricky, but at least it could also be applied to any s3/ssh/etc deps/outs, since we would have to treat those separately anyway.

But yeah, these are just some thoughts, I would not bother with it until we see the demand.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the effort of just ignoring .dvcignore files?

@efiop system/global dvcignores is a no go since they will be an invisible dependencies for all stages - when you change one dvc repro will result in different thing than dvc repro -f.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great point about dependencies @Suor !

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Suor Although, in that case, our local .dvcignores are also dependencies :) We shouldn't treat .dvcignores as dependencies at all, because the scenarios that they are used in could totally wait for the next repro to ignore listed files. I guess this is the case where we could sacrifice dep logic. Or maybe we should actually re-evaluate dvcignores and make them distributes across dvc-files,? The funny thing is that this problem is the same that we have with locks for imported external repos 🙂

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though, unlike locks, dvcignores are clearly a global concept and I can totally see people requesting the current global dvcignore later and I don't see them requesting per-dvcfile dvcignore. So I would stick with the current idea of dvcignore.

Copy link
Contributor

@Suor Suor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job!

dvc/ignore.py Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like self.tree is always there.

@pared pared force-pushed the 1876 branch 3 times, most recently from 69580b6 to 93b3162 Compare July 1, 2019 16:41
@pared pared changed the title state: use dvcignore when accessing metadata [WIP] state: use dvcignore when accessing metadata Jul 1, 2019
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Suor this test seems to be artifical, and similar case as
#2161 (comment)
I think i should remove that, what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are testing real behavior with code backing out here. If we break that this will show it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are testing real behavior with code backing out here. If we break that this will show it

Copy link
Contributor

@efiop efiop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link
Contributor

@Suor Suor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@efiop
Copy link
Contributor

efiop commented Jul 12, 2019

@pared Please address DeepSource issues. I can see only Method could be a function PYL-R0201 issue being relevant here.

@efiop efiop merged commit 393d926 into iterative:master Jul 12, 2019
@efiop
Copy link
Contributor

efiop commented Jul 12, 2019

Thank you @pared ! 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

dvc: use dvcignore when adding data with dvc add/run

3 participants