Skip to content

Conversation

@pared
Copy link
Contributor

@pared pared commented Dec 18, 2019

  • ❗ Have you followed the guidelines in the Contributing to DVC list?

  • πŸ“– Check this box if this PR does not require documentation updates, or if it does and you have created a separate PR in dvc.org with such updates (or at least opened an issue about it in that repo). Please link below to your PR (or issue) in the dvc.org repo.

  • ❌ Have you checked DeepSource, CodeClimate, and other sanity checks below? We consider their findings recommendatory and don't expect everything to be addressed. Please review them carefully and fix those that actually improve code or fix bugs.

Thank you for the contribution - we'll try to review it as soon as possible. πŸ™

Fixes #2914

@pared pared force-pushed the 2914 branch 5 times, most recently from b238c05 to 90e39c6 Compare December 19, 2019 11:00
@pared pared requested review from Suor and efiop December 20, 2019 18:09
@pared pared changed the title [WIP] repo: move dvcignore from repo to tree repo: move dvcignore from repo to tree Dec 20, 2019
@pared
Copy link
Contributor Author

pared commented Dec 20, 2019

Note to self: what about resetting dvcignore of CleanTree

dvc/repo/add.py Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use repo.tree.walk?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guess need to wrap tree from down below too, right?

Copy link
Contributor

@efiop efiop Dec 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pared ^

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@efiop Do you mean the one obtained with get_tree?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dvc/scm/tree.py Outdated
Comment on lines 81 to 108
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe better keep it in the dvcignore.py? Not a strong opinion, just wondering, since it is more of a wrapper than a thing that is actually related to scm.

dvc/scm/tree.py Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still using dvc_walk with dvcignore, which you probably shouldn't since you have CleanTree now πŸ™‚ Also, maybe we could get rid of dvc_walk at all, since we now have CleanTree?

dvc/state.py Outdated
Comment on lines 472 to 474
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could pass it a tree instead? Seems to be natural. Maybe I'm missing something though.

Copy link
Contributor

@Suor Suor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The principles of CleanTree approach are:

  • all the code outside dvcignore module should know only about CleanTree, i.e. should not know about filter, etc. The only way to use it is to wrap a tree with .walk() with a CleanTree, then use its public methods.
  • .walk() and .walk_files() should never accept dvcignore
  • there should be no walk_files() nor dvc_walk() functions, only methods

Also the question is when we should or should not filter by with .dvcignore files. Now we always do that for local remotes, which might not make sense at all, be a useful feature or lead to a surprsing behavior, say we have .dvcignore:

*.backup
work-data  # Just excluding some dir in a repo root 

It excludes *.backup both within a repo and in external deps, which is a useful feature. It also excludes work-data in repo root, repo nested dirs (not intended) and external deps, like a network share (a surprising behavior).

Also consider this dir structure:

/some-dir
    /repo
        .dvcignore
        /repo-data
        /ignored-data
    /external-dep
        ...

Here .dvcignore entries will affect collecting files in /external-dep, which is in an outer scope to the file.

Some remarks below illustrate the principles at the start.

dvc/scm/tree.py Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not about scm. This is about ignore, should be able to wrap WorkingTree, GitTree and Remote as long as they provide those methods.

This one should also provide .walk_files().

dvc/scm/tree.py Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wrapped tree should know nothing about dvcignores.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a bug here, this function is still broken by design, i.e. it has two arguments that need to be consistent, i.e. dvcignore tree might be of some branch, while we list working dir.

Other issue with this is that dvcignores should not work outside repo (or should they?), but they are used here. We should carefully distinguish these two cases, now they look mixed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, this works by a coincidence.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scm/git should know nothing about clean trees.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No **kwargs. You are leaving a possibility of still ignoring passed dvcignore.

dvc/state.py Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should accept a tree, not dvcignore.

@pared
Copy link
Contributor Author

pared commented Dec 23, 2019

@Suor

It excludes *.backup both within a repo and in external deps, which is a useful feature. It also excludes work-data in repo root, repo nested dirs (not intended) and external deps, like a network share (a surprising behavior).

Did you prepare some reproduction script? I prepared a little test and both statements seem not to be valid.:

def test_ignore_external(tmp_dir, scm, dvc):
    tmp_dir.gen(
        {
            "dir": {
                "subdir": {
                    ".dvcignore": "*.backup\nwork-data",
                    "x.backup": "x backup",
                    "work-data": "content"
                },
                "subdir_external": {"y.backup": "y.backup"},
                "work-data": "work-data",
            }
        }
    )

    result = [e for e in dvc.tree.walk("dir")]

If you analyze result content you will see that only x.backup and work-content in subdir is ignored.

Also, as to ignoring external dependencies: AFAIK we do not support ignoring them. (some context is here: #2161)

@Suor Suor mentioned this pull request Dec 24, 2019
@Suor
Copy link
Contributor

Suor commented Dec 24, 2019

@pared you are right external deps/outs are not affected, we only get surprising behavior by ignoring nested, but this is what git also does. Checked current behavior with a couple of tests #3003

The principles still stand. We still need walk_files() in RemoteLOCAL.list_cache_path() all the other code should use trees either clean or unfiltered. dvc_walk() should be removed. Ignore tests should be changed to use either dvc.tree.walk_files() or RemoteLOCAL.walk_files() instead of walk_files() function.

BTW, in the PR you've linked here I've stated this design vulnerability issue :) #2161 (review)

@pared
Copy link
Contributor Author

pared commented Dec 24, 2019

@Suor sure, I do not oppose removing dvc_walk and walk_files, just wanted to make sure we are on the same page.

@pared pared force-pushed the 2914 branch 2 times, most recently from 1053606 to 81ebd85 Compare December 27, 2019 12:01
dvc/utils/fs.py Outdated
Comment on lines 37 to 38
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But only to WorkingTree, right? πŸ™‚

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pared Please remove this FIXME, since you've created an issue already

Copy link
Contributor

@efiop efiop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just need to remove that "FIXME".

@pared
Copy link
Contributor Author

pared commented Dec 27, 2019

@efiop removed FIXME, also created #3011

Copy link
Contributor

@Suor Suor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job. Some cleanup and protections below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to assert that tree is a working tree or a clean working tree. We want to be sure that we never walk the git tree here.

dvc/ignore.py Outdated
Comment on lines +108 to +118
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to check that path is not dvcignored to be entirely safe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that we should check if its ignored, a path can be both file and be ignored at the same time. What is the reasoning for that?

Copy link
Contributor

@Suor Suor Dec 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because if file is ignored then it should not be visible, i.e. tree should pretend it doesn't exist. This is not an issue for now because we only open files that we walk, probably.

I see at least one issue though dvc pipeline show ignored.dvc will actually instantiate this ignored stage, and we will get a puzzling KeyError later.

We may postpone this though because this might be solved in various ways.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same assert here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SCM code should not know anything about clean trees.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If diff will not be dvcignore-aware, we will get different tree for a branch when using diff, and when traversing it using dvcignore-wrapped GitTree, I don't see a way to make GitTree completely unaware of dvcignore

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, we can wrap this into CleanTree later in a repo method.

dvc/utils/fs.py Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to assert that tree is a working tree/clean working tree. Passing git tree here will break the abstraction - we walk git, but os.stat() working tree.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not in is a very non reliable test, please use == instead. Why can't you just:

assert _files_set(".", tree) == {"./bar"}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing here: I changed this assert, but it seems that GitTree always return abs root, which is not how os.walk behaves, I think that should be changed, @efiop, @Suor would you agree with me?

@pared pared requested a review from Suor December 30, 2019 13:21
@pared pared changed the title repo: move dvcignore from repo to tree [WIP] repo: move dvcignore from repo to tree Dec 30, 2019
@pared pared changed the title [WIP] repo: move dvcignore from repo to tree repo: move dvcignore from repo to tree Dec 30, 2019
@efiop efiop merged commit df5fda2 into iterative:master Dec 30, 2019
Copy link
Contributor

@Suor Suor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to fix a couple of asserts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to update this assert the same way as in walk_files().

dvc/utils/fs.py Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to update this assert the same way as in walk_files().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

dvc: dvc_walk() and related things are broken by design

3 participants