Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RepoTree: work with dirty repos #4005

Merged
merged 1 commit into from
Jun 11, 2020
Merged

RepoTree: work with dirty repos #4005

merged 1 commit into from
Jun 11, 2020

Conversation

efiop
Copy link
Member

@efiop efiop commented Jun 10, 2020

Currently RepoTree.open will throw FileNotFoundError if you have a
file that is tracked by dvc, but which has no md5 or a cache for which
is missing, even though you have the file itself in your workspace.

This corresponds very nicely to the --no-commit or --no-exec
scenario, where you might not want to commit your artifacts to cache
just yet.

Fixes #3974

  • ❗ I have followed the Contributing to DVC checklist.

  • 📖 If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here. If the CLI API is changed, I have updated tab completion scripts.

  • ❌ I will check DeepSource, CodeClimate, and other sanity checks below. (We consider them recommendatory and don't expect everything to be addressed. Please fix things that actually improve code or fix bugs.)

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

@efiop efiop requested a review from pmrowla June 10, 2020 01:58
@efiop
Copy link
Member Author

efiop commented Jun 10, 2020

The failing test might be related to #3991 CC @pmrowla

https://travis-ci.com/github/iterative/dvc/jobs/346909743#L8052

@efiop efiop changed the title RepoTree: work with dirty repos [WIP] RepoTree: work with dirty repos Jun 10, 2020
dvc/repo/tree.py Outdated Show resolved Hide resolved
Comment on lines 683 to 685
if not (
tree.isdvc(path_info, strict=False) and tree.fetch
):
Copy link
Member Author

@efiop efiop Jun 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what has been causing that test failure. We rely on a side-effect here, that might not work if the tree is dirty. So now we try to open first, which might download the cache, and then check if we indeed have the cache now and if not - try to save it to cache. CC @pmrowla

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks kind of strange now, since we're calling changed_cache twice, before and after tree.open

Copy link
Contributor

@pmrowla pmrowla Jun 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that relying on this side effect the way it was originally written is correct, since we should only be calling open specifically when we need to download the cache file (which was checked when we call changed_cache before calling open)

With your changes to DvcTree/RepoTree, but without this change to _save_file, tests/func/test_update.py::test_update_import_after_remote_updates_to_dvc fails, but I think that is a bug in the test.

If I make this change in that test:

     with erepo_dir.branch("branch", new=False), erepo_dir.chdir():
         erepo_dir.scm.repo.index.remove(["version"])
         erepo_dir.dvc_gen("version", "updated")
-        erepo_dir.scm.add(["version", "version.dvc"])
+        erepo_dir.scm.add(["version.dvc"])
         erepo_dir.scm.commit("upgrade to DVC tracking")
         new_rev = erepo_dir.scm.get_rev()

it passes with the original _save_file implementation (but with your changes to prefer dirty version in DvcTree).

I think the issue is that scm.add() calls git.index.add(<path>) directly, which is not the same as git add <path>. It looks like git.index.add doesn't respect gitignores, so the original test actually makes both version and version.dvc into git-tracked files.

In that test, when we open("version") to copy/fetch it into cache, repo.tree.exists("version") should be false, since it's a DVC out from our clean erepo clone tree, which does not have any dirty/modified copy of version at all. version.dvc should exist in the git tree, but not .gitignored DVC out file version.

I think the issue w/scm.add() not respecting .gitignore can be verified with

def test_gittree_add_ignore(tmp_dir, scm):
    tmp_dir.gen({"foo": "foo", ".gitignore": "foo"})
    with tmp_dir.chdir():
        scm.add(["foo", ".gitignore"])
        scm.commit("init")
    tree = scm.get_tree("HEAD")
    assert not tree.exists(tmp_dir / "foo")

(test will fail in master)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great explanation! You are absolutely right, we don't have to touch that line for that particular test right now.

The double if changed_cache is just our way of verifying the side-effect to tell if the cache was actually downloaded and there is nothing we need to do here.

Also, it seems like wiith the working tree we don't really go through that branch of execution (because of the if tree), so my change might not even affect it. So that also might be something that is just not abstracted yet.

I'll remove that change in favor of fixing the test itself, and will keep it in mind for the future.

Thanks!

@efiop efiop force-pushed the fix-3974 branch 2 times, most recently from 56378fc to 7569772 Compare June 10, 2020 15:48
@efiop efiop changed the title [WIP] RepoTree: work with dirty repos RepoTree: work with dirty repos Jun 10, 2020
Copy link
Contributor

@pmrowla pmrowla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comments on remote/base.py

Currently `RepoTree.open` will throw FileNotFoundError if you have a
file that is tracked by dvc, but which has no md5 or a cache for which
is missing, even though you have the file itself in your workspace.

This corresponds very nicely to the `--no-commit` or `--no-exec`
scenario, where you might not want to commit your artifacts to cache
just yet.

Fixes iterative#3974
@@ -52,9 +52,9 @@ def test_update_import_after_remote_updates_to_dvc(tmp_dir, dvc, erepo_dir):
new_rev = None
with erepo_dir.branch("branch", new=False), erepo_dir.chdir():
erepo_dir.scm.repo.index.remove(["version"])
erepo_dir.dvc_gen("version", "updated")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #4015

@efiop efiop merged commit 87276fa into iterative:master Jun 11, 2020
@efiop efiop deleted the fix-3974 branch June 11, 2020 12:12
Suor added a commit to Suor/dvc that referenced this pull request Jun 29, 2020
efiop pushed a commit that referenced this pull request Jun 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

make sure metrics/plots can compare dirty repo with the last commit by default
2 participants