Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

diff: does not behave properly when cache is not available #3392

Closed
shcheklein opened this issue Feb 24, 2020 · 6 comments · Fixed by #3468
Closed

diff: does not behave properly when cache is not available #3392

shcheklein opened this issue Feb 24, 2020 · 6 comments · Fixed by #3468
Assignees
Labels
bug Did we break something? p0-critical Critical issue. Needs to be fixed ASAP.

Comments

@shcheklein
Copy link
Member

Version

DVC version: 0.86.5+f67314.mod
Python version: 3.7.6
Platform: Darwin-18.2.0-x86_64-i386-64bit
Binary: False
Package: None
Cache: reflink - supported, hardlink - supported, symlink - supported
Filesystem type (cache directory): ('apfs', '/dev/disk1s1')
Filesystem type (workspace): ('apfs', '/dev/disk1s1')

Reproduce

git clone git@github.com:iterative/example-get-started.git
cd example-get-started
dvc pull
dvc diff baseline-experiment bigrams-experiment

Output

Added:
    data/features/test.pkl
    data/features/train.pkl

Modified:
    auc.metric
    data/features/
    model.pkl

files summary: 2 added, 0 deleted, 2 modified

which is wrong. Files data/features/test.pkl and data/features/train.pkl actually exist in both experiments. The only difference is that one of them cached, one is not.

@triage-new-issues triage-new-issues bot added the triage Needs to be triaged label Feb 24, 2020
@dmpetrov dmpetrov added bug Did we break something? p0-critical Critical issue. Needs to be fixed ASAP. and removed triage Needs to be triaged labels Feb 24, 2020
@dmpetrov
Copy link
Member

dmpetrov commented Feb 24, 2020

@shcheklein good catch!

The only difference is that one of them cached, one is not.

Just a clarification: these two files are not cached for master/bigrams-experiment but not for baseline-experiment.

I'm very surprised that dvc diff is reading the cache and depends on its state. We agreed that #2998 will be also included in the first version of the diff. It is crucial for CI/CD scenarios. Reopening it.

PS: @shcheklein the url didn't work by default

git clone git@github.com:iterative/example-get-started.git
Cloning into 'example-get-started'...
git@github.com: Permission denied (publickey).
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Feb 24, 2020

the url didn't work by default

git clone git@github.com:iterative/example-get-started.git

Hmmm weird. The repo is public. Is it because of the git@ SSH URL? Try with git clone git@github.com:iterative/example-get-started.git @dmpetrov?

If this is the case, we should update in https://dvc.org/doc/command-reference/get#example-compare-different-versions-of-data-or-model

UPDATE: I'm changing this just in case in a regular updates PR to docs.

@pared
Copy link
Contributor

pared commented Mar 6, 2020

Worth noting that after pull on baseline, message seems to be ok (that both paths are modified).

@pared
Copy link
Contributor

pared commented Mar 10, 2020

So, a little update on my side:
The reason for such original behavior is as follows:

  1. There is .dir cache file for pulled revision
  2. There is no .dir cache file for the other one.
  3. When collecting data for non-existing (in cache) revision, we raise an exception which is silently ignored here
  4. In case of existing one loading dir cache is successful
  5. In diff's _paths_checskums we "unpack" dir cache content for later comparison.
  6. This logic in addition to silent ignoring of DirCacheError causes as to detect deleted or added new files (depending on revision order), instead of detecting actual change (modification).

I see the following solutions for this problem:

  1. Raise DirCacheError further and fail diff if there is no cache file - I don't like this idea, because we fail "diagnostic" operation and require the user to download files that he actually might not want to download
  2. Implement logic in remote/base.get_dir_cache for downloading absent cache files (and only them) - I also don't like this idea, as it will be heavy in implementation and still does not guarantee we get proper results (what if someone gc'd his remote? dir cache will be unavailable)
  3. Make dir outs equal from diff point of view to file outs: just remove this unpacking code - I would go with this option, as it does not require from us downloading anything in the situation presented in the original issue and still is able to inform user about changes that happened in the repository.

@dmpetrov @shcheklein what do you think?

@shcheklein
Copy link
Member Author

I think I'm fine with 3 as a first step, and may be think more about 2. implementation mentioned in 2 is needed potentially in other cases - e.g. dvc list - #3394 (not even sure why is it p2 tbh - it's p1 to me), so I think this mechanism will be needed down the road anyway.

@shcheklein
Copy link
Member Author

thanks for the great research @pared , btw :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Did we break something? p0-critical Critical issue. Needs to be fixed ASAP.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants