Skip to content

Conversation

pared
Copy link
Contributor

@pared pared commented Oct 3, 2019

  • Have you followed the guidelines in our
    Contributing document?

  • Does your PR affect documented changes or does it add new functionality
    that should be documented? If yes, have you created a PR for
    dvc.org documenting it or at
    least opened an issue for it? If so, please add a link to it.


Fixes #2507

@efiop efiop requested a review from Suor October 3, 2019 07:33
Copy link
Contributor

@Suor Suor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several things:

  • we should test for local (git handled) file first, because that one is faster. Plus this will allow us to not swallow OutputNotFoundError as we do here.
  • I am not a big fan of double catch like here, I will prefer .exists() test
  • not sure we should use self.tree here, why not simply use open()? see EDIT
  • we lost symmetry between Repo.open() and Repo._open(), ._open() only open outputs now while .open() does everything, so rename ._open() -> ._open_out(). Still starts with underscore since it raises unwrapped exception.

EDIT: It looks like we use Repo.tree everywhere in Repo/Stage for now, so that should be kept. I still don't like this abstraction, but that one is a separate topic.

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Oct 3, 2019

Hi. Will this PR implement #2515 as suggested in #2515 (comment)?

@pared
Copy link
Contributor Author

pared commented Oct 4, 2019

@jorgeorpinel
It does not seem so. Problem with this task lies inside dvc/repo/Repo.open, while as I look in the code, #2515 requires changes in dvc/repo/Repo.get at dvc file check and out file search
It should probably be handled by separate PR.

@efiop efiop requested a review from Suor October 4, 2019 09:13
Copy link
Contributor

@Suor Suor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One point from my first review still stands:

  • we should test for local (git handled) file first, because that one is faster. Plus this will allow us to not swallow OutputNotFoundError as we do here.

OutputNotFoundError is used as cause now, which doesn't make much sense to me, I would let it show itself, e.g. not catch it, after the change above.


# Remove cache to make foo missing
shutil.rmtree(dvc_repo.cache.local.cache_dir)
os.remove(repo_dir.FOO)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

dvc_repo.run(
metrics_no_cache=[metric_file], cmd="python {}".format(code_file)
)
os.remove(metric_file)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I don't think there is another way to test it. We want to test whether dvc will inform us that some output cannot be checked out. If the file is in a repo, it seems natural that it will be used, instead of checking whether it is in the cache, don't you think?

@contextmanager
def open(self, path, remote=None, mode="r", encoding=None):
"""Opens a specified resource as a file descriptor"""
if not remote and self.tree.exists(path):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tree.exists() will be True if someone done checkout and might not be what we need if dvc file then changed (e.g. by git checkout). Shouldn't we first look in cached outputs and then fallback to tree check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, at the beginning I agreed with @Suor s comment here
but now that you mention checkout, it probably better to check cache first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pared Good point. Or maybe we can improve this current approach so that it doesn't suffer from that issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could solve that either by checking a checksum, or prompting the user?
Checking the checksum would result in losing the advantage suggested by @Suor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we should more strictly go over all the possibilities here:

  • get out at the start
  • if we have out and it's cached then we go to ._open_cached()
  • if we don't have out or have out, which is not cached, we go .tree.open(), no need to check for exists then.

There is never such thing as retrying tree open after cached open or vice versa.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. I should have started with separating finding out from opening it. @efiop do you have any concerns about this approach?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pared sounds good to me.

Comment on lines 270 to 274
class NotCachedOutputFileMissingError(DvcException):
def __init__(self, path_info):
super(NotCachedOutputFileMissingError, self).__init__(
"Can't find '{}'. This file was not cached.".format(path_info)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both class name and message blur more than they explain. And, BTW, what's the value of distinguishing this from OutputFileMissing?

I say we should stop inventing more exception classes. This complicates both our code and API.

os.remove(metric_file)

with pytest.raises(NotCachedOutputFileMissingError):
api.read(metric_file)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two tests are almost the same. I would say simply move:

    os.remove(metric_file)
    with pytest.raises(NotCachedOutputFileMissingError):
        api.read(metric_file)

to the end of the first one and remove the second test.

Also, why do you create new code file, while we already have code.py?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New code is required due to issue regarding original code described in #2574

def _open_out(self, path, remote=None, mode="r", encoding=None):
out, = self.find_outs_by_path(path)

if not out.use_cache and not out.exists:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition is wrong - if out is not using cache, but is missing we continue to try open it from cache, which doesn't make sense.

I would say we should check if out is cached before ._open_out() then, rename ._open_out() to ._open_cached() and pass out there instead of path.

@contextmanager
def open(self, path, remote=None, mode="r", encoding=None):
"""Opens a specified resource as a file descriptor"""
if not remote and self.tree.exists(path):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we should more strictly go over all the possibilities here:

  • get out at the start
  • if we have out and it's cached then we go to ._open_cached()
  • if we don't have out or have out, which is not cached, we go .tree.open(), no need to check for exists then.

There is never such thing as retrying tree open after cached open or vice versa.

Copy link
Contributor

@efiop efiop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@Suor Suor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using

..., cmd="python -c '{}'".format(metric_code)

and not creating any file should be easier. This not that important though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will introduce it

@efiop efiop merged commit d2f8cb0 into iterative:master Oct 8, 2019
@efiop
Copy link
Contributor

efiop commented Oct 8, 2019

@pared ok, one more question. So does this make dvc get work with scm-controlled files as well?

@pared
Copy link
Contributor Author

pared commented Oct 8, 2019

@efiop no,
here is the reason

@shcheklein
Copy link
Member

@efiop no, here is the reason

@pared @efiop let's create a ticket then? We need to get to a parity here.

@efiop
Copy link
Contributor

efiop commented Oct 8, 2019

@shcheklein We do have the ticket #2515 . I was just double-checking.

@pared pared deleted the 2507 branch December 17, 2019 13:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

api: open file that is not cached by dvc but stored in git

5 participants