-
Notifications
You must be signed in to change notification settings - Fork 1.2k
api: open scm controlled files #2566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Several things:
- we should test for local (git handled) file first, because that one is faster. Plus this will allow us to not swallow
OutputNotFoundError
as we do here. - I am not a big fan of double catch like here, I will prefer
.exists()
test not sure we should usesee EDITself.tree
here, why not simply useopen()
?- we lost symmetry between
Repo.open()
andRepo._open()
,._open()
only open outputs now while.open()
does everything, so rename._open()
->._open_out()
. Still starts with underscore since it raises unwrapped exception.
EDIT: It looks like we use Repo.tree
everywhere in Repo/Stage for now, so that should be kept. I still don't like this abstraction, but that one is a separate topic.
Hi. Will this PR implement #2515 as suggested in #2515 (comment)? |
@jorgeorpinel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One point from my first review still stands:
- we should test for local (git handled) file first, because that one is faster. Plus this will allow us to not swallow
OutputNotFoundError
as we do here.
OutputNotFoundError
is used as cause now, which doesn't make much sense to me, I would let it show itself, e.g. not catch it, after the change above.
tests/func/test_api.py
Outdated
|
||
# Remove cache to make foo missing | ||
shutil.rmtree(dvc_repo.cache.local.cache_dir) | ||
os.remove(repo_dir.FOO) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this one needed? Because it will trigger https://github.com/iterative/dvc/pull/2566/files#diff-16d20341660af559adf7d9e7306c760dR458 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
dvc_repo.run( | ||
metrics_no_cache=[metric_file], cmd="python {}".format(code_file) | ||
) | ||
os.remove(metric_file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this one needed for https://github.com/iterative/dvc/pull/2566/files#diff-16d20341660af559adf7d9e7306c760dR458 again? This seems weird.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but I don't think there is another way to test it. We want to test whether dvc
will inform us that some output cannot be checked out. If the file is in a repo, it seems natural that it will be used, instead of checking whether it is in the cache, don't you think?
dvc/repo/__init__.py
Outdated
@contextmanager | ||
def open(self, path, remote=None, mode="r", encoding=None): | ||
"""Opens a specified resource as a file descriptor""" | ||
if not remote and self.tree.exists(path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tree.exists()
will be True
if someone done checkout
and might not be what we need if dvc file then changed (e.g. by git checkout). Shouldn't we first look in cached outputs and then fallback to tree check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pared Good point. Or maybe we can improve this current approach so that it doesn't suffer from that issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we could solve that either by checking a checksum, or prompting the user?
Checking the checksum would result in losing the advantage suggested by @Suor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we should more strictly go over all the possibilities here:
- get
out
at the start - if we have
out
and it's cached then we go to._open_cached()
- if we don't have out or have out, which is not cached, we go
.tree.open()
, no need to check for exists then.
There is never such thing as retrying tree open after cached open or vice versa.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. I should have started with separating finding out from opening it. @efiop do you have any concerns about this approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pared sounds good to me.
dvc/exceptions.py
Outdated
class NotCachedOutputFileMissingError(DvcException): | ||
def __init__(self, path_info): | ||
super(NotCachedOutputFileMissingError, self).__init__( | ||
"Can't find '{}'. This file was not cached.".format(path_info) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both class name and message blur more than they explain. And, BTW, what's the value of distinguishing this from OutputFileMissing
?
I say we should stop inventing more exception classes. This complicates both our code and API.
os.remove(metric_file) | ||
|
||
with pytest.raises(NotCachedOutputFileMissingError): | ||
api.read(metric_file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two tests are almost the same. I would say simply move:
os.remove(metric_file)
with pytest.raises(NotCachedOutputFileMissingError):
api.read(metric_file)
to the end of the first one and remove the second test.
Also, why do you create new code file, while we already have code.py
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New code is required due to issue regarding original code described in #2574
dvc/repo/__init__.py
Outdated
def _open_out(self, path, remote=None, mode="r", encoding=None): | ||
out, = self.find_outs_by_path(path) | ||
|
||
if not out.use_cache and not out.exists: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This condition is wrong - if out is not using cache, but is missing we continue to try open it from cache, which doesn't make sense.
I would say we should check if out is cached before ._open_out()
then, rename ._open_out()
to ._open_cached()
and pass out
there instead of path.
dvc/repo/__init__.py
Outdated
@contextmanager | ||
def open(self, path, remote=None, mode="r", encoding=None): | ||
"""Opens a specified resource as a file descriptor""" | ||
if not remote and self.tree.exists(path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we should more strictly go over all the possibilities here:
- get
out
at the start - if we have
out
and it's cached then we go to._open_cached()
- if we don't have out or have out, which is not cached, we go
.tree.open()
, no need to check for exists then.
There is never such thing as retrying tree open after cached open or vice versa.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
tests/func/test_api.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using
..., cmd="python -c '{}'".format(metric_code)
and not creating any file should be easier. This not that important though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I will introduce it
@pared ok, one more question. So does this make |
@shcheklein We do have the ticket #2515 . I was just double-checking. |
Have you followed the guidelines in our
Contributing document?
Does your PR affect documented changes or does it add new functionality
that should be documented? If yes, have you created a PR for
dvc.org documenting it or at
least opened an issue for it? If so, please add a link to it.
Fixes #2507