Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get: copy/download files tracked by Git #2837

Merged
merged 23 commits into from
Dec 9, 2019
Merged

Conversation

danihodovic
Copy link
Contributor

@danihodovic danihodovic commented Nov 22, 2019

Allows dvc get to copy regular files or directories.

fixes: #2515

  • ❗ Have you followed the guidelines in the Contributing to DVC list?

  • πŸ“– Check this box if this PR does not require documentation updates, or if it does and you have created a separate PR in dvc.org with such updates (or at least opened an issue about it in that repo). Please link below to your PR (or issue) in the dvc.org repo.

  • ❌ Have you checked DeepSource, CodeClimate, and other sanity checks below? We consider their findings recommendatory and don't expect everything to be addresses. Please review them carefully and fix those that actually improve code or fix bugs.

Thank you for the contribution - we'll try to review it as soon as possible. πŸ™

@@ -41,7 +41,8 @@ class TestDirFixture(object):
# in tests, we replace foo with bar, so we need to make sure that when we
# modify a file in our tests, its content length changes.
BAR_CONTENTS = BAR + "r"
CODE = "code.py"
CODE_DIR = "code"
CODE = "code/code.py"
Copy link
Member

@efiop efiop Nov 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use os.path.join πŸ™‚

Also, CODE is used in lots of tests, so moving it like that is dangerous. If you need some file like that in your test - just create it where needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't add any new files to this fixture, it is basically a remnant from pre-pytest times. You have 2 better choices now:

  • create whatever you need in test itself,
  • create a separate pytest fixture that creates it, if you need the same thing in several tests.

@@ -736,9 +736,9 @@ def test(self):
os.mkdir(dname)
foo = os.path.join(dname, self.FOO)
bar = os.path.join(dname, self.BAR)
code = os.path.join(dname, self.CODE)
code_dir = os.path.join(dname, self.CODE_DIR)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to modify unrelated tests πŸ™‚ Just create what you need for your tests and don't touch other tests.

@efiop
Copy link
Member

efiop commented Nov 23, 2019

@danihodovic Please take a look at failed tests.

Copy link
Member

@efiop efiop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @danihodovic πŸ™ Please see a few comments above. πŸ™‚

@@ -48,6 +48,12 @@ class TestDirFixture(object):
)
UNICODE = "тСст"
UNICODE_CONTENTS = "ΠΏΡ€ΠΎΠ²Π΅Ρ€ΠΊΠ°"
REGULAR_DIR = "lib"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From TestDirFixture perspective, all files/dirs(DATA, DATA_DIR, FOO, BAR, etc) are regular and not tracked by dvc. You probably want to modify erepo fixture to have those files, not this global fixture.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why exactly do those files need to be tracked by dvc? I thought we were testing the retrieval of a normal file that is tracked by Git.

FOO, BAR are regular files, but they have an additional .dvc file which (I assume) represents the dvc format of the file. When running a breakpoint in a test I can find these files created on the filesytem. I want to test that dvc get works with files that don't have a *.dvc representation.

    src = erepo.REGULAR_FILE
    dst = erepo.REGULAR_FILE + "_imported"

    breakpoint()
    Repo.get(erepo.root_dir, src, dst)

    assert os.path.exists(dst)
    assert os.path.isfile(dst)
    assert filecmp.cmp(src, dst, shallow=False)

Creates the following structure

$ ls /tmp/dvc-test.15186.qxfon8q6.evynmKznxaWjUTntqBTJiP
.rw-rw-r-- dani dani  16 B  Sat Nov 23 21:39:30 2019 ο€–  тСст
drwxrwxr-x dani dani   4 KB Sat Nov 23 21:39:30 2019 ο„•  lib
.rw-rw-r-- dani dani  66 B  Sat Nov 23 21:39:30 2019 ξ˜†  code.py
.rw-rw-r-- dani dani 143 B  Sat Nov 23 21:39:30 2019 ο€–  foo.dvc
.rw-rw-r-- dani dani   3 B  Sat Nov 23 21:39:30 2019 ο€–  foo
.rw-rw-r-- dani dani   4 B  Sat Nov 23 21:39:30 2019 ο€–  bar
.rw-rw-r-- dani dani 143 B  Sat Nov 23 21:39:30 2019 ο€–  bar.dvc
.rw-rw-r-- dani dani 152 B  Sat Nov 23 21:39:30 2019 ο€–  data_dir.dvc
drwxrwxr-x dani dani   4 KB Sat Nov 23 21:39:30 2019 ο„•  data_dir
.rw-rw-r-- dani dani   6 B  Sat Nov 23 21:39:30 2019 ο€–  version
.rw-rw-r-- dani dani 147 B  Sat Nov 23 21:39:30 2019 ο€–  version.dvc

The added tests fail when the logic I've added is removed.

https://github.com/danihodovic/dvc/blob/feat/2515/dvc/repo/get.py#L39-L43

If I've still misunderstood the problem please inform me :)

Copy link
Member

@efiop efiop Nov 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why exactly do those files need to be tracked by dvc? I thought we were testing the retrieval of a normal file that is tracked by Git.

They don't, that is why "regular file/dir" terminology doesn't make sense here πŸ™‚These files(FOO, BAR, etc) are dvc added in erepo fixture, and this base fixture you are modifying doesn't do any assumptions like that. That is why you should modify erepo fixture(or somewhere on top of it) to add files/dirs that are not dvc added, instead of modifying the unrelated base fixture and breaking unrelated tests. πŸ™‚

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should stop creating those gigantic fixtures. I suggest making them more granular so that you only create whatever you need for your tests. Now it is lots of unneeded work for most tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And we can start by not adding things to erepo not to TestDirFixture.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @Suor "One structure for all" also makes debugging harder.

@@ -115,7 +115,7 @@ def test(self):
[
(
self._root_dir,
["data_dir"],
["lib", "data_dir"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are modifying unrelated tests, please see the comment above πŸ™‚

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disregarding the comment above; the changes were made because the addition of the lib directory changes the filesystem tree. If you'd like me to avoid modifying other tests I'd need to create a separate fixture which extends TestDirFixture and adds regular files.

Copy link
Member

@efiop efiop Nov 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess there is a misunderstading here πŸ™‚ I'm simply trying to say that the correct way to approach this is to modify exiting erepo fixture or build on top of it (you can even avoid using var names like FOO,BAR and use plain string filename instead, no problem) to add files/dirs that you need. That way you won't need to break and fix unrelated tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So let's clarify things to avoid confusion. How exactly do you want a basic smoke test setup to look?

Here are my assumptions:

  • We create a test directory into which we want to use dvc get to retrieve and store files. The test directory should probably exist in /tmp/. Let's call this directory A.
  • We want a remote git project to dvc get files from. Ideally this should be a local temporary directory as well to avoid going over the network for reproducibility and performance reasons. Let's call this directory B.
  • We want to retrieve a file called file.txt from directory B.
  • We use dvc get and afterwards we want to assert that the filename and contents of the newly created file in directory A is equal to the file in directory B.
         Dir A                  Dir B


     +----------+            +----------+
     |          |            |          |
     |          |            |          |
1.   |          +<-----------+ file.txt |
     |          |            |          |
     +----------+            +----------+



     +----------+            +----------+
     |          |            |          |
2.   |          |            |          |
     |  file.txt|            |  file.txt|
     |          |            |          |
     +----------+            +----------+


3.     assert A/file.txt == B/file.txt

The questions that remain for me before implementation:

e.g

import os
import tempfile

def test_get_regular_file(erepo):
    dir = tempfile.TemporaryDirectory()
    dst = os.path.join(dir.name, "code_imported")
    os.chdir(dir.name)
    Repo.get(erepo.root_dir, erepo.CODE, dst)
    assert filecmp.cmp(src, dst, shallow=False)

This would still require creating a nested directory in the erepo fixture in order to test dvc get with directories. Do you want me to create the nested directory or to create a new fixture specifically for these two tests?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is definitely a misunderstanding here, your current tests test_get_regular_file and test_get_regular_dir are totally fine, except for the fact that you are modifying TestDirFixture for no good reason. Just stop modifying it and either create those files in place as @Suor mentioned or modify erepo(or create a fixture based on it) to have those files. I'm not against the latter case, because we will soon have to test the same stuff for dvc import and we'll need the same files anyway, but we could do that when we actually need it.

Just to absolutely clarify this, here is an UNTESTED example of how an acceptable test_get_regular_file could look like:

def test_get_regular_file(repo_dir, erepo):
    src = "file"
    dst = src + "_imported"

    with open(os.path.join(erepo.root_dir, src), "w+") as fobj:
        fobj.write("something")

    erepo.scm.add([src])
    erepo.scm.commit("add file")

    Repo.get(erepo.root_dir, src, dst)

    assert os.path.exists(dst)
    assert os.path.isfile(dst)
    assert filecmp.cmp(src, dst, shallow=False)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use:

erepo.create(src, "something")

instead of with open... I guess.

Copy link
Member

@efiop efiop Nov 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Suor Right, forgot that erepo here is not a Repo instance. Thanks!

@efiop
Copy link
Member

efiop commented Nov 24, 2019

@danihodovic Oh, looks like we forgot to explicitly mention #2780 . We need to make sure that dvc get for git-tracked files will work when there is no default remote setup in the erepo. Could you please take a look? πŸ™‚ I.e.

#!/bin/bash

set -e
set -x

rm -rf mytest
mkdir mytest
cd mytest

mkdir erepo
pushd erepo
git init
dvc init
dvc run -O out 'echo foo > out'
git add .
git commit -m "init"
popd

dvc get erepo out

currently fails with

+ dvc get erepo out
ERROR: failed to get 'out' from 'erepo' - No DVC remote is specified in the target repository 'erepo': config file error: no remote specified. Setup default remote with
    dvc config core.remote <name>
or use:
    dvc pull -r <name>

but it shouldn't as we're trying to dvc get a file that is tracked by git and so it doesn't need the dvc remote to be setup. πŸ™‚

dvc/repo/get.py Outdated Show resolved Hide resolved
dvc/repo/get.py Outdated
Comment on lines 61 to 71
o = repo.find_out_by_relpath(path)

with repo.state:
repo.cloud.pull(o.get_used_cache())
o.path_info = PathInfo(os.path.abspath(out))
with o.repo.state:
o.checkout()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not on this PR, but on this code:

  • looks like repo and o.repo are the same, so we shouldn't confuse code reader by referencing it differently
  • hence we can use single with clause,
  • we don't need to use .pull() if we are to checkout it later from cache anyway, we should use .fetch()
  • if we use .pull() then we can move an artifact, which might be faster than checkout.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Suor there is no cloud.fetch. cloud.pull doesn't checkout, so we are fine, even though the naming could be improved πŸ™‚

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we leave this as is?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cloud.fetch? Let's move it into a separate refactoring ticket (for all of those points) or can solve it right away, doesn't look too complex.

# specific order so that we can make deterministic assumptions on the
# directory order.
# https://stackoverflow.com/a/54773660/2966951
dirs.sort()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment explains what are we doing but not why, at least not in any specific form. Why do we need it anyway? Why can't we sort it where we need it? Why don't we sort files?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests for tree equality fail because the order is not deterministic for directories. https://github.com/iterative/dvc/blob/master/tests/func/test_tree.py#L109-L128

I can sort it only for my tests, but I'd imagine most developers writing tests to compare tree structure would assume there is a deterministic order in the first place. It's unfortunate it's not consistent.

Why don't we sort files?

There are multiple files in the test directory and they preserved the sorted order. As soon as we added more than one directory the order would change between test runs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if we don't modify the big fixtures this won't be an issue? BTW, generally we shouldn't change how code works to make tests pass, we should change tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

πŸ’― agree with @Suor . When we encounter that the ordering matters, we just use sets or explicitly sort in our tests, there are a few places in our tests already that do that IIRC.

@@ -41,7 +41,8 @@ class TestDirFixture(object):
# in tests, we replace foo with bar, so we need to make sure that when we
# modify a file in our tests, its content length changes.
BAR_CONTENTS = BAR + "r"
CODE = "code.py"
CODE_DIR = "code"
CODE = "code/code.py"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't add any new files to this fixture, it is basically a remnant from pre-pytest times. You have 2 better choices now:

  • create whatever you need in test itself,
  • create a separate pytest fixture that creates it, if you need the same thing in several tests.

Comment on lines 40 to 41
src = erepo.REGULAR_FILE
dst = erepo.REGULAR_FILE + "_imported"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you use it exactly once. You should simply create a file in erepo and add it to git, no need to change existing or adding your fixtures.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not against of changing erepo or creating a new fixture on top because we will soon have to test the same stuff for dvc import. But it is also okay to do that when we actually need it and for now just create files in-place as noted in #2837 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer a separate fixture even in the future.

@danihodovic
Copy link
Contributor Author

PTAL :)

re: #2837 (comment)

Running your test script works. I don't know if you want a test for this explicitly as the current tests should prove correctness because erepo doesn't have a git upstream by default.

dvc/command/get.py Outdated Show resolved Hide resolved
@efiop efiop requested review from jorgeorpinel, pared and a user November 26, 2019 18:39
dvc/repo/get.py Outdated
Comment on lines 61 to 71
o = repo.find_out_by_relpath(path)

with repo.state:
repo.cloud.pull(o.get_used_cache())
o.path_info = PathInfo(os.path.abspath(out))
with o.repo.state:
o.checkout()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we leave this as is?

tests/func/test_get.py Show resolved Hide resolved
def test_get_regular_file(erepo):
src = os.path.join(erepo.root_dir, "file")
dst = src + "_imported"
erepo.create(src, "hello")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does it work if we haven't even added this new file to git? We are probably not testing it hard enough or smth.

dst = src + "_imported"

os.mkdir(src)
erepo.create(os.path.join(src, "file"), "hello")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.create("directory/file", ...) will create a directory for you.

@jorgeorpinel jorgeorpinel changed the title get: copy regular files get: copy/download files tracked by Git Nov 26, 2019
@jorgeorpinel
Copy link
Contributor

What is "PTAL"?
Will dvc import also support Git tracked files?

Copy link
Contributor

@jorgeorpinel jorgeorpinel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple small things on the command output. Please also see my questions above. Thanks

dvc/command/get.py Outdated Show resolved Hide resolved
dvc/command/get.py Outdated Show resolved Hide resolved
Copy link
Contributor

@pared pared left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff @danihodovic!
Two (actually one) minor things.

@@ -36,6 +36,31 @@ def test_get_repo_dir(repo_dir, erepo):
trees_equal(src, dst)


def test_get_regular_file(erepo):
src = os.path.join(erepo.root_dir, "file")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This must be relative path, that is also the reason for strange behaviour mentioned by @Suor here.
Also probably an idea for new issue ticket: forbid the user from providing the full path in get.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, erepo already has few things inside that can be get-ed:
for example:
in case of file: Repo.get(erepo.root_dir, erepo.FOO, "foo_imported")
in case of dir: Repo.get(erepo.root_dir, erepo.DATA_DIR, "dir_imported")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also probably an idea for new issue ticket: forbid the user from providing the full path in get.

There was a ticket recently from a user that uses full path to get his external local output like that, so we can't really forbid it for good πŸ™‚

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pared

Also, erepo already has few things inside that can be get-ed:
for example:
in case of file: Repo.get(erepo.root_dir, erepo.FOO, "foo_imported")
in case of dir: Repo.get(erepo.root_dir, erepo.DATA_DIR, "dir_imported")

Yes, but those are dvc added in erepo fixture and this PR is aiming to test geting files that are tracked by git and not dvc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh right, sorry. Still, for a file that could be erepo.code for example.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pared sure, but we also need to test a dir too :)



def test_get_regular_dir(repo_dir, erepo):
src = os.path.join(erepo.root_dir, "directory")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, src cannot be full path.

assert filecmp.cmp(src, dst, shallow=False)


def test_get_regular_dir(repo_dir, erepo):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

repo_dir is unnecessary here, can be removed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pared it is necessary, otherwise we will pollute repo root.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@efiop I do not agree:
erepo already uses repo_dir, so temp test dir is already created, and we are back to it thanks to os.chdir inside erepo

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pared Unless I'm missing something, if we take a look at erepo fixture https://github.com/iterative/dvc/blob/0.71.0/tests/conftest.py#L173 we can see that it returns back to the saved dir, which is our repository root. And then down below we do Repo.get, so the file will be downloaded to our repository root, which is bad.

Copy link
Contributor

@pared pared Nov 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@efiop I disagree:

  1. erepo uses repo_dir fixture, which gets created first and it creates temporary test dir and moves into it in _pushd
  2. erepo creates TestDvcGitFixture which creates another temporary dir, but thanks to 1.
    its _saved_dir is actually our repo_dir root.
  3. at the end of erepo creation, chdir moves back to repo_dir root

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conclusion gentlemen?

Copy link
Member

@efiop efiop Nov 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danihodovic The conclusion is that @pared tried it and it works, so he is right and removing is fine :) That being said, I'm a bit surprised by that behavior of erepo, don't remember if it was meant to do that. Seems like it creates a yet another test directory, which harmless but might not be what we need. Need to finally revisit our old unittest classes and new fixtures (I feel your pain @Suor πŸ™‚).

@danihodovic
Copy link
Contributor Author

@efiop @pared @Sour - test changes in 1ca336d
@jorgeorpinel docs changes in bb0925f

@danihodovic danihodovic force-pushed the feat/2515 branch 3 times, most recently from 96ba415 to 0baa307 Compare December 6, 2019 17:51
dvc/repo/get.py Show resolved Hide resolved
shutil.copytree(src_full_path, dst_full_path)
else:
shutil.copy2(src_full_path, dst_full_path)
except FileNotFoundError:
Copy link
Member

@efiop efiop Dec 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not really raсing against anything, so how about we

if not os.path.exists(src_full_path):
    raise PathOutsideRepoError(src, repo_url)

before "if os.path.isdir()" instead of wrapping it in try&except, to make it more linear? Looks like shutil.copy2(src_full_path, dst_full_path) is the only line that could raise this exception, as isdir will return False on non-existing path. Seems like it would make it easier to grasp. I don't have a strong opinion here. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like so?

def _copy_git_file(repo, src, dst, repo_url):
    src_full_path = os.path.join(repo.root_dir, src)
    dst_full_path = os.path.abspath(dst)

    if os.path.isdir(src_full_path):
        shutil.copytree(src_full_path, dst_full_path)
        return

    try:
        shutil.copy2(src_full_path, dst_full_path)
    except FileNotFoundError:
        raise PathOutsideRepoError(src, repo_url)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danihodovic Well, that would do it for me too πŸ˜„ Thanks! πŸ™‚

Copy link
Member

@efiop efiop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! A few minor comments up above.

@efiop efiop requested a review from Suor December 6, 2019 19:00
Copy link
Member

@efiop efiop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! πŸŽ‰

tests/func/test_get.py Outdated Show resolved Hide resolved
dvc/repo/get.py Outdated Show resolved Hide resolved
Comment on lines +77 to +78
is_git_file = output_error and not os.path.isabs(path)
is_not_cached = output and not output.use_cache
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is not cached then this is also a git file, so this var names are confusing. Overall this logic is more complicated than necessary. It is simply either cached or a git managed file, so:

try:
    if out and out.use_cache:
        # do the pull and checkout
        ...
    else:
        # Non cached out outside repo, can't be handled
        if os.path.abspath(src):
            raise PathOutsideRepoError(...)

        # it's git-handled and already checked out to tmp dir
        # we should just copy, not a git specific operation
        ...
        copy_dir_or_file(src_full, dst_full)  
except FileNotFoundError:
    raise FileMissingError(...)

And again forgot about generic exception, basically:

$ dvc get http://some.repo non-existing-path
Can't find non-existing-path in some.repo neither as output 
nor as git handled file/dir

Message may be different, but the idea is that we don't know whether user tried to get an out or a git handled file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Suor We've agreed to not use FileMissingError as its message is not applicable here. Hence PathOutsideRepoError, which is more suitable. The current logic corresponds to what we have in Repo.open.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on the naming.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And indeed output_error is not wrapped as it is in repo.open.

Copy link
Member

@efiop efiop Dec 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking closer at Repo.open, it indeed has a more clear implementation than this, as it doesn't introduce is_git_file confusion.

tests/func/test_get.py Show resolved Hide resolved
@efiop
Copy link
Member

efiop commented Dec 9, 2019

For the record: @danihodovic decided to resign, so we are merging as is and will be fixing on top.

@efiop efiop merged commit 26a9702 into iterative:master Dec 9, 2019
@Baranowski
Copy link
Contributor

@efiop, should I wait with #2862 until you fix or resume working on it?

@efiop
Copy link
Member

efiop commented Dec 10, 2019

@Baranowski Fixed get, please rebase. Very sorry for the delay πŸ™

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

get: allow downloading regular files/dirs tracked by Git
7 participants