Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure to pull after importing a git repo #3377

Closed
charlesbaynham opened this issue Feb 21, 2020 · 15 comments · Fixed by #3503
Closed

Failure to pull after importing a git repo #3377

charlesbaynham opened this issue Feb 21, 2020 · 15 comments · Fixed by #3503
Assignees
Labels
bug Did we break something? p1-important Important, aka current backlog of things to do research

Comments

@charlesbaynham
Copy link
Contributor

Please provide information about your setup
DVC version(i.e. dvc --version), Platform and method of installation (pip, homebrew, pkg Mac, exe (Windows), DEB(Linux), RPM(Linux))

DVC 0.86.4, conda installation, Windows.

See https://discuss.dvc.org/t/importing-from-a-git-repo-then-pulling/320 for background.

After importing a directory from a git repository (not a dvc repository), dvc pull fails with ERROR: unexpected error - 'ExternalGitRepo' object has no attribute 'cache'.

@triage-new-issues triage-new-issues bot added the triage Needs to be triaged label Feb 21, 2020
@charlesbaynham
Copy link
Contributor Author

Script to reproduce:

mkdir git_repo
cd git_repo
git init
mkdir a_dir
touch a_dir/test.txt
git add .
git commit -m "Init git repo"

cd -
mkdir dvc_repo
cd dvc_repo
git init
dvc init
git add .
git commit -m "Init"
dvc import ../git_repo a_dir
dvc pull

@efiop efiop added the p0-critical Critical issue. Needs to be fixed ASAP. label Feb 21, 2020
@triage-new-issues triage-new-issues bot removed the triage Needs to be triaged label Feb 21, 2020
@efiop efiop added the bug Did we break something? label Feb 21, 2020
@efiop
Copy link
Contributor

efiop commented Feb 21, 2020

Thanks for the bug report @charlesbaynham ! We'll take a look ASAP.

@skshetry
Copy link
Member

@jorgeorpinel, @efiop, how should this behave for git repo? Not pull at all or like dvc update? Could not find anything on https://man.dvc.org/pull about this.

@charlesbaynham
Copy link
Contributor Author

Am I right in thinking that dvc repos don't put imported files into their cache / remote? So a pull will always need to access the original source?

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Feb 25, 2020

@charlesbaynham it actually does put it in the cache. Continuing your script:

$ cat a_dir.dvc
...
  repo:
    url: ../git_repo
    rev_lock: 314393c2a7b82004302d5b7f219e724522db88ca
outs:
- md5: 59446444eae8d3c063f64fefd37b43b5
...
$ tree .dvc/cache 
.dvc/cache
├── 59
│   └── 446444eae8d3c063f64fefd37b43b5.dir
└── d4
    └── 1d8cd98f00b204e9800998ecf8427e
$ cat .dvc/cache/59/446444eae8d3c063f64fefd37b43b5.dir 
[{"md5": "d41d8cd98f00b204e9800998ecf8427e", "relpath": "test.txt"}]

But shouldn't push it to the remote storage, indeed.

@skshetry in this case, I don't understand why dvc pull is even trying to pull from this import stage, since the output is already in cache ^ (and workspace).

image

https://dvc.org/doc/command-reference/pull

I think that's the current p0 bug.

  • Whether there are other problems with Git-tracked imported files, we would need to extensively Q/A re-importing and updating them... And probably dvc fetch\pull and dvc push as well, just in case.

@efiop
Copy link
Contributor

efiop commented Feb 28, 2020

@charlesbaynham it actually does put it in the cache. Continuing your script:

It actually doesn't, just that .dir file, which is more of a temporary file than a proper cache in that particular case. So we will indeed need to have access to the original source.

@efiop
Copy link
Contributor

efiop commented Feb 28, 2020

@skshetry It should git clone (as we usually do) and just copy stuff over, same as if you would've re-imported it again.

@jorgeorpinel
Copy link
Contributor

It actually doesn't

OK. But in my script you can see that both .dvc/cache/59/446444e...b5.dir (a_dir) AND .dvc/cache/d4/1d8cd98... (a_dir/test.txt) are in cache... what am I missing? 🙂

@pared pared added the research label Mar 3, 2020
@skshetry skshetry self-assigned this Mar 16, 2020
@skshetry
Copy link
Member

skshetry commented Mar 16, 2020

Able to reproduce with following tests:

def test_pull_imports(tmp_dir, dvc, scm, git_dir):
    with git_dir.chdir():
        git_dir.scm_gen("foo", "foo", commit="first version")

    dvc.imp(fspath(git_dir), "foo")
    dvc.pull()

def test_pull_imports_erepo_dvc(tmp_dir, dvc, scm, erepo_dir):
    with erepo_dir.chdir():
        erepo_dir.dvc_gen("foo", "foo", commit="first version")

    dvc.imp(fspath(erepo_dir), "foo")
    dvc.pull()

Looks like it never worked before. dvc import already puts the file in the cache, and dvc update should be used for updating. Nevertheless, I'm trying to see what's going on.

@skshetry skshetry added p1-important Important, aka current backlog of things to do and removed p0-critical Critical issue. Needs to be fixed ASAP. labels Mar 16, 2020
@skshetry
Copy link
Member

@jorgeorpinel, current pull behavior is to download missing files from the remote storage to the cache. The error is obviously a bug, but, to be able to pull from the git-imported artifacts is a change in behavior.

Do we need to support this? Or, we should delegate it to dvc import/update for this?

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Mar 18, 2020

The error is obviously a bug, but, to be able to pull from the git-imported artifacts is a change in behavior

How else can the bug be fixed? If theres something in the middle that would probably be a good start. Maybe just make pull ignore import stages? I'm not sure tbh, please check with @efiop and @shcheklein. Do we want pull to treat import stages like any other stages? I'm guessing yes if possible.

@charlesbaynham
Copy link
Contributor Author

Do we need to support this? Or, we should delegate it to dvc import/update for this?

Shouldn't it be that dvc pull retrieves the version of the git import specified by rev_lock in the dvc file, in contrast to dvc update / dvc import which git pulls the latest revision and updates rev_lock accordingly?

@skshetry
Copy link
Member

skshetry commented Mar 24, 2020

@charlesbaynham, #3503 implemented exactly how you described. dvc pull will fetch using specified rev_lock. dvc import can be used to import (or, forced re-import), and update will update rev_lock if rev moves. We have update --rev to move to a different revision.

@charlesbaynham
Copy link
Contributor Author

@skshetry @efiop Fantastic, thank you! I'll give it a try.

@charlesbaynham
Copy link
Contributor Author

Works like a charm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Did we break something? p1-important Important, aka current backlog of things to do research
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants