Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pull: fails in V2 if original import was in V1 and specified a git tag #6255

Closed
tomaszpietruszka-globality opened this issue Jun 30, 2021 · 5 comments · Fixed by #6274
Closed
Assignees
Labels
awaiting response we are waiting for your reply, please respond! :) bug Did we break something? p1-important Important, aka current backlog of things to do

Comments

@tomaszpietruszka-globality
Copy link

tomaszpietruszka-globality commented Jun 30, 2021

Bug Report

Description

DVC 2.x does not seem to be backwards compatible with 1.x in cases where the dvc import specified a tag

Attempting to dvc pull after a dvc upgrade results in

ERROR: unexpected error - '_pygit2.Tag' object has no attribute 'tree'

Of course re-importing the same files using a new DVC version is a "solution", but the migration would be quite painful for us - large number of files in multiple places.

BTW, once a file has been re-imported using new DVC, the resulting .dvc file differs in 2 ways:

  • the rev_lock SHA
    • the new one has the SHA of the commit the tag refers to
    • the old one has the SHA of the tag object itself
  • the md5 in the first line of the file (not surprising)

Reproduce

  1. import a file using DVC 1.X
  2. Remove the data file - keeping the .dvc
  3. Upgrade DVC to 2.X
  4. dvc pull - fails with ERROR: unexpected error - '_pygit2.Tag' object has no attribute 'tree'

(the example below includes a path to an existing repo created for debugging and an existing tag within that repo)

~/dev ❯ mkdir test; cd test; mkvirtualenv test
~/dev/test ❯ pip install --quiet "dvc<2"
~/dev/test ❯ deactivate; workon test
~/dev/test ❯ dvc --version 
1.11.16
~/dev/test ❯ git init; dvc init
~/dev/test master +9 !8 ❯ dvc import git@github.com:tomaszpietruszka-globality/toy-dvc-registry.git subdir/images/owl_sticker.png --rev some_tag
Importing 'subdir/images/owl_sticker.png (git@github.com:tomaszpietruszka-globality/toy-dvc-registry.git)' -> 'owl_sticker.png'

To track the changes with git, run:

        git add .gitignore owl_sticker.png.dvc
~/dev/test master +9 !8 ?2 ❯ rm owl_sticker.png
~/dev/test master +9 !8 ?2 ❯ pip install --upgrade --quiet dvc; deactivate; workon test;
~/dev/test master +9 !8 ?2 ❯ dvc --version
2.4.3
~/dev/test master +9 !8 ?2 ❯ dvc pull
ERROR: unexpected error - '_pygit2.Tag' object has no attribute 'tree'

Expected

The imported file should get pulled again

Environment information

It's an issue of compatibility between versions - below showing the version after the upgrade.

Output of dvc doctor:
After the V2 upgrade:

$ dvc doctor
DVC version: 2.4.3 (pip)
---------------------------------
Platform: Python 3.7.8 on Darwin-19.6.0-x86_64-i386-64bit
Supports: http, https
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk1s6
Caches: local
Remotes: None
Workspace directory: apfs on /dev/disk1s6
Repo: dvc, git

Additional Information (if any):
Verbose log:

~/dev/test master ?1 ❯ dvc pull --verbose                                                                                                                                                                                        17s  test
2021-06-30 13:08:17,592 DEBUG: Check for update is enabled.
2021-06-30 13:08:17,644 DEBUG: Creating external repo git@github.com:tomaszpietruszka-globality/toy-dvc-registry.git@c9c612b36d7a803ea3e0856ec28e7900fa40b6a3
2021-06-30 13:08:17,644 DEBUG: erepo: git clone 'git@github.com:tomaszpietruszka-globality/toy-dvc-registry.git' to a temporary dir
2021-06-30 13:08:18,977 ERROR: unexpected error - '_pygit2.Tag' object has no attribute 'tree'
------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/tomaszpietruszka/.virtualenvs/test/lib/python3.7/site-packages/dvc/main.py", line 55, in main
    ret = cmd.do_run()
  File "/Users/tomaszpietruszka/.virtualenvs/test/lib/python3.7/site-packages/dvc/command/base.py", line 50, in do_run
    return self.run()
  File "/Users/tomaszpietruszka/.virtualenvs/test/lib/python3.7/site-packages/dvc/command/data_sync.py", line 41, in run
    glob=self.args.glob,
  File "/Users/tomaszpietruszka/.virtualenvs/test/lib/python3.7/site-packages/dvc/repo/__init__.py", line 51, in wrapper
    return f(repo, *args, **kwargs)
  File "/Users/tomaszpietruszka/.virtualenvs/test/lib/python3.7/site-packages/dvc/repo/pull.py", line 38, in pull
    run_cache=run_cache,
  File "/Users/tomaszpietruszka/.virtualenvs/test/lib/python3.7/site-packages/dvc/repo/__init__.py", line 51, in wrapper
    return f(repo, *args, **kwargs)
  File "/Users/tomaszpietruszka/.virtualenvs/test/lib/python3.7/site-packages/dvc/repo/fetch.py", line 54, in fetch
    revs=revs,
  File "/Users/tomaszpietruszka/.virtualenvs/test/lib/python3.7/site-packages/dvc/repo/__init__.py", line 408, in used_objs
    filter_info=filter_info,
  File "/Users/tomaszpietruszka/.virtualenvs/test/lib/python3.7/site-packages/dvc/stage/__init__.py", line 633, in get_used_objs
    for odb, objs in out.get_used_objs(*args, **kwargs).items():
  File "/Users/tomaszpietruszka/.virtualenvs/test/lib/python3.7/site-packages/dvc/output.py", line 855, in get_used_objs
    return self.get_used_external(**kwargs)
  File "/Users/tomaszpietruszka/.virtualenvs/test/lib/python3.7/site-packages/dvc/output.py", line 899, in get_used_external
    return dep.get_used_objs()
  File "/Users/tomaszpietruszka/.virtualenvs/test/lib/python3.7/site-packages/dvc/dependency/repo.py", line 107, in get_used_objs
    locked=locked, cache_dir=local_odb.cache_dir
  File "/Users/tomaszpietruszka/.pyenv/versions/3.7.8/lib/python3.7/contextlib.py", line 112, in __enter__
    return next(self.gen)
  File "/Users/tomaszpietruszka/.virtualenvs/test/lib/python3.7/site-packages/dvc/external_repo.py", line 70, in external_repo
    repo = Repo(**repo_kwargs)
  File "/Users/tomaszpietruszka/.virtualenvs/test/lib/python3.7/site-packages/dvc/repo/__init__.py", line 157, in __init__
    root_dir=root_dir, scm=scm, rev=rev, uninitialized=uninitialized
  File "/Users/tomaszpietruszka/.virtualenvs/test/lib/python3.7/site-packages/dvc/repo/__init__.py", line 104, in _get_repo_dirs
    fs = scm.get_fs(rev) if isinstance(scm, Git) and rev else None
  File "/Users/tomaszpietruszka/.virtualenvs/test/lib/python3.7/site-packages/dvc/scm/git/__init__.py", line 354, in get_fs
    tree_obj = self.pygit2.get_tree_obj(rev=resolved)
  File "/Users/tomaszpietruszka/.virtualenvs/test/lib/python3.7/site-packages/dvc/scm/git/backend/pygit2.py", line 206, in get_tree_obj
    tree = self.repo[rev].tree
AttributeError: '_pygit2.Tag' object has no attribute 'tree'
@tomaszpietruszka-globality
Copy link
Author

I think the problem is that here:
https://github.com/iterative/dvc/blob/master/dvc/scm/git/backend/pygit2.py#L216
we are assuming that the first element of the returned tuple is a commit, but I'm actually getting

(<pygit2.Object{tag:c9c612b36d7a803ea3e0856ec28e7900fa40b6a3}>, None)

We then proceed to treat it as a commit, which does not work.

The expected behaviour (I think) would be to get, in my example:

(<pygit2.Object{commit:b5f95c4dd49aa5843e6d0ae0113e13ebb8b92352}>, None)

(the tag object does have a .target attribute with the SHA we actually want btw)

I suspect it might be a pygit2 bug, as the resolve_refish should return a commit object as the first element:

https://github.com/libgit2/pygit2/blob/e2c0fdbfbd7f84374f8c7a3df256547b7ea99192/pygit2/repository.py#L294-L296

(haven't found an open issue there).

What's even more worrying - seems like this
commit, _ref = self.repo.resolve_refish(rev)
pattern is used several times in the pygit2 backend within dvc - perhaps more surprises are in store

@tomaszpietruszka-globality
Copy link
Author

fwiw, I've opened an issue in pygit2: libgit2/pygit2#1082

@pmrowla
Copy link
Contributor

pmrowla commented Jul 2, 2021

@tomaszpietruszka-globality I'm not able to reproduce this in DVC or with using pygit directly (tested with pygit 1.5.0 and 1.6.1). Can you confirm which pygit version you have installed through pip, and which libgit2 version it was built against (using pygit2.LIBGIT2_VER)?

test-pygit git:master  py:dvc ❯ git --no-pager show mytag
tag mytag
Tagger: Peter Rowlands <peter@pmrowla.com>
Date:   Fri Jul 2 12:23:59 2021 +0900

annotated tag

commit a17cfb86bc8da41f24222185277fdde4824bf3d3 (HEAD -> master, tag: mytag)
Author: Peter Rowlands <peter@pmrowla.com>
Date:   Fri Jul 2 12:23:48 2021 +0900

    foo

diff --git a/foo b/foo
new file mode 100644
index 0000000..257cc56
--- /dev/null
+++ b/foo
@@ -0,0 +1 @@
+foo

test-pygit git:master  py:dvc ❯ pip freeze|grep pygit2
pygit2==1.6.1

test-pygit git:master  py:dvc ❯ python
Python 3.8.10 (default, May  4 2021, 03:06:52)
[Clang 12.0.0 (clang-1200.0.32.29)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pygit2
>>> pygit2.LIBGIT2_VER
(1, 1, 0)
>>> repo = pygit2.Repository(".")
>>> commit, ref = repo.resolve_refish("mytag")
>>> commit
<pygit2.Object{commit:a17cfb86bc8da41f24222185277fdde4824bf3d3}>
>>> ref
<_pygit2.Reference object at 0x10b2aa3d0>
>>> str(commit)
'<pygit2.Object{commit:a17cfb86bc8da41f24222185277fdde4824bf3d3}>'
>>> str(ref)
'<_pygit2.Reference object at 0x10b2aa3d0>'

@pmrowla pmrowla added the awaiting response we are waiting for your reply, please respond! :) label Jul 2, 2021
@tomaszpietruszka-globality
Copy link
Author

@pmrowla I believe I'm using the same versions:

~/dev/dvc-test-import master ?1 ❯ pip freeze | grep pygit
pygit2==1.6.1
~/dev/dvc-test-import master ?1 ❯ python -c "import pygit2; print(pygit2.LIBGIT2_VER)" 
(1, 1, 0)

The key thing is: the problem does not manifest when resolving the name of the annotated tag, it manifests when resolving the SHA of the annotated tag.

It's probably a very rare use case, because why would someone use the SHAs of tags... except that's what DVC 1.x stores as rev_lock, if a name of an annotated tag was given in --rev.

Another way of reproducing, purely with pygit2:
(please note the 3rd command includes the output of the second one (the tag SHA) - so requires a manual edit when reproducing):

~ ❯ mkdir repo; cd repo; git init; touch file; git add file; git commit -m "first commit"; git tag -a "new_tag" -m "tag message"                                                                                                      test
Initialized empty Git repository in /Users/tomaszpietruszka/repo/.git/
[master (root-commit) f5c40fd] first commit
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 file
~/repo master ❯ git show-ref --tags                                                                                                                                                                                                   test
b3f85c779d0a9fae876284efd166cd2af48df819 refs/tags/new_tag
~/repo master ❯ python -c "from pygit2 import Repository; r = Repository('.'); print(r.resolve_refish('new_tag')); print(r.resolve_refish('b3f85c779d0a9fae876284efd166cd2af48df819'))"                                               test
(<pygit2.Object{commit:f5c40fd1f5806fc9a2646697497bf7b3e7bf9072}>, <_pygit2.Reference object at 0x10452bcb0>)
(<pygit2.Object{tag:b3f85c779d0a9fae876284efd166cd2af48df819}>, None)

@pmrowla
Copy link
Contributor

pmrowla commented Jul 2, 2021

That makes sense. I'm not sure if this is actually a pygit2 bug, or just an incorrect docstring explanation of the resolve_refish behavior (will have to wait and see how they respond in the pygit repo).

The problem here is that the refish isn't a refname, it's a SHA - and the SHA is for the annotated tag object, not the commit pointed to by the annotated tag. I'm guessing that the underlying libgit2 function is always supposed to return the actual object in the case that it is passed a SHA, which means that for this case resolve_refish should probably be returning the annotated tag object (and not a commit).

Either way, we can handle this in DVC, whether or not it ends up being fixed upstream.

@efiop efiop added the bug Did we break something? label Jul 2, 2021
@efiop efiop added the p1-important Important, aka current backlog of things to do label Jul 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting response we are waiting for your reply, please respond! :) bug Did we break something? p1-important Important, aka current backlog of things to do
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants