Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make pygit2 throw if tree of a commit is not found #682

Merged
merged 1 commit into from
Dec 20, 2016

Conversation

bisho
Copy link
Contributor

@bisho bisho commented Dec 19, 2016

Commit objects in git always have a tree_id associated, that points to the corresponding Tree object.

When the Tree object is missing, the repo is corrupted.

In those cases:

  • official git cli fatals with status code 128 and message:
    fatal: unable to read tree <hash>
  • libgit2 returns error GIT_ENOTFOUND when calling git_commit_tree()
  • pygit2 when accessing the tree by id with repo[commit.tree_id] raises a KeyError: <hash>

But on the other hand, on the commit object, rather than throwing and exception, pygit2 is swallowing the error returned by libgit2 and setting the <Commit object>.tree property to None.

This patch changes the behavior to raise an error in those cases.

Rationale:
None is arguably the wrong choice to encode an error condition, specially in python that is used heavily.

In particular this caused in our system to assume there was an empty tree, and the sync service that tails git repo changes decided to DELETE everything. The code was using None to represent empty tree, useful for example when we need to compare a path between two commits (the path might be non-existent at one of the commits you are comparing).

I think that in this case the right decision would be to raise since is an exceptional case, caused by a corrupted repo, is more consistent with other tools, and ensures user code does not take the wrong decisions.

For curiosity the corrupted repository can happen more commonly than expected. We run our repositories on a shared NFS filer, and one of our servers didn't have the lookupcache=positive option. This makes NFS cache the metadata (files on a directory for example) and use that for negative lookups (to deny existance of files). In this case, the commit object was on a directory not cached, so the commit was seen immediately, but the tree object was in a folder that was cached, the cache didn't contained the tree object, and thus for some seconds the tree was not existing and the repo was corrupted. Our sync service saw tree being None and decided to delete everything, causing a lot of issues down the way.

Repro steps:

  • Create a test repo with a couple of commits.
  • Find the tree_id for HEAD:
$ git cat-file -p HEAD
tree 3683f870be446c7cc05ffaef9fa06415276e1828
parent 37d122c3ffcaa77dbe12930c2a573648fdecee06
[...]
  • Delete the tree object:
    rm .git/objects/36/83f870be446c7cc05ffaef9fa06415276e1828
  • Try using that repo with pygit2:
    Before: repo.revparse_single('HEAD').tree will be None
    After: it repo.revparse_single('HEAD') raises GitError: Unable to read tree 3683f870be446c7cc05ffaef9fa06415276e1828

Commit objects in git always have a tree_id associated, that points
to the corresponding Tree object.

When the tree object is missing, the repo is corrupted.

In those cases:
* official git cli fatals with status code 128 and message:
  fatal: unable to read tree <hash>
* libgit2 returns error GIT_ENOTFOUND when calling git_commit_tree()
* pygit2 when accessing the tree by id with repo[commit.tree_id]
raises a KeyError: <hash>

But on the other hand, on the commit object, rather than throwing
and exception, pygit2 is swallowing the error returned by libgit2
and setting the <Commit object>.tree property to None.

None is arguable the wrong choice to encode an error condition,
specially in python that is used heavily.

In particular this caused in our system to assume there was
an empty tree, and the sync service that tails git repo changes
decided to DELETE everything. The code was using None to
represent empty tree, usefull for example when we need to
compare a path between two commits (the path might be non-existant
at one of the commits you are comparing).

I think that in this case the right decision would be to raise
since is an exceptional case, caused by a corrupted repo,
is more consistent with other tools, and ensures user code
does not take the wrong decissions.

For curiosity the corrupted repository can happen more commonly
than expected. We run our repositories on a shared NFS filer,
and one of our servers didn't have the lookupcache=positive
flag. This makes NFS cache the metadata (files on a directory
for example) and use that for negative lookups (to deny existance
of files). In this case, the commit object was on a directory
not cached, so the commit was seen immediately, but the tree
object was in a folder that was cached, the cache didn't
contained the tree object, and thus for some seconds the tree
was not existing and the repo was corrupted. Our sync service
saw tree being None and decided to delete everything, causing
a lot of issues down the way.
@jdavid jdavid merged commit 809fe33 into libgit2:master Dec 20, 2016
@bisho
Copy link
Contributor Author

bisho commented Dec 20, 2016

Thanks!

@jdavid
Copy link
Member

jdavid commented Dec 20, 2016

Thank you @bisho , great explanation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants