Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating a raw diff, without generating deltas (patches) #736

Closed
jnareb opened this issue Nov 18, 2017 · 13 comments
Closed

Creating a raw diff, without generating deltas (patches) #736

jnareb opened this issue Nov 18, 2017 · 13 comments

Comments

@jnareb
Copy link

jnareb commented Nov 18, 2017

As far as I can see there currently is no way to request only tree-level diff, that is as if with git diff-tree or git diff --raw. If I am interested only in names of changed files, I don't want to pay the penalty for generating text of diff between files (generating patch).

I would also like to have recipe for equivalent of $ git diff-tree -r --name-only

@pypingou
Copy link

Once you have a diff, you can browse the patches and get the file names: http://www.pygit2.org/diff.html

@jnareb
Copy link
Author

jnareb commented Nov 20, 2017

The problem is of the performance. For example one way to get list of all files in the repository at given revision, recursively, that I have found recommended is [p.delta.old_file.path for p in repo.revparse_single('foo').tree.diff_to_tree()] - but in large enough repository it takes multiple seconds.

@ethomson
Copy link
Member

I'm not familiar with pygit2, and certainly not familiar with Python's FFI, but it looks like this API is designed to not ask for file patches to be generated, just to perform a tree diff. File patches should be lazily generated when you request them. Unless I've misunderstood what I'm read, I think that there's a couple of possibilities:

  1. pygit2 is inadvertently invoking file patch generation on the tree diff before handing it back to you.
  2. Your code is inadvertently invoking some method that requests file patch generation.
  3. Neither of these two things is happening, it's just slow for another reason. (libgit2 brain damage, incredibly slow marshaling of strings, etc, etc.)

I suspect it's not 2 and that your code is obvious, but would you mind sharing just so that we have a baseline to start from? It would be nice if one of the pygit2 maintainers could jump in and help with clarifying point 1. If that's no help then we may need to explore other options.

@ethomson
Copy link
Member

Also, I'm assuming that this repository is not public. But can you give us a scale of size? Is it one linux kernel's worth of code? Ten? :)

@jnareb
Copy link
Author

jnareb commented Nov 22, 2017

I'm not familiar with pygit2, and certainly not familiar with Python's FFI, but it looks like this API is designed to not ask for file patches to be generated, just to perform a tree diff. File patches should be lazily generated when you request them.

It is not documented whether file patches are lazily generated or not. Because of low performance of one specific use I guessed they were not, but maybe the problem lies in somewhere else.

The repository in question is any sufficiently large repository: I have tested it with JDT, SWT and AspectJ repositories.

I have done the following benchmarks:

#!/bin/sh

DEFAULT_REPO="."
DEFAULT_COMMIT="HEAD"
REPO=${1:-$DEFAULT_REPO}
COMMIT=${2:-$DEFAULT_COMMIT}

echo "pygit2 diff_to_tree()"
python -m timeit \
       -s "import pygit2; repo = pygit2.Repository('$REPO')" \
       "[p.delta.old_file.path for p in repo.revparse_single('$COMMIT').tree.diff_to_tree()]"

echo "pygit2 recursive walk"
python -m timeit \
       -s "import pygit2; repo = pygit2.Repository('$REPO')" \
       -s 'def walktree(tree, path=[]):' \
       -s '    result=[]' \
       -s '    for e in tree:' \
       -s '        if e.type == "blob":' \
       -s '            result.append("/".join(path+[e.name]))' \
       -s '        elif e.type == "tree":' \
       -s '           result.extend(walktree(repo[e.id], path+[e.name]))' \
       -s '    return result' \
       "walktree(repo.revparse_single('$COMMIT').tree)"

echo "GitPython traverse()"
python -m timeit \
       -s "import git; repo = git.Repo('$REPO')" \
       "[e.path for e in repo.commit('$COMMIT').tree.traverse() if e.type == 'blob']"

For small repository with few files the diff_to_tree() solution was around the same (but slower) than GitPython traverse(), but faster than running git ls-files -r. But for JDT the diff_to_tree() solution took around 3 seconds to get result, while GitPython took 0.8 of second (750 msec), and recursive solution took 0.04 seconds (40 msec).

@jdavid
Copy link
Member

jdavid commented Nov 22, 2017

Thanks for the script. But are they fair comparisons? I don't know GitPython, but tree.traverse() looks like the pygit2 recursive walk test, iterating over the commit's tree, not checking against another tree.

Anyway, the slow part is iterating over the Diff object, https://github.com/libgit2/pygit2/blob/master/src/diff.c#L416 There are 2 calls there, git_patch_from_diff and wrap_patch; wrap_patch should be optimized (PRs welcome) because it is indeed building the hunks. However from the tests I have done using the libgit2 repo, the call to git_patch_from_diff is the slowest, so don't expect a huge improvement.

I am not very familiar with this part of the libgit2 API. Maybe there is another way. It would be interesting to compare the pygit2 code against a C program that uses only libgit2.

By the way in pygit2 this part is all in C, no FFI.

@ethomson
Copy link
Member

I think that the opportunity here is to avoid doing the git_patch_from_diff unless patch data is actually being requested. You should be able to populate p.delta (file paths, etc) without generating the patch. If you could populate p.delta via git_diff_get_delta and then only populate patch data from git_patch_from_diff lazily, that would speedup the tree diff-only use case that @jnareb has.

@jnareb
Copy link
Author

jnareb commented Nov 22, 2017

Using commit.tree.diff_to_tree() to get equivalent to git ls-tree -r is not the only case where support for neither the raw output nor lazy generation of patch data decreases pygit2 performance. It is only the most striking one.

When comparing performance for the equivalent of git diff --name-only, that is getting only the names of changed files, in pygit2, GitPython and subprocess (calling git diff-tree -r --name-only) all three performances were either comparable, or pygit2 was up to an order of magnitude faster. Still, in my opinion the case where pygit2 is slower that calling git or using GitPython, as was the case for JDT repository, should not happen.

@jdavid
Copy link
Member

jdavid commented Nov 28, 2017

Okay. Right now in pygit2 iterating over Diff object returns a sequence of Patch objects, and from a Patch object you can get the delta:

for patch in diff:
    patch.delta

I think best would be to define 2 iterators:

for delta in diff.deltas
for patch in diff.patches

And keep the current one as a deprecated alias to diff.patches, for backwards compatibility.

Sounds good?

@jnareb
Copy link
Author

jnareb commented Nov 28, 2017

@jdavid this has also the advantage of generating patches (textual diff), which may be time consuming, on demand and as needed.

@jdavid
Copy link
Member

jdavid commented Dec 6, 2017

Just pushed Diff.deltas

Could you check @jnareb ?

@jdavid
Copy link
Member

jdavid commented Dec 24, 2017

Closing this. According to my tests using the new diff.deltas to get the paths is ~ 200x faster.

Will add a comment in the source to some day add diff.patches replacing diff as an iterable.

Thanks for reporting!

@jdavid jdavid closed this as completed Dec 24, 2017
@jdavid
Copy link
Member

jdavid commented Dec 24, 2017

This is my test code by the way:

import pygit2
from time import time

REPO = '/.../libgit2'
COMMIT = 'HEAD'

if __name__ == '__main__':
    repo = pygit2.Repository(REPO)
    diff = repo.revparse_single(COMMIT).tree.diff_to_tree()

    t0 = time()
    [p.delta.old_file.path for p in diff]
    print('patches %f' % (time()-t0))

    t0 = time()
    [d.old_file.path for d in diff.deltas]
    print('deltas  %f' % (time()-t0))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants