-
-
Notifications
You must be signed in to change notification settings - Fork 383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Creating a raw diff, without generating deltas (patches) #736
Comments
Once you have a diff, you can browse the patches and get the file names: http://www.pygit2.org/diff.html |
The problem is of the performance. For example one way to get list of all files in the repository at given revision, recursively, that I have found recommended is |
I'm not familiar with pygit2, and certainly not familiar with Python's FFI, but it looks like this API is designed to not ask for file patches to be generated, just to perform a tree diff. File patches should be lazily generated when you request them. Unless I've misunderstood what I'm read, I think that there's a couple of possibilities:
I suspect it's not 2 and that your code is obvious, but would you mind sharing just so that we have a baseline to start from? It would be nice if one of the pygit2 maintainers could jump in and help with clarifying point 1. If that's no help then we may need to explore other options. |
Also, I'm assuming that this repository is not public. But can you give us a scale of size? Is it one linux kernel's worth of code? Ten? :) |
It is not documented whether file patches are lazily generated or not. Because of low performance of one specific use I guessed they were not, but maybe the problem lies in somewhere else. The repository in question is any sufficiently large repository: I have tested it with JDT, SWT and AspectJ repositories. I have done the following benchmarks: #!/bin/sh
DEFAULT_REPO="."
DEFAULT_COMMIT="HEAD"
REPO=${1:-$DEFAULT_REPO}
COMMIT=${2:-$DEFAULT_COMMIT}
echo "pygit2 diff_to_tree()"
python -m timeit \
-s "import pygit2; repo = pygit2.Repository('$REPO')" \
"[p.delta.old_file.path for p in repo.revparse_single('$COMMIT').tree.diff_to_tree()]"
echo "pygit2 recursive walk"
python -m timeit \
-s "import pygit2; repo = pygit2.Repository('$REPO')" \
-s 'def walktree(tree, path=[]):' \
-s ' result=[]' \
-s ' for e in tree:' \
-s ' if e.type == "blob":' \
-s ' result.append("/".join(path+[e.name]))' \
-s ' elif e.type == "tree":' \
-s ' result.extend(walktree(repo[e.id], path+[e.name]))' \
-s ' return result' \
"walktree(repo.revparse_single('$COMMIT').tree)"
echo "GitPython traverse()"
python -m timeit \
-s "import git; repo = git.Repo('$REPO')" \
"[e.path for e in repo.commit('$COMMIT').tree.traverse() if e.type == 'blob']" For small repository with few files the |
Thanks for the script. But are they fair comparisons? I don't know GitPython, but Anyway, the slow part is iterating over the I am not very familiar with this part of the libgit2 API. Maybe there is another way. It would be interesting to compare the pygit2 code against a C program that uses only libgit2. By the way in pygit2 this part is all in C, no FFI. |
I think that the opportunity here is to avoid doing the |
Using When comparing performance for the equivalent of |
Okay. Right now in pygit2 iterating over
I think best would be to define 2 iterators:
And keep the current one as a deprecated alias to Sounds good? |
@jdavid this has also the advantage of generating patches (textual diff), which may be time consuming, on demand and as needed. |
Just pushed Could you check @jnareb ? |
Closing this. According to my tests using the new Will add a comment in the source to some day add Thanks for reporting! |
This is my test code by the way:
|
As far as I can see there currently is no way to request only tree-level diff, that is as if with
git diff-tree
orgit diff --raw
. If I am interested only in names of changed files, I don't want to pay the penalty for generating text of diff between files (generating patch).I would also like to have recipe for equivalent of
$ git diff-tree -r --name-only
The text was updated successfully, but these errors were encountered: