New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
state of pygit2 #139
Comments
It definitely makes sense to hide the C details behind objects that feel native to python. There's a couple of details:
This might just be implicit in your proposal, but it'd be nice to have the types in there as well. |
We certainly need a more friendly and consistent API.
|
You are right for the efficiency part (in-memory object creation), but I do not like this exclusive way of creating objects (especially trees). I think there has to be a more consistent solution. As well pygit2 is the only binding which forbids object instantiation. |
Can you show how it looks object instantiation in rugged, for instance? |
repo = Rugged::Repository.new("path/to/repo")
oid = Rugged::Blob.create(repo, "a new blob content")
repo = Rugged::Repository.new("path/to/repo")
ref = Rugged::Reference.create(repo, "refs/heads/unit_test", "refs/heads/master")
repo = Rugged::Repository.new("path/to/repo")
entry = {:type => :blob,
:name => "README.txt",
:oid => "1385f264afb75a56a5bec74243be9b367ba4ca08",
:filemode => 33188}
builder = Rugged::Tree::Builder.new
builder << entry
sha = builder.write(repo) |
I don't know Ruby, so I will learn a little from the code. For blobs and references I don't see any sensible difference between Ruby and Python. There is not a
If we wanted pygit2 to look more like Ruby we would use an static method (though it is longer and in my opinion ugly):
Regarding the tree builder, there is a difference in when the repo is passed: at the beginning (Python) or at the end (Rugged). I do not have an strong opinion here. In any case, the tree builder is fundamentally different from blobs, references, ... in that it needs to be constructed progressively before it is written. |
On Fri, Nov 23, 2012 at 05:20:25AM -0800, cholin wrote:
While I agree that a clean Python API is more than a bunch of C-API wrappers, I think that it's worth minimizing unnecessary differences to keep the mental overhead of translating between the APIs small. In this case, |
I my opinion the target users for pygit2 are not libgit2 developers. But maybe I'm wrong with that... libgit2's api is getting cleaner and I think there will be soon a major release (Version 1.0). So it would be good to have a consistent api in near future. Maybe there are some pygit2 hackers at http://git-merge.com/ in May in Berlin, so we can discuss this there. But I would prefer to come to a decision earlier! |
I am (slowly) working on the docs. Now going through every function to add arguments (issue #85), and to use PyDoc_STRVAR for every docstring. If you want to help, you are welcome. This issue is large. I think it would be simpler to split it by discussing and fixing one topic at a time (index file, references, etc.). I also like to look at the problem through the documentation, so I wanted to get the docs up-to-date first. PS : I didn't know about the Git merge event, I may be there ... |
I have some concrete Ideas for this issue and for #71. Basically, when I tried pygit2 for the first time, it seemed quite cumbersome to handle simple things, e.g. looking up all branches for a given repository. Essentially what I was missing was a pure-python productivity layer on top of pygit2 -- something that makes it easy to get an overview quickly. To illustrate how this might work, I sketched the following piece of code: https://github.com/esc/pythonicgit2/blob/master/pythonicgit2.py And this reminds me, in principle, of the example shown in this pull-request. However, I am not convinced yet, that this pythonic layer needs to be written as a C-extension necessarily. |
I agree. One idea is to write a low-level API written in C, where there is about one Python wrapper for every libgit2 function; and then write the high level stuff in Python. The obvious place to start is the repository, by sub-classing it. For instance |
Commit 9ffc141 introduces a |
The As it seems it's important to have the same api of libgit2 in python I would prefer prefer to use the same function names for the low level api as well or at least similiar naming convention ( |
I concur: mapping the libgit2 API directly to python using the same function/method names is a good idea. For connecting high and low level repository objects you can either use inheritance or composition (facade pattern). The disadvantage of inheritance is, that you get a bloated class which may expose much low level functionality. Using an underscore may be one way to mitigate this (maybe even by monkey patching at runtime). The disadvantage of composition is, you may need much boilerplate to forward and transform the high-level calls to low-level calls. On the other hand, you are cleanly separating low- from high-level. |
On Sun, Mar 03, 2013 at 07:26:38AM -0800, Valentin Haenel wrote:
-∞ for monkeypatching namespace fixes ;). I don't see a problem with class bloat here. Command line Git has |
Problem is there are methods which are too simple to wrap. For instance
What I think does not provide any value. I would rather use the Maybe talking about low/high level APIs is wrong. Maybe it is just about mixing Python and C code to better handle complexity. Anyway, one thing I am concerned about is having a clearer criteria for the API design, and for the coding conventions. So at the end pygit2 is consistent, and not a collage. |
My example above is likely wrong. There is a value to implement (in Python) a cache for Git objects, so the One advantage of having a low-level API where one Python method maps to one libgit2 function, is to make it brain-dead to implement new features. If we go full that way and decide to keep the same name, then we could keep the |
On Sun, Mar 03, 2013 at 10:19:51AM -0800, J. David Ibáñez wrote:
Brain dead wrapping + intelligent Pythonization sounds good to me. |
+2 |
Hmm in my opinion it's not that easy. For example for iterators/generator we often use multiple git functions in each iteration step. But let's give it a try... |
Have you looked at CFFI for wrapping the C API with little effort? https://cffi.readthedocs.org/en/release-0.6/#examples |
I only recently discovered cffi and to my knowledge there is no attempt to python wrap lingit2 with it. There is however an experimental cygit wrapper at: |
I did not knew about cffi, nor was aware that there is an active effort to develop cygit bindings. There are also the glib bindings, https://git.gnome.org/browse/libgit2-glib |
After spending a couple of days fighting with python's memory system, I took a look at CFFI. It's pretty neat, and it does make wrapping the C easier, but performance is worse than with the current codebase (though it is faster with pypy than cpython). I've been doing some performance measurements, and a walk of the git repo takes twice as long. Looking at the cProfile output, it looks like we're spending as much time creating our objects as doing the walk, which is disappointing. There's still some work left to be done perf-wise (we should be able to avoid copying data and creating objects in a few places) but we'd have to look at whether we're willing to take the performance hit for the ease of writing and extending the bindings. |
Thanks @carlosmn to take the time to experiment with CFFI and look at the performance. In my opinion twice as long is too much. If we were starting pygit2 now we may take a different path. But we have not gone this far to switch now to CFFI, and give up that much on performance. |
I have made a couple of changes to avoid making extra copies and whatnot, which brings the times of a walk down git.git + extracting the author to (rough measurements in seconds)
So current pygit2 adds about 0.2s to the raw time from the walk, and the cffi version 0.7. I'm still optimistic about improving the performance. |
Here's a fairer comparison, with the Linux repository. Instead of counting just the time for the walk, the C version also looks up the object, which is something we do in pygit2 (427782 commits, in seconds, with a warm filesystem cache).
So it's not twice as slow, but it is about 25-35% slower, which is a shame. Even though pypy needs to simulate refcounting, which it doesn't do internally, and other things to pretend that it's CPython, over the long run the jitter makes up for it. Using cffi on CPython means running more python code than with "pure" pygit2, so CPython with cffi does come last. Where the jitter shines is with cffi, as we do end up running a lot of native code, which is what we're doing by compiling the C code in pygit2. These figures should be taken with a pinch of salt, as they are in many ways a micro-benchmark, that's simply testing how fast we can create python objects and pump data out of libgit2. A "real" application would be doing much more with this data, such that the differences would become more noisy. The benchmark is essentially repo = pygit2.Repository('linux/linux')
for commit in repo.walk(repo.head.target, 0):
i += 1
del repo |
Concerning the decision to make, I give little weight to the pypy numbers, because it is not the mainstream implementation of Python we should optimize for. What matters to me here is the 26.5% difference between pygit2 and pygit2-cffi, what still looks like too much. Of course that's half the picture, the other half is: how much simpler is the code with cffi? |
The code with cffi is both much smaller in size as well as simpler, most of the time we can simply proxy a call to the libgit2 function we want. We avoid all of the type definition boilerplate and documentation macros. We also get to write python instead of C for any decisions we need to make, which also simplifies some expressions. |
I gave a quick look at your cffi branch to see what CFFI looks like (and it looks good). There are a some fixable details, like it drops support for Python 2.6; but better to focus on the core subject for the decision to make. For instance, the new To the point, I am +1 to mix CFFI and a c extension. CFFI would be used to implement new stuff, and why not to rewrite current features where performance is not an issue. I do think CFFI would be of great help to fill the gap between libgit2 and pygit2, this is to say: to implement the missing features and to stay up-to-date to changes in libgit2. Thanks again @carlosmn for this effort, you have been the driving force behind pygit2 for a while now. |
Walking the commit graph is probably the operation where we do the most back-and-forth between libgi2 and pygit2, and where we create the most objects in a short time, so this is where the largest differential would be found (unless you're iterating over all refs in a huge gerrit project). I would expect a benchmark of the index operations would show that the difference in performance would be small enough for the reduced code size (especially as it's C code) to be worth it. The Mixing cffi and our custom C extension seem reasonable, though I wonder how tricky it would be. I don't think we currently use pure-python classes from within the extension, but we can work around that. If/When we merge this, we should definitely be implementing new features with cffi first and convert to custom C code if it gets painful performance-wise. Stuff that's IO-bound like remotes shouldn't need any C code on our side. |
I would like to see this implemented not as a huge PR, but a functional block at a time, this would help to better visualize the changes and simplify review. Yes the remote code looks like a good candidate to start. (I would put the ffi code in a separate C file, just to enable syntax highlight in the editor.) |
Regarding support for Python 2.6 I am |
I'll see about writing remotes with cffi then. I'm not sure what you mean about the C file. All the C code we'd need should be a string with contents of a pseudo header for the stuff we want to import. Should that be in a file we read in at load time? |
Yes, that's what I mean, a |
I've gone with that approach, and it loading it at runtime works fine, other that I am having some problems with making the file get copied to where the code can find it.
|
Just pushed a The unit tests pass with Python 3. But with Python 2 I get 9 errors like:
|
I was using Not sure where those errors come from yet. I see them with pypy but not with CPython 2.7. |
Time to close this one in my opinion. The largest change since this discussion started has been the partial move cffi. Further discussion on this topic should start fresh with a new issue. |
With git_revparse_sigle I think we could improve object lookups a little bit. For example
repo['HEAD']
orrepo.lookup_object('HEAD')
would be much more convenient to use instead ofrepo.revparse_single('HEAD')
. At the moment the pygit2 api looks a little bit confusing because we just write bindings for c functions. I would like to have a clean well-structured pythonic Repository class instead of just a blind collection of git_* functions.Something like the following (of course written in c...):
I don't know if it's really necessary to create all objects through
Repository
... As well I don't like the naming differences for object creations:repo.create_X
orrepo.TreeBuilder().write()
.What do you think? Some suggestions?
The text was updated successfully, but these errors were encountered: