Join GitHub today
New implementation of Git #227
Tomorrow, I come back to Cambridge, then it's better to discuss about the next step of
To be clear, this PR need a strong acknowledge about what is Git. I don't explain what is a tree or a blob or the PACK file. I will do a conference in India for that.
Wait, what is it ?
Some people from the mirage team know me, I started from February to improve
This work will (I hope) fix some bug about the memory consumption of
For all of this, we need to rethink about the decoding of a Git object and, fortunately most of the Git objects are present on the PACK file. So, change the implementation of the decoding of the PACK file will change (I hope) this kind of bug.
However, when we write a new Git object, we write a loose Git object. So, it's the purpose of
Now, this implementation come with a delta-ification (I don't know the good word to explain this computation, but yes... deal with it) and compress the PACK file by the delta-ification. From some tests, we produce the same PACK file than
So, about this compression and the possibility to store all objects in one PACK file, we can consider than we are close to implement the
Finally, the PACK file is the key of the mystery of the chocolate. And the key of Git. Indeed, when you clone a Git repository, you receive a PACK file. When you push to a repository, you send a PACK file. So we need to take care about the implementation of the PACK file - I will explain precisely what is my implementation.
So for all of this, I refactored the
To prove my work ALL COMMITS OF THIS PR WAS SEND BY MY IMPLEMENTATION!
Finally, we can restart the implementation of the HTTP protocol in the good way. So if you read all, you can say: oh wait, but you missed some goals of what we expected! Like, yes I did not yet implemented the HTTP protocol and yes, the
Firstly, I started to implement only the PACK file in another project. The name is
So, it's time to explain the strong constraints about my implementation:
The last two point is very important because, in Git, we can handle some huge over big files (blob). And it's not realistic to store entirely the PACK file in the memory, compute some Git objects in the memory and have some other allocated data.
I wrote a little article (in French, sorry) about that and say something like:
Because the state will be in the minor heap, it will be deleted quickly. So the final point is, the decoding of the PACK file never allocates any buffer in the major heap. Only the client is able to allocate what is needed to decode a PACK file. In the truth, we allocate a little buffer of
However, we have the delta-ification. The delta-ification is:
So, we need a second pass to construct all delta-ified objects or we need the totality of the PACK file to resolve all delta-ified objects. I provided these two kind of way to reconstruct any Git objects from a PACK file.
But, again, it's about the memory consumption. If we want to construct the new Git object, we need to keep the list of
You can understand that, if you implement this naively, your computer will burn.
So, my implementation is like a toolbox to deserialize any Git object and try to avoid any allocation of the memory. If you know the size of the biggest Git object of your PACK file, and if you know the deeper chain of the delta-ification of your PACK file, you can allocate strictly what is needed to decode any object of this PACK file (delta-ification, or not). It's an unpleasant game between all of your buffers to store any information (meta-data, list of hunks, Git base object) strictly needed to construct the Git object requested. In other way, if you don't know one of this information requested, my decoder will allocate what is needed.
So, for all of that, I think, my implementation will fix the bug about the memory consumption because we can know strictly what is needed to construct any Git objects. However, to obtain the requested information, you need to pass one time on the PACK file and do some computation and this is cost.
If you want a concrete example of all of this, you can look the
I think it's the more interesting part of my job, try to encode any Git object to a new PACK file and calculate a good application of the delta-ification. But the specification about this hard compute is only described in an IRC discussion between one guy and Linus Torvalds... I known better than that about the specification.
But, with @samoht, we try lot of time to have a good result and try to be close to the implementation of
The key about the implementation of the delta-ification is about the heuristic of which object is better as base to apply a delta-ification to the current object. I don't want to explain the detail about this heuristic (we can read the IRC discussion!) but when we find a good base object, we will compute the delta-ification.
Before, I believe
But, finally, we produced a good PACK file and, as I said, sometimes we produce a better PACK file than
And, from this delta-ification, you need to re-order the list of your Git objects (a topological sort) because Git has some assertion about the format of the PACK file (only understandably on the only exhaustive specification of Git, the C code).
Finally, we have an encoder (in the same way as the decoder) to encode this list to a PACK file. To be clear, at this time we use a lightweight representation of the Git object. You can understand than it's not possible to manipulate like 4 000 000 commits (and tree, and blob, and tag, ...) in your memory.
We have a good implementation of the PACK file. It's not perfect, for example, we can optimize again the serialization (in the same way as
Thanks! I will try to review all the patches shortly, but first a few question/remarks:
crc32: please keep in mind that the xapi-project/crc is C code, and requires effort to get it running on all platforms
performance: I'd be happy to see some numbers (I don't have any) -- esp. whether a bytes/string backed
About CRC-32, I took the previous implementation available in
Note: A good point to debug,
We have different way to fix this:
For the first and the second way, it will be easy to switch between a
I will take my flight :) !
I've done a first pass on the patches (and I expect to have a few more :p)
First of all, this is great, thanks very much for your impressive effort. The code is clean and I can mostly follow what happens :-) I am really looking forward merging your PR. But before that, a few more comments:
- Please keep/update existing copyright headers to add your name (check that all the code that you committed here is compatible with MIT first);
- Please update your
XXXcomments to clarify whether they are normal comments (in this case, simply remove them) or if this is a TODO. If it is an important TODO, it would be great to open an issue in the repo to track it and explain the severity (this can be done after the PR is merged);
- The library should remain split between what is unix-specific and what is not. Currently
gitis the core library and is independent on any backend.
git-unixis the Unix backend and
git-mirageis the mirage backend. Please do not bring unix-only dependency (and non-portable C code) in the core. So could you try to split your changes between
- There a few places where you wrote a hex converter again and again. Could you use
- I haven't seen where you are using
digestif(although you seem to depend on the C backend). Could you depend on the portable bits and let the final link choose between C and OCaml backend?
- Please use
Fmtas much as possible instead of writing the format combinator manually :-) (Ideally, all use of
Format.*should be replaced by
- Please do not use
Printexc.record_backtrace truein the library code. The only place where It is fine to set it is in
- What about the missing modules:
search.mli, etc. Do you have any plans to add them back? :-)
That's it for now, I will make a new pass after you made the requested changes :-) And thanks again for the huge work!
From the commit 795c11b, we can switch between a
module Digest = struct type buffer = Cstruct.t val feed : ctx -> buffer -> unit ... end type hex = string val of_hex : hex -> t val to_hex : t -> hex val of_string : string -> t val to_string : t -> string ...
I just implemented the atomic computation for the reference (as @samoht requested as a deep requirement of
I protected all mutable computation about the reference but I need to explain why I put an (other) functor
In this way, to read or write something, all of these are done in the non-atomic way. That means, in a concurrency context, we can write partially a Git object, switch to another process, which one read partially an other (or the same) Git object. About
Unfortunately, it's not the same for the
I don't have any strong-opinion about this choice (IMHO, I prefer the first one which is already available by
In my shower time, I asked to myself the existence of the
In the unix system (
However, in the mirage system, it's not the case.
The link between my previous comment and this thing is about the choice between some assumption of how we implement the lock logic (highlighted by the
It's a good argument for the second proposition, but I'm still open to other opinion and this is the week-end now.
About the re-implementation of the
So, the point is, I can reduce the new
The point and it's the goal of this implementation is to locate the possibly memory problem in this module - because it's the only part of the code where a buffer can be grow automatically (and when it can be difficult the infer the memory consumption).
And it's enough for the last commit :) !
Now, the current version of
Indeed, the current implementation of the Rabin's fingerprint is close to the
But, this case could not happen because this case appear only when when we want to access to a value of some specific arrays which have some other assertions to avoid this possible case.
Thanks for all the updates! Is the PR in a state when I can review it again?
Regarding locks: I agree that the second solution seems more elegant. About the in-memory implementation: it is pretty useful for testing purposes. I fully agree with modifying the API to be non-blocking for values, and add atomic/locking operations for references only. I think we should also keep some convenience functions to read/write full values.
Regarding pack files an in-memory stores: I don't understand what is the current limitation: are you able to read object in the pack but only write loose objects? If yes, do you have a way to re-compact the pack files using the in-memory backend? I agree that this is a bit of a stretch use-case (as it doesn't matter to control the memory consumption as everything will stay in memory anyway) but it would be nice if it could somehow work :-)
At this time, I implemented a small common API between the memory back-end and the store (file-system related back-end).
It's the time to think if it's the best and compare to the store.mli interface (which has more functions). Then, this interface will use to make the
The minimal API looks great. About the
The store API looks ok. I am not sure I will need to use all these functions, but I guess it is nice we have them just in case :-) How do you feed data to the repository/pack file state? Is the internal decoder is similar to jsnonm/xmlm (with an
Feel free to change the
So now all the tests pass on Travis, the failures are due to missing constraints in the revdeps. These should be fixed by ocaml/opam-repository#10904 (this is not a blocker for that merge).
The tests are failing on windows for git-mirage with:
which I guess is just missing a
Also the git-unix tests are failing on windows:
The current status is:
So I am happy to merge that PR in this current state. Thank you very much @dinosaure for all that hard work. The new