New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New implementation of Git #227
Conversation
|
Thanks! I will try to review all the patches shortly, but first a few question/remarks:
|
|
crc32: please keep in mind that the xapi-project/crc is C code, and requires effort to get it running on all platforms performance: I'd be happy to see some numbers (I don't have any) -- esp. whether a bytes/string backed |
|
To ease the review, could you replace the existing code with the new one instead of adding a new subdirectory? Would be easier to catch API changes and what is currently missing. |
|
About CRC-32, I took the previous implementation available in About Note: A good point to debug, About We have different way to fix this:
For the first and the second way, it will be easy to switch between a I will take my flight :) ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done a first pass on the patches (and I expect to have a few more :p)
First of all, this is great, thanks very much for your impressive effort. The code is clean and I can mostly follow what happens :-) I am really looking forward merging your PR. But before that, a few more comments:
- Please keep/update existing copyright headers to add your name (check that all the code that you committed here is compatible with MIT first);
- Please update your
XXXcomments to clarify whether they are normal comments (in this case, simply remove them) or if this is a TODO. If it is an important TODO, it would be great to open an issue in the repo to track it and explain the severity (this can be done after the PR is merged); - The library should remain split between what is unix-specific and what is not. Currently
gitis the core library and is independent on any backend.git-unixis the Unix backend andgit-mirageis the mirage backend. Please do not bring unix-only dependency (and non-portable C code) in the core. So could you try to split your changes betweengitandgit-unix? - There a few places where you wrote a hex converter again and again. Could you use
ocaml-hexinstead? - I haven't seen where you are using
digestif(although you seem to depend on the C backend). Could you depend on the portable bits and let the final link choose between C and OCaml backend? - Please use
Fmtas much as possible instead of writing the format combinator manually :-) (Ideally, all use ofFormat.*should be replaced byFmt. - Please do not use
Printexc.record_backtrace truein the library code. The only place where It is fine to set it is inmain.ml. - What about the missing modules:
git.mli,search.mli, etc. Do you have any plans to add them back? :-)
That's it for now, I will make a new pass after you made the requested changes :-) And thanks again for the huge work!
src/git/_tags
Outdated
| @@ -0,0 +1,4 @@ | |||
| true: safe_string, bin_annot, debug, keep_locs, use_menhir | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you update the jbuild file instead?
src/git/blob.ml
Outdated
| * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN | ||
| * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF | ||
| * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. | ||
| *) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please keep the copyright header (and add your name to it)
src/git/blob.mli
Outdated
| * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN | ||
| * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF | ||
| * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. | ||
| *) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here for the copyright + everywhere else in the code :-)
src/git/_tags
Outdated
| true: safe_string, bin_annot, debug, keep_locs, use_menhir | ||
| <*/**>: warn(@20@25@26@33@39) | ||
|
|
||
| <*/**.{ml,mli,byte,native}>: package(fpath cstruct lwt lwt.unix angstrom faraday farfadet digestif.c bos decompress lru camlzip uri astring), linkdep(rabin_stubs.o) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't want a dependency on lwt.unix, bos and digestif.c in the core library. Some of the deps should be moved to git-unix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's to compile main.ml, I will delete this - for sure, the library don't have these dependence (bos was no longer necessary, lwt.unix is for lwt_fs_unix.ml and digestif.c is for sha1.ml.
src/git/commit.ml
Outdated
| committer = @[<hov>%a@];@ \ | ||
| message = @[<hov>%S@];@] }" | ||
| Hash.pp tree | ||
| (list_pp ~sep:(fun fmt () -> Format.fprintf fmt ";@ ") Hash.pp) parents |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use Fmt.list instead
src/git/search.ml
Outdated
| @@ -1,93 +0,0 @@ | |||
| (* | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be great to re-add this.
src/git/smart.mli
Outdated
| (** The [Digest] module used to make the module. *) | ||
|
|
||
| module Decoder | ||
| : DECODER with type Hash.t = Digest.t and module Digest = Digest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does the DECODER needs to care about the Digest implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Smart decoder (and the encoder too) needs to know the length of the current hash implementation and the Digest module notify this information with Digest.length. I can reduce the signature need by these functors (like: sig val length : int end) if you want.
src/git/store.mli
Outdated
| state-defined buffer. That means the client can not use this function in a | ||
| concurrency context with the same [state]. *) | ||
|
|
||
| val size : state -> Hash.t -> (int64, error) result Lwt.t |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would just expose the top-level function (with the _s suffix) and move the non-blocking one (with the _p suffix into a sub-module).
src/git/store.ml
Outdated
| val buffer_io : t -> Cstruct.t | ||
| val fold : t -> ('a -> ?name:Path.t -> length:int64 -> Hash.t -> Value.t -> 'a Lwt.t) -> path:Path.t -> 'a -> Hash.t -> 'a Lwt.t | ||
|
|
||
| module Ref : |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do you read/write references now? what about the test_and_set function?
src/git/tree.mli
Outdated
| | `Dir (** A sub-{!Tree.t}. *) | ||
| | `Commit (** A sub-module ({!Commit.t}). *) | ||
| ] | ||
| and t = entry list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The entry list is supposed to stay sorted (with a weird lexicographic order distinguishing tree and objects). Do you manage this? If yes, I guess the type t would be better be and t = private entry list to avoid random orders.
f666d99
to
755dd39
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with a few comments (check Fmt.unit it can saves you a lot of code :p))
src/git/blob.ml
Outdated
| for i = 0 to Cstruct.len cs - 1 | ||
| do match Cstruct.get_char cs i with | ||
| | '\000' .. '\031' | '\127' -> Format.fprintf fmt "." | ||
| | chr -> Format.fprintf fmt "%c" chr | ||
| | '\000' .. '\031' | '\127' -> Fmt.const Fmt.char '.' ppf () |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fmt.char ppf '.'
src/git/commit.ml
Outdated
| committer = %a;@ \ | ||
| message = %a;@] }" | ||
| (Fmt.hvbox Hash.pp) tree | ||
| (Fmt.list ~sep:(fun ppf () -> Fmt.pf ppf ";@ ") Hash.pp) parents |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fmt.(list ~sep:(unit ";@ ") Hash.pp
src/git/idx.ml
Outdated
| t.i_off t.i_pos t.i_len pp_state t.state | ||
| let pp_state ppf = function | ||
| | Header _ -> | ||
| Fmt.pf ppf "(Header #k)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe try to to keep the function compact (as it was before) by not adding a newline after -> when the line is short enough? And keep the only long case as the last one. Also you can use Fmt.string ppf "..." when you don't have parameters.
src/git/idx.ml
Outdated
| | Hash _ -> | ||
| Fmt.pf ppf "(Hash #k)" | ||
| | End -> | ||
| Fmt.pf ppf "End" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, why so many endlines? :-)
src/git/minienc.ml
Outdated
| aux lst | ||
| let pp ppv ppf q = | ||
| Fmt.pf ppf "[ %a ]" | ||
| (Fmt.hvbox (Fmt.list ~sep:(fun ppf () -> Fmt.pf ppf ";@ ") ppv)) (to_list q) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fmt.l(ist ~sep:(unit "; @") ppv)
src/git/negociator.ml
Outdated
| @@ -50,7 +41,7 @@ struct | |||
| then [ "ALTERNATE" ] else [] ] | |||
| |> List.concat | |||
| in | |||
| pp_list ~sep:(fun fmt () -> Format.fprintf fmt " |@ ") Format.pp_print_string fmt flags | |||
| Fmt.list ~sep:(fun ppf () -> Fmt.pf ppf " |@ ") Fmt.string ppf flags | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here: Fmt.(list ~sep:(unit " |@ ") string)
src/git/pack.ml
Outdated
| | Consume -> | ||
| Fmt.pf ppf "Consume" | ||
| | Exception err -> | ||
| Fmt.pf ppf "(Error %a)" (Fmt.hvbox pp_error) err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
too many endlines, please keep the code compact :p
src/git/radix.ml
Outdated
| (pp_list ~sep:(fun fmt () -> pp fmt ";@ ") (fun fmt (k, v) -> pp_data fmt k v)) (to_list radix) | ||
| let pp ppv ppf radix = | ||
| Fmt.pf ppf "[ %a ]" | ||
| (Fmt.hvbox (Fmt.list ~sep:(fun ppf () -> Fmt.pf ppf ";@ ") (fun ppf x -> ppv ppf x))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fmt.(list ~sep:(unit ";@ ") ppv)
src/git/smart.ml
Outdated
| let sep fmt () = Format.fprintf fmt ";@ " in | ||
| let pp_ref fmt (hash, refname, peeled) = | ||
| let pp_advertised_refs ppf { shallow; refs; capabilities; } = | ||
| let sep ppf () = Fmt.pf ppf ";@ " in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let sep = Fmt.unit ";@ "
src/git/smart.ml
Outdated
| pp_a a pp_b b | ||
| in | ||
| let pp_report_status ppf { unpack; commands; } = | ||
| let sep ppf () = Fmt.pf ppf ";@ " in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can inline that sep value using Fmt.unit
|
From the commit 795c11b, we can switch between a module Digest =
struct
type buffer = Cstruct.t
val feed : ctx -> buffer -> unit
...
end
type hex = string
val of_hex : hex -> t
val to_hex : t -> hex
val of_string : string -> t
val to_string : t -> string
... |
|
I deleted the constraint about the error of the |
|
I just implemented the atomic computation for the reference (as @samoht requested as a deep requirement of I protected all mutable computation about the reference but I need to explain why I put an (other) functor In this way, to read or write something, all of these are done in the non-atomic way. That means, in a concurrency context, we can write partially a Git object, switch to another process, which one read partially an other (or the same) Git object. About Unfortunately, it's not the same for the
I don't have any strong-opinion about this choice (IMHO, I prefer the first one which is already available by |
|
In my shower time, I asked to myself the existence of the In the unix system ( However, in the mirage system, it's not the case. The link between my previous comment and this thing is about the choice between some assumption of how we implement the lock logic (highlighted by the It's a good argument for the second proposition, but I'm still open to other opinion and this is the week-end now. |
|
About the re-implementation of the
So, the point is, I can reduce the new
The point and it's the goal of this implementation is to locate the possibly memory problem in this module - because it's the only part of the code where a buffer can be grow automatically (and when it can be difficult the infer the memory consumption). And it's enough for the last commit :) ! |
src/git/mem.ml
Outdated
| val remove_reference : t -> Reference.t -> unit Lwt.t | ||
| val test_and_set_reference : t -> Reference.t -> test:Hash.t option -> set:Hash.t option -> bool Lwt.t | ||
| val read_inflated : t -> Hash.t -> string option Lwt.t | ||
| val write_inflated : t -> Cstruct.t -> Hash.t Lwt.t |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure, but why is this within Lwt.t? side effects I can observe below are mutation (of the hashtable) and errors (using either raise or Lwt.fail) -- I feel these functions could life outside of the Lwt monad...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, some parts of the mem.ml don't need to be wrapped in a Lwt.t. But we need to keep in our mind the compatibility between store.ml (which Lwt.t is needed - because we interact with the file-system) and mem.ml (and share the same API). So it's the global point (and this signature is come from the previous version of ocaml-git).
However, for this specific function, in the current version of the implementation of the store.ml with a file-system, I don't provide a write_inflate but only a write function.
The better question should be: Is it necessary to expose this function ? And, if the response is yes, how we can expose this kind of computation ? In fact, in my implementation, the signature should be:
val write_inflated_s : t -> Cstruct.t -> (Hash.t * int, error) result Lwt.t_s notices than we use the state-defined buffer (instead a write_inflate_p : t -> Cstruct.t -> ztmp:Cstruct.t -> raw:Cstruct.t -> Cstruct.t -> (Hash.t * int, error) result Lwt.t), the Cstruct.t is the already-serialized Git object, and we will return the Hash.t produced and how many byte(s) we wrote.
Finally, I think it's not needed (for the internal purpose of ocaml-git) to have this function (but a read_inflated function is needed in several points of the PACK encoder - so necessary in both-side, store.ml and mem.ml). Then, may be some peoples use ocaml-git and this function. We need to take care about that.
|
About the |
|
Now, the current version of Indeed, the current implementation of the Rabin's fingerprint is close to the But, this case could not happen because this case appear only when when we want to access to a value of some specific arrays which have some other assertions to avoid this possible case. |
|
Thanks for all the updates! Is the PR in a state when I can review it again? Regarding locks: I agree that the second solution seems more elegant. About the in-memory implementation: it is pretty useful for testing purposes. I fully agree with modifying the API to be non-blocking for values, and add atomic/locking operations for references only. I think we should also keep some convenience functions to read/write full values. Regarding pack files an in-memory stores: I don't understand what is the current limitation: are you able to read object in the pack but only write loose objects? If yes, do you have a way to re-compact the pack files using the in-memory backend? I agree that this is a bit of a stretch use-case (as it doesn't matter to control the memory consumption as everything will stay in memory anyway) but it would be nice if it could somehow work :-) |
This change add the kind information of the Git object in the internal Hashtbl of the memory back-end. When we return the inflated Git object, we notice the kind.
…vaScript world. This change is more deep. We change decompress (which has already the trampoline pattern) and angstrom (where I send an issue). The compilation of ocaml-git and the usability in a web browser works but keep in your mind than Git in your browser is just for fun.
This is the first abstract of the minimal common interface between the memory back-end and the store (file-system related) back-end.
|
The tests on 4.04 seems to pass now (yay!). The errors on 4.03 should be fixed by ocaml/opam-repository#10897 |
(to try to keep some kind of backward compatibility)
Try to remove unused types signatures: use Fpath directly instead of an abstract PATH, use Lwt.t instead of io, etc.
|
So now all the tests pass on Travis, the failures are due to missing constraints in the revdeps. These should be fixed by ocaml/opam-repository#10904 (this is not a blocker for that merge). The tests are failing on windows for git-mirage with: which I guess is just missing a Also the git-unix tests are failing on windows: |
|
More revdeps fixes there: ocaml/opam-repository#10910 |
|
The current status is:
So I am happy to merge that PR in this current state. Thank you very much @dinosaure for all that hard work. The new |
Hi all,
Tomorrow, I come back to Cambridge, then it's better to discuss about the next step of
ocaml-gitwith a concrete code. So this PR will be the good place to talk about the performance, the API, the code and the kind of beer of the Cambridge Blue.To be clear, this PR need a strong acknowledge about what is Git. I don't explain what is a tree or a blob or the PACK file. I will do a conference in India for that.
Wait, what is it ?
Some people from the mirage team know me, I started from February to improve
ocaml-git, provide a new implementation of the PACK/IDX file, implement thegit gccommand, implement thegit pushand the HTTP protocol and some others stuff forgotten.This work will (I hope) fix some bug about the memory consumption of
ocaml-gittoo. The design of this library will be change internally for that. We will put some strong constraints about the memory consumption and change the scheme of the deserialization/serialization of a Git object.For all of this, we need to rethink about the decoding of a Git object and, fortunately most of the Git objects are present on the PACK file. So, change the implementation of the decoding of the PACK file will change (I hope) this kind of bug.
However, when we write a new Git object, we write a loose Git object. So, it's the purpose of
git gc. To avoid any fragmentation of the Git object inside your file-system, it's better to call cyclically acompactfunction. This function will PACK all of your Git objects inside one PACK file (and delete all others loose objects). So, we need to implement a smart packer. Before this implementation,ocaml-gitjust put the deflated Git object inside the PACK file and nothing else.Now, this implementation come with a delta-ification (I don't know the good word to explain this computation, but yes... deal with it) and compress the PACK file by the delta-ification. From some tests, we produce the same PACK file than
gitand sometimes, we produce a smaller PACK file thangit- yes, we can be better thangit!So, about this compression and the possibility to store all objects in one PACK file, we can consider than we are close to implement the
git gccommand - now it's about some polishing computation like the memory needed to compact all (and avoid an explosion of your computer) and the deletion of loose objects.Finally, the PACK file is the key of the mystery of the chocolate. And the key of Git. Indeed, when you clone a Git repository, you receive a PACK file. When you push to a repository, you send a PACK file. So we need to take care about the implementation of the PACK file - I will explain precisely what is my implementation.
So for all of this, I refactored the
Syncmodule and provide a better modular API to communicate (by the Smart protocol) with a remote Git repository. I re-implemented the clone command, I implemented a negotiation engine (directly inspired by thegitengine) to implement then the fetch command, and finally I implement the PUSH COMMAND!To prove my work ALL COMMITS OF THIS PR WAS SEND BY MY IMPLEMENTATION!
Finally, we can restart the implementation of the HTTP protocol in the good way. So if you read all, you can say: oh wait, but you missed some goals of what we expected! Like, yes I did not yet implemented the HTTP protocol and yes, the
git gcis not yet available. But I will start a detailed explanation of what happen for this PR, my choices, and why I took a long time to do all.sirodepacFirstly, I started to implement only the PACK file in another project. The name is
sirodepac, a famous lemon syrup in south of France. This project was focused only on the PACK file and try to decode a PACK file and encode a PACK file. But, because @avsm commits faster than its shadow, if you want to decode naively a big PACK file from any repositories of @avsm, your computer will die.So, it's time to explain the strong constraints about my implementation:
statewhich we had kept.The last two point is very important because, in Git, we can handle some huge over big files (blob). And it's not realistic to store entirely the PACK file in the memory, compute some Git objects in the memory and have some other allocated data.
I wrote a little article (in French, sorry) about that and say something like:
Decoder
Because the state will be in the minor heap, it will be deleted quickly. So the final point is, the decoding of the PACK file never allocates any buffer in the major heap. Only the client is able to allocate what is needed to decode a PACK file. In the truth, we allocate a little buffer of
0x7Fbytes, but only one, it's ok...However, we have the delta-ification. The delta-ification is:
hunkhunks, we can construct a new Git object which is a delta-ified object.So, we need a second pass to construct all delta-ified objects or we need the totality of the PACK file to resolve all delta-ified objects. I provided these two kind of way to reconstruct any Git objects from a PACK file.
But, again, it's about the memory consumption. If we want to construct the new Git object, we need to keep the list of
hunkwhich contains some little buffers (of, in maximum,0x7Fbytes), we need to store the base Git object and a result buffer to store the result of the resolving. And, another point is a base Git object can be delta-ified too (we can have a chain of 50 delta-ification in the current implementation of Git).You can understand that, if you implement this naively, your computer will burn.
So, my implementation is like a toolbox to deserialize any Git object and try to avoid any allocation of the memory. If you know the size of the biggest Git object of your PACK file, and if you know the deeper chain of the delta-ification of your PACK file, you can allocate strictly what is needed to decode any object of this PACK file (delta-ification, or not). It's an unpleasant game between all of your buffers to store any information (meta-data, list of hunks, Git base object) strictly needed to construct the Git object requested. In other way, if you don't know one of this information requested, my decoder will allocate what is needed.
So, for all of that, I think, my implementation will fix the bug about the memory consumption because we can know strictly what is needed to construct any Git objects. However, to obtain the requested information, you need to pass one time on the PACK file and do some computation and this is cost.
If you want a concrete example of all of this, you can look the
clonefunction which calculates this information, allocate what is needed, checks all Git object, and if it's a thin-pack, repack all of these Git objects to a new PACK file.Encoder
I think it's the more interesting part of my job, try to encode any Git object to a new PACK file and calculate a good application of the delta-ification. But the specification about this hard compute is only described in an IRC discussion between one guy and Linus Torvalds... I known better than that about the specification.
But, with @samoht, we try lot of time to have a good result and try to be close to the implementation of
git- to produce the same small PACK file.The key about the implementation of the delta-ification is about the heuristic of which object is better as base to apply a delta-ification to the current object. I don't want to explain the detail about this heuristic (we can read the IRC discussion!) but when we find a good base object, we will compute the delta-ification.
Before, I believe
gitdo a patience diff between the source and the target to produce a good list ofhunks. I re-implemented (you can look in the history of thesirodepacrepository) the patience diff without Core. But the result was bad.In truth,
gituses a tuned implementation of the Rabin's fingerprint fromlibXdiffto generate a good list ofhunk. And because the only reference implementation of this algorithm is in C, I put a C stub in this library (rabin_stubs.c) to calculate the Rabin's fingerprint. Then, I translated the tuned implementation fromgitto an OCaml code. I think, we can move this code to an OCaml code but I did not have the time to do this - and I think, it's not easy to do the translation because is very close to the representation of the integer in C.But, finally, we produced a good PACK file and, as I said, sometimes we produce a better PACK file than
gitbut for some others reasons.And, from this delta-ification, you need to re-order the list of your Git objects (a topological sort) because Git has some assertion about the format of the PACK file (only understandably on the only exhaustive specification of Git, the C code).
Finally, we have an encoder (in the same way as the decoder) to encode this list to a PACK file. To be clear, at this time we use a lightweight representation of the Git object. You can understand than it's not possible to manipulate like 4 000 000 commits (and tree, and blob, and tag, ...) in your memory.
Conclusion
We have a good implementation of the PACK file. It's not perfect, for example, we can optimize again the serialization (in the same way as
git) and we can, again, restrict the memory consumption. But, I think, it's a good bedrock for the next things.ocaml-gitThen, the job is to integrate this big part in
ocaml-git. And you can understand, it's not really easy.A bottom-to-top development
When you know the code-base of
ocaml-gityou can understand than why we have some bug about the memory consumption. Any decoder was not think in the non-blocking way, they a use the totality (stored in the memory) of the input (and they need to have the totality of the input, because, exactly, the decoding is not non-blocking), and it's same about the serialization.One other problem is to use an interface close to
camlzipwhich one can not permit to be used to a non-blocking decoder/encoder.For all of these, it's needed to start a big replacement of some specific layer of
ocaml-git. But, it's not like, this project is bad and we restart a new one. This project is good and the API is good and it's why it's a bottom-to_top development, the goal is to provide the same previous API but with a big internal change.So, I don't restart the project, I took it and try to improve some deeps layers - and, of course, this surgical strike changed a little bit the API.
Angstrom
I think, all people want to read this PR known
angstromfrom @seliopou. This is a good library and, I think, it's the best choice I did to handle the deserialization. In fact, it's easy to describe the Git format firstly and when you understand the problem of the alteration (and the problem to keep an internal buffer to not remove what was consumed by the alteration), you can provide a useful API to decode everything.But, to be clear, and I explained this in the code, all decoder/encoder stuff in this project is close to the Git format and because I know when we can have an alteration and how many bytes I need to keep while the alteration, I produced a user-friendly interface. But, don't copy/past...
So the big deal is happen in the
Helpermodule which one makes a new non-blocking memory limited decoder from an Angstrom decoder.Faraday / Farfadet
However, I will not say the same for
Faraday(sorry @seliopou).Faradaywas think to be run in two cooperative processes, one to feed the encoder, the second to flush. But Faraday don't care about the memory consumption of the encoder because, in this pattern, we can say easily than the encoder will not growth.So, I provided a Faraday encoder (make by the Farfadet library) but I don't use it. I still open about this question.
Minienc
And yes, if I don't use Faraday, what I use to encode a Git object? My little pony^Wminienc. This module provide a memory limited encoder in the same way than Faraday (with the same interface). The big change is:
From all of this, I provided exactly the same interface. I did not have the time to play with GADT and provide a same library like
Farfadet, but it still in my TODO.Finally, from this
Minienc, I can digest a fixed-size buffer to obtain the hash of a Git object.camlzipanddecompressI keep this compatibility from the previous version to use
camlzipinsteaddecompress. You need to know that is not easy to find a common interface between these two projects and I'm not sure if all work (but I prefer to believe than all is ok).So, we can continue to improve the compression or the decompression of the Git object independently than
decompress. And yes, the interface required is close to the interface ofdecompressbut if a can provide the same interface withcamlzipwith the same internal constraint (the non-blocking state), I think it's a good interface. Then, this interface is an interface "à la @dbuenzli" and, to use this kind of intf, I think it's a good choice to constraint the inflater and the deflater to this intf.However, about the memory consumption, we can have a problem when we want to deflate. Indeed, for both, the implementation need to keep all of the input to produce then an Huffman tree. For a large file,
gitcallszlibwith a specificFLUSHmethod (likeSYNC_FLUSH) to make a new deflate block and free new memory.In the state, I don't do this and it can be a problem for a large file.
CRC-32
I provide a C implementation of the CRC-32 checksum. For sure, if you want to have a pure OCaml implementation, I can but, I let this because it's not the biggest goal for the next step in my TODO list.
FileSystem and Path
I decided to split the representation of the Path from the FileSystem because I consider than the Path can be useful to represent a reference independently than a file-system back-end (like for
Git.Mem). Then, theStoremodule needs an implementation of the file-system which one provides a partialreadand a partialwrite(to connect with a non-blocking encoder/decoder) and themmapfunction for the PACK file decoder (and resolve a delta-ified object in one pass).However, this API can not be changed for sure and it's one other reason of why I decided to do a development bottom-to-top. Before this version,
ocaml-gitloaded all of the file in one time in the memory and, obviously, it's another point to explain the memory consumption ofocaml-git.DigestifandBytes.tI don't know what people think about that but for me, an end-user will use a lot of hash values. And for this reason, I consider than it's better to localized the value of hash in the minor heap as a
Bytes.tinstead abigstring.So I constrained in the PACK decoder/encoder than the result of the
digestfunction will be aBytes.t. I'm not sure about this choice and we can change (not easily on the other hand). But I think, we need a proof than it's better to usebigstringinsteadBytes.t.Smart and Sync
And for all of this, I re-implemented, in the same way than
ocaml-imapthe Smart protocol.I think, it's better to implement the decoder and the encoder in this way instead to use Angstrom but I explained why in the file.
Obviously, the API is more modular than the previous implementation of
clone,pushandfetch. In this API, we can change the negotiator engine, the packer engine and let the user to specify what he wants. Themain.mlis a good example to understand how to use theSyncmodule. The implementation to connect, send and receive the date from the network is functorized - and in the example, you can look 2 implementation of theNetinterface (one forssh, the other to communicate with agitserver).I will push a documentation about
Syncsoon.What is the next?
As I said, I tried to do a bottom-to-top development and if you know the API of the
Storemodule, you can say than this work is not done yet (but close).So the next, is to continue the development to provide the same previous interface. But I think, it's a good moment to have a critic about the API and make a better library to interact with a Git repository. From this project, we will obviously break the compatibility so, for me, it's an opportunity to make the best choice the next thing.
Voili, Voilou!