Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New implementation of Git #227

Merged
merged 331 commits into from Dec 3, 2017
Merged

New implementation of Git #227

merged 331 commits into from Dec 3, 2017

Conversation

@dinosaure
Copy link
Member

@dinosaure dinosaure commented Jul 25, 2017

Hi all,

Tomorrow, I come back to Cambridge, then it's better to discuss about the next step of ocaml-git with a concrete code. So this PR will be the good place to talk about the performance, the API, the code and the kind of beer of the Cambridge Blue.

To be clear, this PR need a strong acknowledge about what is Git. I don't explain what is a tree or a blob or the PACK file. I will do a conference in India for that.

Wait, what is it ?

Some people from the mirage team know me, I started from February to improve ocaml-git, provide a new implementation of the PACK/IDX file, implement the git gc command, implement the git push and the HTTP protocol and some others stuff forgotten.

This work will (I hope) fix some bug about the memory consumption of ocaml-git too. The design of this library will be change internally for that. We will put some strong constraints about the memory consumption and change the scheme of the deserialization/serialization of a Git object.

For all of this, we need to rethink about the decoding of a Git object and, fortunately most of the Git objects are present on the PACK file. So, change the implementation of the decoding of the PACK file will change (I hope) this kind of bug.

However, when we write a new Git object, we write a loose Git object. So, it's the purpose of git gc. To avoid any fragmentation of the Git object inside your file-system, it's better to call cyclically a compact function. This function will PACK all of your Git objects inside one PACK file (and delete all others loose objects). So, we need to implement a smart packer. Before this implementation, ocaml-git just put the deflated Git object inside the PACK file and nothing else.

Now, this implementation come with a delta-ification (I don't know the good word to explain this computation, but yes... deal with it) and compress the PACK file by the delta-ification. From some tests, we produce the same PACK file than git and sometimes, we produce a smaller PACK file than git - yes, we can be better than git!

So, about this compression and the possibility to store all objects in one PACK file, we can consider than we are close to implement the git gc command - now it's about some polishing computation like the memory needed to compact all (and avoid an explosion of your computer) and the deletion of loose objects.

Finally, the PACK file is the key of the mystery of the chocolate. And the key of Git. Indeed, when you clone a Git repository, you receive a PACK file. When you push to a repository, you send a PACK file. So we need to take care about the implementation of the PACK file - I will explain precisely what is my implementation.

So for all of this, I refactored the Sync module and provide a better modular API to communicate (by the Smart protocol) with a remote Git repository. I re-implemented the clone command, I implemented a negotiation engine (directly inspired by the git engine) to implement then the fetch command, and finally I implement the PUSH COMMAND!

To prove my work ALL COMMITS OF THIS PR WAS SEND BY MY IMPLEMENTATION!

Finally, we can restart the implementation of the HTTP protocol in the good way. So if you read all, you can say: oh wait, but you missed some goals of what we expected! Like, yes I did not yet implemented the HTTP protocol and yes, the git gc is not yet available. But I will start a detailed explanation of what happen for this PR, my choices, and why I took a long time to do all.

sirodepac

Firstly, I started to implement only the PACK file in another project. The name is sirodepac, a famous lemon syrup in south of France. This project was focused only on the PACK file and try to decode a PACK file and encode a PACK file. But, because @avsm commits faster than its shadow, if you want to decode naively a big PACK file from any repositories of @avsm, your computer will die.

So, it's time to explain the strong constraints about my implementation:

  • First, my decoder (and encoder) need to be a non-blocking decoder. That means, we can stop at any step of the decoding, do what we want, and restart the process only with a state which we had kept.
  • Secondly, the buffer needed to store the input must be of fixed size.
  • Finally, any internals buffers has the same constraint

The last two point is very important because, in Git, we can handle some huge over big files (blob). And it's not realistic to store entirely the PACK file in the memory, compute some Git objects in the memory and have some other allocated data.

I wrote a little article (in French, sorry) about that and say something like:

  • The state must be pure to be located only on the minor heap
  • The fixed size buffer will be in the major heap but this object will live as long as we want to handle any Git objects.

Decoder

Because the state will be in the minor heap, it will be deleted quickly. So the final point is, the decoding of the PACK file never allocates any buffer in the major heap. Only the client is able to allocate what is needed to decode a PACK file. In the truth, we allocate a little buffer of 0x7F bytes, but only one, it's ok...

However, we have the delta-ification. The delta-ification is:

  • We have a base, it's a Git object
  • We have a list of hunk
  • And from the base and the hunks, we can construct a new Git object which is a delta-ified object.

So, we need a second pass to construct all delta-ified objects or we need the totality of the PACK file to resolve all delta-ified objects. I provided these two kind of way to reconstruct any Git objects from a PACK file.

But, again, it's about the memory consumption. If we want to construct the new Git object, we need to keep the list of hunk which contains some little buffers (of, in maximum, 0x7F bytes), we need to store the base Git object and a result buffer to store the result of the resolving. And, another point is a base Git object can be delta-ified too (we can have a chain of 50 delta-ification in the current implementation of Git).

You can understand that, if you implement this naively, your computer will burn.

So, my implementation is like a toolbox to deserialize any Git object and try to avoid any allocation of the memory. If you know the size of the biggest Git object of your PACK file, and if you know the deeper chain of the delta-ification of your PACK file, you can allocate strictly what is needed to decode any object of this PACK file (delta-ification, or not). It's an unpleasant game between all of your buffers to store any information (meta-data, list of hunks, Git base object) strictly needed to construct the Git object requested. In other way, if you don't know one of this information requested, my decoder will allocate what is needed.

So, for all of that, I think, my implementation will fix the bug about the memory consumption because we can know strictly what is needed to construct any Git objects. However, to obtain the requested information, you need to pass one time on the PACK file and do some computation and this is cost.

If you want a concrete example of all of this, you can look the clone function which calculates this information, allocate what is needed, checks all Git object, and if it's a thin-pack, repack all of these Git objects to a new PACK file.

Encoder

I think it's the more interesting part of my job, try to encode any Git object to a new PACK file and calculate a good application of the delta-ification. But the specification about this hard compute is only described in an IRC discussion between one guy and Linus Torvalds... I known better than that about the specification.

But, with @samoht, we try lot of time to have a good result and try to be close to the implementation of git - to produce the same small PACK file.

The key about the implementation of the delta-ification is about the heuristic of which object is better as base to apply a delta-ification to the current object. I don't want to explain the detail about this heuristic (we can read the IRC discussion!) but when we find a good base object, we will compute the delta-ification.

Before, I believe git do a patience diff between the source and the target to produce a good list of hunks. I re-implemented (you can look in the history of the sirodepac repository) the patience diff without Core. But the result was bad.

In truth, git uses a tuned implementation of the Rabin's fingerprint from libXdiff to generate a good list of hunk. And because the only reference implementation of this algorithm is in C, I put a C stub in this library (rabin_stubs.c) to calculate the Rabin's fingerprint. Then, I translated the tuned implementation from git to an OCaml code. I think, we can move this code to an OCaml code but I did not have the time to do this - and I think, it's not easy to do the translation because is very close to the representation of the integer in C.

But, finally, we produced a good PACK file and, as I said, sometimes we produce a better PACK file than git but for some others reasons.

And, from this delta-ification, you need to re-order the list of your Git objects (a topological sort) because Git has some assertion about the format of the PACK file (only understandably on the only exhaustive specification of Git, the C code).

Finally, we have an encoder (in the same way as the decoder) to encode this list to a PACK file. To be clear, at this time we use a lightweight representation of the Git object. You can understand than it's not possible to manipulate like 4 000 000 commits (and tree, and blob, and tag, ...) in your memory.

Conclusion

We have a good implementation of the PACK file. It's not perfect, for example, we can optimize again the serialization (in the same way as git) and we can, again, restrict the memory consumption. But, I think, it's a good bedrock for the next things.

ocaml-git

Then, the job is to integrate this big part in ocaml-git. And you can understand, it's not really easy.

A bottom-to-top development

When you know the code-base of ocaml-git you can understand than why we have some bug about the memory consumption. Any decoder was not think in the non-blocking way, they a use the totality (stored in the memory) of the input (and they need to have the totality of the input, because, exactly, the decoding is not non-blocking), and it's same about the serialization.

One other problem is to use an interface close to camlzip which one can not permit to be used to a non-blocking decoder/encoder.

For all of these, it's needed to start a big replacement of some specific layer of ocaml-git. But, it's not like, this project is bad and we restart a new one. This project is good and the API is good and it's why it's a bottom-to_top development, the goal is to provide the same previous API but with a big internal change.

So, I don't restart the project, I took it and try to improve some deeps layers - and, of course, this surgical strike changed a little bit the API.

Angstrom

I think, all people want to read this PR known angstrom from @seliopou. This is a good library and, I think, it's the best choice I did to handle the deserialization. In fact, it's easy to describe the Git format firstly and when you understand the problem of the alteration (and the problem to keep an internal buffer to not remove what was consumed by the alteration), you can provide a useful API to decode everything.

But, to be clear, and I explained this in the code, all decoder/encoder stuff in this project is close to the Git format and because I know when we can have an alteration and how many bytes I need to keep while the alteration, I produced a user-friendly interface. But, don't copy/past...

So the big deal is happen in the Helper module which one makes a new non-blocking memory limited decoder from an Angstrom decoder.

Faraday / Farfadet

However, I will not say the same for Faraday (sorry @seliopou). Faraday was think to be run in two cooperative processes, one to feed the encoder, the second to flush. But Faraday don't care about the memory consumption of the encoder because, in this pattern, we can say easily than the encoder will not growth.

So, I provided a Faraday encoder (make by the Farfadet library) but I don't use it. I still open about this question.

Minienc

And yes, if I don't use Faraday, what I use to encode a Git object? My little pony^Wminienc. This module provide a memory limited encoder in the same way than Faraday (with the same interface). The big change is:

  • If we are close to fulfill the internal buffer, we ask to the client to flush.
  • The state is pure
  • The internal Queue is pure (and is come from containers from @companion-cube and Chris Okasaki)

From all of this, I provided exactly the same interface. I did not have the time to play with GADT and provide a same library like Farfadet, but it still in my TODO.

Finally, from this Minienc, I can digest a fixed-size buffer to obtain the hash of a Git object.

camlzip and decompress

I keep this compatibility from the previous version to use camlzip instead decompress. You need to know that is not easy to find a common interface between these two projects and I'm not sure if all work (but I prefer to believe than all is ok).

So, we can continue to improve the compression or the decompression of the Git object independently than decompress. And yes, the interface required is close to the interface of decompress but if a can provide the same interface with camlzip with the same internal constraint (the non-blocking state), I think it's a good interface. Then, this interface is an interface "à la @dbuenzli" and, to use this kind of intf, I think it's a good choice to constraint the inflater and the deflater to this intf.

However, about the memory consumption, we can have a problem when we want to deflate. Indeed, for both, the implementation need to keep all of the input to produce then an Huffman tree. For a large file, git calls zlib with a specific FLUSH method (like SYNC_FLUSH) to make a new deflate block and free new memory.

In the state, I don't do this and it can be a problem for a large file.

CRC-32

I provide a C implementation of the CRC-32 checksum. For sure, if you want to have a pure OCaml implementation, I can but, I let this because it's not the biggest goal for the next step in my TODO list.

FileSystem and Path

I decided to split the representation of the Path from the FileSystem because I consider than the Path can be useful to represent a reference independently than a file-system back-end (like for Git.Mem). Then, the Store module needs an implementation of the file-system which one provides a partial read and a partial write (to connect with a non-blocking encoder/decoder) and the mmap function for the PACK file decoder (and resolve a delta-ified object in one pass).

However, this API can not be changed for sure and it's one other reason of why I decided to do a development bottom-to-top. Before this version, ocaml-git loaded all of the file in one time in the memory and, obviously, it's another point to explain the memory consumption of ocaml-git.

Digestif and Bytes.t

I don't know what people think about that but for me, an end-user will use a lot of hash values. And for this reason, I consider than it's better to localized the value of hash in the minor heap as a Bytes.t instead a bigstring.

So I constrained in the PACK decoder/encoder than the result of the digest function will be a Bytes.t. I'm not sure about this choice and we can change (not easily on the other hand). But I think, we need a proof than it's better to use bigstring instead Bytes.t.

Smart and Sync

And for all of this, I re-implemented, in the same way than ocaml-imap the Smart protocol.

I think, it's better to implement the decoder and the encoder in this way instead to use Angstrom but I explained why in the file.

Obviously, the API is more modular than the previous implementation of clone, push and fetch. In this API, we can change the negotiator engine, the packer engine and let the user to specify what he wants. The main.ml is a good example to understand how to use the Sync module. The implementation to connect, send and receive the date from the network is functorized - and in the example, you can look 2 implementation of the Net interface (one for ssh, the other to communicate with a git server).

I will push a documentation about Sync soon.

What is the next?

As I said, I tried to do a bottom-to-top development and if you know the API of the Store module, you can say than this work is not done yet (but close).

So the next, is to continue the development to provide the same previous interface. But I think, it's a good moment to have a critic about the API and make a better library to interact with a Git repository. From this project, we will obviously break the compatibility so, for me, it's an opportunity to make the best choice the next thing.

Voili, Voilou!

@samoht
Copy link
Member

@samoht samoht commented Jul 25, 2017

Thanks! I will try to review all the patches shortly, but first a few question/remarks:

  • about digestif/camlzip: I don't have a strong opinion about keeping the choice to have both (it was useful when digestif was not yet stable -- which is not the case anymore). If you thing you can do simpler and better things by breaking the compat with camlzip, please do: having a portable non-blocking codec for pack files is a good tradeoff vs. pure efficiency.

  • about hashes in the major/minor heap: I agree that we need some perf numbers. Nocrypto uses cstruct buffers everywhere, so I guess @pqwy and @hannesm have already considered the question.

  • about crc32: have you seen https://github.com/xapi-project/ocaml-crc? ocaml-git were using that before switching to a pure version of it, to ease portability.

  • and more generally: it would be great to minimize the amount of C code that we add to this library, as it will cause headache when 1/ porting to Xen and 2/ porting to JavaScript. I could see that the parts where this is critical for speed could have two backends, but in general I would prefer the simplicity of having only one code portable code path written in OCaml. But some numbers would probably be needed there too.

@hannesm
Copy link
Member

@hannesm hannesm commented Jul 26, 2017

crc32: please keep in mind that the xapi-project/crc is C code, and requires effort to get it running on all platforms

performance: I'd be happy to see some numbers (I don't have any) -- esp. whether a bytes/string backed cstruct is faster/slower than the current one (which uses bigarray) -- but also taking other buffer structures into account (iovec?).

@samoht
Copy link
Member

@samoht samoht commented Jul 26, 2017

To ease the review, could you replace the existing code with the new one instead of adding a new subdirectory? Would be easier to catch API changes and what is currently missing.

@dinosaure dinosaure force-pushed the dinosaure:ngit branch from c76bea1 to b23f7ac Jul 26, 2017
@dinosaure
Copy link
Member Author

@dinosaure dinosaure commented Jul 26, 2017

About CRC-32, I took the previous implementation available in ocaml-git 👍 .

About decompress/camlzip, I will keep the current interface. Indeed, we don't have any overhead with this interface and decompress (because, as I said, this interface is close to the interface of decompress). The available implementation of the inflate/deflate modules with camlzip was used to compare the ratio between decompress and camlzip and it seems than decompress as a good ratio.

Note: A good point to debug, decompress put in any way an empty flush at the end of the compression block, so when you want to extract a git object from a PACK file: the block starts with 78 9c (the header) and finishes with 00 00 ff ff - this trick saved my brain.

About Digestif and the constraint with the Bytes.t, it's because I need a concrete type to define a get function needed by the Radix functor. The get : t -> int -> char function is not available in the IDIGEST interface because it's not the purpose of an abstract hash representation (you should never use get on your hash). So, when I constraint the type of Bytes.t, then I can say in the lower layer than to get a character, I will use Bytes.get.

We have different way to fix this:

  • We can consider than it's worth to put in the IDIGEST interface the get function
  • We can put a proof (GADT "à la decompress") to know the concrete type of the hash and, by this proof, provide the function for the Radix functor
  • We can keep this current structure

For the first and the second way, it will be easy to switch between a bigstring or a Bytes.t representation. For the current structure, we have lot of noise to change the concrete type (like change all constraint) - and it's why it's not easy to switch.

I will take my flight :) !

Copy link
Member

@samoht samoht left a comment

I've done a first pass on the patches (and I expect to have a few more :p)

First of all, this is great, thanks very much for your impressive effort. The code is clean and I can mostly follow what happens :-) I am really looking forward merging your PR. But before that, a few more comments:

  • Please keep/update existing copyright headers to add your name (check that all the code that you committed here is compatible with MIT first);
  • Please update yourXXX comments to clarify whether they are normal comments (in this case, simply remove them) or if this is a TODO. If it is an important TODO, it would be great to open an issue in the repo to track it and explain the severity (this can be done after the PR is merged);
  • The library should remain split between what is unix-specific and what is not. Currently git is the core library and is independent on any backend. git-unix is the Unix backend and git-mirage is the mirage backend. Please do not bring unix-only dependency (and non-portable C code) in the core. So could you try to split your changes between git and git-unix?
  • There a few places where you wrote a hex converter again and again. Could you use ocaml-hex instead?
  • I haven't seen where you are using digestif (although you seem to depend on the C backend). Could you depend on the portable bits and let the final link choose between C and OCaml backend?
  • Please use Fmt as much as possible instead of writing the format combinator manually :-) (Ideally, all use of Format.* should be replaced by Fmt.
  • Please do not use Printexc.record_backtrace true in the library code. The only place where It is fine to set it is in main.ml.
  • What about the missing modules: git.mli, search.mli, etc. Do you have any plans to add them back? :-)

That's it for now, I will make a new pass after you made the requested changes :-) And thanks again for the huge work!

@@ -0,0 +1,4 @@
true: safe_string, bin_annot, debug, keep_locs, use_menhir

This comment has been minimized.

@samoht

samoht Jul 26, 2017
Member

can you update the jbuild file instead?

* WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
* ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
* OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
*)

This comment has been minimized.

@samoht

samoht Jul 26, 2017
Member

Please keep the copyright header (and add your name to it)

* WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
* ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
* OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
*)

This comment has been minimized.

@samoht

samoht Jul 26, 2017
Member

same here for the copyright + everywhere else in the code :-)

true: safe_string, bin_annot, debug, keep_locs, use_menhir
<*/**>: warn(@20@25@26@33@39)

<*/**.{ml,mli,byte,native}>: package(fpath cstruct lwt lwt.unix angstrom faraday farfadet digestif.c bos decompress lru camlzip uri astring), linkdep(rabin_stubs.o)

This comment has been minimized.

@samoht

samoht Jul 26, 2017
Member

we don't want a dependency on lwt.unix, bos and digestif.c in the core library. Some of the deps should be moved to git-unix

This comment has been minimized.

@dinosaure

dinosaure Jul 27, 2017
Author Member

Yes, it's to compile main.ml, I will delete this - for sure, the library don't have these dependence (bos was no longer necessary, lwt.unix is for lwt_fs_unix.ml and digestif.c is for sha1.ml.

committer = @[<hov>%a@];@ \
message = @[<hov>%S@];@] }"
Hash.pp tree
(list_pp ~sep:(fun fmt () -> Format.fprintf fmt ";@ ") Hash.pp) parents

This comment has been minimized.

@samoht

samoht Jul 26, 2017
Member

Please use Fmt.list instead

@@ -1,93 +0,0 @@
(*

This comment has been minimized.

@samoht

samoht Jul 26, 2017
Member

would be great to re-add this.

(** The [Digest] module used to make the module. *)

module Decoder
: DECODER with type Hash.t = Digest.t and module Digest = Digest

This comment has been minimized.

@samoht

samoht Jul 26, 2017
Member

why does the DECODER needs to care about the Digest implementation?

This comment has been minimized.

@dinosaure

dinosaure Jul 28, 2017
Author Member

The Smart decoder (and the encoder too) needs to know the length of the current hash implementation and the Digest module notify this information with Digest.length. I can reduce the signature need by these functors (like: sig val length : int end) if you want.

state-defined buffer. That means the client can not use this function in a
concurrency context with the same [state]. *)

val size : state -> Hash.t -> (int64, error) result Lwt.t

This comment has been minimized.

@samoht

samoht Jul 26, 2017
Member

I think I would just expose the top-level function (with the _s suffix) and move the non-blocking one (with the _p suffix into a sub-module).

val buffer_io : t -> Cstruct.t
val fold : t -> ('a -> ?name:Path.t -> length:int64 -> Hash.t -> Value.t -> 'a Lwt.t) -> path:Path.t -> 'a -> Hash.t -> 'a Lwt.t

module Ref :

This comment has been minimized.

@samoht

samoht Jul 26, 2017
Member

how do you read/write references now? what about the test_and_set function?

| `Dir (** A sub-{!Tree.t}. *)
| `Commit (** A sub-module ({!Commit.t}). *)
]
and t = entry list

This comment has been minimized.

@samoht

samoht Jul 26, 2017
Member

The entry list is supposed to stay sorted (with a weird lexicographic order distinguishing tree and objects). Do you manage this? If yes, I guess the type t would be better be and t = private entry list to avoid random orders.

@dinosaure dinosaure force-pushed the dinosaure:ngit branch 2 times, most recently from f666d99 to 755dd39 Jul 27, 2017
Copy link
Member

@samoht samoht left a comment

LGTM with a few comments (check Fmt.unit it can saves you a lot of code :p))

for i = 0 to Cstruct.len cs - 1
do match Cstruct.get_char cs i with
| '\000' .. '\031' | '\127' -> Format.fprintf fmt "."
| chr -> Format.fprintf fmt "%c" chr
| '\000' .. '\031' | '\127' -> Fmt.const Fmt.char '.' ppf ()

This comment has been minimized.

@samoht

samoht Jul 28, 2017
Member

Fmt.char ppf '.'

committer = %a;@ \
message = %a;@] }"
(Fmt.hvbox Hash.pp) tree
(Fmt.list ~sep:(fun ppf () -> Fmt.pf ppf ";@ ") Hash.pp) parents

This comment has been minimized.

@samoht

samoht Jul 28, 2017
Member

Fmt.(list ~sep:(unit ";@ ") Hash.pp

t.i_off t.i_pos t.i_len pp_state t.state
let pp_state ppf = function
| Header _ ->
Fmt.pf ppf "(Header #k)"

This comment has been minimized.

@samoht

samoht Jul 28, 2017
Member

maybe try to to keep the function compact (as it was before) by not adding a newline after -> when the line is short enough? And keep the only long case as the last one. Also you can use Fmt.string ppf "..." when you don't have parameters.

| Hash _ ->
Fmt.pf ppf "(Hash #k)"
| End ->
Fmt.pf ppf "End"

This comment has been minimized.

@samoht

samoht Jul 28, 2017
Member

same here, why so many endlines? :-)

aux lst
let pp ppv ppf q =
Fmt.pf ppf "[ %a ]"
(Fmt.hvbox (Fmt.list ~sep:(fun ppf () -> Fmt.pf ppf ";@ ") ppv)) (to_list q)

This comment has been minimized.

@samoht

samoht Jul 28, 2017
Member

Fmt.l(ist ~sep:(unit "; @") ppv)

@@ -50,7 +41,7 @@ struct
then [ "ALTERNATE" ] else [] ]
|> List.concat
in
pp_list ~sep:(fun fmt () -> Format.fprintf fmt " |@ ") Format.pp_print_string fmt flags
Fmt.list ~sep:(fun ppf () -> Fmt.pf ppf " |@ ") Fmt.string ppf flags

This comment has been minimized.

@samoht

samoht Jul 28, 2017
Member

Same here: Fmt.(list ~sep:(unit " |@ ") string)

| Consume ->
Fmt.pf ppf "Consume"
| Exception err ->
Fmt.pf ppf "(Error %a)" (Fmt.hvbox pp_error) err

This comment has been minimized.

@samoht

samoht Jul 28, 2017
Member

too many endlines, please keep the code compact :p

(pp_list ~sep:(fun fmt () -> pp fmt ";@ ") (fun fmt (k, v) -> pp_data fmt k v)) (to_list radix)
let pp ppv ppf radix =
Fmt.pf ppf "[ %a ]"
(Fmt.hvbox (Fmt.list ~sep:(fun ppf () -> Fmt.pf ppf ";@ ") (fun ppf x -> ppv ppf x)))

This comment has been minimized.

@samoht

samoht Jul 28, 2017
Member

Fmt.(list ~sep:(unit ";@ ") ppv)

let sep fmt () = Format.fprintf fmt ";@ " in
let pp_ref fmt (hash, refname, peeled) =
let pp_advertised_refs ppf { shallow; refs; capabilities; } =
let sep ppf () = Fmt.pf ppf ";@ " in

This comment has been minimized.

@samoht

samoht Jul 28, 2017
Member

let sep = Fmt.unit ";@ "

pp_a a pp_b b
in
let pp_report_status ppf { unpack; commands; } =
let sep ppf () = Fmt.pf ppf ";@ " in

This comment has been minimized.

@samoht

samoht Jul 28, 2017
Member

you can inline that sep value using Fmt.unit

@dinosaure dinosaure force-pushed the dinosaure:ngit branch from 4146895 to d34da3a Jul 28, 2017
@dinosaure
Copy link
Member Author

@dinosaure dinosaure commented Aug 2, 2017

From the commit 795c11b, we can switch between a Bigarray implementation or a Bytes implementation of the hash. I linted constraints when the module expected as the argument of the functor needs to provide:

module Digest =
struct
  type buffer = Cstruct.t

  val feed : ctx -> buffer -> unit

  ...
end

type hex = string

val of_hex : hex -> t
val to_hex : t -> hex

val of_string : string -> t
val to_string : t -> string

...
@dinosaure
Copy link
Member Author

@dinosaure dinosaure commented Aug 2, 2017

I deleted the constraint about the error of the FileSystem, I keeped the polymorphic variant but the user is free to define a more precise error. I did this change because after, this could break the API (preferrably to do this change now).

@dinosaure
Copy link
Member Author

@dinosaure dinosaure commented Aug 4, 2017

I just implemented the atomic computation for the reference (as @samoht requested as a deep requirement of irmin). I followed the previous implementation in the details, reviewed the code (about EINTR and other unix-dependent stuff), and matched to Mirage back-end (currently in the pipe).

I protected all mutable computation about the reference but I need to explain why I put an (other) functor Lock. As I said, by the design, I concentrated all computation to be non-blocking and to only use some fixed-size buffers to improve the memory consumption (in general) of ocaml-git.

In this way, to read or write something, all of these are done in the non-atomic way. That means, in a concurrency context, we can write partially a Git object, switch to another process, which one read partially an other (or the same) Git object. About tree, commit, blob and tag, it does not matter because these Git object are immutable (it's forbidden for any process to modify these Git object).

Unfortunately, it's not the same for the Reference, which one is mutable. And it's the purpose of the Lock module, to protect for the race condition any modification of any Reference. So, we have the choice:

  • We can expect an atomic write function (which take care to open, write all and close, and - most important - lock this computation) from the FileSystem module (this is the case from the previous version)
  • We need to provide a Lock abstraction to the Reference module to protect for any race-condition the Reference.write function - which one do the open, n-write and the close syscall (this is the current case of this PR).

I don't have any strong-opinion about this choice (IMHO, I prefer the first one which is already available by atomic_write but not exposed). Then, because we can have the choice to change something about the API, I let this question for you, reviewer 👀 !

@dinosaure
Copy link
Member Author

@dinosaure dinosaure commented Aug 4, 2017

In my shower time, I asked to myself the existence of the lock_dir value available in the Git state. Indeed, this value is close to the implementation of the Lock module: we consider a lock as a file, which will be stored in the lock_dir.

In the unix system (git-unix), this is what happen. We create a *.lock file to protect an update of the reference (as I said) against race condition.

However, in the mirage system, it's not the case. lock_dir is a unused value and we prefer to use an Hashtbl to represent this bucket of locks.

The link between my previous comment and this thing is about the choice between some assumption of how we implement the lock logic (highlighted by the lock_dir value) or the full abstraction of the lock logic with an abstract type of how we store these locks (file-system, hashtbl ?) and the lock itself.

It's a good argument for the second proposition, but I'm still open to other opinion and this is the week-end now.

@dinosaure
Copy link
Member Author

@dinosaure dinosaure commented Aug 10, 2017

About the re-implementation of the mem module:

  • Deleting of the PACK I/O: I understood clearly the purpose of the write_pack for a memory back-end, and obviously a file-system back-end. So, I will work to a better provided interface and a compact function. But I need to notice than to read a PACK file, we need a virtual representation of a file-system - the point is: the de-serialization of the PACK file will be not available in the memory back-end.

  • The common part of store.ml and mem.ml: I tried to be close the old interface of store.ml in the previous version with the new version. However, I keep the old interface for the mem.ml. This the main purpose of the PR: think about a new API. The new store.ml is what I think about the new interface (with some dependents functions to a virtual representation of the file-system - like reading a PACK file). The current mem.ml is the old interface (which all functions can be found in store.ml).

So, the point is, I can reduce the new store.ml API to the same interface than mem.ml or we can think about a new common interface.

  • Serialization of the Value: this is the highest problem about the memory consumption in the previous version of ocaml-git. As I said in the introduction, all serialization/deserialization only needs some fixed-size buffer to write to a file. However, in the memory backend, we need to write to a buffer (Cstruct.t) which can be grow as a file. So, the memory back-end need a new functor, a Value.BUFFER which implement this idea. Obviously, we can use the Buffer module available in the standard library (with some little tricks to match with the interface) or create an optimized implementation.

The point and it's the goal of this implementation is to locate the possibly memory problem in this module - because it's the only part of the code where a buffer can be grow automatically (and when it can be difficult the infer the memory consumption).

And it's enough for the last commit :) !

val remove_reference : t -> Reference.t -> unit Lwt.t
val test_and_set_reference : t -> Reference.t -> test:Hash.t option -> set:Hash.t option -> bool Lwt.t
val read_inflated : t -> Hash.t -> string option Lwt.t
val write_inflated : t -> Cstruct.t -> Hash.t Lwt.t

This comment has been minimized.

@hannesm

hannesm Aug 10, 2017
Member

not sure, but why is this within Lwt.t? side effects I can observe below are mutation (of the hashtable) and errors (using either raise or Lwt.fail) -- I feel these functions could life outside of the Lwt monad...

This comment has been minimized.

@dinosaure

dinosaure Aug 10, 2017
Author Member

Indeed, some parts of the mem.ml don't need to be wrapped in a Lwt.t. But we need to keep in our mind the compatibility between store.ml (which Lwt.t is needed - because we interact with the file-system) and mem.ml (and share the same API). So it's the global point (and this signature is come from the previous version of ocaml-git).

However, for this specific function, in the current version of the implementation of the store.ml with a file-system, I don't provide a write_inflate but only a write function.

The better question should be: Is it necessary to expose this function ? And, if the response is yes, how we can expose this kind of computation ? In fact, in my implementation, the signature should be:

val write_inflated_s : t -> Cstruct.t -> (Hash.t * int, error) result Lwt.t

_s notices than we use the state-defined buffer (instead a write_inflate_p : t -> Cstruct.t -> ztmp:Cstruct.t -> raw:Cstruct.t -> Cstruct.t -> (Hash.t * int, error) result Lwt.t), the Cstruct.t is the already-serialized Git object, and we will return the Hash.t produced and how many byte(s) we wrote.

Finally, I think it's not needed (for the internal purpose of ocaml-git) to have this function (but a read_inflated function is needed in several points of the PACK encoder - so necessary in both-side, store.ml and mem.ml). Then, may be some peoples use ocaml-git and this function. We need to take care about that.

@dinosaure
Copy link
Member Author

@dinosaure dinosaure commented Aug 11, 2017

About the Object_graph module, I changed the pack function which one cannot use the Lwt_list.map_p to compute all Git object. Instead, we can use Lwt_pool, allocate needed buffer and compute some Git objects in the same time.

@dinosaure
Copy link
Member Author

@dinosaure dinosaure commented Aug 15, 2017

Now, the current version of ocaml-git does not have any dependency with C code. I replaced the C code to calculate the Rabin's fingerprint needed for the delta-ification by an OCaml code. However, we have a strong limitation about the 32-bits architecture.

Indeed, the current implementation of the Rabin's fingerprint is close to the unsigned int32 representation. I used hack of @dbuenzli (to compare two unsigned int32) from mtime. However, when I want to cast the unsigned int32 to a native integer, for a 32-bits architecture, it could possible to not be able to store the value in a native integer - I raise a Failure in this case.

But, this case could not happen because this case appear only when when we want to access to a value of some specific arrays which have some other assertions to avoid this possible case.

@samoht
Copy link
Member

@samoht samoht commented Aug 21, 2017

Thanks for all the updates! Is the PR in a state when I can review it again?

Regarding locks: I agree that the second solution seems more elegant. About the in-memory implementation: it is pretty useful for testing purposes. I fully agree with modifying the API to be non-blocking for values, and add atomic/locking operations for references only. I think we should also keep some convenience functions to read/write full values.

Regarding pack files an in-memory stores: I don't understand what is the current limitation: are you able to read object in the pack but only write loose objects? If yes, do you have a way to re-compact the pack files using the in-memory backend? I agree that this is a bit of a stretch use-case (as it doesn't matter to control the memory consumption as everything will stay in memory anyway) but it would be nice if it could somehow work :-)

dinosaure added 7 commits Sep 5, 2017
This change add the kind information of the Git object
in the internal Hashtbl of the memory back-end. When we
return the inflated Git object, we notice the kind.
…vaScript world.

This change is more deep. We change decompress (which has already
the trampoline pattern) and angstrom (where I send an issue).
The compilation of ocaml-git and the usability in a web browser works
but keep in your mind than Git in your browser is just for fun.
This is the first abstract of the minimal common interface
between the memory back-end and the store (file-system related) back-end.
@samoht
Copy link
Member

@samoht samoht commented Dec 1, 2017

The tests on 4.04 seems to pass now (yay!). The errors on 4.03 should be fixed by ocaml/opam-repository#10897

samoht added 4 commits Dec 1, 2017
(to try to keep some kind of backward compatibility)
Try to remove unused types signatures: use Fpath directly instead of an abstract
PATH, use Lwt.t instead of io, etc.
@samoht samoht force-pushed the dinosaure:ngit branch from 52fbf59 to 176fb5b Dec 1, 2017
samoht and others added 5 commits Dec 1, 2017
FileSystem -> FS
exists -> mem
@samoht samoht force-pushed the dinosaure:ngit branch from 7920c01 to c092c4f Dec 2, 2017
@samoht
Copy link
Member

@samoht samoht commented Dec 2, 2017

So now all the tests pass on Travis, the failures are due to missing constraints in the revdeps. These should be fixed by ocaml/opam-repository#10904 (this is not a blocker for that merge).

The tests are failing on windows for git-mirage with:

- [exec] cmd /d /v:off /c rd /s /q "test-git-mirage-store"
- The system cannot find the file specified.
- [exec] error 2

which I guess is just missing a Sys.file_exists somewhere in the test init code.

Also the git-unix tests are failing on windows:

[ERROR]             Unix Store             4   Operations on references.
...
#=== ERROR while compiling git-unix.dev =======================================#
# opam-version         1.3.0~dev (2c2399d98ad0ebcb43a4bc856f3636030bb376db)
# os                   win32
# command              jbuilder runtest -p git-unix
# path                 C:/cygwin64/home/appveyor/.opam/4.03.0+mingw64c/build/git-unix.dev
# exit-code            1
# env-file             C:/cygwin64/home/appveyor/.opam/4.03.0+mingw64c/build/git-unix.dev\git-unix-288-d2c37b.env
# stdout-file          C:/cygwin64/home/appveyor/.opam/4.03.0+mingw64c/build/git-unix.dev\git-unix-288-d2c37b.out
# stderr-file          C:/cygwin64/home/appveyor/.opam/4.03.0+mingw64c/build/git-unix.dev\git-unix-288-d2c37b.err
### stderr ###
# [...]
# ASSERT r1
# --------------------------------------------------------------------------------
# --------------------------------------------------------------------------------
# ASSERT r2
# --------------------------------------------------------------------------------
# [exception] "Assert_failure src/git/reference.ml:309:8"
# Raised at file "bytes.ml", line 219, characters 25-34
# Called from file "src0/sexp.ml", line 90, characters 13-47
# 
# 
# The full test results are available in `_build/_tests`.
# 1 error! in 0.094s. 16 tests run.
@samoht samoht force-pushed the dinosaure:ngit branch from d87d817 to 70f113a Dec 2, 2017
@samoht samoht force-pushed the dinosaure:ngit branch from 70f113a to d6f15b0 Dec 2, 2017
@samoht
Copy link
Member

@samoht samoht commented Dec 2, 2017

More revdeps fixes there: ocaml/opam-repository#10910

@samoht samoht force-pushed the dinosaure:ngit branch from d7157fb to 454288c Dec 3, 2017
@samoht
Copy link
Member

@samoht samoht commented Dec 3, 2017

The current status is:

  • all the test runs are green on Linux / Travis CI
  • the test runs are green on Appveyor / Windows for git-unix but they fail for an unknown reason on git-mirage. It is very hard to investigate the windows error without a windows environment that neither @dinosaure or I have at the moment so let's say this is not a blocker for the merge but should be fixed ASAP.

So I am happy to merge that PR in this current state. Thank you very much @dinosaure for all that hard work. The new ocaml-git (that I will probably bump to 2.0) is a great piece of code :-)

@samoht samoht merged commit 35c72dc into mirage:master Dec 3, 2017
0 of 2 checks passed
0 of 2 checks passed
continuous-integration/travis-ci/pr The Travis CI build could not complete due to an error
Details
continuous-integration/appveyor/pr AppVeyor build failed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants
You can’t perform that action at this time.