New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New implementation of Git #227

Merged
merged 331 commits into from Dec 3, 2017

Conversation

Projects
None yet
3 participants
@dinosaure
Member

dinosaure commented Jul 25, 2017

Hi all,

Tomorrow, I come back to Cambridge, then it's better to discuss about the next step of ocaml-git with a concrete code. So this PR will be the good place to talk about the performance, the API, the code and the kind of beer of the Cambridge Blue.

To be clear, this PR need a strong acknowledge about what is Git. I don't explain what is a tree or a blob or the PACK file. I will do a conference in India for that.

Wait, what is it ?

Some people from the mirage team know me, I started from February to improve ocaml-git, provide a new implementation of the PACK/IDX file, implement the git gc command, implement the git push and the HTTP protocol and some others stuff forgotten.

This work will (I hope) fix some bug about the memory consumption of ocaml-git too. The design of this library will be change internally for that. We will put some strong constraints about the memory consumption and change the scheme of the deserialization/serialization of a Git object.

For all of this, we need to rethink about the decoding of a Git object and, fortunately most of the Git objects are present on the PACK file. So, change the implementation of the decoding of the PACK file will change (I hope) this kind of bug.

However, when we write a new Git object, we write a loose Git object. So, it's the purpose of git gc. To avoid any fragmentation of the Git object inside your file-system, it's better to call cyclically a compact function. This function will PACK all of your Git objects inside one PACK file (and delete all others loose objects). So, we need to implement a smart packer. Before this implementation, ocaml-git just put the deflated Git object inside the PACK file and nothing else.

Now, this implementation come with a delta-ification (I don't know the good word to explain this computation, but yes... deal with it) and compress the PACK file by the delta-ification. From some tests, we produce the same PACK file than git and sometimes, we produce a smaller PACK file than git - yes, we can be better than git!

So, about this compression and the possibility to store all objects in one PACK file, we can consider than we are close to implement the git gc command - now it's about some polishing computation like the memory needed to compact all (and avoid an explosion of your computer) and the deletion of loose objects.

Finally, the PACK file is the key of the mystery of the chocolate. And the key of Git. Indeed, when you clone a Git repository, you receive a PACK file. When you push to a repository, you send a PACK file. So we need to take care about the implementation of the PACK file - I will explain precisely what is my implementation.

So for all of this, I refactored the Sync module and provide a better modular API to communicate (by the Smart protocol) with a remote Git repository. I re-implemented the clone command, I implemented a negotiation engine (directly inspired by the git engine) to implement then the fetch command, and finally I implement the PUSH COMMAND!

To prove my work ALL COMMITS OF THIS PR WAS SEND BY MY IMPLEMENTATION!

Finally, we can restart the implementation of the HTTP protocol in the good way. So if you read all, you can say: oh wait, but you missed some goals of what we expected! Like, yes I did not yet implemented the HTTP protocol and yes, the git gc is not yet available. But I will start a detailed explanation of what happen for this PR, my choices, and why I took a long time to do all.

sirodepac

Firstly, I started to implement only the PACK file in another project. The name is sirodepac, a famous lemon syrup in south of France. This project was focused only on the PACK file and try to decode a PACK file and encode a PACK file. But, because @avsm commits faster than its shadow, if you want to decode naively a big PACK file from any repositories of @avsm, your computer will die.

So, it's time to explain the strong constraints about my implementation:

  • First, my decoder (and encoder) need to be a non-blocking decoder. That means, we can stop at any step of the decoding, do what we want, and restart the process only with a state which we had kept.
  • Secondly, the buffer needed to store the input must be of fixed size.
  • Finally, any internals buffers has the same constraint

The last two point is very important because, in Git, we can handle some huge over big files (blob). And it's not realistic to store entirely the PACK file in the memory, compute some Git objects in the memory and have some other allocated data.

I wrote a little article (in French, sorry) about that and say something like:

  • The state must be pure to be located only on the minor heap
  • The fixed size buffer will be in the major heap but this object will live as long as we want to handle any Git objects.

Decoder

Because the state will be in the minor heap, it will be deleted quickly. So the final point is, the decoding of the PACK file never allocates any buffer in the major heap. Only the client is able to allocate what is needed to decode a PACK file. In the truth, we allocate a little buffer of 0x7F bytes, but only one, it's ok...

However, we have the delta-ification. The delta-ification is:

  • We have a base, it's a Git object
  • We have a list of hunk
  • And from the base and the hunks, we can construct a new Git object which is a delta-ified object.

So, we need a second pass to construct all delta-ified objects or we need the totality of the PACK file to resolve all delta-ified objects. I provided these two kind of way to reconstruct any Git objects from a PACK file.

But, again, it's about the memory consumption. If we want to construct the new Git object, we need to keep the list of hunk which contains some little buffers (of, in maximum, 0x7F bytes), we need to store the base Git object and a result buffer to store the result of the resolving. And, another point is a base Git object can be delta-ified too (we can have a chain of 50 delta-ification in the current implementation of Git).

You can understand that, if you implement this naively, your computer will burn.

So, my implementation is like a toolbox to deserialize any Git object and try to avoid any allocation of the memory. If you know the size of the biggest Git object of your PACK file, and if you know the deeper chain of the delta-ification of your PACK file, you can allocate strictly what is needed to decode any object of this PACK file (delta-ification, or not). It's an unpleasant game between all of your buffers to store any information (meta-data, list of hunks, Git base object) strictly needed to construct the Git object requested. In other way, if you don't know one of this information requested, my decoder will allocate what is needed.

So, for all of that, I think, my implementation will fix the bug about the memory consumption because we can know strictly what is needed to construct any Git objects. However, to obtain the requested information, you need to pass one time on the PACK file and do some computation and this is cost.

If you want a concrete example of all of this, you can look the clone function which calculates this information, allocate what is needed, checks all Git object, and if it's a thin-pack, repack all of these Git objects to a new PACK file.

Encoder

I think it's the more interesting part of my job, try to encode any Git object to a new PACK file and calculate a good application of the delta-ification. But the specification about this hard compute is only described in an IRC discussion between one guy and Linus Torvalds... I known better than that about the specification.

But, with @samoht, we try lot of time to have a good result and try to be close to the implementation of git - to produce the same small PACK file.

The key about the implementation of the delta-ification is about the heuristic of which object is better as base to apply a delta-ification to the current object. I don't want to explain the detail about this heuristic (we can read the IRC discussion!) but when we find a good base object, we will compute the delta-ification.

Before, I believe git do a patience diff between the source and the target to produce a good list of hunks. I re-implemented (you can look in the history of the sirodepac repository) the patience diff without Core. But the result was bad.

In truth, git uses a tuned implementation of the Rabin's fingerprint from libXdiff to generate a good list of hunk. And because the only reference implementation of this algorithm is in C, I put a C stub in this library (rabin_stubs.c) to calculate the Rabin's fingerprint. Then, I translated the tuned implementation from git to an OCaml code. I think, we can move this code to an OCaml code but I did not have the time to do this - and I think, it's not easy to do the translation because is very close to the representation of the integer in C.

But, finally, we produced a good PACK file and, as I said, sometimes we produce a better PACK file than git but for some others reasons.

And, from this delta-ification, you need to re-order the list of your Git objects (a topological sort) because Git has some assertion about the format of the PACK file (only understandably on the only exhaustive specification of Git, the C code).

Finally, we have an encoder (in the same way as the decoder) to encode this list to a PACK file. To be clear, at this time we use a lightweight representation of the Git object. You can understand than it's not possible to manipulate like 4 000 000 commits (and tree, and blob, and tag, ...) in your memory.

Conclusion

We have a good implementation of the PACK file. It's not perfect, for example, we can optimize again the serialization (in the same way as git) and we can, again, restrict the memory consumption. But, I think, it's a good bedrock for the next things.

ocaml-git

Then, the job is to integrate this big part in ocaml-git. And you can understand, it's not really easy.

A bottom-to-top development

When you know the code-base of ocaml-git you can understand than why we have some bug about the memory consumption. Any decoder was not think in the non-blocking way, they a use the totality (stored in the memory) of the input (and they need to have the totality of the input, because, exactly, the decoding is not non-blocking), and it's same about the serialization.

One other problem is to use an interface close to camlzip which one can not permit to be used to a non-blocking decoder/encoder.

For all of these, it's needed to start a big replacement of some specific layer of ocaml-git. But, it's not like, this project is bad and we restart a new one. This project is good and the API is good and it's why it's a bottom-to_top development, the goal is to provide the same previous API but with a big internal change.

So, I don't restart the project, I took it and try to improve some deeps layers - and, of course, this surgical strike changed a little bit the API.

Angstrom

I think, all people want to read this PR known angstrom from @seliopou. This is a good library and, I think, it's the best choice I did to handle the deserialization. In fact, it's easy to describe the Git format firstly and when you understand the problem of the alteration (and the problem to keep an internal buffer to not remove what was consumed by the alteration), you can provide a useful API to decode everything.

But, to be clear, and I explained this in the code, all decoder/encoder stuff in this project is close to the Git format and because I know when we can have an alteration and how many bytes I need to keep while the alteration, I produced a user-friendly interface. But, don't copy/past...

So the big deal is happen in the Helper module which one makes a new non-blocking memory limited decoder from an Angstrom decoder.

Faraday / Farfadet

However, I will not say the same for Faraday (sorry @seliopou). Faraday was think to be run in two cooperative processes, one to feed the encoder, the second to flush. But Faraday don't care about the memory consumption of the encoder because, in this pattern, we can say easily than the encoder will not growth.

So, I provided a Faraday encoder (make by the Farfadet library) but I don't use it. I still open about this question.

Minienc

And yes, if I don't use Faraday, what I use to encode a Git object? My little pony^Wminienc. This module provide a memory limited encoder in the same way than Faraday (with the same interface). The big change is:

  • If we are close to fulfill the internal buffer, we ask to the client to flush.
  • The state is pure
  • The internal Queue is pure (and is come from containers from @companion-cube and Chris Okasaki)

From all of this, I provided exactly the same interface. I did not have the time to play with GADT and provide a same library like Farfadet, but it still in my TODO.

Finally, from this Minienc, I can digest a fixed-size buffer to obtain the hash of a Git object.

camlzip and decompress

I keep this compatibility from the previous version to use camlzip instead decompress. You need to know that is not easy to find a common interface between these two projects and I'm not sure if all work (but I prefer to believe than all is ok).

So, we can continue to improve the compression or the decompression of the Git object independently than decompress. And yes, the interface required is close to the interface of decompress but if a can provide the same interface with camlzip with the same internal constraint (the non-blocking state), I think it's a good interface. Then, this interface is an interface "à la @dbuenzli" and, to use this kind of intf, I think it's a good choice to constraint the inflater and the deflater to this intf.

However, about the memory consumption, we can have a problem when we want to deflate. Indeed, for both, the implementation need to keep all of the input to produce then an Huffman tree. For a large file, git calls zlib with a specific FLUSH method (like SYNC_FLUSH) to make a new deflate block and free new memory.

In the state, I don't do this and it can be a problem for a large file.

CRC-32

I provide a C implementation of the CRC-32 checksum. For sure, if you want to have a pure OCaml implementation, I can but, I let this because it's not the biggest goal for the next step in my TODO list.

FileSystem and Path

I decided to split the representation of the Path from the FileSystem because I consider than the Path can be useful to represent a reference independently than a file-system back-end (like for Git.Mem). Then, the Store module needs an implementation of the file-system which one provides a partial read and a partial write (to connect with a non-blocking encoder/decoder) and the mmap function for the PACK file decoder (and resolve a delta-ified object in one pass).

However, this API can not be changed for sure and it's one other reason of why I decided to do a development bottom-to-top. Before this version, ocaml-git loaded all of the file in one time in the memory and, obviously, it's another point to explain the memory consumption of ocaml-git.

Digestif and Bytes.t

I don't know what people think about that but for me, an end-user will use a lot of hash values. And for this reason, I consider than it's better to localized the value of hash in the minor heap as a Bytes.t instead a bigstring.

So I constrained in the PACK decoder/encoder than the result of the digest function will be a Bytes.t. I'm not sure about this choice and we can change (not easily on the other hand). But I think, we need a proof than it's better to use bigstring instead Bytes.t.

Smart and Sync

And for all of this, I re-implemented, in the same way than ocaml-imap the Smart protocol.

I think, it's better to implement the decoder and the encoder in this way instead to use Angstrom but I explained why in the file.

Obviously, the API is more modular than the previous implementation of clone, push and fetch. In this API, we can change the negotiator engine, the packer engine and let the user to specify what he wants. The main.ml is a good example to understand how to use the Sync module. The implementation to connect, send and receive the date from the network is functorized - and in the example, you can look 2 implementation of the Net interface (one for ssh, the other to communicate with a git server).

I will push a documentation about Sync soon.

What is the next?

As I said, I tried to do a bottom-to-top development and if you know the API of the Store module, you can say than this work is not done yet (but close).

So the next, is to continue the development to provide the same previous interface. But I think, it's a good moment to have a critic about the API and make a better library to interact with a Git repository. From this project, we will obviously break the compatibility so, for me, it's an opportunity to make the best choice the next thing.

Voili, Voilou!

@samoht

This comment has been minimized.

Show comment
Hide comment
@samoht

samoht Jul 25, 2017

Member

Thanks! I will try to review all the patches shortly, but first a few question/remarks:

  • about digestif/camlzip: I don't have a strong opinion about keeping the choice to have both (it was useful when digestif was not yet stable -- which is not the case anymore). If you thing you can do simpler and better things by breaking the compat with camlzip, please do: having a portable non-blocking codec for pack files is a good tradeoff vs. pure efficiency.

  • about hashes in the major/minor heap: I agree that we need some perf numbers. Nocrypto uses cstruct buffers everywhere, so I guess @pqwy and @hannesm have already considered the question.

  • about crc32: have you seen https://github.com/xapi-project/ocaml-crc? ocaml-git were using that before switching to a pure version of it, to ease portability.

  • and more generally: it would be great to minimize the amount of C code that we add to this library, as it will cause headache when 1/ porting to Xen and 2/ porting to JavaScript. I could see that the parts where this is critical for speed could have two backends, but in general I would prefer the simplicity of having only one code portable code path written in OCaml. But some numbers would probably be needed there too.

Member

samoht commented Jul 25, 2017

Thanks! I will try to review all the patches shortly, but first a few question/remarks:

  • about digestif/camlzip: I don't have a strong opinion about keeping the choice to have both (it was useful when digestif was not yet stable -- which is not the case anymore). If you thing you can do simpler and better things by breaking the compat with camlzip, please do: having a portable non-blocking codec for pack files is a good tradeoff vs. pure efficiency.

  • about hashes in the major/minor heap: I agree that we need some perf numbers. Nocrypto uses cstruct buffers everywhere, so I guess @pqwy and @hannesm have already considered the question.

  • about crc32: have you seen https://github.com/xapi-project/ocaml-crc? ocaml-git were using that before switching to a pure version of it, to ease portability.

  • and more generally: it would be great to minimize the amount of C code that we add to this library, as it will cause headache when 1/ porting to Xen and 2/ porting to JavaScript. I could see that the parts where this is critical for speed could have two backends, but in general I would prefer the simplicity of having only one code portable code path written in OCaml. But some numbers would probably be needed there too.

@hannesm

This comment has been minimized.

Show comment
Hide comment
@hannesm

hannesm Jul 26, 2017

Member

crc32: please keep in mind that the xapi-project/crc is C code, and requires effort to get it running on all platforms

performance: I'd be happy to see some numbers (I don't have any) -- esp. whether a bytes/string backed cstruct is faster/slower than the current one (which uses bigarray) -- but also taking other buffer structures into account (iovec?).

Member

hannesm commented Jul 26, 2017

crc32: please keep in mind that the xapi-project/crc is C code, and requires effort to get it running on all platforms

performance: I'd be happy to see some numbers (I don't have any) -- esp. whether a bytes/string backed cstruct is faster/slower than the current one (which uses bigarray) -- but also taking other buffer structures into account (iovec?).

@samoht

This comment has been minimized.

Show comment
Hide comment
@samoht

samoht Jul 26, 2017

Member

To ease the review, could you replace the existing code with the new one instead of adding a new subdirectory? Would be easier to catch API changes and what is currently missing.

Member

samoht commented Jul 26, 2017

To ease the review, could you replace the existing code with the new one instead of adding a new subdirectory? Would be easier to catch API changes and what is currently missing.

@dinosaure

This comment has been minimized.

Show comment
Hide comment
@dinosaure

dinosaure Jul 26, 2017

Member

About CRC-32, I took the previous implementation available in ocaml-git 👍 .

About decompress/camlzip, I will keep the current interface. Indeed, we don't have any overhead with this interface and decompress (because, as I said, this interface is close to the interface of decompress). The available implementation of the inflate/deflate modules with camlzip was used to compare the ratio between decompress and camlzip and it seems than decompress as a good ratio.

Note: A good point to debug, decompress put in any way an empty flush at the end of the compression block, so when you want to extract a git object from a PACK file: the block starts with 78 9c (the header) and finishes with 00 00 ff ff - this trick saved my brain.

About Digestif and the constraint with the Bytes.t, it's because I need a concrete type to define a get function needed by the Radix functor. The get : t -> int -> char function is not available in the IDIGEST interface because it's not the purpose of an abstract hash representation (you should never use get on your hash). So, when I constraint the type of Bytes.t, then I can say in the lower layer than to get a character, I will use Bytes.get.

We have different way to fix this:

  • We can consider than it's worth to put in the IDIGEST interface the get function
  • We can put a proof (GADT "à la decompress") to know the concrete type of the hash and, by this proof, provide the function for the Radix functor
  • We can keep this current structure

For the first and the second way, it will be easy to switch between a bigstring or a Bytes.t representation. For the current structure, we have lot of noise to change the concrete type (like change all constraint) - and it's why it's not easy to switch.

I will take my flight :) !

Member

dinosaure commented Jul 26, 2017

About CRC-32, I took the previous implementation available in ocaml-git 👍 .

About decompress/camlzip, I will keep the current interface. Indeed, we don't have any overhead with this interface and decompress (because, as I said, this interface is close to the interface of decompress). The available implementation of the inflate/deflate modules with camlzip was used to compare the ratio between decompress and camlzip and it seems than decompress as a good ratio.

Note: A good point to debug, decompress put in any way an empty flush at the end of the compression block, so when you want to extract a git object from a PACK file: the block starts with 78 9c (the header) and finishes with 00 00 ff ff - this trick saved my brain.

About Digestif and the constraint with the Bytes.t, it's because I need a concrete type to define a get function needed by the Radix functor. The get : t -> int -> char function is not available in the IDIGEST interface because it's not the purpose of an abstract hash representation (you should never use get on your hash). So, when I constraint the type of Bytes.t, then I can say in the lower layer than to get a character, I will use Bytes.get.

We have different way to fix this:

  • We can consider than it's worth to put in the IDIGEST interface the get function
  • We can put a proof (GADT "à la decompress") to know the concrete type of the hash and, by this proof, provide the function for the Radix functor
  • We can keep this current structure

For the first and the second way, it will be easy to switch between a bigstring or a Bytes.t representation. For the current structure, we have lot of noise to change the concrete type (like change all constraint) - and it's why it's not easy to switch.

I will take my flight :) !

@samoht

I've done a first pass on the patches (and I expect to have a few more :p)

First of all, this is great, thanks very much for your impressive effort. The code is clean and I can mostly follow what happens :-) I am really looking forward merging your PR. But before that, a few more comments:

  • Please keep/update existing copyright headers to add your name (check that all the code that you committed here is compatible with MIT first);
  • Please update yourXXX comments to clarify whether they are normal comments (in this case, simply remove them) or if this is a TODO. If it is an important TODO, it would be great to open an issue in the repo to track it and explain the severity (this can be done after the PR is merged);
  • The library should remain split between what is unix-specific and what is not. Currently git is the core library and is independent on any backend. git-unix is the Unix backend and git-mirage is the mirage backend. Please do not bring unix-only dependency (and non-portable C code) in the core. So could you try to split your changes between git and git-unix?
  • There a few places where you wrote a hex converter again and again. Could you use ocaml-hex instead?
  • I haven't seen where you are using digestif (although you seem to depend on the C backend). Could you depend on the portable bits and let the final link choose between C and OCaml backend?
  • Please use Fmt as much as possible instead of writing the format combinator manually :-) (Ideally, all use of Format.* should be replaced by Fmt.
  • Please do not use Printexc.record_backtrace true in the library code. The only place where It is fine to set it is in main.ml.
  • What about the missing modules: git.mli, search.mli, etc. Do you have any plans to add them back? :-)

That's it for now, I will make a new pass after you made the requested changes :-) And thanks again for the huge work!

Show outdated Hide outdated src/git/_tags
Show outdated Hide outdated src/git/blob.ml
Show outdated Hide outdated src/git/blob.mli
Show outdated Hide outdated src/git/_tags
Show outdated Hide outdated src/git/commit.ml
Show outdated Hide outdated src/git/search.ml
Show outdated Hide outdated src/git/smart.mli
Show outdated Hide outdated src/git/store.mli
Show outdated Hide outdated src/git/store.ml
Show outdated Hide outdated src/git/tree.mli
@samoht

LGTM with a few comments (check Fmt.unit it can saves you a lot of code :p))

Show outdated Hide outdated src/git/blob.ml
Show outdated Hide outdated src/git/commit.ml
Show outdated Hide outdated src/git/idx.ml
Show outdated Hide outdated src/git/idx.ml
Show outdated Hide outdated src/git/minienc.ml
Show outdated Hide outdated src/git/negociator.ml
Show outdated Hide outdated src/git/pack.ml
Show outdated Hide outdated src/git/radix.ml
Show outdated Hide outdated src/git/smart.ml
Show outdated Hide outdated src/git/smart.ml
@dinosaure

This comment has been minimized.

Show comment
Hide comment
@dinosaure

dinosaure Aug 2, 2017

Member

From the commit 795c11b, we can switch between a Bigarray implementation or a Bytes implementation of the hash. I linted constraints when the module expected as the argument of the functor needs to provide:

module Digest =
struct
  type buffer = Cstruct.t

  val feed : ctx -> buffer -> unit

  ...
end

type hex = string

val of_hex : hex -> t
val to_hex : t -> hex

val of_string : string -> t
val to_string : t -> string

...
Member

dinosaure commented Aug 2, 2017

From the commit 795c11b, we can switch between a Bigarray implementation or a Bytes implementation of the hash. I linted constraints when the module expected as the argument of the functor needs to provide:

module Digest =
struct
  type buffer = Cstruct.t

  val feed : ctx -> buffer -> unit

  ...
end

type hex = string

val of_hex : hex -> t
val to_hex : t -> hex

val of_string : string -> t
val to_string : t -> string

...
@dinosaure

This comment has been minimized.

Show comment
Hide comment
@dinosaure

dinosaure Aug 2, 2017

Member

I deleted the constraint about the error of the FileSystem, I keeped the polymorphic variant but the user is free to define a more precise error. I did this change because after, this could break the API (preferrably to do this change now).

Member

dinosaure commented Aug 2, 2017

I deleted the constraint about the error of the FileSystem, I keeped the polymorphic variant but the user is free to define a more precise error. I did this change because after, this could break the API (preferrably to do this change now).

@dinosaure

This comment has been minimized.

Show comment
Hide comment
@dinosaure

dinosaure Aug 4, 2017

Member

I just implemented the atomic computation for the reference (as @samoht requested as a deep requirement of irmin). I followed the previous implementation in the details, reviewed the code (about EINTR and other unix-dependent stuff), and matched to Mirage back-end (currently in the pipe).

I protected all mutable computation about the reference but I need to explain why I put an (other) functor Lock. As I said, by the design, I concentrated all computation to be non-blocking and to only use some fixed-size buffers to improve the memory consumption (in general) of ocaml-git.

In this way, to read or write something, all of these are done in the non-atomic way. That means, in a concurrency context, we can write partially a Git object, switch to another process, which one read partially an other (or the same) Git object. About tree, commit, blob and tag, it does not matter because these Git object are immutable (it's forbidden for any process to modify these Git object).

Unfortunately, it's not the same for the Reference, which one is mutable. And it's the purpose of the Lock module, to protect for the race condition any modification of any Reference. So, we have the choice:

  • We can expect an atomic write function (which take care to open, write all and close, and - most important - lock this computation) from the FileSystem module (this is the case from the previous version)
  • We need to provide a Lock abstraction to the Reference module to protect for any race-condition the Reference.write function - which one do the open, n-write and the close syscall (this is the current case of this PR).

I don't have any strong-opinion about this choice (IMHO, I prefer the first one which is already available by atomic_write but not exposed). Then, because we can have the choice to change something about the API, I let this question for you, reviewer 👀 !

Member

dinosaure commented Aug 4, 2017

I just implemented the atomic computation for the reference (as @samoht requested as a deep requirement of irmin). I followed the previous implementation in the details, reviewed the code (about EINTR and other unix-dependent stuff), and matched to Mirage back-end (currently in the pipe).

I protected all mutable computation about the reference but I need to explain why I put an (other) functor Lock. As I said, by the design, I concentrated all computation to be non-blocking and to only use some fixed-size buffers to improve the memory consumption (in general) of ocaml-git.

In this way, to read or write something, all of these are done in the non-atomic way. That means, in a concurrency context, we can write partially a Git object, switch to another process, which one read partially an other (or the same) Git object. About tree, commit, blob and tag, it does not matter because these Git object are immutable (it's forbidden for any process to modify these Git object).

Unfortunately, it's not the same for the Reference, which one is mutable. And it's the purpose of the Lock module, to protect for the race condition any modification of any Reference. So, we have the choice:

  • We can expect an atomic write function (which take care to open, write all and close, and - most important - lock this computation) from the FileSystem module (this is the case from the previous version)
  • We need to provide a Lock abstraction to the Reference module to protect for any race-condition the Reference.write function - which one do the open, n-write and the close syscall (this is the current case of this PR).

I don't have any strong-opinion about this choice (IMHO, I prefer the first one which is already available by atomic_write but not exposed). Then, because we can have the choice to change something about the API, I let this question for you, reviewer 👀 !

@dinosaure

This comment has been minimized.

Show comment
Hide comment
@dinosaure

dinosaure Aug 4, 2017

Member

In my shower time, I asked to myself the existence of the lock_dir value available in the Git state. Indeed, this value is close to the implementation of the Lock module: we consider a lock as a file, which will be stored in the lock_dir.

In the unix system (git-unix), this is what happen. We create a *.lock file to protect an update of the reference (as I said) against race condition.

However, in the mirage system, it's not the case. lock_dir is a unused value and we prefer to use an Hashtbl to represent this bucket of locks.

The link between my previous comment and this thing is about the choice between some assumption of how we implement the lock logic (highlighted by the lock_dir value) or the full abstraction of the lock logic with an abstract type of how we store these locks (file-system, hashtbl ?) and the lock itself.

It's a good argument for the second proposition, but I'm still open to other opinion and this is the week-end now.

Member

dinosaure commented Aug 4, 2017

In my shower time, I asked to myself the existence of the lock_dir value available in the Git state. Indeed, this value is close to the implementation of the Lock module: we consider a lock as a file, which will be stored in the lock_dir.

In the unix system (git-unix), this is what happen. We create a *.lock file to protect an update of the reference (as I said) against race condition.

However, in the mirage system, it's not the case. lock_dir is a unused value and we prefer to use an Hashtbl to represent this bucket of locks.

The link between my previous comment and this thing is about the choice between some assumption of how we implement the lock logic (highlighted by the lock_dir value) or the full abstraction of the lock logic with an abstract type of how we store these locks (file-system, hashtbl ?) and the lock itself.

It's a good argument for the second proposition, but I'm still open to other opinion and this is the week-end now.

@dinosaure

This comment has been minimized.

Show comment
Hide comment
@dinosaure

dinosaure Aug 10, 2017

Member

About the re-implementation of the mem module:

  • Deleting of the PACK I/O: I understood clearly the purpose of the write_pack for a memory back-end, and obviously a file-system back-end. So, I will work to a better provided interface and a compact function. But I need to notice than to read a PACK file, we need a virtual representation of a file-system - the point is: the de-serialization of the PACK file will be not available in the memory back-end.

  • The common part of store.ml and mem.ml: I tried to be close the old interface of store.ml in the previous version with the new version. However, I keep the old interface for the mem.ml. This the main purpose of the PR: think about a new API. The new store.ml is what I think about the new interface (with some dependents functions to a virtual representation of the file-system - like reading a PACK file). The current mem.ml is the old interface (which all functions can be found in store.ml).

So, the point is, I can reduce the new store.ml API to the same interface than mem.ml or we can think about a new common interface.

  • Serialization of the Value: this is the highest problem about the memory consumption in the previous version of ocaml-git. As I said in the introduction, all serialization/deserialization only needs some fixed-size buffer to write to a file. However, in the memory backend, we need to write to a buffer (Cstruct.t) which can be grow as a file. So, the memory back-end need a new functor, a Value.BUFFER which implement this idea. Obviously, we can use the Buffer module available in the standard library (with some little tricks to match with the interface) or create an optimized implementation.

The point and it's the goal of this implementation is to locate the possibly memory problem in this module - because it's the only part of the code where a buffer can be grow automatically (and when it can be difficult the infer the memory consumption).

And it's enough for the last commit :) !

Member

dinosaure commented Aug 10, 2017

About the re-implementation of the mem module:

  • Deleting of the PACK I/O: I understood clearly the purpose of the write_pack for a memory back-end, and obviously a file-system back-end. So, I will work to a better provided interface and a compact function. But I need to notice than to read a PACK file, we need a virtual representation of a file-system - the point is: the de-serialization of the PACK file will be not available in the memory back-end.

  • The common part of store.ml and mem.ml: I tried to be close the old interface of store.ml in the previous version with the new version. However, I keep the old interface for the mem.ml. This the main purpose of the PR: think about a new API. The new store.ml is what I think about the new interface (with some dependents functions to a virtual representation of the file-system - like reading a PACK file). The current mem.ml is the old interface (which all functions can be found in store.ml).

So, the point is, I can reduce the new store.ml API to the same interface than mem.ml or we can think about a new common interface.

  • Serialization of the Value: this is the highest problem about the memory consumption in the previous version of ocaml-git. As I said in the introduction, all serialization/deserialization only needs some fixed-size buffer to write to a file. However, in the memory backend, we need to write to a buffer (Cstruct.t) which can be grow as a file. So, the memory back-end need a new functor, a Value.BUFFER which implement this idea. Obviously, we can use the Buffer module available in the standard library (with some little tricks to match with the interface) or create an optimized implementation.

The point and it's the goal of this implementation is to locate the possibly memory problem in this module - because it's the only part of the code where a buffer can be grow automatically (and when it can be difficult the infer the memory consumption).

And it's enough for the last commit :) !

Show outdated Hide outdated src/git/mem.ml
@dinosaure

This comment has been minimized.

Show comment
Hide comment
@dinosaure

dinosaure Aug 11, 2017

Member

About the Object_graph module, I changed the pack function which one cannot use the Lwt_list.map_p to compute all Git object. Instead, we can use Lwt_pool, allocate needed buffer and compute some Git objects in the same time.

Member

dinosaure commented Aug 11, 2017

About the Object_graph module, I changed the pack function which one cannot use the Lwt_list.map_p to compute all Git object. Instead, we can use Lwt_pool, allocate needed buffer and compute some Git objects in the same time.

@dinosaure

This comment has been minimized.

Show comment
Hide comment
@dinosaure

dinosaure Aug 15, 2017

Member

Now, the current version of ocaml-git does not have any dependency with C code. I replaced the C code to calculate the Rabin's fingerprint needed for the delta-ification by an OCaml code. However, we have a strong limitation about the 32-bits architecture.

Indeed, the current implementation of the Rabin's fingerprint is close to the unsigned int32 representation. I used hack of @dbuenzli (to compare two unsigned int32) from mtime. However, when I want to cast the unsigned int32 to a native integer, for a 32-bits architecture, it could possible to not be able to store the value in a native integer - I raise a Failure in this case.

But, this case could not happen because this case appear only when when we want to access to a value of some specific arrays which have some other assertions to avoid this possible case.

Member

dinosaure commented Aug 15, 2017

Now, the current version of ocaml-git does not have any dependency with C code. I replaced the C code to calculate the Rabin's fingerprint needed for the delta-ification by an OCaml code. However, we have a strong limitation about the 32-bits architecture.

Indeed, the current implementation of the Rabin's fingerprint is close to the unsigned int32 representation. I used hack of @dbuenzli (to compare two unsigned int32) from mtime. However, when I want to cast the unsigned int32 to a native integer, for a 32-bits architecture, it could possible to not be able to store the value in a native integer - I raise a Failure in this case.

But, this case could not happen because this case appear only when when we want to access to a value of some specific arrays which have some other assertions to avoid this possible case.

@samoht

This comment has been minimized.

Show comment
Hide comment
@samoht

samoht Aug 21, 2017

Member

Thanks for all the updates! Is the PR in a state when I can review it again?

Regarding locks: I agree that the second solution seems more elegant. About the in-memory implementation: it is pretty useful for testing purposes. I fully agree with modifying the API to be non-blocking for values, and add atomic/locking operations for references only. I think we should also keep some convenience functions to read/write full values.

Regarding pack files an in-memory stores: I don't understand what is the current limitation: are you able to read object in the pack but only write loose objects? If yes, do you have a way to re-compact the pack files using the in-memory backend? I agree that this is a bit of a stretch use-case (as it doesn't matter to control the memory consumption as everything will stay in memory anyway) but it would be nice if it could somehow work :-)

Member

samoht commented Aug 21, 2017

Thanks for all the updates! Is the PR in a state when I can review it again?

Regarding locks: I agree that the second solution seems more elegant. About the in-memory implementation: it is pretty useful for testing purposes. I fully agree with modifying the API to be non-blocking for values, and add atomic/locking operations for references only. I think we should also keep some convenience functions to read/write full values.

Regarding pack files an in-memory stores: I don't understand what is the current limitation: are you able to read object in the pack but only write loose objects? If yes, do you have a way to re-compact the pack files using the in-memory backend? I agree that this is a bit of a stretch use-case (as it doesn't matter to control the memory consumption as everything will stay in memory anyway) but it would be nice if it could somehow work :-)

dinosaure added some commits Sep 5, 2017

Change the internal Hashtbl of the memory back-end.
This change add the kind information of the Git object
in the internal Hashtbl of the memory back-end. When we
return the inflated Git object, we notice the kind.
Break the recursion (return Cont) to avoid the stack overflow in a Ja…
…vaScript world.

This change is more deep. We change decompress (which has already
the trampoline pattern) and angstrom (where I send an issue).
The compilation of ocaml-git and the usability in a web browser works
but keep in your mind than Git in your browser is just for fun.
Use a Minimal interface of the Git repository
This is the first abstract of the minimal common interface
between the memory back-end and the store (file-system related) back-end.
Implementation of the Smart HTTP protocol
This implementation is not complete (pull and push).
@dinosaure

This comment has been minimized.

Show comment
Hide comment
@dinosaure

dinosaure Sep 5, 2017

Member

At this time, I implemented a small common API between the memory back-end and the store (file-system related back-end).

It's the time to think if it's the best and compare to the store.mli interface (which has more functions). Then, this interface will use to make the Object_graph and the Search module first, then the Smart and Smart_http protocol (with the Negociation engine and the Revision DSL).

Member

dinosaure commented Sep 5, 2017

At this time, I implemented a small common API between the memory back-end and the store (file-system related back-end).

It's the time to think if it's the best and compare to the store.mli interface (which has more functions). Then, this interface will use to make the Object_graph and the Search module first, then the Smart and Smart_http protocol (with the Negociation engine and the Revision DSL).

@samoht

This comment has been minimized.

Show comment
Hide comment
@samoht

samoht Sep 5, 2017

Member

Thanks!

The minimal API looks great. About the size function: would it be possible to get the inflated size instead? Usually you want to pre-allocate some structure before reading some contents: in that case I am not sure what to do the the deflated size.

The store API looks ok. I am not sure I will need to use all these functions, but I guess it is nice we have them just in case :-) How do you feed data to the repository/pack file state? Is the internal decoder is similar to jsnonm/xmlm (with an Await state?)

Feel free to change the Object_graph and Search API as much as you want, I think they are just there to allow easy testing of the library. All the complex search are the Git DAG should be handled by Irmin anyway.

Member

samoht commented Sep 5, 2017

Thanks!

The minimal API looks great. About the size function: would it be possible to get the inflated size instead? Usually you want to pre-allocate some structure before reading some contents: in that case I am not sure what to do the the deflated size.

The store API looks ok. I am not sure I will need to use all these functions, but I guess it is nice we have them just in case :-) How do you feed data to the repository/pack file state? Is the internal decoder is similar to jsnonm/xmlm (with an Await state?)

Feel free to change the Object_graph and Search API as much as you want, I think they are just there to allow easy testing of the library. All the complex search are the Git DAG should be handled by Irmin anyway.

@samoht

This comment has been minimized.

Show comment
Hide comment
@samoht

samoht Dec 1, 2017

Member

The tests on 4.04 seems to pass now (yay!). The errors on 4.03 should be fixed by ocaml/opam-repository#10897

Member

samoht commented Dec 1, 2017

The tests on 4.04 seems to pass now (yay!). The errors on 4.03 should be fixed by ocaml/opam-repository#10897

samoht and others added some commits Dec 1, 2017

Fix case of Git_unix.FS
(to try to keep some kind of backward compatibility)
Code cleanups
Try to remove unused types signatures: use Fpath directly instead of an abstract
PATH, use Lwt.t instead of io, etc.
More cleanups
FileSystem -> FS
exists -> mem
@samoht

This comment has been minimized.

Show comment
Hide comment
@samoht

samoht Dec 2, 2017

Member

So now all the tests pass on Travis, the failures are due to missing constraints in the revdeps. These should be fixed by ocaml/opam-repository#10904 (this is not a blocker for that merge).

The tests are failing on windows for git-mirage with:

- [exec] cmd /d /v:off /c rd /s /q "test-git-mirage-store"
- The system cannot find the file specified.
- [exec] error 2

which I guess is just missing a Sys.file_exists somewhere in the test init code.

Also the git-unix tests are failing on windows:

[ERROR]             Unix Store             4   Operations on references.
...
#=== ERROR while compiling git-unix.dev =======================================#
# opam-version         1.3.0~dev (2c2399d98ad0ebcb43a4bc856f3636030bb376db)
# os                   win32
# command              jbuilder runtest -p git-unix
# path                 C:/cygwin64/home/appveyor/.opam/4.03.0+mingw64c/build/git-unix.dev
# exit-code            1
# env-file             C:/cygwin64/home/appveyor/.opam/4.03.0+mingw64c/build/git-unix.dev\git-unix-288-d2c37b.env
# stdout-file          C:/cygwin64/home/appveyor/.opam/4.03.0+mingw64c/build/git-unix.dev\git-unix-288-d2c37b.out
# stderr-file          C:/cygwin64/home/appveyor/.opam/4.03.0+mingw64c/build/git-unix.dev\git-unix-288-d2c37b.err
### stderr ###
# [...]
# ASSERT r1
# --------------------------------------------------------------------------------
# --------------------------------------------------------------------------------
# ASSERT r2
# --------------------------------------------------------------------------------
# [exception] "Assert_failure src/git/reference.ml:309:8"
# Raised at file "bytes.ml", line 219, characters 25-34
# Called from file "src0/sexp.ml", line 90, characters 13-47
# 
# 
# The full test results are available in `_build/_tests`.
# 1 error! in 0.094s. 16 tests run.
Member

samoht commented Dec 2, 2017

So now all the tests pass on Travis, the failures are due to missing constraints in the revdeps. These should be fixed by ocaml/opam-repository#10904 (this is not a blocker for that merge).

The tests are failing on windows for git-mirage with:

- [exec] cmd /d /v:off /c rd /s /q "test-git-mirage-store"
- The system cannot find the file specified.
- [exec] error 2

which I guess is just missing a Sys.file_exists somewhere in the test init code.

Also the git-unix tests are failing on windows:

[ERROR]             Unix Store             4   Operations on references.
...
#=== ERROR while compiling git-unix.dev =======================================#
# opam-version         1.3.0~dev (2c2399d98ad0ebcb43a4bc856f3636030bb376db)
# os                   win32
# command              jbuilder runtest -p git-unix
# path                 C:/cygwin64/home/appveyor/.opam/4.03.0+mingw64c/build/git-unix.dev
# exit-code            1
# env-file             C:/cygwin64/home/appveyor/.opam/4.03.0+mingw64c/build/git-unix.dev\git-unix-288-d2c37b.env
# stdout-file          C:/cygwin64/home/appveyor/.opam/4.03.0+mingw64c/build/git-unix.dev\git-unix-288-d2c37b.out
# stderr-file          C:/cygwin64/home/appveyor/.opam/4.03.0+mingw64c/build/git-unix.dev\git-unix-288-d2c37b.err
### stderr ###
# [...]
# ASSERT r1
# --------------------------------------------------------------------------------
# --------------------------------------------------------------------------------
# ASSERT r2
# --------------------------------------------------------------------------------
# [exception] "Assert_failure src/git/reference.ml:309:8"
# Raised at file "bytes.ml", line 219, characters 25-34
# Called from file "src0/sexp.ml", line 90, characters 13-47
# 
# 
# The full test results are available in `_build/_tests`.
# 1 error! in 0.094s. 16 tests run.
@samoht

This comment has been minimized.

Show comment
Hide comment
@samoht

samoht Dec 2, 2017

Member

More revdeps fixes there: ocaml/opam-repository#10910

Member

samoht commented Dec 2, 2017

More revdeps fixes there: ocaml/opam-repository#10910

@samoht

This comment has been minimized.

Show comment
Hide comment
@samoht

samoht Dec 3, 2017

Member

The current status is:

  • all the test runs are green on Linux / Travis CI
  • the test runs are green on Appveyor / Windows for git-unix but they fail for an unknown reason on git-mirage. It is very hard to investigate the windows error without a windows environment that neither @dinosaure or I have at the moment so let's say this is not a blocker for the merge but should be fixed ASAP.

So I am happy to merge that PR in this current state. Thank you very much @dinosaure for all that hard work. The new ocaml-git (that I will probably bump to 2.0) is a great piece of code :-)

Member

samoht commented Dec 3, 2017

The current status is:

  • all the test runs are green on Linux / Travis CI
  • the test runs are green on Appveyor / Windows for git-unix but they fail for an unknown reason on git-mirage. It is very hard to investigate the windows error without a windows environment that neither @dinosaure or I have at the moment so let's say this is not a blocker for the merge but should be fixed ASAP.

So I am happy to merge that PR in this current state. Thank you very much @dinosaure for all that hard work. The new ocaml-git (that I will probably bump to 2.0) is a great piece of code :-)

@samoht samoht merged commit 35c72dc into mirage:master Dec 3, 2017

0 of 2 checks passed

continuous-integration/travis-ci/pr The Travis CI build could not complete due to an error
Details
continuous-integration/appveyor/pr AppVeyor build failed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment