Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include an even-higher level interface? #115

Closed
brendanlong opened this issue Mar 13, 2021 · 6 comments
Closed

Include an even-higher level interface? #115

brendanlong opened this issue Mar 13, 2021 · 6 comments

Comments

@brendanlong
Copy link

I had some trouble using this library since the higher-level interface is undocumented and still too-low-level for what I'm trying to do (decompress some data that I have in-memory).

I eventually came up with:

let inflate_string ?(buffer_size_bytes = 100000) data =
  let input = Bigstring.of_string data
  and buffer = Bigstring.create buffer_size_bytes
  and output = Queue.create () in
  let refill s = Bigstring.length s
  and flush s n = Queue.enqueue output (Bigstring.sub s ~pos:0 ~len:n) in
  Gz.Higher.uncompress ~refill ~flush input buffer
  |> Result.map ~f:(fun _ ->
         Queue.to_list output |> Bigstring.concat |> Bigstring.to_string)

Would it make sense to include something like this in the code, or at least give an example in the documentation?

@brendanlong
Copy link
Author

Not sure if you want to go there, but it would also be useful to have an interface for deflating an async input, like:

val inflate_pipe : string Async.Pipe.t -> string Async.Pipe.t

@dinosaure
Copy link
Member

dinosaure commented Mar 14, 2021

Hi, thanks to try to use decompress!

This question about an even simpler interface is well know (see #25). I agree with you that documentation is not the best and we definitely need to show a simple example like to_string/of_string as you said.

However, I would not like to propose such function in the package when responsibilities come. Indeed, the triptych memory/ratio/speed remains for any kind of inputs. For me, it's important to let the end-user to control such details.

  • For example, an inflate_string : string -> string needs to arbitrary choose the length of the output buffer. However, we don't have any clue about the length. We should say that the usual ratio of the compression is 1/3 and allocate a buffer such as Bigstringaf.create (String.length input * 3). But it's not true. Sometimes, the needed output buffer can be smaller than the input and sometimes it can be largely bigger (ratio of 1/5).
  • Such situation can be fixed by a use of a Queue as you did or a simple Buffer. We can use a ropes too. In anyway, several choices appears and none of them are perfects. A queue fits better for a fully unknowable input. A buffer can take the advantage of an hypothetical length, a ropes is the same advantage as the queue. However, for most of them, we delete the advantage of underlying used Bigarray and we must do a copy between Bigarray/string (or these data-structures are abstracted enough to directly manipulate Bigarray).
  • Again, the question about string/Bigarray is not clear. we know that for the context of the inflation/deflation, it's fair to use Bigarray, but for upper dependencies. In some contexts, for small objects, it better to manipulates string and use and re-use underlying Bigarray (due to the cost of the allocation). In some other contexts, it's better to keep Bigarray to be able to do some others treatments (such as a hash computing) and take the opportunity of the non-relocatable aspect of them.
  • Finally, may be you want to print out the output buffer. In that context, it's definitely better to allocate a large output buffer (independently the length of the input) and print out via a syscall.

So, several questions comes and none of them can perfectly fit into any uses. If we provide such function, we take the responsibility of its implementation - and, as I explained, such implementation leads several choices. We strictly knows that these choices can be really bad for some use-cases and to avoid any problems about that, we prefer to respect, at least, the interface provided by camlzip.

The current Higher module lets a full control of these details and even if most of them can be replaced by default values, we don't want to bring the user to a silly use of decompress - and it's when we should provide a better documentation! It implies several hidden choices and we don't want to hide the magic too much, and we don't want to take the responsibility of them.

Another aspect is about time, decompress is a full re-implementation of zlib (and gzip). By this way, we are more focus on algorithms/speed/memory consumption than a simpler interface. For such purpose, you can take a look on ezgzip which wants to experiment with decompress. I think, it's a better area if you want a simpler API.

And about async, decompress is a bedrock project of some others MirageOS-compatible projects. So we have a strict constraint about dependencies (no async, no lwt, no unix).

I will try to improve documentation next week and if you want to participate, I will happy to introduce your incomes 👍.

@brendanlong
Copy link
Author

This comment says there's documentation somewhere on how to write a simple inflate/deflate function: #25 (comment)

I think just linking to where that is from this section of the readme would help a lot: https://github.com/mirage/decompress#higher-interface

I can understand the difficulty of making tradeoffs here, but it think it's generally fine to just pick something "good enough" for a high level interface and in the documentation/comments explain to people why they might want to use a lower-level version. I can understanding leaving that for a higher level library though, although you might want to mention that people looking for a higher-level interface could try [insert library here] in the docs too.

Note: Based on what you said above, I ended up making my high-level interface look like:

let inflate_bigstring_to_bigbuffer
    ?(buffer_size_bytes = 10000)
    ?(expected_compression_ratio = 3)
    input
  =
  let buffer = Bigstring.create buffer_size_bytes
  and output = Bigbuffer.create (Bigstring.length input * expected_compression_ratio) in
  let refill s = Bigstring.length s
  and flush s n = Bigbuffer.add_bigstring output (Bigstring.sub_shared s ~pos:0 ~len:n) in
  Gz.Higher.uncompress ~refill ~flush input buffer |> Result.map ~f:(fun _ -> output)
;;

let inflate_bigstring ?buffer_size_bytes ?expected_compression_ratio input =
  inflate_bigstring_to_bigbuffer ?buffer_size_bytes ?expected_compression_ratio input
  |> Result.map ~f:Bigbuffer.big_contents
;;

let inflate_string ?buffer_size_bytes ?expected_compression_ratio input =
  Bigstring.of_string input
  |> inflate_bigstring_to_bigbuffer ?buffer_size_bytes ?expected_compression_ratio
  |> Result.map ~f:Bigbuffer.contents
;;

I think the biggest downside of this is that it uses Bigbuffer from Core_kernel, but that lets it be fairly efficient except for a copy at the end to get the contents.

@dinosaure
Copy link
Member

  and flush s n = Bigbuffer.add_bigstring output (Bigstring.sub_shared s ~pos:0 ~len:n) in

I'm not sure about sub_shared. Concretely, s is buffer. decompress never allocates temporary buffers and you should copy bytes from s/buffer into output. I don't know the semantic of Bigbuffer.add_bigstring (if it does the copy or not) but if it re-use s, the output buffer will be wrong.

Then, I added a documentation, add examples and clear some aspects of decompress, you can see it here: #116

@brendanlong
Copy link
Author

Bigbuffer.add_bigstring copies the input:

https://github.com/janestreet/core_kernel/blob/0ca444d236acc08bbb52e50c67af69a3fa7f8bd7/src/bigbuffer.ml#L100

So I think what I was doing make sense (I'm trying to let the function do a copy of the substring without doing an additional copy).

Thanks for adding more documentation to this!

@dinosaure
Copy link
Member

Thanks for the feedback 👍.

kit-ty-kate pushed a commit to ocaml/opam-repository that referenced this issue Apr 22, 2021
CHANGES:

- Add a well-know limitation about the encoding on the documentation, the
  output buffer must be upper than 2 bytes in any cases.
  (@dinosaure, mirage/decompress#114)
- Improve the documentation
  (@dinosaure, @brendanlong, mirage/decompress#115)
- **breaking changes**, the type of the window used to deflate according
  RFC 1951 was updated to `De.Lz77.window`
  (@dinosaure, mirage/decompress#116 & mirage/decompress#115)
- Fix a bug when we want to add the EOB _op-code_ into the queue. The deflation
  takes care about that. Note that the queue must be larger than 2.
  (@dinosaure, @kluvin, mirage/decompress#117)
- Improve the documentation a bit
  (@mseri, @dinosaure, mirage/decompress#119)
- Fix the use of `optint` and how we handle large files with `Gz`. Fix an error
  when we encode the `isize` into the deflated stream.
  (@dinosaure, @igarnier, mirage/decompress#121 & mirage/decompress#120)
- Add a _non-stream_ implementation (@clecat, @dinosaure, mirage/decompress#102 & mirage/decompress#92)

  The non-blocking stream API has a cost to maintain a _state_ across _syscall_
  such as `read` and `write`. It's useful when we want to plug `decompress`
  behind something like a `socket` and care about memory-consumption but it has
  a big cost when we want to compress/decompress an object saved into one and
  unique buffer.

  The _non-stream_ API gives an opportunity to inflate/deflate one and unique
  buffer without the usual plumbing required by the non-blocking stream API.
  However, we are limited to compute only objects which can fit into a
  `bigarray`.

  About performance, the non-stream API is better than the non-blocking stream
  API. See the PR for more details about performances. On the
  `book2` (from the Calgary corpus) file:
  - `decompress` (stream):
     15 Mb/s (deflation), 76 Mb/s (inflation), ratio: 42.46 %
  - `decompress` (non-stream):
     17 Mb/s (deflation), 105 Mb/s (inflation), ratio: 34.66 %

  Even if we checked the implementation with our tests (we ran `ocaml-git` and
  `irmin` with this path), the implementation is young and we probably miss
  some details/bugs. So we advise the user to compare, at least, the non-stream
  implementation with the non-blocking stream implementation if something is
  wrong.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants