Include an even-higher level interface? #115

brendanlong · 2021-03-13T20:16:30Z

I had some trouble using this library since the higher-level interface is undocumented and still too-low-level for what I'm trying to do (decompress some data that I have in-memory).

I eventually came up with:

let inflate_string ?(buffer_size_bytes = 100000) data =
  let input = Bigstring.of_string data
  and buffer = Bigstring.create buffer_size_bytes
  and output = Queue.create () in
  let refill s = Bigstring.length s
  and flush s n = Queue.enqueue output (Bigstring.sub s ~pos:0 ~len:n) in
  Gz.Higher.uncompress ~refill ~flush input buffer
  |> Result.map ~f:(fun _ ->
         Queue.to_list output |> Bigstring.concat |> Bigstring.to_string)

Would it make sense to include something like this in the code, or at least give an example in the documentation?

brendanlong · 2021-03-13T20:19:20Z

Not sure if you want to go there, but it would also be useful to have an interface for deflating an async input, like:

val inflate_pipe : string Async.Pipe.t -> string Async.Pipe.t

dinosaure · 2021-03-14T09:28:51Z

Hi, thanks to try to use decompress!

This question about an even simpler interface is well know (see #25). I agree with you that documentation is not the best and we definitely need to show a simple example like to_string/of_string as you said.

However, I would not like to propose such function in the package when responsibilities come. Indeed, the triptych memory/ratio/speed remains for any kind of inputs. For me, it's important to let the end-user to control such details.

For example, an inflate_string : string -> string needs to arbitrary choose the length of the output buffer. However, we don't have any clue about the length. We should say that the usual ratio of the compression is 1/3 and allocate a buffer such as Bigstringaf.create (String.length input * 3). But it's not true. Sometimes, the needed output buffer can be smaller than the input and sometimes it can be largely bigger (ratio of 1/5).
Such situation can be fixed by a use of a Queue as you did or a simple Buffer. We can use a ropes too. In anyway, several choices appears and none of them are perfects. A queue fits better for a fully unknowable input. A buffer can take the advantage of an hypothetical length, a ropes is the same advantage as the queue. However, for most of them, we delete the advantage of underlying used Bigarray and we must do a copy between Bigarray/string (or these data-structures are abstracted enough to directly manipulate Bigarray).
Again, the question about string/Bigarray is not clear. we know that for the context of the inflation/deflation, it's fair to use Bigarray, but for upper dependencies. In some contexts, for small objects, it better to manipulates string and use and re-use underlying Bigarray (due to the cost of the allocation). In some other contexts, it's better to keep Bigarray to be able to do some others treatments (such as a hash computing) and take the opportunity of the non-relocatable aspect of them.
Finally, may be you want to print out the output buffer. In that context, it's definitely better to allocate a large output buffer (independently the length of the input) and print out via a syscall.

So, several questions comes and none of them can perfectly fit into any uses. If we provide such function, we take the responsibility of its implementation - and, as I explained, such implementation leads several choices. We strictly knows that these choices can be really bad for some use-cases and to avoid any problems about that, we prefer to respect, at least, the interface provided by camlzip.

The current Higher module lets a full control of these details and even if most of them can be replaced by default values, we don't want to bring the user to a silly use of decompress - and it's when we should provide a better documentation! It implies several hidden choices and we don't want to hide the magic too much, and we don't want to take the responsibility of them.

Another aspect is about time, decompress is a full re-implementation of zlib (and gzip). By this way, we are more focus on algorithms/speed/memory consumption than a simpler interface. For such purpose, you can take a look on ezgzip which wants to experiment with decompress. I think, it's a better area if you want a simpler API.

And about async, decompress is a bedrock project of some others MirageOS-compatible projects. So we have a strict constraint about dependencies (no async, no lwt, no unix).

I will try to improve documentation next week and if you want to participate, I will happy to introduce your incomes 👍.

brendanlong · 2021-03-15T15:08:17Z

This comment says there's documentation somewhere on how to write a simple inflate/deflate function: #25 (comment)

I think just linking to where that is from this section of the readme would help a lot: https://github.com/mirage/decompress#higher-interface

I can understand the difficulty of making tradeoffs here, but it think it's generally fine to just pick something "good enough" for a high level interface and in the documentation/comments explain to people why they might want to use a lower-level version. I can understanding leaving that for a higher level library though, although you might want to mention that people looking for a higher-level interface could try [insert library here] in the docs too.

Note: Based on what you said above, I ended up making my high-level interface look like:

let inflate_bigstring_to_bigbuffer
    ?(buffer_size_bytes = 10000)
    ?(expected_compression_ratio = 3)
    input
  =
  let buffer = Bigstring.create buffer_size_bytes
  and output = Bigbuffer.create (Bigstring.length input * expected_compression_ratio) in
  let refill s = Bigstring.length s
  and flush s n = Bigbuffer.add_bigstring output (Bigstring.sub_shared s ~pos:0 ~len:n) in
  Gz.Higher.uncompress ~refill ~flush input buffer |> Result.map ~f:(fun _ -> output)
;;

let inflate_bigstring ?buffer_size_bytes ?expected_compression_ratio input =
  inflate_bigstring_to_bigbuffer ?buffer_size_bytes ?expected_compression_ratio input
  |> Result.map ~f:Bigbuffer.big_contents
;;

let inflate_string ?buffer_size_bytes ?expected_compression_ratio input =
  Bigstring.of_string input
  |> inflate_bigstring_to_bigbuffer ?buffer_size_bytes ?expected_compression_ratio
  |> Result.map ~f:Bigbuffer.contents
;;

I think the biggest downside of this is that it uses Bigbuffer from Core_kernel, but that lets it be fairly efficient except for a copy at the end to get the contents.

dinosaure · 2021-03-16T10:47:23Z

  and flush s n = Bigbuffer.add_bigstring output (Bigstring.sub_shared s ~pos:0 ~len:n) in

I'm not sure about sub_shared. Concretely, s is buffer. decompress never allocates temporary buffers and you should copy bytes from s/buffer into output. I don't know the semantic of Bigbuffer.add_bigstring (if it does the copy or not) but if it re-use s, the output buffer will be wrong.

Then, I added a documentation, add examples and clear some aspects of decompress, you can see it here: #116

brendanlong · 2021-03-23T14:51:39Z

Bigbuffer.add_bigstring copies the input:

https://github.com/janestreet/core_kernel/blob/0ca444d236acc08bbb52e50c67af69a3fa7f8bd7/src/bigbuffer.ml#L100

So I think what I was doing make sense (I'm trying to let the function do a copy of the substring without doing an additional copy).

Thanks for adding more documentation to this!

dinosaure · 2021-03-23T15:03:10Z

Thanks for the feedback 👍.

@dinosaure

CHANGES: - Add a well-know limitation about the encoding on the documentation, the output buffer must be upper than 2 bytes in any cases. (@dinosaure, mirage/decompress#114) - Improve the documentation (@dinosaure, @brendanlong, mirage/decompress#115) - **breaking changes**, the type of the window used to deflate according RFC 1951 was updated to `De.Lz77.window` (@dinosaure, mirage/decompress#116 & mirage/decompress#115) - Fix a bug when we want to add the EOB _op-code_ into the queue. The deflation takes care about that. Note that the queue must be larger than 2. (@dinosaure, @kluvin, mirage/decompress#117) - Improve the documentation a bit (@mseri, @dinosaure, mirage/decompress#119) - Fix the use of `optint` and how we handle large files with `Gz`. Fix an error when we encode the `isize` into the deflated stream. (@dinosaure, @igarnier, mirage/decompress#121 & mirage/decompress#120) - Add a _non-stream_ implementation (@clecat, @dinosaure, mirage/decompress#102 & mirage/decompress#92) The non-blocking stream API has a cost to maintain a _state_ across _syscall_ such as `read` and `write`. It's useful when we want to plug `decompress` behind something like a `socket` and care about memory-consumption but it has a big cost when we want to compress/decompress an object saved into one and unique buffer. The _non-stream_ API gives an opportunity to inflate/deflate one and unique buffer without the usual plumbing required by the non-blocking stream API. However, we are limited to compute only objects which can fit into a `bigarray`. About performance, the non-stream API is better than the non-blocking stream API. See the PR for more details about performances. On the `book2` (from the Calgary corpus) file: - `decompress` (stream): 15 Mb/s (deflation), 76 Mb/s (inflation), ratio: 42.46 % - `decompress` (non-stream): 17 Mb/s (deflation), 105 Mb/s (inflation), ratio: 34.66 % Even if we checked the implementation with our tests (we ran `ocaml-git` and `irmin` with this path), the implementation is young and we probably miss some details/bugs. So we advise the user to compare, at least, the non-stream implementation with the non-blocking stream implementation if something is wrong.

dinosaure mentioned this issue Mar 15, 2021

Improve documentation fix type of deflate window #116

Merged

brendanlong closed this as completed Mar 23, 2021

clecat mentioned this issue Apr 22, 2021

[new release] decompress and rfc1951 (1.4.0) ocaml/opam-repository#18550

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include an even-higher level interface? #115

Include an even-higher level interface? #115

brendanlong commented Mar 13, 2021

brendanlong commented Mar 13, 2021

dinosaure commented Mar 14, 2021 •

edited

Loading

brendanlong commented Mar 15, 2021

dinosaure commented Mar 16, 2021

brendanlong commented Mar 23, 2021

dinosaure commented Mar 23, 2021

Include an even-higher level interface? #115

Include an even-higher level interface? #115

Comments

brendanlong commented Mar 13, 2021

brendanlong commented Mar 13, 2021

dinosaure commented Mar 14, 2021 • edited Loading

brendanlong commented Mar 15, 2021

dinosaure commented Mar 16, 2021

brendanlong commented Mar 23, 2021

dinosaure commented Mar 23, 2021

dinosaure commented Mar 14, 2021 •

edited

Loading