Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Functions to read/write binary representations of numbers #1864

Merged
merged 20 commits into from Nov 9, 2018

Conversation

Projects
None yet
9 participants
@alainfrisch
Copy link
Contributor

commented Jun 26, 2018

MPR#5771 introduced primitives to read/write binary representations of 16, 32 and 64-bit integers directly to/from (big)strings (and now bytes), which is very useful for any program that needs to serialize numbers efficiently. Those primitives are exposed only in the ocplib-endian library, but I believe most people would expect to find the functionality in the stdlib, and this is easy since the hard work of supporting the primitives in the compiler and runtime system is behind us. Moreover, extending Buffer directly allows for more efficient functions than one could get with an external library.

This PR is an attempt to launch the discussion around the addition of such features. It is far from ready to be merged! Currently, it adds:

  • Functions to read/write 16-, 32- and 64-bit "binary" integer to bytes (only versions with bound checks are exposed).
  • Functions to append 16-, 32- and 64-bit "binary" integer to Buffer.t; also single-precision and double-precision floating point numbers.

These function all use the native endianness of the current machine. This might be problematic for some use cases, but perfectly fine for others (and better, performance-wise), and one could consider adding versions with explicit endianness (as ocp-endian does).

Missing:

  • Functions to read/write floats to/from bytes values.
  • Functions to read numbers from string values.
  • Functions to read/write floats to/from bigarrays of chars.
  • Variants without bound checks.
  • Perhaps also functions to read/write from/to channels.

The performance benefits of the new functions can be illustrated with the following micro-benchmark:

let () =
  let a = Float.Array.create 1000 in
  for j = 0 to Float.Array.length a - 1 do
    Float.Array.set a j (float j)
  done;
  let b = Buffer.create 100 in
  for i = 1 to 100000 do
    Buffer.clear b;
    for j = 0 to Float.Array.length a - 1 do
      Buffer.add_float64_bin b (Float.Array.get a j)
    done;
  done

and its variants using:

let add_float64_bin b x =
  let i = Int64.bits_of_float x in
  Buffer.add_char b (Char.unsafe_chr (Int64.to_int (Int64.logand i 0xffL)));
  Buffer.add_char b (Char.unsafe_chr (Int64.to_int (Int64.logand (Int64.shift_right_logical i (8*1)) 0xffL)));
  Buffer.add_char b (Char.unsafe_chr (Int64.to_int (Int64.logand (Int64.shift_right_logical i (8*2)) 0xffL)));
  Buffer.add_char b (Char.unsafe_chr (Int64.to_int (Int64.logand (Int64.shift_right_logical i (8*3)) 0xffL)));
  Buffer.add_char b (Char.unsafe_chr (Int64.to_int (Int64.logand (Int64.shift_right_logical i (8*4)) 0xffL)));
  Buffer.add_char b (Char.unsafe_chr (Int64.to_int (Int64.logand (Int64.shift_right_logical i (8*5)) 0xffL)));
  Buffer.add_char b (Char.unsafe_chr (Int64.to_int (Int64.logand (Int64.shift_right_logical i (8*6)) 0xffL)));
  Buffer.add_char b (Char.unsafe_chr (Int64.to_int (Int64.logand (Int64.shift_right_logical i (8*7)) 0xffL)))

The version using the new function runs about 6.5 times faster on my machine.

@alainfrisch alainfrisch added the stdlib label Jun 26, 2018

@nojb

This comment has been minimized.

Copy link
Contributor

commented Jun 26, 2018

I strongly support adding these functions to the stdlib.

  • I think adding little endian and big endian variants would be quite useful : a very common use-case for these functions is interfacing with binary protocols and/or formats in which case one always needs to use a specific endianness. A suffix _le and _be could be used. Adding the three versions (little endian, big endian, and host endian) may be optimal but a bit excessive.
  • For the case of bigstrings (1-dimensional bigarrays), I think that adding a dedicated Bigstring module makes sense, containing roughly the union of operations on bytes and Bigarray. See also MPR#7772.
  • For naming, is the suffix _bin really necessary? What about introducing a signature
module type S = sig
  type t
  val get_int16: t -> int
  val get_int32: t -> int32
  val get_int64: t -> int64
  val set_int16: t -> int -> unit
  val set_int32: t -> int32 -> unit
  val set_int64: t -> int64 -> unit
  (* etc *)
end

and define modules Bytes.BE : S with t := bytes, Bytes.LE : S with t := bytes, same for Buffer (with t := Buffer.t, etc), and so on. Unfortunately, this signature does not work for the case of channels (where the type t is different for get and set functions (maybe splitting the signature in an "input" part and an "output" part would be better to also handle the case of string).

@alainfrisch

This comment has been minimized.

Copy link
Contributor Author

commented Jun 26, 2018

  • I tend to agree that exposing versions with explicit endianness would be good, but I think it's good to keep the version with the native endianness of the machine (which is useful e.g. for inter-process communication on the same machine, or to interact with buffers containing native "C" data). The number of functions is a bit worrying, though, between int16/int32/int64/float32/float64, le/be/native, string/bytes/bigstring/buffer(/channel), possibly safe/unsafe.

  • An alternative could be to expose fewer functions, taking arguments, and hope/check that the inliner will be good enough to eliminate the overhead. E.g.:

type _ number_ty = Int16: int number_ty | Int32: int32 number_ty | ...
val Buffer.add_number: ?endian:[`LE|`BE](*default:Native*) -> Buffer.t -> 'a number_ty -> 'a -> unit  
  • About the _bin suffix: I had in mind Stdlib.print_int and co, working with the ASCII decimal representation. But perhaps we don't care.

  • I forgot about input_binary_int/output_binary_int, which we already have: big-endian, 4 bytes, but working with int, not int32.

@alainfrisch

This comment has been minimized.

Copy link
Contributor Author

commented Jun 27, 2018

Yes another approach would be to expose only functions using the native endianness, plus:

 val Int32.Endian.little: int32 -> int32
   (** Modify the value, if needed, so that functions such as Bytes.set_int32 produce a little-endian
        representation of the integer, independently of the native encoding of the host machine. Similarly,
        this function can be applied to the result of functions such as Bytes.get_int32 when reading
        a 32-bit integer in little-endian. *)

 val Int32.Endian.big: int32 -> int32
    (** .... *)

(and same for 16-bit and 64-bit)

@xclerc

This comment has been minimized.

Copy link
Contributor

commented Jun 27, 2018

This looks like a very useful addition.

Even if unsafe versions are not exposed, I think it would make
sense to avoid some bound checks in Buffer.add_xyz (they
are avoided in Buffer.add_char by using Bytes.unsafe_set
rather than Bytes.set).

(I would also consider adding add_int8/get_int8/set_int8.
They would of course be mere aliases to existing functions,
but would allow one to write more uniform code when working
on file formats or network protocols.)

@dbuenzli

This comment has been minimized.

Copy link
Contributor

commented Jun 27, 2018

In my experience what you need most of the time is the explicit endianness (network, codecs, etc.) and it's less error prone if the native endianness has as an explicit suffix; otherwise you tend to use it without thinking about the issue (or it will be the first one to auto-complete...) and you likely end up with a bug.

@alainfrisch I think your last suggestion would make things more confusing than enlighting. Personally I don't mind the number of functions. I think this should be simply:

{add,get,set}_int16_{ne,be,le} 
{add,get,set}_int32_{ne,be,le}
{add,get,set}_int64_{ne,be,le}

I don't think it's worth adding unsafe versions, people willing to go into unsafe territory can play the primitive game. I don't think it's worth adding the float versions this is an {Int64,Int32}.{bits_of_float,float_of_bits} away.

@alainfrisch

This comment has been minimized.

Copy link
Contributor Author

commented Jun 27, 2018

I don't think it's worth adding the float versions this is an {Int64,Int32}.{bits_of_float,float_of_bits} away.

I was concerned that this would cause extra boxing of the intermediate int64 if the function (e.g. Buffer.add_int64) is not inlined. To be checked.

@alainfrisch

This comment has been minimized.

Copy link
Contributor Author

commented Jun 27, 2018

Ok, it's seem fine to let the user do that float->int64 mapping (checked only with the "Closure" inliner, but I'm sure flambda will be ok), with default inlining settings.

@alainfrisch

This comment has been minimized.

Copy link
Contributor Author

commented Jun 27, 2018

I also check that the check for Sys.big_endian is eliminated in e.g. Buffer.add_int64_le, and that the performance is the same as Buffer.add_int64_ne (on a little-endian system).

For the naming: are _be, _le, _ne good enough?

Concerning the scope of this PR:

  • I suggest to keep the idea of providing channel operations for later.
  • Bigstring, which would be a whole new module, would deserve its own PR.
  • Unsafe versions are indeed not needed (the impact on performance is rather small anyway).
  • Perhaps adding get_int* functions to String makes sense, to keep it in sync with Bytes, even though most people would probably use bytes anyway for parsing binary data, I guess.
@alainfrisch

This comment has been minimized.

Copy link
Contributor Author

commented Jun 27, 2018

One more point: it could be useful to provide both get_int16 (sign extension) and get_uint16 (always returning positive integers).

@dbuenzli

This comment has been minimized.

Copy link
Contributor

commented Jun 27, 2018

Agreed on all points. I think it would be good to add the functions in String though.

One more point: it could be useful to provide both get_int16 (sign extension) and get_uint16 (always returning positive integers).

Yes ! FWIW I have this in an unpublished code of similar functionality. It's the kind of bit fiddling things you prefer to be handled by the library.

@@ -463,6 +463,91 @@ val of_seq : char Seq.t -> t
(** Create a string from the generator
@since 4.07 *)

(** {6 Binary storage of numbers} *)

external get_int16_ne : bytes -> int -> int = "%caml_bytes_get16"

This comment has been minimized.

Copy link
@dbuenzli

dbuenzli Jun 27, 2018

Contributor

IIRC there's no advantage to expose this as an external at the .mli level.

This comment has been minimized.

Copy link
@nojb

nojb Jun 27, 2018

Contributor

Exposing the external in the .mli (and hence in the .cmi) allows every use of this function to compile a call to the primitive directly (no passing through a wrapper function). Same as inlining, but this works even if inlining does not take place (e.g. in bytecode).

This comment has been minimized.

Copy link
@gasche

gasche Jun 27, 2018

Member

Is this an important property, though? Because the cost is that if we ever decide to change the implementation (for example wrapping it into a bound-checking or exception-changing routine) or primitive name, we break the interface, and that seems quite high to me.

This comment has been minimized.

Copy link
@nojb

nojb Jun 27, 2018

Contributor

I am not sure I understand: is there a way to write code that would break if one changes the signature from external to val ?

This comment has been minimized.

Copy link
@damiendoligez

damiendoligez Jun 27, 2018

Member

Some people like to copy the interfaces from the stdlib for their own implementations, and they would get bitten by this.

Other people avoid this problem by includeing the interfaces, and they get bitten every time we add a function or even an optional parameter to an existing function.

So I would like to see the externals removed from the .mlis. Bytecode speed is not an issue nowadays.

This comment has been minimized.

Copy link
@xclerc

xclerc Jun 27, 2018

Contributor

I guess @gasche was referring to something like:

module F (X : sig
    external f : unit -> unit = "%ignore"
end) = struct
end

module OK = F (struct
    external f : unit -> unit = "%ignore"
  end)

module KO = F (struct
    let f () = ()
  end)
@pmetzger

This comment has been minimized.

Copy link
Member

commented Jun 27, 2018

Two comments:

  1. Why not include 8 bit quantities as well? Yes, the function to do so is trivial, but it will make some protocol handling code look more uniform.
  2. About endianness, I'm always reminded of Rob Pike's insight that code that knows about endianness (other than compilers) is generally badly written. Here's his blog post on the topic. The code is all written in C but it's simple enough not to matter. https://commandcenter.blogspot.com/2012/04/byte-order-fallacy.html
@dbuenzli

This comment has been minimized.

Copy link
Contributor

commented Jun 27, 2018

About endianness, I'm always reminded of Rob Pike's insight that code that knows about endianness (other than compilers) is generally badly written.

I don't understand what this is supposed to bring to the discussion. Would you care to expand maybe ?

@alainfrisch

This comment has been minimized.

Copy link
Contributor Author

commented Jun 27, 2018

  • Are we ok with 21 new functions just in Bytes (12 get, 9 set)? Should we organize them in submodules (per endianness) as suggested by @nojb? (I generally prefer having explicit signatures.)
  • Assuming there is no runtime overhead, would it be good to pass the endianness choice (ne/le/be) as an argument, in order to divide the number of functions by 3?
  • What about adding int8/uint8 variants?
  • I believe that using Sys.int_size instead of Sys.word_size (as done in ocplib-endian) is better for JS backends, but that would need to be checked.
@jhwoodyatt

This comment has been minimized.

Copy link

commented Jun 27, 2018

For comparison purposes, I should point out that I have implemented logic for this purpose in my Orsetto project. Start here for details if you're interested.

Accordingly, I have no strong opinion on how this pull request might be evaluated. My unsolicited opinions on the various questions discussed above summarized:

  • unsafe interfaces are useful in the case where a higher-level interface is ensuring that data format invariants are respected.
  • byte-order should not by inferred from the system architecture when formatting data for interchange, i.e. it should always be exposed in the programming interface without a default.
  • formatting IEEE-754 floating point for interchange should provide support for binary16, binary32 and binary64 formats.
@alainfrisch

This comment has been minimized.

Copy link
Contributor Author

commented Jun 27, 2018

  • I've tried with
val Buffer.add_int64: [`LE|`BE|`NE] -> t -> int64 -> unit

let[@ocaml.inline] add_int64 e b x =
  let new_position = b.position + 8 in
  if new_position > b.length then resize b 8;
  unsafe_set_int64 b.buffer b.position
    (match e, Sys.big_endian with
     | `LE, true | `BE, false -> swap64 x
     | _ -> x);
  b.position <- new_position

and thanks to the inline annotation, this gives the same performance than ad hoc functions when endianness is known statically.

  • @jhwoodyatt : do you mean that we should not expose "native endianness" functions, and let the user rely on Sys.big_endian to choose between little/big endian variants if he really wants to use the native endianness (e.g. to interact with C datastructures)? This indeed makes it more explicit which parts of the code depends on the local native endianness.
@dbuenzli

This comment has been minimized.

Copy link
Contributor

commented Jun 27, 2018

and thanks to the inline annotation, this gives the same performance than ad hoc functions when endianness is known statically.

I think having all the function is in general more predictable. Except for guaranteed statically known checks done at the toplevel (like Sys.big_endian) I'm a little bit wary on relying too much on the compiler optimizations to get the baseline performance. Suppose I want to make something generic on endianness, with the current proposal I will abstract over the functions. With your last proposal I might abstract over [LE|NE|BE] which (I guess) might give poor performance if I'm not careful.

Personally I don't have a problem with all the additions (uint8 and int8 is something is also something I have in that unpublished code if you want to add them) and I personally don't see a huge benefit of gathering those in submodules. Having everything flat, including native endianness, is nice to write and obvious to read in my opinion.

@jhwoodyatt

This comment has been minimized.

Copy link

commented Jun 27, 2018

@alainfrisch no, I mean that you should provide three variations and require the choice made explicitly in the programming interface:

  • Little-endian byte order
  • Big-endian byte order (or so-called "network byte order")
  • System architecture byte order

That's what I did, and I think it was the right choice. Another thing I did was [basically] this:

type t = [ `BE | `LE | `SE ]
val create: [< t ] -> (module Endian)

Usually, the byte order is known statically. I think if dynamic byte order is really on the menu, then it's best to use a function that selects the right module.

@dbuenzli

This comment has been minimized.

Copy link
Contributor

commented Jul 30, 2018

@alainfrisch any progress on that one ? I would like to rely on this for further stdlib improvements.

@alainfrisch

This comment has been minimized.

Copy link
Contributor Author

commented Jul 30, 2018

Remaining points:

  • Write the doc for all functions. I'd like to get feedback on the existing text before copying them furiously to all other functions
  • Port functions from Bytes to String (once the exact design/doc for Bytes is stable)
  • Add tests
Show resolved Hide resolved stdlib/bytes.mli Outdated
Show resolved Hide resolved stdlib/bytes.mli
Show resolved Hide resolved stdlib/bytes.mli Outdated
Show resolved Hide resolved stdlib/bytes.mli Outdated
Show resolved Hide resolved stdlib/bytes.mli
@dbuenzli

This comment has been minimized.

Copy link
Contributor

commented Jul 30, 2018

Write the doc for all functions. I'd like to get feedback on the existing text before copying them furiously to all other functions

Reviewed, see my comments. I think we should keep the doc strings simple using the patterns I mentioned.

@nojb nojb referenced this pull request Aug 2, 2018

Closed

[RFC] Add Bigstring module #1961

@alainfrisch alainfrisch force-pushed the alainfrisch:binary_numbers branch from 658f94e to ba03e5a Nov 8, 2018

@alainfrisch

This comment has been minimized.

Copy link
Contributor Author

commented Nov 8, 2018

Just one minor pending point about the use of Char.unsafe_chr. Waiting for @dbuenzli feedback on that one.

@alainfrisch alainfrisch changed the title [RFC] Functions to read/write binary representations of numbers Functions to read/write binary representations of numbers Nov 8, 2018

@dbuenzli
Copy link
Contributor

left a comment

I'd say we are good to go. Thanks for putting the work in this.

@alainfrisch

This comment has been minimized.

Copy link
Contributor Author

commented Nov 8, 2018

Thanks for your help with the design and documentation!

@gasche

gasche approved these changes Nov 8, 2018

@alainfrisch alainfrisch merged commit 1ec0ece into ocaml:trunk Nov 9, 2018

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.