Add I/O primitives for Bigarrays #12365

nojb · 2023-07-08T10:20:41Z

Following discussion in #12360 this PR proposes adding I/O primitives for bigarrays to the standard library and Unix. I took the liberty of copying the In_channel and Out_channel functions from #12360. On top of that, Unix variants are also added here:

In_channel.input_bigarray, In_channel.really_input_bigarray
Out_channel.output_bigarray
Unix.read_bigarray, Unix.write_bigarray, Unix.single_write_bigarray

In each case the signature of the functions are identical to the existing ones on bytes, except that they use _ Bigarray.Genarray.t in its place. Offset and length parameters are interpreted in terms of bytes.

An alternative would be to use _ Bigarray.Array1.t instead of Bigarray.Genarray.t and interpret offset and length in terms of elements. I was going to do this originally, but I sensed a small difficulty with the Unix.single_write operation, which could fail to write a "full" element of a bigarray (but perhaps this is not a problem). Opinions welcome.

There is some code duplication in the Unix bindings but it is a bit difficult to share code as we take advantage of the fact that the data part of a bigarray does not move in memory to avoid copying data to an intermediate buffer and to release the runtime lock more liberally.

xavierleroy · 2023-07-11T09:06:56Z

Thanks for getting the ball rolling on this idea. Some high-level comments following a quick look:

The C stub code looks fine.
The proposed API is appropriate for 1D bigarrays of characters (and 8-bit integers?) but super-confusing for other bigarrays, where the notions of byte offset and byte length don't make obvious sense.
For arbitrary bigarrays, I'm afraid the only operations that make intuitive sense is writing the whole bigarray or reading the whole bigarray (a la really_input, raising end-of-file if too short).

A compromise could be to have "read whole" and "write whole" operations for arbitrary bigarrays in the stdlib, and byte-oriented operations over 1D char bigarrays in the Unix module... but this is to be discussed more.

stedolan · 2023-07-12T09:48:40Z

I think it would be nice to have functions for bigarray I/O in the stdlib (since bigarrays are in the stdlib now), but there is a reason that my original patch with @shindere was limited to char bigarrays: if you do I/O on non-char bigarrays, you have to think about endianness, partial writes, offsets and lengths that are non-integer multiples of the element size, and so on.

This is because channels read and write byte sequences, so some conversion needs to be done if you are reading and writing sequences of things other than bytes. I'm not sure that the stdlib I/O functions are the right place for such a conversion API: perhaps the user should do the conversion to bytes using functions from Bigarray, and the I/O should operate only on bigarrays of bytes?

nojb · 2023-07-12T09:53:40Z

perhaps the user should do the conversion to bytes using functions from Bigarray, and the I/O should operate only on bigarrays of bytes?

This sounds reasonable to me.

nojb · 2023-07-13T07:41:41Z

perhaps the user should do the conversion to bytes using functions from Bigarray, and the I/O should operate only on bigarrays of bytes?

This sounds reasonable to me.

I am planning to rework the PR to restrict I/O primitives to 1-D char bigarrays, with the following signatures:

Unix.read_bigarray : Unix.file_descr -> (char, int8_unsigned_int, _) Bigarray.Array1.t -> int -> int -> int
Unix.write_bigarray : Unix.file_descr -> (char, int8_unsigned_int, _) Bigarray.Array1.t -> int -> int -> int
Unix.single_write_bigarray : Unix.file_descr -> (char, int8_unsigned_int, _) Bigarray.Array1.t -> int -> int -> int
In_channel.input_bigarray : t -> (char, int8_unsigned_int, _) Bigarray.Array1.t -> int -> int -> int
In_channel.really_input_bigarray : t -> (char, int8_unsigned_int, _) Bigarray.Array1.t -> int -> int -> unit
Out_channel.output_bigarray : t -> (char, int8_unsigned_int, _) Bigarray.Array1.t -> int -> int -> unit

Please speak up if you have any objections!

yallop · 2023-07-13T08:15:22Z

I think it'd be clearer to also restrict the layout to c_layout, to avoid any possible confusion around 0-based and 1-based indexing. An alternative would be to carefully document the behaviour on Fortran-layout bigarrays, e.g. whether passing ~pos:1 to write_bigarray will start writing at the first or the second element.

nojb · 2023-07-13T09:13:28Z

I think it'd be clearer to also restrict the layout to c_layout, to avoid any possible confusion around 0-based and 1-based indexing.

Good point, will do.

xavierleroy · 2023-07-13T16:36:06Z

I am planning to rework the PR to restrict I/O primitives to 1-D char bigarrays, with the following signatures: Unix.read_bigarray : Unix.file_descr -> (char, int8_unsigned_int, _) Bigarray.Array1.t -> int -> int -> int

I agree that the layout should better be forced to be c_layout, to avoid misunderstandings w.r.t. Fortran layout. On the other hand, the first parameter char could be generalized to _, so as to support both char and int8_unsigned bigarray kinds. Isn't that cute?

xavierleroy · 2023-07-13T16:46:35Z

This is because channels read and write byte sequences, so some conversion needs to be done if you are reading and writing sequences of things other than bytes. I'm not sure that the stdlib I/O functions are the right place for such a conversion API: perhaps the user should do the conversion to bytes using functions from Bigarray, and the I/O should operate only on bigarrays of bytes?

On the one hand, I agree that endianness differences are best handled by using marshaling (input_value/output_value).

On the other hand, the memory-mapping API (Unix.map_file) supports all kinds of bigarrays without any conversions, just using the platform's native endianness. For I/O on pipes or sockets, memory mapping is not an option, hence functions for reading and writing bigarrays without conversions could make sense.

On the third hand, just like we have Bigarray.reshape* functions to change the dimensions of bigarrays without copying, we could have Bigarray.repr* functions that provide a view of a bigarray as an int8_unsigned 1D bigarray, over which I/O can be performed. Looks like the most flexible approach.

otherlibs/unix/write_unix.c

nojb · 2023-07-13T19:56:13Z

On the other hand, the first parameter char could be generalized to _, so as to support both char and int8_unsigned bigarray kinds. Isn't that cute?

It is. Amended!

otherlibs/unix/read_unix.c

runtime/io.c

otherlibs/unix/unixsupport.h

stedolan · 2023-07-18T13:45:05Z

The latest GC safety patch looks good to me, although appveyor points out that one remaining fd->kind should become a Descr_kind_val(fd) at read_win32.c:75.

stedolan · 2023-07-18T13:47:20Z

we could have Bigarray.repr* functions that provide a view of a bigarray as an int8_unsigned 1D bigarray

I agree with the third hand here - the Bigarray.repr* functions sound useful, and I actually thought we already had them! (For a separate PR, though)

xavierleroy · 2023-07-18T15:53:06Z

the Bigarray.repr* functions sound useful, and I actually thought we already had them!

We have reshape functions that change the number and values of dimensions, but not the element kind. repr functions would serve a different purpose, but I agree they would be useful too.

nojb · 2023-07-18T20:41:59Z

The latest GC safety patch looks good to me, although appveyor points out that one remaining fd->kind should become a Descr_kind_val(fd) at read_win32.c:75.

Thanks, fixed!

stedolan · 2023-07-19T09:25:14Z

otherlibs/unix/write_win32.c

+    ofs += numwritten;
+    len -= numwritten;
+  }
+  caml_leave_blocking_section();


I think this line shouldn't be here, and is causing the remaining appveyor failure.

Indeed, fixed. Thanks!

otherlibs/unix/read_unix.c

nojb · 2023-07-27T12:41:42Z

This PR needs two official approvals if it is to move forward. Any takers?

Thanks!

shindere · 2023-07-27T12:50:04Z

Nicolás Ojeda Bär (2023/07/27 05:41 -0700):

This PR needs two official approvals if it is to move forward. Any takers?

Sure. Would you please mind squashing all the commits?

nojb · 2023-07-27T12:53:28Z

Sure. Would you please mind squashing all the commits?

Sure, done.

shindere · 2023-07-27T13:00:54Z

Thanks! What's the status of the change requested by @yallop? GH shows it to me as not taken into account, is that correct?

nojb · 2023-07-27T13:03:47Z

What's the status of the change requested by @yallop? GH shows it to me as not taken into account, is that correct?

I believe all mentioned issues have been addressed.

The `read()` and `write()` system calls take a length with type `size_t` ≈ `uintnat` and return a result of type `ssize_t` ≈ `intnat`. So, on a 64-bit platform, the number of bytes read or written may not fit in type `int` and must be given type `intnat`.

`ReadFile` and `WriteFile` take a length of type `DWORD` (unsigned 32 bits), so the number of bytes to read or write must be capped at 0xFFFFFFFF. `recv` and `send` take a length of type `int` (signed 32 bits), so the number of bytes to read or write must be capped at INT_MAX.

xavierleroy

I reviewed the implementation. I think it's OK except for the cases where the number of bytes to read or write is not representable as an int or a DWORD. I took the liberty to push fixes directly on this PR: one commit is for Unix, the other for Win32. Let me know what you think.

I'm also tempted to factor out the C code between the "write" and "single_write" cases, but haven't done anything in this direction yet.

c-cube · 2023-08-09T13:41:03Z

This is going to be another nail in the coffin for modular IO, isn't it? More C functions, now operating on something that's not byte buffers…

xavierleroy · 2023-08-09T14:02:04Z

I'm also tempted to factor out the C code between the "write" and "single_write" cases, but haven't done anything in this direction yet.

Here is a first try, on a personal branch: 8b954fb

nojb · 2023-08-10T14:09:45Z

I reviewed the implementation. I think it's OK except for the cases where the number of bytes to read or write is not representable as an int or a DWORD. I took the liberty to push fixes directly on this PR: one commit is for Unix, the other for Win32. Let me know what you think.

Thanks for the fix and the careful comments. Both commits look good to me.

…rray`

nojb · 2023-08-17T07:11:29Z

I'm also tempted to factor out the C code between the "write" and "single_write" cases, but haven't done anything in this direction yet.

Here is a first try, on a personal branch: 8b954fb

Thanks, looks good to me so I cherry-picked to this branch. Should we do the same for Unix.write and Unix.single_write?

xavierleroy · 2023-08-18T17:40:14Z

Should we do the same for Unix.write and Unix.single_write?

I thought about it, but was afraid to break 3rd-party reimplementations of the Unix module (JS_of_ocaml, maybe?) that might assume there are two different primitives. So, let's leave it as that.

xavierleroy

Looks good to me, approving! Formally, a second approval is needed, as this is a stdlib extension.

shindere · 2023-08-21T12:44:23Z

Xavier Leroy (2023/08/18 10:43 -0700):

@xavierleroy approved this pull request. Looks good to me, approving! Formally, a second approval is needed, as this is a stdlib extension.

Isn't your approvla the second one? I did approve thisPR a few weeks ago but perhaps you'd prefer the second approval to be from a core dev more seasoned with stdlib changes?

xavierleroy · 2023-08-22T15:59:34Z

I did approve thisPR a few weeks ago

Ah, sorry, I forgot about it (was one month ago) and didn't look at the full history.

but perhaps you'd prefer the second approval to be from a core dev more seasoned with stdlib changes?

I'm neutral. More eyeballs is always good, but as stdlib extensions go. this PR isn't controversial, I believe.

At any rate, I'll look into this again when I'm back next week.

shindere · 2023-08-22T16:04:13Z

Xavier Leroy (2023/08/22 08:59 -0700):

Ah, sorry, I forgot about it (was one month ago) and didn't look at the full history.

No problem. :)

> but perhaps you'd prefer the second approval to be from a core dev more seasoned with stdlib changes? I'm neutral. More eyeballs is always good,

Yeah, especially given the poor quality of mine. Sorry, coulnd't resist.

but as stdlib extensions go. this PR isn't controversial, I believe.

I don't believe either. :)

At any rate, I'll look into this again when I'm back next week.

Thanks. Its merge will then unblock #12360 which will be made simpler once rebased, I expect.

nojb · 2023-08-22T16:25:56Z

but perhaps you'd prefer the second approval to be from a core dev more seasoned with stdlib changes?

I'm neutral. More eyeballs is always good, but as stdlib extensions go. this PR isn't controversial, I believe.

@shindere's review took place before the changes explained in #12365 (review) so it would be best to have another review of the current state of the patch.

At any rate, I'll look into this again when I'm back next week.

Thanks!

dra27

LVGTM2! There is one actual typo to fix in the doc comment for Out_channel.output_bigarray.

Only because of that, the description of Unix.read_bigarray uses the verb "read" in its description yet the descriptions of In_channel.input_bigarray and In_channel.really_input_bigarray use the verb "write" which read oddly in this unusual position of reading them all at once. I'd alter the In_channel functions to use "read the data into a bigarray" instead.

Finally, in the doc strings, there's inconsistency between "take data" and "take the data" between otherwise identical descriptions - FWIW I'd go for adding the (i.e. "read the data into a bigarray" and "take the data from a bigarray").

stdlib/out_channel.mli

stdlib/in_channel.mli

shindere · 2023-08-24T08:19:59Z

Cool! Thanks a lot for the review! Perhaps worth taking the opportunity of dong the fixes to squash all the commits or at least make sure their number is minimal and the history coherent?

nojb · 2023-08-24T08:20:44Z

Perhaps worth taking the opportunity of dong the fixes to squash all the commits or at least make sure their number is minimal and the history coherent?

No need, I'll squash the PR when merging.

Co-authored-by: David Allsopp <david.allsopp@metastack.com>

nojb · 2023-08-24T08:42:49Z

LVGTM2!

Thanks @dra27 for your review! I accepted all your suggestions.

I suggest we wait for @xavierleroy's second look before merging.

xavierleroy · 2023-08-28T08:33:39Z

I had a (quick) second look and I think it's high time to merge this PR! Thanks to all who participated.

nojb · 2023-08-28T08:35:51Z

Thanks!

nojb mentioned this pull request Jul 8, 2023

Get rid of the LongString module #12360

Merged

yallop requested changes Jul 13, 2023

View reviewed changes

otherlibs/unix/write_unix.c Show resolved Hide resolved

gasche reviewed Jul 13, 2023

View reviewed changes

otherlibs/unix/read_unix.c Outdated Show resolved Hide resolved

nojb force-pushed the bigarray_io branch from 9bebdd6 to 25a318a Compare July 13, 2023 20:04

gasche reviewed Jul 14, 2023

View reviewed changes

runtime/io.c Show resolved Hide resolved

hhugo mentioned this pull request Jul 15, 2023

Support for OCaml trunk ocsigen/js_of_ocaml#1487

Draft

11 tasks

stedolan reviewed Jul 17, 2023

View reviewed changes

otherlibs/unix/unixsupport.h Outdated Show resolved Hide resolved

stedolan reviewed Jul 19, 2023

View reviewed changes

dra27 reviewed Jul 19, 2023

View reviewed changes

otherlibs/unix/read_unix.c Show resolved Hide resolved

Add I/O primitives for Bigarrays

71a2099

nojb force-pushed the bigarray_io branch from 65596de to 71a2099 Compare July 27, 2023 12:53

shindere approved these changes Jul 27, 2023

View reviewed changes

xavierleroy added 2 commits August 9, 2023 14:52

xavierleroy reviewed Aug 9, 2023

View reviewed changes

Share the C stub code between write_bigarray and `write_single_biga…

0626507

…rray`

nojb force-pushed the bigarray_io branch from e03f6b1 to 0626507 Compare August 17, 2023 07:08

xavierleroy approved these changes Aug 18, 2023

View reviewed changes

dra27 approved these changes Aug 23, 2023

View reviewed changes

stdlib/out_channel.mli Outdated Show resolved Hide resolved

stdlib/in_channel.mli Outdated Show resolved Hide resolved

stdlib/in_channel.mli Outdated Show resolved Hide resolved

nojb and others added 3 commits August 24, 2023 10:41

Update stdlib/out_channel.mli

7000ebb

Co-authored-by: David Allsopp <david.allsopp@metastack.com>

Update stdlib/in_channel.mli

77a2802

Co-authored-by: David Allsopp <david.allsopp@metastack.com>

Update stdlib/in_channel.mli

b7babe9

Co-authored-by: David Allsopp <david.allsopp@metastack.com>

dra27 and others added 2 commits August 25, 2023 11:06

More the

60bf35c

Update reviewers in Changes

0a243b8

xavierleroy merged commit f772ae0 into ocaml:trunk Aug 28, 2023
9 checks passed

nojb deleted the bigarray_io branch August 28, 2023 08:35

gasche added a commit that referenced this pull request Aug 29, 2023

hot fix for a build-breaking conflict between #12365 and #12446

76c4617

xavierleroy mentioned this pull request Oct 10, 2023

Unix.read isn't POSIX-confirmant even if the OS is #8352

Closed

Add I/O primitives for Bigarrays #12365

Add I/O primitives for Bigarrays #12365

Conversation

nojb commented Jul 8, 2023

xavierleroy commented Jul 11, 2023

stedolan commented Jul 12, 2023

nojb commented Jul 12, 2023

nojb commented Jul 13, 2023 • edited

yallop commented Jul 13, 2023 • edited

nojb commented Jul 13, 2023

xavierleroy commented Jul 13, 2023

xavierleroy commented Jul 13, 2023 • edited

nojb commented Jul 13, 2023

stedolan commented Jul 18, 2023

stedolan commented Jul 18, 2023

xavierleroy commented Jul 18, 2023

nojb commented Jul 18, 2023

stedolan Jul 19, 2023

Choose a reason for hiding this comment

nojb Jul 19, 2023

Choose a reason for hiding this comment

nojb commented Jul 27, 2023

shindere commented Jul 27, 2023 via email

nojb commented Jul 27, 2023

shindere commented Jul 27, 2023 via email

nojb commented Jul 27, 2023

xavierleroy left a comment

Choose a reason for hiding this comment

c-cube commented Aug 9, 2023

xavierleroy commented Aug 9, 2023

nojb commented Aug 10, 2023

nojb commented Aug 17, 2023

xavierleroy commented Aug 18, 2023

xavierleroy left a comment

Choose a reason for hiding this comment

shindere commented Aug 21, 2023 via email

xavierleroy commented Aug 22, 2023

shindere commented Aug 22, 2023 via email

nojb commented Aug 22, 2023 • edited

dra27 left a comment

Choose a reason for hiding this comment

shindere commented Aug 24, 2023 via email

nojb commented Aug 24, 2023

nojb commented Aug 24, 2023

xavierleroy commented Aug 28, 2023

nojb commented Aug 28, 2023

nojb commented Jul 13, 2023 •

edited

yallop commented Jul 13, 2023 •

edited

xavierleroy commented Jul 13, 2023 •

edited

nojb commented Aug 22, 2023 •

edited