Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add I/O primitives for Bigarrays #12365

Merged
merged 9 commits into from
Aug 28, 2023
Merged

Add I/O primitives for Bigarrays #12365

merged 9 commits into from
Aug 28, 2023

Conversation

nojb
Copy link
Contributor

@nojb nojb commented Jul 8, 2023

Following discussion in #12360 this PR proposes adding I/O primitives for bigarrays to the standard library and Unix. I took the liberty of copying the In_channel and Out_channel functions from #12360. On top of that, Unix variants are also added here:

  • In_channel.input_bigarray, In_channel.really_input_bigarray
  • Out_channel.output_bigarray
  • Unix.read_bigarray, Unix.write_bigarray, Unix.single_write_bigarray

In each case the signature of the functions are identical to the existing ones on bytes, except that they use _ Bigarray.Genarray.t in its place. Offset and length parameters are interpreted in terms of bytes.

An alternative would be to use _ Bigarray.Array1.t instead of Bigarray.Genarray.t and interpret offset and length in terms of elements. I was going to do this originally, but I sensed a small difficulty with the Unix.single_write operation, which could fail to write a "full" element of a bigarray (but perhaps this is not a problem). Opinions welcome.

There is some code duplication in the Unix bindings but it is a bit difficult to share code as we take advantage of the fact that the data part of a bigarray does not move in memory to avoid copying data to an intermediate buffer and to release the runtime lock more liberally.

@xavierleroy
Copy link
Contributor

Thanks for getting the ball rolling on this idea. Some high-level comments following a quick look:

  • The C stub code looks fine.
  • The proposed API is appropriate for 1D bigarrays of characters (and 8-bit integers?) but super-confusing for other bigarrays, where the notions of byte offset and byte length don't make obvious sense.
  • For arbitrary bigarrays, I'm afraid the only operations that make intuitive sense is writing the whole bigarray or reading the whole bigarray (a la really_input, raising end-of-file if too short).

A compromise could be to have "read whole" and "write whole" operations for arbitrary bigarrays in the stdlib, and byte-oriented operations over 1D char bigarrays in the Unix module... but this is to be discussed more.

@stedolan
Copy link
Contributor

I think it would be nice to have functions for bigarray I/O in the stdlib (since bigarrays are in the stdlib now), but there is a reason that my original patch with @shindere was limited to char bigarrays: if you do I/O on non-char bigarrays, you have to think about endianness, partial writes, offsets and lengths that are non-integer multiples of the element size, and so on.

This is because channels read and write byte sequences, so some conversion needs to be done if you are reading and writing sequences of things other than bytes. I'm not sure that the stdlib I/O functions are the right place for such a conversion API: perhaps the user should do the conversion to bytes using functions from Bigarray, and the I/O should operate only on bigarrays of bytes?

@nojb
Copy link
Contributor Author

nojb commented Jul 12, 2023

perhaps the user should do the conversion to bytes using functions from Bigarray, and the I/O should operate only on bigarrays of bytes?

This sounds reasonable to me.

@nojb
Copy link
Contributor Author

nojb commented Jul 13, 2023

perhaps the user should do the conversion to bytes using functions from Bigarray, and the I/O should operate only on bigarrays of bytes?

This sounds reasonable to me.

I am planning to rework the PR to restrict I/O primitives to 1-D char bigarrays, with the following signatures:

Unix.read_bigarray : Unix.file_descr -> (char, int8_unsigned_int, _) Bigarray.Array1.t -> int -> int -> int
Unix.write_bigarray : Unix.file_descr -> (char, int8_unsigned_int, _) Bigarray.Array1.t -> int -> int -> int
Unix.single_write_bigarray : Unix.file_descr -> (char, int8_unsigned_int, _) Bigarray.Array1.t -> int -> int -> int
In_channel.input_bigarray : t -> (char, int8_unsigned_int, _) Bigarray.Array1.t -> int -> int -> int
In_channel.really_input_bigarray : t -> (char, int8_unsigned_int, _) Bigarray.Array1.t -> int -> int -> unit
Out_channel.output_bigarray : t -> (char, int8_unsigned_int, _) Bigarray.Array1.t -> int -> int -> unit

Please speak up if you have any objections!

@yallop
Copy link
Member

yallop commented Jul 13, 2023

I think it'd be clearer to also restrict the layout to c_layout, to avoid any possible confusion around 0-based and 1-based indexing. An alternative would be to carefully document the behaviour on Fortran-layout bigarrays, e.g. whether passing ~pos:1 to write_bigarray will start writing at the first or the second element.

@nojb
Copy link
Contributor Author

nojb commented Jul 13, 2023

I think it'd be clearer to also restrict the layout to c_layout, to avoid any possible confusion around 0-based and 1-based indexing.

Good point, will do.

@xavierleroy
Copy link
Contributor

I am planning to rework the PR to restrict I/O primitives to 1-D char bigarrays, with the following signatures: Unix.read_bigarray : Unix.file_descr -> (char, int8_unsigned_int, _) Bigarray.Array1.t -> int -> int -> int

I agree that the layout should better be forced to be c_layout, to avoid misunderstandings w.r.t. Fortran layout. On the other hand, the first parameter char could be generalized to _, so as to support both char and int8_unsigned bigarray kinds. Isn't that cute?

@xavierleroy
Copy link
Contributor

xavierleroy commented Jul 13, 2023

This is because channels read and write byte sequences, so some conversion needs to be done if you are reading and writing sequences of things other than bytes. I'm not sure that the stdlib I/O functions are the right place for such a conversion API: perhaps the user should do the conversion to bytes using functions from Bigarray, and the I/O should operate only on bigarrays of bytes?

On the one hand, I agree that endianness differences are best handled by using marshaling (input_value/output_value).

On the other hand, the memory-mapping API (Unix.map_file) supports all kinds of bigarrays without any conversions, just using the platform's native endianness. For I/O on pipes or sockets, memory mapping is not an option, hence functions for reading and writing bigarrays without conversions could make sense.

On the third hand, just like we have Bigarray.reshape* functions to change the dimensions of bigarrays without copying, we could have Bigarray.repr* functions that provide a view of a bigarray as an int8_unsigned 1D bigarray, over which I/O can be performed. Looks like the most flexible approach.

otherlibs/unix/write_unix.c Show resolved Hide resolved
@nojb
Copy link
Contributor Author

nojb commented Jul 13, 2023

On the other hand, the first parameter char could be generalized to _, so as to support both char and int8_unsigned bigarray kinds. Isn't that cute?

It is. Amended!

otherlibs/unix/read_unix.c Outdated Show resolved Hide resolved
runtime/io.c Show resolved Hide resolved
@stedolan
Copy link
Contributor

The latest GC safety patch looks good to me, although appveyor points out that one remaining fd->kind should become a Descr_kind_val(fd) at read_win32.c:75.

@stedolan
Copy link
Contributor

we could have Bigarray.repr* functions that provide a view of a bigarray as an int8_unsigned 1D bigarray

I agree with the third hand here - the Bigarray.repr* functions sound useful, and I actually thought we already had them! (For a separate PR, though)

@xavierleroy
Copy link
Contributor

the Bigarray.repr* functions sound useful, and I actually thought we already had them!

We have reshape functions that change the number and values of dimensions, but not the element kind. repr functions would serve a different purpose, but I agree they would be useful too.

@nojb
Copy link
Contributor Author

nojb commented Jul 18, 2023

The latest GC safety patch looks good to me, although appveyor points out that one remaining fd->kind should become a Descr_kind_val(fd) at read_win32.c:75.

Thanks, fixed!

ofs += numwritten;
len -= numwritten;
}
caml_leave_blocking_section();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this line shouldn't be here, and is causing the remaining appveyor failure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, fixed. Thanks!

@nojb
Copy link
Contributor Author

nojb commented Jul 27, 2023

This PR needs two official approvals if it is to move forward. Any takers?

Thanks!

@shindere
Copy link
Contributor

shindere commented Jul 27, 2023 via email

@nojb
Copy link
Contributor Author

nojb commented Jul 27, 2023

Sure. Would you please mind squashing all the commits?

Sure, done.

@shindere
Copy link
Contributor

shindere commented Jul 27, 2023 via email

@nojb
Copy link
Contributor Author

nojb commented Jul 27, 2023

What's the status of the change requested by @yallop? GH shows it to me as not taken into account, is that correct?

I believe all mentioned issues have been addressed.

The `read()` and `write()` system calls take a length with type
`size_t` ≈ `uintnat` and return a result of type `ssize_t` ≈ `intnat`.
So, on a 64-bit platform, the number of bytes read or written may not
fit in type `int` and must be given type `intnat`.
`ReadFile` and `WriteFile` take a length of type `DWORD` (unsigned 32 bits),
so the number of bytes to read or write must be capped at 0xFFFFFFFF.

`recv` and `send` take a length of type `int` (signed 32 bits),
so the number of bytes to read or write must be capped at INT_MAX.
Copy link
Contributor

@xavierleroy xavierleroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the implementation. I think it's OK except for the cases where the number of bytes to read or write is not representable as an int or a DWORD. I took the liberty to push fixes directly on this PR: one commit is for Unix, the other for Win32. Let me know what you think.

I'm also tempted to factor out the C code between the "write" and "single_write" cases, but haven't done anything in this direction yet.

@c-cube
Copy link
Contributor

c-cube commented Aug 9, 2023

This is going to be another nail in the coffin for modular IO, isn't it? More C functions, now operating on something that's not byte buffers…

@xavierleroy
Copy link
Contributor

I'm also tempted to factor out the C code between the "write" and "single_write" cases, but haven't done anything in this direction yet.

Here is a first try, on a personal branch: 8b954fb

@nojb
Copy link
Contributor Author

nojb commented Aug 10, 2023

I reviewed the implementation. I think it's OK except for the cases where the number of bytes to read or write is not representable as an int or a DWORD. I took the liberty to push fixes directly on this PR: one commit is for Unix, the other for Win32. Let me know what you think.

Thanks for the fix and the careful comments. Both commits look good to me.

@nojb
Copy link
Contributor Author

nojb commented Aug 17, 2023

I'm also tempted to factor out the C code between the "write" and "single_write" cases, but haven't done anything in this direction yet.

Here is a first try, on a personal branch: 8b954fb

Thanks, looks good to me so I cherry-picked to this branch. Should we do the same for Unix.write and Unix.single_write?

@xavierleroy
Copy link
Contributor

Should we do the same for Unix.write and Unix.single_write?

I thought about it, but was afraid to break 3rd-party reimplementations of the Unix module (JS_of_ocaml, maybe?) that might assume there are two different primitives. So, let's leave it as that.

Copy link
Contributor

@xavierleroy xavierleroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, approving! Formally, a second approval is needed, as this is a stdlib extension.

@shindere
Copy link
Contributor

shindere commented Aug 21, 2023 via email

@xavierleroy
Copy link
Contributor

I did approve thisPR a few weeks ago

Ah, sorry, I forgot about it (was one month ago) and didn't look at the full history.

but perhaps you'd prefer the second approval to be from a core dev more seasoned with stdlib changes?

I'm neutral. More eyeballs is always good, but as stdlib extensions go. this PR isn't controversial, I believe.

At any rate, I'll look into this again when I'm back next week.

@shindere
Copy link
Contributor

shindere commented Aug 22, 2023 via email

@nojb
Copy link
Contributor Author

nojb commented Aug 22, 2023

but perhaps you'd prefer the second approval to be from a core dev more seasoned with stdlib changes?

I'm neutral. More eyeballs is always good, but as stdlib extensions go. this PR isn't controversial, I believe.

@shindere's review took place before the changes explained in #12365 (review) so it would be best to have another review of the current state of the patch.

At any rate, I'll look into this again when I'm back next week.

Thanks!

Copy link
Member

@dra27 dra27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LVGTM2! There is one actual typo to fix in the doc comment for Out_channel.output_bigarray.

Only because of that, the description of Unix.read_bigarray uses the verb "read" in its description yet the descriptions of In_channel.input_bigarray and In_channel.really_input_bigarray use the verb "write" which read oddly in this unusual position of reading them all at once. I'd alter the In_channel functions to use "read the data into a bigarray" instead.

Finally, in the doc strings, there's inconsistency between "take data" and "take the data" between otherwise identical descriptions - FWIW I'd go for adding the (i.e. "read the data into a bigarray" and "take the data from a bigarray").

stdlib/out_channel.mli Outdated Show resolved Hide resolved
stdlib/in_channel.mli Outdated Show resolved Hide resolved
stdlib/in_channel.mli Outdated Show resolved Hide resolved
@shindere
Copy link
Contributor

shindere commented Aug 24, 2023 via email

@nojb
Copy link
Contributor Author

nojb commented Aug 24, 2023

Perhaps worth taking the opportunity of dong the fixes to squash all the commits or at least make sure their number is minimal and the history coherent?

No need, I'll squash the PR when merging.

nojb and others added 3 commits August 24, 2023 10:41
Co-authored-by: David Allsopp <david.allsopp@metastack.com>
Co-authored-by: David Allsopp <david.allsopp@metastack.com>
Co-authored-by: David Allsopp <david.allsopp@metastack.com>
@nojb
Copy link
Contributor Author

nojb commented Aug 24, 2023

LVGTM2!

Thanks @dra27 for your review! I accepted all your suggestions.

I suggest we wait for @xavierleroy's second look before merging.

@xavierleroy
Copy link
Contributor

I had a (quick) second look and I think it's high time to merge this PR! Thanks to all who participated.

@xavierleroy xavierleroy merged commit f772ae0 into ocaml:trunk Aug 28, 2023
9 checks passed
@nojb
Copy link
Contributor Author

nojb commented Aug 28, 2023

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants