modular IO #19

c-cube · 2020-06-15T20:14:11Z

This RFC proposes to update the stdlib's types in_channel and out_channel to make them user-definable and composable.

yakobowski · 2020-06-15T21:54:40Z

This is a very interesting proposal, that I strongly support. This especially resonated with me:

In my own experience, Format is a better choice because a single Format.formatter -> t -> unit function

In practice, I always end up writing everything in Format (even when not use pretty-printing) just to benefit from composable formatters.

gasche · 2020-06-16T05:36:12Z

There are a lot of extra features in built-in formats that correspond to usage modes they appear to have been designed for:

an offset (to support seek operations?)
a mutex for concurrent usage
a refcount (it looks like it was used by Cash to create several channels on the same underlying file descriptor?) and "revealed" count (also for Cash)
a global list of all channels

It would be nice to get feedback on which of those features we can get rid of, and which are important to preserve, because they may also constrain the design of the user-exposed interface. For example:

your interface does not appear to support seek_{in,out}, would those operations just fail on user-defined channels?
do we want the guarantee that a channel can be safely used in a concurrent scenario, with accesses to the same channel linearized sequentially? (Today the OCaml runtime lock would guarantee that calls to the read function are linearized, but tomorrow in a Multicore world...)

nojb · 2020-06-16T06:55:19Z

There are a lot of extra features in built-in formats that correspond to usage modes they appear to have been designed for:
...

a refcount (it looks like it was used by Cash to create several channels on the same underlying file descriptor?) and "revealed" count (also for Cash)

My understanding is that these two fields could be removed (Cash is unmaintained and does not compile with the current OCaml version).

a global list of all channels

This is used to implement flush_all.

It would be nice to get feedback on which of those features we can get rid of, and which are important to preserve, because they may also constrain the design of the user-exposed interface. For example:

c-cube · 2020-06-16T13:45:52Z

1. your interface does not appear to support `seek_{in,out}`, would those operations just fail on user-defined channels?

yes, there is a mention of that there. Seek only makes sense on some channels (even now), when the user knows it's mapped to a file underneath. Seek would just be a partial function that raises on user defined channels.

Of course an alternative is to have seek: (int -> unit) option in the user-defined record, with the default, None, meaning that seek must fail.

2. do we want the guarantee that a channel can be safely used in a concurrent scenario, with accesses to the same channel linearized sequentially? (Today the OCaml runtime lock would guarantee that calls to the `read` function are linearized, but tomorrow in a Multicore world...)

Seems very bad practice to me. If you want a lock, you should use one (see rust again about a locked wrapper around stdin/stdout).

I dislike the list of all channels (again, bad practice, flush_all is a code smell in my book; use with_in / with_out guards to ensure resources are properly disposed of) but it should probably be kept for all raw channels, for compatibility. I see no reason to extend it to user defined channels (why would you want flush_all to also work on channels that might print into buffers anyway)?

There is also a global map, in unix, from raw file descriptors to channels, to implement Unix.{in,out}_channel_of_descr. I think we also need to live with that one.

gasche · 2020-07-23T14:26:24Z

We discussed this RFC at a developer meeting, not in depth but rather to get quick feedback from people who hadn't already looked at it. One remark was that it was not completely clear that the proposed API would, indeed, enable the cool applications mentioned (in particular efficiencient) middle-layer adapters for compression or encryption, and that this could be conclusively answered without doing a PR that modifies the stdlib, just with a third-party prototype of the interface. Would you (@c-cube) be interested in doing this?

Some related works were briefly mentioned:

I mentioned BatIO, that you also mentioned in the proposal. I seem to remember that it is indeed flexible enough to offer, say, compression on the fly, so personally I'm not too worried about your proposal.
@jeremiedimino mentioned that Lwt_io has a similar "more flexible" abstraction.

Having a prototype would also have the nice property that it's easy to see the "final" state of the proposal to see how issues are adressed. (For example I would see seek : (int -> unit) option, and not need to browse the discussion about that. I'm fine with this proposal, by the way; a type-based distinction would be nicer in theory and too painful to deploy in practice if we want backward-compatibility.)

c-cube · 2020-07-26T21:56:30Z

A proof of concept is being developped there fyi.

c-cube · 2020-12-09T02:46:45Z

will the PoC be discussed at some developer meeting?

gasche · 2020-12-09T07:34:04Z

I looked at the proof-of-concept again and raised a couple small issues; I'm planning next to encourage others to give feedback.

c-cube · 2021-01-12T19:09:01Z

I updated the poc, btw.

Octachron · 2021-02-05T17:17:19Z

Looking at the RFCs and the proof-of-concept, I think this is a worthwhile extension: this is both fully backward compatible and improves the interoperability of existing libraries.

Octachron · 2021-02-05T17:17:42Z

rfcs/modular_io.md

+  type t
+
+  (** Obtain a slice of the current buffer. Empty iff EOF was reached *)
+  val fill_buf : t -> (bytes * int * int)


One small difference with the previous interface is that the user cannot choose an upper bound on the size of the slice. It seems that it could matter in term of latency. Is that a limitation in practice?

hmm, I'm not sure I follow where latency is involved.

The contract is that fill_buf returns a non-empty slice, but it doesn't have to be the "full" underlying buffer. The name might be suboptimal.

My thought process is that if the client is latency-sensitive (a graphical or a audio client for instance) and it might make sense to provide an upper bound on the amount of work done by a call of fill_buf to get some data now rather than an unknown amount later.

the current interface also doesn't give you anything to control that, doesn't it? It's best effort in both cases. As long as you return a non empty slice you can be as lazy as possible (typically, one syscall for reading).

edit: to be more clear, input chan buf i n asks to read at most n bytes into the buffer (and at least one, unless EOF is reached). fill_buf chan asks to return a non empty slice, unless EOF is reached. I don't think one is lazier than the other.

With the current interface, one can ask for at most one byte of data, so there is at least (very) theoretically some negotiation between the client and the buffer in term of latency versus throughput. The added complexity is probably not worth it however.

My guess would be that a natural implementation of fill_buf would stop before a blocking operation, unless there is no data to read at all, similarly to how input would return less data in that case.

That is indeed the current implementation in the proof of concept, no syscall is done unless all has been consumed. I think the name is misleading, it doesn't try to refill the whole buffer on every call.

Indeed, fill_buf seems misleading; might something along the lines of get_segment or get_slice be better?

Like Florian, I wonder if it might make sense to let the user specify a lower bound and/or an upper bound on the size of the slice that is returned. E.g., specifying a lower bound would save the user the trouble of writing a loop. (The operation would block or raise an exception if there is not enough data.) Specifying an upper bound may help prevent reading data which the user knows is not needed.

get_slice is probably better. I think specifying bounds complicates the whole design and contract between consumer and producer: the underlying producer could be, for example, decompressing a stream and not in control of how much is decompressed in one batch. Similarly, reading a TCP stream doesn't give you a lot of control (same as the current API).

To avoid writing a loop, there can be helpers like a read_exactly that would try and read n bytes, for example. I think Lwt has similar thing (see "read_into_exactly").

dra27 · 2021-02-05T19:40:01Z

One possibility which could come from exposing seek and channel_length in the user-defined functions is that we could lift Windows text mode channels out of C, and even to a separate library (which would allow text mode to be available cross-platform). Loosely related to ocaml/ocaml#10109

fpottier · 2021-02-12T16:20:51Z

The proposal suggests re-implementing sprintf in terms of fprintf. Would there be a cost in efficiency, compared to the current implementation?

c-cube · 2021-02-12T16:25:35Z

@fpottier we would need to keep sprintf as it is for compatibility, I think. The idea is that you could have asprintf which takes a out_channel printer and does the same dance of "allocate buffer, make out_channel out of it, call fprintf, call Buffer.content".

Octachron · 2021-03-08T13:02:15Z

This RFC was discussed at today OCaml developer meeting, and there was a general consensus that this a step in the right direction in term of interoperability.

There were some API design questions on the usability of flush in a network context and the best way to handle seek.

There was also some interrogation on the possibility to completely switch to the OO-interface and remove the builtin implementations (probably at the cost of bytecode performance).
My personal opinion is that the hybrid approach is fine for a first step.

Overall, it seems that the remaining finer points can be discussed in a PR.
This RFC is thus approved.

c-cube added 10 commits June 15, 2020 15:36

add rfc for modular IO types

19c9623

fix

17a3763

add more explanations on new interface

a9b42a5

add note on oc

41f6dd9

reword

79f9c57

reword

5b4417f

point out magical external

24cdb00

extensible new api

5c0bfb0

add a few more explanations of why

01735fc

add related works

715d36f

note on compat of seek/fd

45c5f60

gasche added rfc stdlib labels Jul 15, 2020

c-cube added 2 commits July 26, 2020 16:19

fix typo

b80b234

point to proof of concept

54f6cfc

Octachron reviewed Feb 5, 2021

View reviewed changes

Octachron merged commit 2ab45a2 into ocaml:master Mar 8, 2021

c-cube deleted the rfc-io branch March 8, 2021 15:18

alainfrisch mentioned this pull request Mar 10, 2021

gzip : allow reading from / writing to in-memory buffer xavierleroy/camlzip#27

Open

zindel mentioned this pull request Jul 21, 2021

Functorize Zip reading/writing functionality to permit different backends xavierleroy/camlzip#36

Open

c-cube mentioned this pull request Feb 3, 2022

add Buffer.unsafe_internal_buffer ocaml/ocaml#10982

Closed

gasche mentioned this pull request Mar 22, 2022

Add {In,Out}_channel.isatty ocaml/ocaml#11128

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

modular IO #19

modular IO #19

c-cube commented Jun 15, 2020 •

edited

Loading

yakobowski commented Jun 15, 2020

gasche commented Jun 16, 2020 •

edited

Loading

nojb commented Jun 16, 2020

c-cube commented Jun 16, 2020

gasche commented Jul 23, 2020

c-cube commented Jul 26, 2020

c-cube commented Dec 9, 2020

gasche commented Dec 9, 2020

c-cube commented Jan 12, 2021

Octachron commented Feb 5, 2021

Octachron Feb 5, 2021

c-cube Feb 5, 2021

Octachron Feb 5, 2021

c-cube Feb 5, 2021 •

edited

Loading

Octachron Feb 5, 2021

gasche Feb 5, 2021 •

edited

Loading

c-cube Feb 5, 2021

fpottier Feb 12, 2021

c-cube Feb 12, 2021

dra27 commented Feb 5, 2021

fpottier commented Feb 12, 2021

c-cube commented Feb 12, 2021

Octachron commented Mar 8, 2021

modular IO #19

modular IO #19

Conversation

c-cube commented Jun 15, 2020 • edited Loading

yakobowski commented Jun 15, 2020

gasche commented Jun 16, 2020 • edited Loading

nojb commented Jun 16, 2020

c-cube commented Jun 16, 2020

gasche commented Jul 23, 2020

c-cube commented Jul 26, 2020

c-cube commented Dec 9, 2020

gasche commented Dec 9, 2020

c-cube commented Jan 12, 2021

Octachron commented Feb 5, 2021

Octachron Feb 5, 2021

Choose a reason for hiding this comment

c-cube Feb 5, 2021

Choose a reason for hiding this comment

Octachron Feb 5, 2021

Choose a reason for hiding this comment

c-cube Feb 5, 2021 • edited Loading

Choose a reason for hiding this comment

Octachron Feb 5, 2021

Choose a reason for hiding this comment

gasche Feb 5, 2021 • edited Loading

Choose a reason for hiding this comment

c-cube Feb 5, 2021

Choose a reason for hiding this comment

fpottier Feb 12, 2021

Choose a reason for hiding this comment

c-cube Feb 12, 2021

Choose a reason for hiding this comment

dra27 commented Feb 5, 2021

fpottier commented Feb 12, 2021

c-cube commented Feb 12, 2021

Octachron commented Mar 8, 2021

c-cube commented Jun 15, 2020 •

edited

Loading

gasche commented Jun 16, 2020 •

edited

Loading

c-cube Feb 5, 2021 •

edited

Loading

gasche Feb 5, 2021 •

edited

Loading