Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

modular IO #19

Merged
merged 13 commits into from
Mar 8, 2021
Merged

modular IO #19

merged 13 commits into from
Mar 8, 2021

Conversation

c-cube
Copy link
Contributor

@c-cube c-cube commented Jun 15, 2020

This RFC proposes to update the stdlib's types in_channel and out_channel to make them user-definable and composable.

view file

@yakobowski
Copy link

This is a very interesting proposal, that I strongly support. This especially resonated with me:

In my own experience, Format is a better choice because a single Format.formatter -> t -> unit function

In practice, I always end up writing everything in Format (even when not use pretty-printing) just to benefit from composable formatters.

@gasche
Copy link
Member

gasche commented Jun 16, 2020

There are a lot of extra features in built-in formats that correspond to usage modes they appear to have been designed for:

  • an offset (to support seek operations?)
  • a mutex for concurrent usage
  • a refcount (it looks like it was used by Cash to create several channels on the same underlying file descriptor?) and "revealed" count (also for Cash)
  • a global list of all channels

It would be nice to get feedback on which of those features we can get rid of, and which are important to preserve, because they may also constrain the design of the user-exposed interface. For example:

  1. your interface does not appear to support seek_{in,out}, would those operations just fail on user-defined channels?
  2. do we want the guarantee that a channel can be safely used in a concurrent scenario, with accesses to the same channel linearized sequentially? (Today the OCaml runtime lock would guarantee that calls to the read function are linearized, but tomorrow in a Multicore world...)

@nojb
Copy link

nojb commented Jun 16, 2020

There are a lot of extra features in built-in formats that correspond to usage modes they appear to have been designed for:
...

  • a refcount (it looks like it was used by Cash to create several channels on the same underlying file descriptor?) and "revealed" count (also for Cash)

My understanding is that these two fields could be removed (Cash is unmaintained and does not compile with the current OCaml version).

  • a global list of all channels

This is used to implement flush_all.

It would be nice to get feedback on which of those features we can get rid of, and which are important to preserve, because they may also constrain the design of the user-exposed interface. For example:

@c-cube
Copy link
Contributor Author

c-cube commented Jun 16, 2020

1. your interface does not appear to support `seek_{in,out}`, would those operations just fail on user-defined channels?

yes, there is a mention of that there. Seek only makes sense on some channels (even now), when the user knows it's mapped to a file underneath. Seek would just be a partial function that raises on user defined channels.

Of course an alternative is to have seek: (int -> unit) option in the user-defined record, with the default, None, meaning that seek must fail.

2. do we want the guarantee that a channel can be safely used in a concurrent scenario, with accesses to the same channel linearized sequentially? (Today the OCaml runtime lock would guarantee that calls to the `read` function are linearized, but tomorrow in a Multicore world...)

Seems very bad practice to me. If you want a lock, you should use one (see rust again about a locked wrapper around stdin/stdout).

I dislike the list of all channels (again, bad practice, flush_all is a code smell in my book; use with_in / with_out guards to ensure resources are properly disposed of) but it should probably be kept for all raw channels, for compatibility. I see no reason to extend it to user defined channels (why would you want flush_all to also work on channels that might print into buffers anyway)?

There is also a global map, in unix, from raw file descriptors to channels, to implement Unix.{in,out}_channel_of_descr. I think we also need to live with that one.

@gasche
Copy link
Member

gasche commented Jul 23, 2020

We discussed this RFC at a developer meeting, not in depth but rather to get quick feedback from people who hadn't already looked at it. One remark was that it was not completely clear that the proposed API would, indeed, enable the cool applications mentioned (in particular efficiencient) middle-layer adapters for compression or encryption, and that this could be conclusively answered without doing a PR that modifies the stdlib, just with a third-party prototype of the interface. Would you (@c-cube) be interested in doing this?

Some related works were briefly mentioned:

  • I mentioned BatIO, that you also mentioned in the proposal. I seem to remember that it is indeed flexible enough to offer, say, compression on the fly, so personally I'm not too worried about your proposal.
  • @jeremiedimino mentioned that Lwt_io has a similar "more flexible" abstraction.

Having a prototype would also have the nice property that it's easy to see the "final" state of the proposal to see how issues are adressed. (For example I would see seek : (int -> unit) option, and not need to browse the discussion about that. I'm fine with this proposal, by the way; a type-based distinction would be nicer in theory and too painful to deploy in practice if we want backward-compatibility.)

@c-cube
Copy link
Contributor Author

c-cube commented Jul 26, 2020

A proof of concept is being developped there fyi.

@c-cube
Copy link
Contributor Author

c-cube commented Dec 9, 2020

will the PoC be discussed at some developer meeting?

@gasche
Copy link
Member

gasche commented Dec 9, 2020

I looked at the proof-of-concept again and raised a couple small issues; I'm planning next to encourage others to give feedback.

@c-cube
Copy link
Contributor Author

c-cube commented Jan 12, 2021

I updated the poc, btw.

@Octachron
Copy link
Member

Looking at the RFCs and the proof-of-concept, I think this is a worthwhile extension: this is both fully backward compatible and improves the interoperability of existing libraries.

type t

(** Obtain a slice of the current buffer. Empty iff EOF was reached *)
val fill_buf : t -> (bytes * int * int)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small difference with the previous interface is that the user cannot choose an upper bound on the size of the slice. It seems that it could matter in term of latency. Is that a limitation in practice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I'm not sure I follow where latency is involved.

The contract is that fill_buf returns a non-empty slice, but it doesn't have to be the "full" underlying buffer. The name might be suboptimal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought process is that if the client is latency-sensitive (a graphical or a audio client for instance) and it might make sense to provide an upper bound on the amount of work done by a call of fill_buf to get some data now rather than an unknown amount later.

Copy link
Contributor Author

@c-cube c-cube Feb 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the current interface also doesn't give you anything to control that, doesn't it? It's best effort in both cases. As long as you return a non empty slice you can be as lazy as possible (typically, one syscall for reading).

edit: to be more clear, input chan buf i n asks to read at most n bytes into the buffer (and at least one, unless EOF is reached). fill_buf chan asks to return a non empty slice, unless EOF is reached. I don't think one is lazier than the other.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the current interface, one can ask for at most one byte of data, so there is at least (very) theoretically some negotiation between the client and the buffer in term of latency versus throughput. The added complexity is probably not worth it however.

Copy link
Member

@gasche gasche Feb 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guess would be that a natural implementation of fill_buf would stop before a blocking operation, unless there is no data to read at all, similarly to how input would return less data in that case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is indeed the current implementation in the proof of concept, no syscall is done unless all has been consumed. I think the name is misleading, it doesn't try to refill the whole buffer on every call.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, fill_buf seems misleading; might something along the lines of get_segment or get_slice be better?

Like Florian, I wonder if it might make sense to let the user specify a lower bound and/or an upper bound on the size of the slice that is returned. E.g., specifying a lower bound would save the user the trouble of writing a loop. (The operation would block or raise an exception if there is not enough data.) Specifying an upper bound may help prevent reading data which the user knows is not needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_slice is probably better. I think specifying bounds complicates the whole design and contract between consumer and producer: the underlying producer could be, for example, decompressing a stream and not in control of how much is decompressed in one batch. Similarly, reading a TCP stream doesn't give you a lot of control (same as the current API).

To avoid writing a loop, there can be helpers like a read_exactly that would try and read n bytes, for example. I think Lwt has similar thing (see "read_into_exactly").

@dra27
Copy link
Member

dra27 commented Feb 5, 2021

One possibility which could come from exposing seek and channel_length in the user-defined functions is that we could lift Windows text mode channels out of C, and even to a separate library (which would allow text mode to be available cross-platform). Loosely related to ocaml/ocaml#10109

@fpottier
Copy link

The proposal suggests re-implementing sprintf in terms of fprintf. Would there be a cost in efficiency, compared to the current implementation?

@c-cube
Copy link
Contributor Author

c-cube commented Feb 12, 2021

@fpottier we would need to keep sprintf as it is for compatibility, I think. The idea is that you could have asprintf which takes a out_channel printer and does the same dance of "allocate buffer, make out_channel out of it, call fprintf, call Buffer.content".

@Octachron
Copy link
Member

This RFC was discussed at today OCaml developer meeting, and there was a general consensus that this a step in the right direction in term of interoperability.

There were some API design questions on the usability of flush in a network context and the best way to handle seek.

There was also some interrogation on the possibility to completely switch to the OO-interface and remove the builtin implementations (probably at the cost of bytecode performance).
My personal opinion is that the hybrid approach is fine for a first step.

Overall, it seems that the remaining finer points can be discussed in a PR.
This RFC is thus approved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants