Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation for Unix.[in/out]_channel_of_descr is misleading and leads to unwanted closing of file descriptor #9786

Closed
jhjourdan opened this issue Jul 21, 2020 · 10 comments

Comments

@jhjourdan
Copy link
Contributor

A common pattern for using sockets in OCaml is to transform them in two channels: one for input and one for output:

  let sock =
    Unix.socket [...] Unix.SOCK_STREAM 0 in
  let inchan = Unix.in_channel_of_descr sock in
  let outchan = Unix.out_channel_of_descr sock in

However, at some point, inchan and outchan will be closed, and this will close the underlying file descriptor. The documentation of Unix.[in/out]_channel_of_descr says:

Closing the channel also closes the underlying file descriptor (unless it was already closed).

In other words, it says implicitly that it is fine to call both close_in and close_out on inchan and outchan for closing the socket: what says the documentation says that the closing of the second channel will simply not close the underlying file descriptor, since it has already been closed.

However, things actually happens differently in the current implementation: close_in and close_out both always call the close system call, and simply ignore the return value. This is problematic, since the POSIX file descriptor may very well be reused between the closing of the two channels, resulting in the closing of a possibly completely unrelated file descriptor.

This is exemplified with the following OCaml program:

let () =
  (* Create a socket, and convert it in two channels *)
  let addr = Unix.ADDR_INET(Unix.inet_addr_loopback, 0) in
  let sock =
    Unix.socket (Unix.domain_of_sockaddr addr) Unix.SOCK_STREAM 0 in
  let inchan = Unix.in_channel_of_descr sock in
  let outchan = Unix.out_channel_of_descr sock in

  (* Close the input channel (and thus the underlying socket) *)
  close_in inchan;

  (* Open a completely unrelated file. This will reuse the file descriptor or the socket. *)
  let file = Unix.(openfile "file.tmp" [O_WRONLY;O_CREAT;O_TRUNC] 0o666) in

  (* Close the output channel of the socket. This will close the file. *)
  close_out outchan;

  (* Try to write to the file. *)
  ignore(Unix.write_substring file "aa" 0 2)
  (* Its file descriptor is closed, but we never asked so. This crashes with "exception Unix.Unix_error(Unix.EBADF, "write", "")" *)

This pattern is rather common. It is used in several tutorials for using the OCaml socket API. For example, see https://caml.inria.fr/pub/docs/oreilly-book/html/book-ora187.html, https://rosettacode.org/wiki/Sockets, https://ocaml.github.io/ocamlunix/sockets.html, ... Moreover, this bug is already present in ocamldebug implementation (not using the Unix interface, but by using the internal C API, but the idea is the same).

The reason nobody discovered the bug in real code is probably that the input and output channels are most often closed at the same time. The impression of soundness is, however, misleading: indeed, between the two closing calls, many things can happen, including a context switch to another thread, which may very well open e.g., another socket.

I can see two fixes to this bug:

  • Say that this is not a bug: make more precise the documentation and explain that one should not call both Unix.in_channel_of_descr and Unix.out_channel_of_descr on a given file descriptor. There is a simple workaround: simply call Unix.dup on the file descriptor. However, the various books, tutorials, etc... will remain there and the pattern will very probably will still be used in the wild for long.
  • Say what the docs implicitly says. The simplest is perhaps to change the internal implementation of the Unix.file_descr type, and to make it a record of two fields: the first one contains the actual POSIX file descriptor number, while the second is a mutable flag indicating whether the file descriptor has been closed. But this is a massive change in the Unix implementation, since pretty much all its functions use the file_descr type.
@jhjourdan
Copy link
Contributor Author

Cc @gadmm, who discovered first the bug in the ocamldebug implementation.

@xavierleroy
Copy link
Contributor

What about the following: if there is already a channel using the given file descriptor fd, Unix.{in,out}_channel_of_descr fd duplicates fd and returns a channel opened on the duplicate. This should work for the example in this PR, and gets us closer to a programming model where a channel "owns" the underlying file descriptor.

Determining whether a file descriptor is already used by a channel can be done today by scanning the list of all active channels, or if that is not efficient enough, by using a more efficient data structure (did I say "skiplist"?).

@gadmm
Copy link
Contributor

gadmm commented Aug 10, 2020

and gets us closer to a programming model where a channel "owns" the underlying file descriptor.

I am not sure to understand why this is closer to an ownership model, i.e. what would the high-level documentation be in terms of ownership? I think the specification would be complicated, while the current behaviour is the simplest one where the function assumes ownership of the file descriptor (I mean at least the the corrected version that correctly closes the file descriptor on error at https://github.com/ocaml/ocaml/pull/8997/files#diff-ab941d7100a652a3a5b09eec8270173cR72-R104).

I like the proposition in the first bullet point (clarify the documentation) for that reason. Moreover one often learns that things that used to work accidentally in single-threaded (e.g. closing a file descriptor twice) becomes dangerous in multi-threaded. With a clear documentation of ownership, which is along what C++/Rust programmers would expect, the working programmer should not make the error. (A helper function for creating in and out channels for sockets could be included if only to draw the programmer's attention to the issue.)

@damiendoligez
Copy link
Member

damiendoligez commented Jan 5, 2021

I see two other solutions:

  • like @jhjourdan's first bullet point, but simply document the fact that, if you open several channels on the same descriptor, only one of them should be closed. Optionally, raise an error when closing a channel whose fd is already closed.
  • symmetrically to @xavierleroy's solution, when closing a channel, find all the other channels that share the same fd and erase their fd field (set it to -1). This would implement the documented behavior without inserting spurious system calls.

@xavierleroy
Copy link
Contributor

symmetrically to @xavierleroy's solution, when closing a channel, find all the other channels that share the same fd and erase their fd field (set it to -1). This would implement the documented behavior without inserting spurious system calls.

That's a neat idea! Probably the simplest way to implement the documented behavior.

@gadmm
Copy link
Contributor

gadmm commented Jan 8, 2021

I hear “skiplist” but there is also an upcoming multithreading issue. It is not yet clear to me how the global linked list of channels will evolve with multicore (it seems that the multicore version of io.c has not yet been adapted to multicore). A global lock will be an easy sell if it is only used inside flush_all. It is probably a harder sell if needed inside every call to close_{in,out}. (An alternative idea to have a per-domain list of channels and no lock has other drawbacks like being unable to give ownership of a channel to another domain.)

This is a costly price for a feature that is actually not of much use. The few situations where you could rely on this part of the documented behaviour are also those where you could rewrite them not to. In particular, the documentation of posix close warns against closing a file descriptor while in use in another thread, and in this case setting fd to -1 does not fix anything. The documentation already forces the channels to remained owned by the same person, like @damiendoligez's proposed clarification makes clear.

Alternatively, Damien proposes optionally to raise an error. I assume checking for EBADF to report an error unreliably (the race condition remains). We could instead decide that the behaviour, which is correct in single-threaded code, remains correct under this assumption, so as not to break working code. The specification of posix close itself needed clarifying for multithreaded code, and code needed to be adapted for multithreading, but no working single-threaded code was broken in the process. I did not check again but it is possible that the code in ocamldebug was correct under this assumption and I was just splitting hairs.

In the spirit of @jhjourdan's first bullet point I propose to simply remove (unless it was already closed) from the documentation. This gets us to a programming model where a channel owns the underlying file descriptor.

(In addition, it would make sense to have a flag to indicate that a channel does not own its file descriptor, set upon creation, or a function to close a channel without closing the file descriptor. Intuitively, you rarely want a channel on stdin/out/err to close its file descriptor. This would be a clean way to express the idiom described by Jacques-Henri without having to call dup.)

@jhjourdan also proposed to play with the implementation of Unix.file_descr but I remember seeing its current definition (= the system file descriptor) being relied upon by some libraries through the FFI. (This was during some unrelated sightseeing on opam, so I do not have a pointer.)

@xavierleroy
Copy link
Contributor

Before we start speculating on performance issues in multicore, I wanted to fact-check some points of the discussion.

However, things actually happens differently in the current implementation: close_in and close_out both always call the close system call, and simply ignore the return value.

This has not been the case since 2002: 0738514

Closing an already-closed channel is a no-op. Closing a channel whose file descriptor has already been closed is a Sys_error exception.

In particular, the following code has been broken for more than 18 years:

  let sock = Unix.socket [...] Unix.SOCK_STREAM 0 in
  let inchan = Unix.in_channel_of_descr sock in
  let outchan = Unix.out_channel_of_descr sock in
  ...
  close_out outchan; close_in inchan

The close_in will consistently raise an exception.

So, I'm pretty sure our users actually write

  let sock = Unix.socket [...] Unix.SOCK_STREAM 0 in
  let inchan = Unix.in_channel_of_descr sock in
  let outchan = Unix.out_channel_of_descr sock in
  ...
  close_out outchan    (* or:  close_in inchan *)

I agree this should be documented better. We could say that if several channels are opened on the same file descriptor, only one should be closed (preferably the output channel so that it gets flushed correctly) and the other should not be used further. We could also recommend to avoid this situation by appropriate use of Unix.dup.

There is still a potential for error if the other channel (inchan in the example above) is used later:

  • if there's data in the buffer, reading from inchan may not fail immediately;
  • if another file is opened for reading and gets the same file descriptor, as in @jhjourdan's attack, reading from inchan may read from the new file.

If we really want to protect against this, @damiendoligez 's proposal (close the other channels sharing the same FD) sounds good to me. It has the side-effect of un-breaking the code above that has been broken for more than 18 years, but that's not the main motivation. The naive implementation is a bit expensive, but we can improve that with better data structures.

My suggestion of duplicating the FD so that each channel has its own FD could break example number 2 above. After close_out outchan we would still have one FD open on the socket, and this may change the behavior of the program.

So, what do we do?

  • Fix the documentation (we need to do that in all cases, but it may be enough to fix the documentation and not change the implementation).
  • Implement @damiendoligez's proposal, with a linear traversal of the list of all channels at every close operation.
  • Likewise, with better data structures to quickly find all the channels that share a FD.

@gadmm
Copy link
Contributor

gadmm commented Jan 9, 2021

Indeed, one should not confuse caml_ml_close_channel with caml_close_channel. The former checks the error code unlike the latter, which is used in ocamldebug but not much elsewhere. This clarification being made, we seem to agree on the three bullet points.

Since it is fact-checking time, I went for some sightseeing. By order of commonness:

  1. The most common behaviour is to leak the resources: small programs, including example files (!), and sometimes shutting down (not closing) the connection. This includes OCaml's own Unix.establish_server. (Correct e.g. for clients rather than servers.)

  2. The second most common behaviour is to directly close the descriptor but leak the channels (odb-server, obrowser, qmp-protocol, freetennis, hydro) (Correct, thanks to finalization.)

  3. More rarely, close only one (ocamlnet, message-switch+Lwt/Async, ocaml-http) (Correct)

  4. More rarely, close both and catch the exception (flowtype, functory) (Incorrect except under single-threading assumptions.)

  5. Close both and let the exception escape (message-switch. Possible explanation: there are three backends, and the other two (Lwt and Async) correctly close only one of them, so this might be dead code.) (Incorrect.)

  6. Get a high-value fd by calling dup a lot of times and closing all but the highest (sibylfs) (?)

  7. None of the programs used dup to get it right.

I only looked for usage of {in,out}_channel_of_descr in opam, but the dangerous idiom also appears in OCaml's own Unix.open_connection and Unix.establish_server and one could look at usage of those as well.

I understand how the solution 2. can appear more intuitive than 3. This mimics a semantics where the channel borrows the file descriptor.

Given the involvement of Unix.open_connection instead of just, e.g., third-party tutorials, it is hard to fix the fuzzy ownership semantics of the stdlib. One goal (A) is to (try to) fix this semantics (which includes fixing 4. & 5. above).

The other goal you mention (B) is to have a clean exception on use-after-free (for 3., but also desirable for 2.).

  • The @damiendoligez proposal fixes (A) by embracing the complexity of the fuzzy ownership semantics (including increasing reliance on a global knowledge of channels, my assumption here being that the semantics remains more malleable if it the global structure is only needed for flush_all). It fixes (B) for 3. but not for 2., which cannot be fixed by a similar trick.

  • Alternatively, one can clarify the documentation, and add something towards an optional borrowing semantics (for instance, a close channel function but that does not close the descriptor). This fixes (A) by making the borrowing semantics used in 2. official (an action is needed from developers to apply it; essentially they do by hand what the @damiendoligez proposal does with a global data structure). This also fixes (B) for both 2. and 3. (again, after explicit action from the developers, e.g. introducing calls to the new close function).

The first one magically fixes existing code (4. and 5. above, provided they are really buggy), while the second one gets to a situation with clearer ownership semantics (although still a bit fuzzy).

Finally let us appreciate how ownership is far from being an obvious notion, even for seemingly-simply issues such as double-free.

@xavierleroy
Copy link
Contributor

Thank you for the survey of usages in the wild.

One correction: you're unfair to Unix.establish_server. The channels are allocated in a child process and are automatically closed, along with the socket, when the child exits. As a comment in the implementation explains, there is no point in closing anything explicitly just before exiting.

Coming back to the survey: the documentation for {in,out}_channel_of_descr could describe and suggest approaches 2, 3, and 7 (using dup). Closing the out_channel (in case 3) is safer, as it guarantees that the channel is flushed and no data to be sent is lost.

I'm not sure we need more than that, but I'll think about it some more.

@gadmm
Copy link
Contributor

gadmm commented Jan 11, 2021

Yes of course, it is fine to leak for short-lived processes. Only for the example files it is worrying.

The missing functionality programmers cannot write themselves is probably captured by a function consume:

val consume_X : X_channel -> file_descr

that does the effect of close_X except it returns the file descriptor instead of closing it.

let close_out_without_closing_descr c = ignore (consume_out c)
(* for the flowtype code linked above *)
let close_two (i,o) =
  let fd1, fd2 = consume_in i, consume_out o in
  Unix.close fd1 ;
  if fd2 <> fd1 then Unix.close fd2

xavierleroy added a commit to xavierleroy/ocaml that referenced this issue Jan 29, 2021
Explain more precisely what to do and what not to do to close channels
and their underlying descriptor.

Fixes: ocaml#9786
xavierleroy added a commit to xavierleroy/ocaml that referenced this issue Jan 30, 2021
Explain more precisely what to do and what not to do to close channels
and their underlying descriptor.

Fixes: ocaml#9786
smuenzel pushed a commit to smuenzel/ocaml that referenced this issue Mar 30, 2021
Explain more precisely what to do and what not to do to close channels and their underlying descriptor.

Fixes: ocaml#9786
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants