Reinitialize IO mutexes after fork #12886

TheNumbat · 2024-01-05T19:49:26Z

(Upstreaming this)

In OCaml 4, systhreads would re-initialize IO channel mutexes while resetting after a fork.
This PR makes the 5 runtime do so as well, fixing the relevant FIXME.

When resetting the mutexes matters, we're likely in an inconsistent state (i.e. another thread locked the mutex and then disappeared), but as the comment in caml_thread_reinitialize says, support for forking with threads is only best-effort.

TheNumbat · 2024-01-05T19:53:47Z

Not entirely related, but I also noticed that in io.c, caml_seek_in and caml_ml_close_channel both call caml_enter_blocking_section_no_pending without first calling check_pending. Is this wrong? Once we enter the blocking section, we may swap to another thread and process pending signals while the channel is locked. But the point of check_pending appears to be ensuring that the channel is not locked while signal handlers/finalizers run.

TheNumbat · 2024-01-05T20:50:28Z

Looks like thread-related tests are failing in bytecode mode with this error:

> Fatal error: cannot load shared library dllthreads
> Reason: flexdll error: cannot relocate caml_plat_mutex_init RELOC_REL32, target is too far: fffffffdd13b3a07  ffffffffd13b3a07

Is this a problem with the cygwin CI build?

xavierleroy · 2024-01-06T16:42:57Z

This has already been fixed in 4.14: ea90edc . There is no consensus that reinitializing the I/O mutexes in the child process is better than doing nothing, but this can be rediscussed. It's mentioned as a "remaining issue for OCaml 5" in #12399. For OCaml 4, the behavior changed in #12646 (commit ea90edc) to fix #12636.

xavierleroy · 2024-01-07T08:52:17Z

This PR was closed by mistake. Reopening and adding more context to my previous comment.

TheNumbat · 2024-01-08T16:52:07Z

I see, thanks. Is this patch the proper way to port reinitialization to the 5 runtime, then?

I agree reinitializing the mutexes is not obviously better than doing nothing, but if we don't, I don't think it's ever feasible to fork with threads (in which case it shouldn't be supported at all).
Since the 4 runtime reinitializes, I think preserving the behavior makes sense, even though it could lead to inconsistent IO state.

xavierleroy · 2024-01-08T18:40:10Z

Is this patch the proper way to port reinitialization to the 5 runtime, then?

I tend to think so. For OCaml 4, an option we didn't investigate was to forget the old mutexes (but not destroy nor free them!) and reallocate new mutexes on demand, at the cost of leaking memory. In OCaml 5, I/O channel descriptors contain a mutex, not a pointer to a heap-allocated mutex, so this option is not available. Reinitializing a mutex is technically undefined behavior, but looks like the safest option to me.

This said, this part of the OCaml code was rewritten by @Engil, then much amended by @gadmm, so both of them should chime in now.

gadmm · 2024-01-09T23:38:14Z

Before I reply, has it been discussed to change fork such that it starts by acquiring all the channel mutexes? I have a vague recollection of a discussion, but I did not find it and do not remember the issue if any.

xavierleroy · 2024-01-10T17:52:21Z

has it been discussed to change fork such that it starts by acquiring all the channel mutexes?

I thought about it briefly, but it won't work: a channel mutex can be held forever (e.g. while reading from a socket or pipe where the other end is not providing data), so this would prevent Unix.fork from completing.

gadmm · 2024-01-12T14:00:06Z

Ideally, programs that are wrong should fail as early as possible. Due to the "best-effort" and anti-POSIX aspect of the implementation, though, it is not clear how to reach a clear specification while keeping the feature useful.

The best way to look at it is from a backwards-compatibility perspective. One should make sure to preserve programs that are correct a posteriori (e.g. due to relying on underspecified behaviour, due to machine- and OS-specific reasons...). From this point of view, a program that leaves some channel in an inconsistent state, but that never observes this inconsistent state (for whichever programming reason), is correct.

Since no-one seems to have the time nor motivation to do an in-depth study of programs in the wild, it is far much simpler to preserve the behaviour from OCaml 4 in a litteral sense. So I think this PR should be reviewed with an eye towards acceptance.

In the longer term, a way to better specify the behaviour, whilst preserving most of the correct programs, could be to try to acquire all channel locks before forking (using try_lock rather than lock), and to place all channels for which try_lock fails into an invalid state (a state that deterministically causes an exception to be raised whenever the program accesses it).

Not entirely related, but I also noticed that in io.c, caml_seek_in and caml_ml_close_channel both call caml_enter_blocking_section_no_pending without first calling check_pending. Is this wrong?

It does not seem to be a problem to me. Since the signal handlers/finalisers would run on a different thread, correctly they would block if they tried to access the same channel.

gadmm

This PR gives programs a chance to crash (or not) rather than deadlock when observing a channel in an inconsistent state, following a fork while the channel is locked by another thread.

I recommend to accept it because this is closer to OCaml 4, and one does not really know what programs in the wild do (nor has the time to look into it, probably). Reinitializing the mutex directly differs a bit from OCaml 4, but is already used as a method for resetting the domain lock just above.

Still, this is not great and meant as a short-term fix. A possible better fix using try_lock on all channels is described above.

However, it does not seem useful nor necessary to reinitialize the all_opened_channels mutex, see below.

gadmm · 2024-01-12T14:22:49Z

otherlibs/systhreads/st_stubs.c

+  /* Reinitialize IO mutexes, in case the fork happened while another thread
+     had locked the channel. If so, we're likely in an inconsistent state,
+     but we may be able to proceed anyway. */
+  caml_plat_mutex_init(&caml_all_opened_channels_mutex);


caml_all_opened_channels_mutex was introduced in OCaml 5 for the case of concurrent access by two domains. For single-domain programs, caml_all_opened_channels is protected by the domain lock. Since fork is only for single-domain programs, I think caml_all_opened_channels_mutex is always in an unlocked state and its data is valid. Thus, maybe there is no need to reinitialise it?

I see, so in a single domain program, caml_all_opened_channels_mutex can't be locked at the fork because it's never locked during a blocking section. Is that right?

(Deleted this change)

Yes, or I would see it as a bug with OCaml 5 (there is no such mutex in OCaml 4).

TheNumbat · 2024-01-16T16:45:29Z

This looks ready to merge (at least as a temporary fix), other than the failing Windows build. I'm not sure what's going on there, can someone take a look?

gadmm · 2024-01-16T16:52:11Z

My guess would be a missing CAMLexport for caml_plat_mutex_init in runtime/caml/platform.h.

TheNumbat · 2024-01-16T19:09:31Z

Yep, that was it

gasche

I am approving based on @gadmm's review and my understanding of the generally positive conversation on this best-effort change. Thanks!

nojb · 2024-02-01T06:47:53Z

I think the Changes entry for this was mistakingly put in the 5.2 section (in trunk).

gasche · 2024-02-01T09:52:40Z

Actually I think that this is a bugfix (as it, it un-crashes programs in the wild), so I will cherry-pick into 5.2 instead of moving the Changes entry to trunk.

Reinitialize IO mutexes after fork (cherry picked from commit 5ef295c)

gasche · 2024-02-01T10:02:10Z

Cherry-picked in 5.2 as 800ba1f (nice hash).

gasche · 2024-02-05T07:08:48Z

I have a question about this change. caml_atfork_hook in domain.c currently reads as follows (in trunk, after the change is merged):

/* default handler for unix_fork, will be called by unix_fork. */
static void caml_atfork_default(void)
{
  caml_reset_domain_lock();
  caml_acquire_domain_lock();
  /* FIXME: For best portability, the IO channel locks should be
     reinitialised as well. (See comment in
     caml_reset_domain_lock.) */
}

The FIXME is suggesting the same thing that was added by the present PR in otherlibs/systhreads/st_stubs.c. If I understand correctly, this makes the comment slightly inconsistent, but not wrong:

when the systhreads library are linked, IO channel locks will be reset (thanks to the present PR)
but when it is not, they will not be reset
for OCaml 4, only systhreads would require taking IO channel locks, so having that logic in systhreads only made perfect sense
but with OCaml 5, non-systhreads-using multi-domain program will routinely take channel locks, so they would also need the logic

My current understanding is that the IO-channel logic should move from systhreads.c to domain.c, preferably with a more specific name than caml_atfork_default, and then caml_thread_reinitialize in systhreads.c should call that logic from domain.c.

What do you think?

xavierleroy · 2024-02-05T08:26:42Z

Unix.fork fails if the program spawned multiple domains, and if the fork is done in C code, the child is not supposed to use any of the runtime system services. So, I don't think any action is required, except perhaps removing the FIXME comment if it turns to be misleading.

This said, the purpose and use of caml_atfork_hook is not clear to me. Right now, its only use is in runtime/afl.c, where the child runs the whole OCaml code and the parent monitors it, and I don't know if this is 100% safe.

gasche · 2024-02-05T10:01:46Z

I looked at caml_atfork_hook again after seeing this Discuss thread where someone has OCaml FFI code (ocamlfuse) that works on OCaml 4 and breaks on OCaml 5. The FFI code calls a C function that "daemonizes" the process by forking under the hood (the parent terminates and execution continues in the child process). The proposed fix is to call caml_atfork_hook after this daemonization has been done: https://github.com/astrada/ocamlfuse/blob/c27893c2d5ad5eca733b59e586448dc850aa2788/lib/Fuse_util.c#L733-L736 . This seems to be a reasonable use-case to me.

(The ocamltest codebase also calls fork from C and needs caml_atfork_hook.)

TheNumbat · 2024-02-05T19:23:06Z

My use case for this PR was also daemonizing a child process (which worked in 4 but failed in 5), but this change was sufficient because the fork occurred in an OCaml library that used systhreads.

I think the fork in ocamlfuse is currently not supported, since it's either forking with multiple domains or using the runtime from a C thread without linking systhreads. (i.e. what @xavierleroy said)

TheNumbat added 2 commits January 5, 2024 14:42

reinit channel mutexes

de4a65a

changes entry

f55c3af

SGrondin mentioned this pull request Jan 5, 2024

_os_unfair_lock_corruption_abort after fork on MacOS ocaml-multicore/eio#660

Closed

gasche added the runtime-system label Jan 6, 2024

xavierleroy closed this Jan 6, 2024

xavierleroy reopened this Jan 7, 2024

gadmm approved these changes Jan 12, 2024

View reviewed changes

TheNumbat added 2 commits January 12, 2024 12:18

don't reinit caml_all_opened_channels_mutex

829577e

update changes

4340fc4

CAMLexport

43d5057

gasche approved these changes Jan 16, 2024

View reviewed changes

gasche merged commit 5ef295c into ocaml:trunk Jan 16, 2024
9 checks passed

gasche added a commit that referenced this pull request Feb 1, 2024

Merge pull request #12886 from TheNumbat/trunk

800ba1f

Reinitialize IO mutexes after fork (cherry picked from commit 5ef295c)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reinitialize IO mutexes after fork #12886

Reinitialize IO mutexes after fork #12886

TheNumbat commented Jan 5, 2024

TheNumbat commented Jan 5, 2024 •

edited

TheNumbat commented Jan 5, 2024 •

edited

xavierleroy commented Jan 6, 2024 •

edited

xavierleroy commented Jan 7, 2024

TheNumbat commented Jan 8, 2024

xavierleroy commented Jan 8, 2024

gadmm commented Jan 9, 2024

xavierleroy commented Jan 10, 2024

gadmm commented Jan 12, 2024

gadmm left a comment

gadmm Jan 12, 2024

TheNumbat Jan 12, 2024

gadmm Jan 12, 2024

TheNumbat commented Jan 16, 2024

gadmm commented Jan 16, 2024

TheNumbat commented Jan 16, 2024

gasche left a comment

nojb commented Feb 1, 2024

gasche commented Feb 1, 2024

gasche commented Feb 1, 2024

gasche commented Feb 5, 2024

xavierleroy commented Feb 5, 2024

gasche commented Feb 5, 2024

TheNumbat commented Feb 5, 2024

Reinitialize IO mutexes after fork #12886

Reinitialize IO mutexes after fork #12886

Conversation

TheNumbat commented Jan 5, 2024

TheNumbat commented Jan 5, 2024 • edited

TheNumbat commented Jan 5, 2024 • edited

xavierleroy commented Jan 6, 2024 • edited

xavierleroy commented Jan 7, 2024

TheNumbat commented Jan 8, 2024

xavierleroy commented Jan 8, 2024

gadmm commented Jan 9, 2024

xavierleroy commented Jan 10, 2024

gadmm commented Jan 12, 2024

gadmm left a comment

Choose a reason for hiding this comment

gadmm Jan 12, 2024

Choose a reason for hiding this comment

TheNumbat Jan 12, 2024

Choose a reason for hiding this comment

gadmm Jan 12, 2024

Choose a reason for hiding this comment

TheNumbat commented Jan 16, 2024

gadmm commented Jan 16, 2024

TheNumbat commented Jan 16, 2024

gasche left a comment

Choose a reason for hiding this comment

nojb commented Feb 1, 2024

gasche commented Feb 1, 2024

gasche commented Feb 1, 2024

gasche commented Feb 5, 2024

xavierleroy commented Feb 5, 2024

gasche commented Feb 5, 2024

TheNumbat commented Feb 5, 2024

TheNumbat commented Jan 5, 2024 •

edited

TheNumbat commented Jan 5, 2024 •

edited

xavierleroy commented Jan 6, 2024 •

edited