Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reinitialize IO mutexes after fork #12886

Merged
merged 5 commits into from
Jan 16, 2024
Merged

Reinitialize IO mutexes after fork #12886

merged 5 commits into from
Jan 16, 2024

Conversation

TheNumbat
Copy link
Contributor

(Upstreaming this)

In OCaml 4, systhreads would re-initialize IO channel mutexes while resetting after a fork.
This PR makes the 5 runtime do so as well, fixing the relevant FIXME.

When resetting the mutexes matters, we're likely in an inconsistent state (i.e. another thread locked the mutex and then disappeared), but as the comment in caml_thread_reinitialize says, support for forking with threads is only best-effort.

@TheNumbat
Copy link
Contributor Author

TheNumbat commented Jan 5, 2024

Not entirely related, but I also noticed that in io.c, caml_seek_in and caml_ml_close_channel both call caml_enter_blocking_section_no_pending without first calling check_pending. Is this wrong? Once we enter the blocking section, we may swap to another thread and process pending signals while the channel is locked. But the point of check_pending appears to be ensuring that the channel is not locked while signal handlers/finalizers run.

@TheNumbat
Copy link
Contributor Author

TheNumbat commented Jan 5, 2024

Looks like thread-related tests are failing in bytecode mode with this error:

> Fatal error: cannot load shared library dllthreads
> Reason: flexdll error: cannot relocate caml_plat_mutex_init RELOC_REL32, target is too far: fffffffdd13b3a07  ffffffffd13b3a07

Is this a problem with the cygwin CI build?

@xavierleroy
Copy link
Contributor

xavierleroy commented Jan 6, 2024

This has already been fixed in 4.14: ea90edc . There is no consensus that reinitializing the I/O mutexes in the child process is better than doing nothing, but this can be rediscussed. It's mentioned as a "remaining issue for OCaml 5" in #12399. For OCaml 4, the behavior changed in #12646 (commit ea90edc) to fix #12636.

@xavierleroy xavierleroy closed this Jan 6, 2024
@xavierleroy
Copy link
Contributor

This PR was closed by mistake. Reopening and adding more context to my previous comment.

@xavierleroy xavierleroy reopened this Jan 7, 2024
@TheNumbat
Copy link
Contributor Author

I see, thanks. Is this patch the proper way to port reinitialization to the 5 runtime, then?

I agree reinitializing the mutexes is not obviously better than doing nothing, but if we don't, I don't think it's ever feasible to fork with threads (in which case it shouldn't be supported at all).
Since the 4 runtime reinitializes, I think preserving the behavior makes sense, even though it could lead to inconsistent IO state.

@xavierleroy
Copy link
Contributor

Is this patch the proper way to port reinitialization to the 5 runtime, then?

I tend to think so. For OCaml 4, an option we didn't investigate was to forget the old mutexes (but not destroy nor free them!) and reallocate new mutexes on demand, at the cost of leaking memory. In OCaml 5, I/O channel descriptors contain a mutex, not a pointer to a heap-allocated mutex, so this option is not available. Reinitializing a mutex is technically undefined behavior, but looks like the safest option to me.

This said, this part of the OCaml code was rewritten by @Engil, then much amended by @gadmm, so both of them should chime in now.

@gadmm
Copy link
Contributor

gadmm commented Jan 9, 2024

Before I reply, has it been discussed to change fork such that it starts by acquiring all the channel mutexes? I have a vague recollection of a discussion, but I did not find it and do not remember the issue if any.

@xavierleroy
Copy link
Contributor

has it been discussed to change fork such that it starts by acquiring all the channel mutexes?

I thought about it briefly, but it won't work: a channel mutex can be held forever (e.g. while reading from a socket or pipe where the other end is not providing data), so this would prevent Unix.fork from completing.

@gadmm
Copy link
Contributor

gadmm commented Jan 12, 2024

Ideally, programs that are wrong should fail as early as possible. Due to the "best-effort" and anti-POSIX aspect of the implementation, though, it is not clear how to reach a clear specification while keeping the feature useful.

The best way to look at it is from a backwards-compatibility perspective. One should make sure to preserve programs that are correct a posteriori (e.g. due to relying on underspecified behaviour, due to machine- and OS-specific reasons...). From this point of view, a program that leaves some channel in an inconsistent state, but that never observes this inconsistent state (for whichever programming reason), is correct.

Since no-one seems to have the time nor motivation to do an in-depth study of programs in the wild, it is far much simpler to preserve the behaviour from OCaml 4 in a litteral sense. So I think this PR should be reviewed with an eye towards acceptance.

In the longer term, a way to better specify the behaviour, whilst preserving most of the correct programs, could be to try to acquire all channel locks before forking (using try_lock rather than lock), and to place all channels for which try_lock fails into an invalid state (a state that deterministically causes an exception to be raised whenever the program accesses it).

Not entirely related, but I also noticed that in io.c, caml_seek_in and caml_ml_close_channel both call caml_enter_blocking_section_no_pending without first calling check_pending. Is this wrong?

It does not seem to be a problem to me. Since the signal handlers/finalisers would run on a different thread, correctly they would block if they tried to access the same channel.

Copy link
Contributor

@gadmm gadmm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR gives programs a chance to crash (or not) rather than deadlock when observing a channel in an inconsistent state, following a fork while the channel is locked by another thread.

I recommend to accept it because this is closer to OCaml 4, and one does not really know what programs in the wild do (nor has the time to look into it, probably). Reinitializing the mutex directly differs a bit from OCaml 4, but is already used as a method for resetting the domain lock just above.

Still, this is not great and meant as a short-term fix. A possible better fix using try_lock on all channels is described above.

However, it does not seem useful nor necessary to reinitialize the all_opened_channels mutex, see below.

/* Reinitialize IO mutexes, in case the fork happened while another thread
had locked the channel. If so, we're likely in an inconsistent state,
but we may be able to proceed anyway. */
caml_plat_mutex_init(&caml_all_opened_channels_mutex);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

caml_all_opened_channels_mutex was introduced in OCaml 5 for the case of concurrent access by two domains. For single-domain programs, caml_all_opened_channels is protected by the domain lock. Since fork is only for single-domain programs, I think caml_all_opened_channels_mutex is always in an unlocked state and its data is valid. Thus, maybe there is no need to reinitialise it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, so in a single domain program, caml_all_opened_channels_mutex can't be locked at the fork because it's never locked during a blocking section. Is that right?

(Deleted this change)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, or I would see it as a bug with OCaml 5 (there is no such mutex in OCaml 4).

@TheNumbat
Copy link
Contributor Author

This looks ready to merge (at least as a temporary fix), other than the failing Windows build. I'm not sure what's going on there, can someone take a look?

@gadmm
Copy link
Contributor

gadmm commented Jan 16, 2024

My guess would be a missing CAMLexport for caml_plat_mutex_init in runtime/caml/platform.h.

@TheNumbat
Copy link
Contributor Author

Yep, that was it

Copy link
Member

@gasche gasche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am approving based on @gadmm's review and my understanding of the generally positive conversation on this best-effort change. Thanks!

@gasche gasche merged commit 5ef295c into ocaml:trunk Jan 16, 2024
9 checks passed
@nojb
Copy link
Contributor

nojb commented Feb 1, 2024

I think the Changes entry for this was mistakingly put in the 5.2 section (in trunk).

@gasche
Copy link
Member

gasche commented Feb 1, 2024

Actually I think that this is a bugfix (as it, it un-crashes programs in the wild), so I will cherry-pick into 5.2 instead of moving the Changes entry to trunk.

gasche added a commit that referenced this pull request Feb 1, 2024
Reinitialize IO mutexes after fork

(cherry picked from commit 5ef295c)
@gasche
Copy link
Member

gasche commented Feb 1, 2024

Cherry-picked in 5.2 as 800ba1f (nice hash).

@gasche
Copy link
Member

gasche commented Feb 5, 2024

I have a question about this change. caml_atfork_hook in domain.c currently reads as follows (in trunk, after the change is merged):

/* default handler for unix_fork, will be called by unix_fork. */
static void caml_atfork_default(void)
{
  caml_reset_domain_lock();
  caml_acquire_domain_lock();
  /* FIXME: For best portability, the IO channel locks should be
     reinitialised as well. (See comment in
     caml_reset_domain_lock.) */
}

The FIXME is suggesting the same thing that was added by the present PR in otherlibs/systhreads/st_stubs.c. If I understand correctly, this makes the comment slightly inconsistent, but not wrong:

  • when the systhreads library are linked, IO channel locks will be reset (thanks to the present PR)
  • but when it is not, they will not be reset
  • for OCaml 4, only systhreads would require taking IO channel locks, so having that logic in systhreads only made perfect sense
  • but with OCaml 5, non-systhreads-using multi-domain program will routinely take channel locks, so they would also need the logic

My current understanding is that the IO-channel logic should move from systhreads.c to domain.c, preferably with a more specific name than caml_atfork_default, and then caml_thread_reinitialize in systhreads.c should call that logic from domain.c.

What do you think?

@xavierleroy
Copy link
Contributor

Unix.fork fails if the program spawned multiple domains, and if the fork is done in C code, the child is not supposed to use any of the runtime system services. So, I don't think any action is required, except perhaps removing the FIXME comment if it turns to be misleading.

This said, the purpose and use of caml_atfork_hook is not clear to me. Right now, its only use is in runtime/afl.c, where the child runs the whole OCaml code and the parent monitors it, and I don't know if this is 100% safe.

@gasche
Copy link
Member

gasche commented Feb 5, 2024

I looked at caml_atfork_hook again after seeing this Discuss thread where someone has OCaml FFI code (ocamlfuse) that works on OCaml 4 and breaks on OCaml 5. The FFI code calls a C function that "daemonizes" the process by forking under the hood (the parent terminates and execution continues in the child process). The proposed fix is to call caml_atfork_hook after this daemonization has been done: https://github.com/astrada/ocamlfuse/blob/c27893c2d5ad5eca733b59e586448dc850aa2788/lib/Fuse_util.c#L733-L736 . This seems to be a reasonable use-case to me.

(The ocamltest codebase also calls fork from C and needs caml_atfork_hook.)

@TheNumbat
Copy link
Contributor Author

My use case for this PR was also daemonizing a child process (which worked in 4 but failed in 5), but this change was sufficient because the fork occurred in an OCaml library that used systhreads.

I think the fork in ocamlfuse is currently not supported, since it's either forking with multiple domains or using the runtime from a C thread without linking systhreads. (i.e. what @xavierleroy said)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants