Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix fork() usage in ocamltest #11111

Merged
merged 2 commits into from Mar 16, 2022
Merged

fix fork() usage in ocamltest #11111

merged 2 commits into from Mar 16, 2022

Conversation

gasche
Copy link
Member

@gasche gasche commented Mar 13, 2022

The error-reporting code in ocamltest's run_unix.c is incorrect with respect to the new Multicore runtime, because it tries to use the OCaml runtime (to write to an OCaml channel) without calling caml_atfork_hook first.

reproduction information

Add

child_error("boom");

at the entry of run_command_child in ocamltest/run_unix.c, then run

make -C ocamltest
./ocamltest/ocamltest testsuite/tests/lib-queue/test.ml

Observed output contains:

 ... testing 'test.ml' with 2 (bytecode) => Fatal error: Fatal error during unlock: Operation not permitted
Process 1228576 got signal 6(Aborted), core dumped

On my system, the core dump has the following backtrace:

(gdb) #0  0x00007f18a664e2a2 in raise () from /lib64/libc.so.6
#1  0x00007f18a66378a4 in abort () from /lib64/libc.so.6
#2  0x00000000004377e1 in caml_fatal_error (msg=msg@entry=0x4496e8 "Fatal error during %s: %s\n") at misc.c:117
#3  0x00000000004389bc in caml_plat_fatal_error (action=action@entry=0x44a63b "unlock", err=<optimized out>) at platform.c:36
#4  0x0000000000424b0e in check_err (err=<optimized out>, action=0x44a63b "unlock") at caml/platform.h:130
#5  caml_plat_unlock (m=<optimized out>) at caml/platform.h:163
#6  caml_release_domain_lock () at domain.c:1322
#7  0x000000000043ddb5 in caml_write_fd (fd=3, flags=<optimized out>, buf=buf@entry=0x23c3dbc, n=n@entry=16) at unix.c:97
#8  0x000000000042feb7 in caml_flush_partial (channel=channel@entry=0x23c3d50) at io.c:257
#9  0x000000000042ff60 in caml_flush (channel=0x23c3d50) at io.c:272
#10 0x000000000041b2d7 in logToChannel ()
#11 0x000000000041ac59 in mylog ()
#12 0x000000000041ad29 in myperror_with_location.constprop ()
#13 0x000000000041b227 in run_command ()
#14 0x000000000041b4b8 in caml_run_command ()
#15 0x0000000000441bcd in caml_interprete (prog=<optimized out>, prog_size=<optimized out>) at interp.c:1012
#16 0x0000000000443572 in caml_main (argv=0x7ffce74a5da8) at startup_byt.c:578
#17 0x000000000041401c in main (argc=<optimized out>, argv=<optimized out>) at main.c:37

The problem comes from trying to use the OCaml runtime from a forked
child, without calling the new Multicore-OCaml function
caml_atfork_hook from domain.h. At this point the forked child has
its domain lock in an incoherent state, and calling caml_write_fd
fails. (Compare the ceremony around fork() in ocamltest/run_unix.c and
in otherlibs/unix/unix_fork.c)

Note: the backtrace also shows that C files in ocamltest were not
compiled with debug information. The PR also adds CFLAGS=-g in
ocamlest/Makefile to fix this.

@gasche gasche added the bug label Mar 13, 2022
@gasche
Copy link
Member Author

gasche commented Mar 13, 2022

Note: I wonder if there is a deeper problem here.

  1. Can OCaml 5.0 users use caml_atfork_hook, is it exposed in the public API? (And, ideally, documented somewhere?)
  2. Is it possible to write fork()-using C code that works with both 4.x and 5.0?

Maybe we can recommend to always use unix_fork from C code, and never the naked fork() call, to avoid this issue?

(Note: even with 4.x, there were reasons to recommend this, for example for Eventlog users, as unix_fork has the right cleanup logic there.)

@xavierleroy
Copy link
Contributor

It's not obvious to me why the child process uses the OCaml runtime. If we were to implement spawning using posix_spawn instead of fork + exec (resulting in nice performance gains), there would be no opportunity for the child to use the OCaml runtime. For the same reason, the Win32 implementation of command spawning cannot use the OCaml runtime. So, what's going on here?

@gasche
Copy link
Member Author

gasche commented Mar 14, 2022

The ocamltest implementation of "running a command" takes as parameter an OCaml channel for logging. At the point where the child sets things up to call excevp, any error will call error-reporting functions that write to this logging channel by calling OCaml runtime functions. See run_command_child and logToChannel.

In the medium term, maybe we could reconsider the idea of maintaining yet another implementation of spawn. In the short term, the PR is a minimal fix for a bug introduced by the change of runtime assumptions around fork().

@xavierleroy
Copy link
Contributor

At the point where the child sets things up to call excevp, any error will call error-reporting functions that write to this logging channel by calling OCaml runtime functions

Well, I find this questionable, even in OCaml 4. For example, there's a risk of double-flushing of OCaml channels (once in the parent, once in the child). resulting in duplicate output.

@gasche
Copy link
Member Author

gasche commented Mar 14, 2022

Good point. I did wonder about flushing before the fork() and then writing to the file descriptor directly, for example. But this also requires subtle and invasive changes to the implementation. I'm not currently planning to reimplement the run_command routine of ocamltest.

On the other hand, I think we do have OCaml programs that fork() and then use the OCaml runtime on both sides. I would expect those programs to call Unix.fork directly, but some could also possibly fork from C. We have decided that forking a multi-domain program would not be supported (and Unix.fork will error in this situation), but we intend to support forking a single-domain program, even under the Multicore runtime. So we can take this one example from the compiler distribution as a case study of compatibility issues that our users could encounter inthe wild. (Possibly as well in situations where the fork() usage is questionable, but that work fine in practice under 4.x.)

Copy link
Contributor

@xavierleroy xavierleroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let's merge this as a bug fix and think about alternate implementations later. Can you please fix the Changes entry then squash-and-merge?

@xavierleroy
Copy link
Contributor

On the other hand, I think we do have OCaml programs that fork() and then use the OCaml runtime on both sides. I would expect those programs to call Unix.fork directly, but some could also possibly fork from C.

I really hope this pattern (fork in C, then use the OCaml runtime system in the child process) is not to be found in the wild. If I understand correctly, this pattern causes a fatal error in 5.0, so we'll hear about it.

@gasche
Copy link
Member Author

gasche commented Mar 16, 2022

Thanks! I updated and rebased.

If I understand correctly, this pattern causes a fatal error in 5.0, so we'll hear about it.

It fails on my machine on the first time that we try to release the runtime lock / enter a blocking section. (I'm slightly nervous at the idea that our options to fix the issue in a backward-compatible ways will be very limited once 5.0 and 4.14 are release without a coherent support for forking, but oh well, let's assume that nobody does that indeed.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants