Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Force all domain threads to exit before the main thread #13010

Closed
wants to merge 2 commits into from

Conversation

dustanddreams
Copy link
Contributor

This is a subtask of the "memory cleanup at exit time" work in progress.

In order to be able to correctly release the various memory resources held by the program's domains, I need to make sure than no other threads are running.

In the current state of the runtime, no effort is made to ensure that all domain threads exit at the end of the program execution; it relies upon the operating system to reap all the threads at program termination time. This can be a problem if the caml runtime is embedded in a larger program which does not necessarily exit after running a caml program.

I have experimented with various way of getting these threads to terminate. It turns out that the simplest and probably safest way is to simply force an STW rendezvous, during which all the threads but the main thread will ask their backup thread to exit (if it had been started) and exit themselves. Of course, there is a risk of not releasing locks or mutexes, but since all the other threads will also exit this should not matter.

For the record, there are five tests in the compiler testsuite which end up with threads running and cause this new code path to be taken:

  • tests/lazy/lazy3.ml
  • tests/lib-runtime-events/test_dropped_events.ml
  • tests/parallel/backup_thread.ml
  • tests/parallel/major_gc_wait_backup.ml
  • tests/parallel/mctest.ml

This PR passes all the compiler test and also the mulitcoretests.

Copy link
Member

@gasche gasche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have two questions inline, and two questions here:

  1. caml_do_exit seems to also be called from a side thread under Windows (see caml_signal_thread in runtime/win32.c). Is that correct? My uninformed intuition is that it is wrong, but was already wrong before the current change. (But maybe the change makes it worse?)

  2. You made the choice to call pthread_exit on all non-main domains and their backup threads, but not to call their termination logic (the many things done in domain_terminate after we exit the STW). I suppose this is an intentional design choice? Do we believe that the cleanup-mode-at-exit will be good enough to reclaim all the domain-owned resources that would have been released by the termination logic, or do we risk leaking resources with your approach?

runtime/domain.c Show resolved Hide resolved
@@ -191,6 +191,7 @@ CAML_RUNTIME_EVENTS_DESTROY();
#ifndef NATIVE_CODE
caml_debugger(PROGRAM_EXIT, Val_unit);
#endif
caml_terminate_all_domains();
if (caml_params->cleanup_on_exit)
caml_shutdown();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to call caml_terminate_all_domains unconditionally or only when cleanup_on_exit is used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be preferable to do this regardless of whether cleanup will occur, in order not to leave "dangling" threads in programs embedding the caml runtime (as mentioned in the PR description).

But if people are worried of the consequences of this change, I'll move it to caml_shutdown.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't personally have a strong opinion -- I don't know about threads and systems programming enough to tell the amount of risk involved here -- so I am happy to follow your preference unless someone else speaks out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the one hand, this adds some time overhead.

On the other hand, I think there is a race in the existing code between caml_terminate_signals and the delivery of signals to some other (still-running) domain, which this will fix.

Copy link
Contributor

@gadmm gadmm Apr 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this call to caml_terminate_signals is problematic, but I also think it can be removed.

I am not sure I understand the point about programs embedding the ocaml runtime. These programs will indeed call caml_shutdown directly rather than caml_do_exit.

@dustanddreams
Copy link
Contributor Author

1. `caml_do_exit` seems to also be called from a side thread under Windows (see `caml_signal_thread` in runtime/win32.c). Is that correct? My uninformed intuition is that it is wrong, but was already wrong before the current change. (But maybe the change makes it worse?)

Thanks for noticing this. I have not looked at the windows code paths yet. I will have a look and answer later.

2. You made the choice to call `pthread_exit` on all non-main domains and their backup threads, but not to call their termination logic (the many things done in `domain_terminate` after we exit the STW). I suppose this is an intentional design choice? Do we believe that the cleanup-mode-at-exit will be good enough to reclaim all the domain-owned resources that would have been released by the termination logic, or do we risk leaking resources with your approach?

My understanding of the STW rendezvous mechanism requires the callbacks to be limited in what they do. In particular, runtime/domain.c mentions that:

    - STW sections must not trigger other callbacks into mutator code
      (eg. finalisers or signal handlers).

which is why I am not doing any of the domain cleanup work here at this point (my intent is to have the main thread do it in caml_shutdown eventually after all the other threads have exited), and I believe it will be able to release all the memory resources allocated by other threads.

@@ -191,6 +191,7 @@ CAML_RUNTIME_EVENTS_DESTROY();
#ifndef NATIVE_CODE
caml_debugger(PROGRAM_EXIT, Val_unit);
#endif
caml_terminate_all_domains();
if (caml_params->cleanup_on_exit)
caml_shutdown();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't personally have a strong opinion -- I don't know about threads and systems programming enough to tell the amount of risk involved here -- so I am happy to follow your preference unless someone else speaks out.

runtime/domain.c Outdated Show resolved Hide resolved
@dustanddreams
Copy link
Contributor Author

1. `caml_do_exit` seems to also be called from a side thread under Windows (see `caml_signal_thread` in runtime/win32.c). Is that correct? My uninformed intuition is that it is wrong, but was already wrong before the current change. (But maybe the change makes it worse?)

This looks harmless to me. That use case is specific to the use of (apparently undocumented) CAMLSIGPIPE environment variable, which, if present, is expected to be the numeric handle of a pipe on which the other side can write 'C' to inflict a Ctrl-C or 'T' to request termination. caml_do_exit will get invoked when an I/O error on this pipe occurs (i.e. the pipe gets closed by the remote end) to terminate the Caml process immediately.

@kayceesrk
Copy link
Contributor

The code looks reasonable to me. IIUC, the current PR unconditionally forces all the non-main domains to terminate before the main domain termites, irrespective of whether memory cleanup at exit (MCE) mode is on. Do we know how by how much this slows down multi-domain programs that do not use MCE?

My motivation is to allow command-line tools to be written using multiple domains. My experimental observations with OCaml 5 lead me to think that the cost of parallelism creation is still quite high. Without significant work (100s of milliseconds), the parallelism isn't worth it now. I would like to bring this cost down.

I wonder whether it is worth considering doing this only when the "memory cleanup at exit" mode is on for performance?

@dustanddreams
Copy link
Contributor Author

IIUC, the current PR unconditionally forces all the non-main domains to terminate before the main domain termites, irrespective of whether memory cleanup at exit (MCE) mode is on.

Yes.

Do we know how by how much this slows down multi-domain programs that do not use MCE?

I have not attempted to measure this, and this will depend upon the kind of activity done in each domain thread. This adds the cost of a STW rendez-vous, pthread_exit being negligible. Of course, if you design your code so that the domain threads exit by themselves when your program has completed its work, the overhead is zero since this is guarded by !caml_domain_alone.

I wonder whether it is worth considering doing this only when the "memory cleanup at exit" mode is on for performance?

This point is open for discussion.

@kayceesrk
Copy link
Contributor

Sorry, I noticed that the point that I had raised was already discussed here: https://github.com/ocaml/ocaml/pull/13010/files#r1514040251.

I'm happy to go with the unconditional wait to terminate all domains. We can come back to this point if and when we solve other issues with scaling short-lived programs with multiple domains.

@dustanddreams
Copy link
Contributor Author

dustanddreams commented Mar 13, 2024

For the record, @damiendoligez suggested raising an exception to force the domain thread to exit in a more "natural" way and proper cleanup.

However I can't get this to work at the moment (and also, raising the exception in the STW callback will cause it to be taken care of immediately, which goes against the "no mutator code" STW requirement).

@kayceesrk
Copy link
Contributor

For the record, @damiendoligez suggested raising an exception to force the domain thread to exit in a more "natural" way and proper cleanup.

Can you elaborate on this idea? Does this mean that any GC safe point (allocation, poll points, etc) can raise an exception that the user code is expected to handle cleanly? IINM, today, only Out_of_memory may be raised at GC safe points. Out_of_memory is a catastrophic exception that can't be handled cleanly; we're out of memory and anything that we do cannot allocate more. Are we proposing to add another exception that may arise out of GC safe points? CC @gadmm as he may understand the (expected) semantics of these well.

@gasche
Copy link
Member

gasche commented Mar 19, 2024

My understanding is that the idea is not to encourage users to try to catch failures at poll points (in fact many of them are not visible in the source), but to use coarser-grained exception handlers, typically using helpers such as Fun.protect, to release resources allocated in the execution context / call stack. This cleanup behavior would be similar to the way the Sys.Break exception is raised asynchronously on Ctrl-C, the use of the Thread.Exit exception in threads, or some uses of discontinue within effect handlers.

The documentation of Thread.Exit is explicit about this:

val exit : unit -> unit
[@@ocaml.deprecated "Use 'raise Thread.Exit' instead."]
(** Raise the {!Thread.Exit} exception.
    In a thread created by {!Thread.create}, this will cause the thread
    to terminate prematurely, unless the thread function handles the
    exception itself.  {!Fun.protect} finalizers and catch-all
    exception handlers will be executed.

    To make it clear that an exception is raised and will trigger
    finalizers and catch-all exception handlers, it is recommended
    to write [raise Thread.Exit] instead of [Thread.exit ()]. [...] *)

There are proposals to change the way we handle asynchronous exceptions. As far as I know there are currently two competing proposals:

  1. one around the idea to use masking to let users explicitly avoid asynchronous exceptions in critical sections, Masking of asynchronous callbacks #8961 ;
  2. the other around the idea of delegating asynchronous exceptions to a second-class status that does not use raise or try .. with ..., but a separate layer of handlers that are only exposed as a Sys library function : Improve the semantics of asynchronous exceptions (new simpler version) ocaml-flambda/flambda-backend#802 ).

In any case I would expect Fun.protect to still work as expected (if we take route (2) we would change its implementation to catch and reraise asynchronous exceptions). The future compatibility status of catch-all exception handlers is less clear.

@kayceesrk
Copy link
Contributor

Thanks @gasche. I like the idea (2); it makes sense to relegate asynchronous exceptions to a second-class status.

@dustanddreams I would like to know the proposed design for the asynchronous exception. In particular, what exception are we planning to raise on the other threads? Will this be Thread.Exit?

raising the exception in the STW callback will cause it to be taken care of immediately, which goes against the "no mutator code" STW requirement

Curious whether this requirement is documented somewhere in the runtime.

@dustanddreams
Copy link
Contributor Author

For the record, @damiendoligez suggested raising an exception to force the domain thread to exit in a more "natural" way and proper cleanup.

Can you elaborate on this idea? Does this mean that any GC safe point (allocation, poll points, etc) can raise an exception that the user code is expected to handle cleanly? IINM, today, only Out_of_memory may be raised at GC safe points. Out_of_memory is a catastrophic exception that can't be handled cleanly; we're out of memory and anything that we do cannot allocate more. Are we proposing to add another exception that may arise out of GC safe points? CC @gadmm as he may understand the (expected) semantics of these well.

He suggested raising a discontinue exception to cause the thread to exit. However, the signature of this exception does not fit well with a simple "stop what you are doing now, period" signal.

I was thinking of introducing something similar to Out_of_memory, and in fact I have been experimenting raising Out_of_memory in my experiments.

Unfortunately, in the bytecode flavour, this reliably causes a siglongjmp sanity check under linux+glibc.

@dustanddreams
Copy link
Contributor Author

raising the exception in the STW callback will cause it to be taken care of immediately, which goes against the "no mutator code" STW requirement

Curious whether this requirement is documented somewhere in the runtime.

See #13010 (comment)

@gasche
Copy link
Member

gasche commented Mar 20, 2024

Discontinuing non-main domains with an exception

However I can't get [raising an exception] to work at the moment (and also, raising the exception in the STW callback will cause it to be taken care of immediately, which goes against the "no mutator code" STW requirement).

I would try to do this by handling this "domain exit" as a "pending action" in the sense of the pending_action flag in signals.c, see the long note on "Pending asynchronous actions" and the caml_do_pending_actions_exn function.

I don't think you need to run a STW yourself to do this. You could presumably just set a global flag in the runtime ("we are in the process of shutting down"), then call caml_interrupt_all_signal_safe, and I think that this should result in caml_do_pending_actions being called on each domain on their next allocation/poll point. (This does not set action_pending on each domain, I'm not sure whether we would need to do that, I never played with this code myself.)

One question that arises is how to prevent a sort of race where you ask all current domains to shut down, but there are other domains being spawned at the same time that don't get the memo. Maybe this can be avoided by taking one of the locks on domain states that are needed to spawn (so as to prevent new domain from spawning), or by changing the spawning code to fail once we are in the "we are trying to shut down everyone" global state.

Feature scope

I wonder what is the feature scope for the present PR. I think that you started with "valgrind should not report memory leaks due to uncollected domain structures on program exit", and now we are in the territory of "let's make a best effort to let OCaml code cleanup its own logical resources on program exit", which is a much harder problem.

I would propose two things moving forward:

  • for now let's gate any behavior change on this "resource cleanup at exit" flag; it is easy to set it up to test the new behavior, but not the default, so it's no big deal if we find out later that the changes were too large
  • if I were you I would propose to stick to the "all malloc-ed memory is freed by exit time" as a first, easier objective, and only consider larger changes if they make implementation easier (maybe they do!)

@gasche
Copy link
Member

gasche commented Mar 20, 2024

(I edited my post above to discuss how the "let's shutdown everyone" decision should affect domains that are being spawned concurrently.)

@gadmm
Copy link
Contributor

gadmm commented Apr 3, 2024

For the record, @damiendoligez suggested raising an exception to force the domain thread to exit in a more "natural" way and proper cleanup.

Can you elaborate on this idea? Does this mean that any GC safe point (allocation, poll points, etc) can raise an exception that the user code is expected to handle cleanly? IINM, today, only Out_of_memory may be raised at GC safe points. Out_of_memory is a catastrophic exception that can't be handled cleanly; we're out of memory and anything that we do cannot allocate more. Are we proposing to add another exception that may arise out of GC safe points? CC @gadmm as he may understand the (expected) semantics of these well.

Out of curiosity, do you know of a particular situation where minor allocations (or poll points) can raise Out_of_memory? When the minor GC cannot allocate a pool on the major heap, there is a fatal error. But there are many code paths, and I wonder if there are other code paths that can lead to Out_of_memory. If so, one could discuss whether this should be replaced with a fatal error to ensure that minor allocations and poll points are consistent in not raising Out_of_memory.

It has already been mentioned that asynchronous exceptions are a thing, but there has been ample discussion and research on actual code usage in the past couple of years, and there is no longer a debate about whether these exceptions are relied-upon in practice. Perhaps you mean to say that out of the box, this sort of exception does not arise. Indeed, async exceptions are currently opt-in. One of the requirements of our current work on async exceptions is that programs built with the assumption that there are no async exceptions should remain correct by keeping async exceptions opt-in. For this reason, the solution cannot be to interrupt threads by default.

With this in mind, I propose to split the idea of interruption in two. The default behaviour should be for the runtime running on domain 0 to wait for the completion of other domains, similarly to how the main systhread of a domain waits on other systhreads at termination. The programmer can then interrupt domains with asynchronous exceptions as a deliberate choice. (A variant of exit that exits all domains at once could also be used in some situations, assuming it remains opt-in for the programmer.)

Then, the support for interruption in OCaml should be developed to become more convenient and adapted to multicore. Part of this support can be delegated to specialized libraries. The cancellation tokens from memprof-limits can already serve for this purpose in some situations, and serve as a proof of concept. There are many ways in which support for interruption could be improved with better support from the runtime. Since cancellation becomes all the more important in a parallel setting, interruption by asynchronous exceptions is indeed deserving of attention from core developers.


I agree that there is an issue with domain termination order even without the "memory cleanup at exit" mode. I'd also be wary about introducing major differences between two modes.

By the way, the "memory cleanup at exit" mode already introduces semantic changes with regular mode that makes it unusable in practice especially for library writers1. I was also left with the impression that this mode was unmaintained, cf. the conclusion of #10304.

So I think there is a problem to solve worthy of some efforts, but not because of the "memory cleanup at exit" mode, itself not worthy of too much effort.


Also, I think it would be beneficial for discussions in the OCaml github that they are not based on hearsay about private experiments in 3rd-party tools that have not yet been proposed for public discussion. It is not true that one could adapt Fun.protect to work with the mentioned approach. In fact, the compatibility path for the programs that already make use of asynchronous exceptions most likely requires explicit opt-in of the new feature by programmers. It cannot consist in a straight-up replacement of the current mechanism and so it is necessarily orthogonal to the current discussion.

Footnotes

  1. Some explanation: this is due to the choice of behaviour it being all-or-nothing, and to changes in behaviour being as large as 1) some functions that did not need the runtime lock needed it in OCaml 4 (in particular for deallocation), and 2) incompatible clean-up semantics. In OCaml 5 replace 1) with: every caml_stat_alloc/caml_stat_free operation acquires a global lock. As a consequence, library writers using caml_stat_alloc must choose to write with one or the other behaviour in mind, and all libraries in a program must have relied on the same choice.

Copy link
Contributor

@gadmm gadmm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In summary:

  • There indeed seems to be an issue with domain termination order, and not only in the "memory cleanup at exit" mode. It seems very preferable to have the same behaviour in the two modes. (Edit: this mainly concerns caml_shutdown, rather than caml_do_exit, but this PR on changes caml_do_exit.)
  • Interrupting domains with asynchronous exceptions as a default behaviour breaks an implicit rule that asynchronous exceptions are opt-in, which some want to see preserved.
  • The default behaviour could be to wait on the completion of all domains before tearing down the global runtime state.
  • Separately, support for interruption could be improved in the runtime.
  • The current PR could be adapted into adding some "exit_all_domains" primitive that the programmer calls explicitly if that's the right solution.

@kayceesrk
Copy link
Contributor

Out of curiosity, do you know of a particular situation where minor allocations (or poll points) can raise Out_of_memory? When the minor GC cannot allocate a pool on the major heap, there is a fatal error. But there are many code paths, and I wonder if there are other code paths that can lead to Out_of_memory. If so, one could discuss whether this should be replaced with a fatal error to ensure that minor allocations and poll points are consistent in not raising Out_of_memory.

I don't know the answer to the question of whether there is a divergence in the failure mode (Out_of_memory vs fatal error). I agree that making the behaviour consistent is a good idea. It would be useful to track this somewhere (perhaps not an issue because we don't know whether the issue even exists). Perhaps just document the prescribed behaviour in code / manual in the appropriate place?

@kayceesrk
Copy link
Contributor

  • The default behaviour could be to wait on the completion of all domains before tearing down the global runtime state.
  • Separately, support for interruption could be improved in the runtime.
  • The current PR could be adapted into adding some "exit_all_domains" primitive that the programmer calls explicitly if that's the right solution.

In the interest of not making this PR more complicated, I would split this into two steps.

  1. The default behaviour could be to wait on the completion of all domains before tearing down the global runtime state. (Current state of the PR)
  2. Make an issue to track exit_all_domains, because there are different ways this could be implemented.

Would this be acceptable? Also, it would be interesting to hear from @damiendoligez, who originally suggested the idea of tearing down by raising exceptions.

@damiendoligez
Copy link
Member

Waiting for the completion of all domains seems the cleanest solution to me, as it avoids the many drawbacks of asynchronous termination. However, there is one caveat: the expected behavior of Sys.exit is to asynchronously terminate all domains and threads, and deallocate all the resources held by the program. Waiting for all domains is a big deviation from that and I expect a non-trivial amount of breakage.

I don't remember why I suggested raising an exception, probably as an attempt at implementing the asynchronous termination while allowing for in-program resource cleanup, but I think that was misguided, since an exception doesn't guarantee termination.

As a side note, I totally support making sure that Out_of_memory doesn't get raised asynchronously and killing the program with a fatal error instead.

@dustanddreams
Copy link
Contributor Author

This looks like I opened a can of worms larger than expected, and there is no strong consensus on what to do.

I think we can all agree that the expected behaviour of a well-written Caml program using multiple domains is to make sure to Domain.join all of them during the final steps of the program.

If we can be bold and write somewhere that terminating the main thread while other domains are still running is a programming error, then the current intent of this PR, forcing remaining domains to pthread_exit by themselves, is a good enough way to ensure there will be no rogue threads remaining, and what needs to be done is some polishing to use something simpler than an STW rendezvous to achieve this.

If the behaviour of letting domain threads outlive the main thread is accepted, then this PR won't do and I'll withdraw it.

@c-cube
Copy link
Contributor

c-cube commented Apr 5, 2024

I think we can all agree that the expected behaviour of a well-written Caml program using multiple domains is to make sure to Domain.join all of them during the final steps of the program.

Please don't, it was not a documented invariant and it's not a particularly nice one for people who are already trying to use domain pools. I've never had to make sure all threads were stopped before exiting the main thread, why would I have to do it for domains (which are a necessity even for those of us who prefer to run threads, as it's necessary to run domains even just to dispatch threads on them).

The domain interface is already not the greatest imho, I'd like it if it was not made even more complicated to deal with.

@gadmm
Copy link
Contributor

gadmm commented Apr 5, 2024

@c-cube Threads are automatically joined when the main thread exits. Is this what you would like to see for domains?


One could distinguish calling exit from normal termination of the program. In POSIX, programmers have a choice between exit and pthread_exit (which can be used from the main thread to let other threads continue).

It makes sense to halt other domains inside caml_do_exit in whatever way to prevent them from accessing the runtime during shutdown, since they're going to be collected by POSIX exit anyway. This is consistent with OCaml 4 systhreads due to the runtime lock.

(From this point of view my suggestion of an exit_all_domains was redundant with normal exit.)

@c-cube
Copy link
Contributor

c-cube commented Apr 5, 2024

@gadmm are they?

let () =
  let rd, _wr = Unix.pipe () in
  let _t =
    Thread.create
      (fun () ->
        Printf.printf "thread started…\n%!";
        let buf = Bytes.create 4 in
        let n = Unix.read rd buf 0 4 in
        Printf.printf "read %dB\n%!" n)
      ()
  in
  Thread.delay 0.1;
  ()

this prints "thread started" and exits after ~ 0.1s. The thread is not joined since it's in a blocking syscall that will never return. Am I missing something?

@gadmm
Copy link
Contributor

gadmm commented Apr 6, 2024

Thanks for the example. I missed an important code path: regarding systhreads, domain termination is handled differently for domain 0 and other domains. In the case of domain 0, threads on this domain are not joined during termination.

Other points:

  • I do not see the race with caml_terminate_signals, AFAIU it only affects the signal stack of domain 0. But if there is an issue, this function call could also be made conditional on cleanup_on_exit since the point of doing it here is memory leak detection (according to 04118b0#r747341331 — a comment could be added here). But then I don't understand the point of having it inside caml_do_exit since it is already called inside caml_shutdown.
  • There is already a STW in CAML_RUNTIME_EVENTS_DESTROY(), conditional on runtime events being enabled. If a STW on exit becomes necessary, and the delay is an issue for short-lived programs, then perhaps a single "shutdown" STW could be used.

To go back at the core of this PR, it makes sense to have a conditional STW to stop all domains in the cleanup_on_exit mode for correctness.

Lastly, I am missing at this point the link with the scenario from the description of this PR:

the caml runtime is embedded in a larger program which does not necessarily exit after running a caml program

There is likely something to do directly in caml_shutdown for this scenario. But this PR only affects caml_do_exit, and it always calls exit.

Copy link
Contributor

@gadmm gadmm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think caml_shutdown, rather than caml_do_exit, should be fixed. In the specific case of callers that are going to call exit, it makes sense to call pthread_exit in all threads that hold the domain lock of their domain to avoid interference with clean-up functions.

But I sense a need to go to the blackboard regarding the desired semantics of caml_shutdown, and also the cleanup_on_exit mode that relies on the pooled mode.

thread to terminate here. */
terminate_backup_thread(domain_self);
atomic_fetch_add(&caml_num_domains_running, -1);
pthread_exit(0);
Copy link
Contributor

@gadmm gadmm Apr 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, this is dealing with other threads running on the same domain by not releasing the domain lock. Since the other threads cannot acquire the domain lock, they won't access the runtime state. Doesn't the same reasoning apply for the backup thread? As in, you do not need to terminate it?

Another potential bug (perhaps unrelated to this PR) is that threads calling caml_stat_free without the domain lock will race with the cleanup of the pools. For this bug specifically, rather than changing this PR to stop all threads (some of which might well be blocked actually), I think one should fix the caml_stat_* functions to prevent this race, if the pooled mode is to be retained.

Comment on lines +1650 to +1653
while (!caml_domain_alone()) {
pthread_t myself = pthread_self();
caml_try_run_on_all_domains(stw_terminate_domain, &myself, NULL);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
while (!caml_domain_alone()) {
pthread_t myself = pthread_self();
caml_try_run_on_all_domains(stw_terminate_domain, &myself, NULL);
}
pthread_t myself = pthread_self();
while (!caml_try_run_on_all_domains(stw_terminate_domain, &myself, NULL)) {};

Does this work? It avoids hijacking caml_num_domains_running.

run domain_terminate() on their own, so we need to ask the backup
thread to terminate here. */
terminate_backup_thread(domain_self);
atomic_fetch_add(&caml_num_domains_running, -1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
atomic_fetch_add(&caml_num_domains_running, -1);

(see other comment)

@@ -191,6 +191,7 @@ CAML_RUNTIME_EVENTS_DESTROY();
#ifndef NATIVE_CODE
caml_debugger(PROGRAM_EXIT, Val_unit);
#endif
caml_terminate_all_domains();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that caml_shutdown still runs OCaml code. After you ran pthread_exit from all the threads holding a domain lock, the runtime is probably in an inconsistent state, and trying to run OCaml code can go badly. If running OCaml code afterwards (even trivial functions), as if on a lone domain, is intended, I expect it to be brittle. Terminating all domains could be moved to caml_shutdown, taking into account the fact that it's also better not to introduce delays at exit.

But then it is not clear to me whether immediately terminating all domains is always the expected behaviour of caml_shutdown (e.g. for callers of caml_shutdown who do not calling exit immediately thereafter). As it leaves threads waiting for their domain lock, it also fails to achieve the goal of cleaning-up for programs embedding the OCaml runtime.

@@ -191,6 +191,7 @@ CAML_RUNTIME_EVENTS_DESTROY();
#ifndef NATIVE_CODE
caml_debugger(PROGRAM_EXIT, Val_unit);
#endif
caml_terminate_all_domains();
if (caml_params->cleanup_on_exit)
caml_shutdown();
Copy link
Contributor

@gadmm gadmm Apr 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this call to caml_terminate_signals is problematic, but I also think it can be removed.

I am not sure I understand the point about programs embedding the ocaml runtime. These programs will indeed call caml_shutdown directly rather than caml_do_exit.

@kayceesrk
Copy link
Contributor

#13010 (comment)

Thanks for the reply @damiendoligez.

@c-cube
Copy link
Contributor

c-cube commented Apr 8, 2024

Thanks for the example. I missed an important code path: regarding systhreads, domain termination is handled differently for domain 0 and other domains. In the case of domain 0, threads on this domain are not joined during termination.

running this in a Domain.spawn doesn't seem to change anything, the program still exits. So, really, I don't think the current state of things is to join anything at exit.

Going back to the initial premise of this PR:

In the current state of the runtime, no effort is made to ensure that all domain threads exit at the end of the program execution; it relies upon the operating system to reap all the threads at program termination time.

that's great for the use case of embedding OCaml as a .so in another program, but if it breaks (in a horrible deadlock way) some already existing programs relying on the current behavior, I'm less than enthusiastic about it. And yes, this would break in a non-trivial way some of my programs that use domain pools and background threads.

@gadmm
Copy link
Contributor

gadmm commented Apr 8, 2024

running this in a Domain.spawn doesn't seem to change anything, the program still exits. So, really, I don't think the current state of things is to join anything at exit.

To clarify: domain 1 waits for the threads to stop, whereas domain 0 waits neither for threads nor the domains. To observe a difference with Domain.spawn, you need both the domain 1 to wait on its threads, and domain 0 to wait on domain 1, and only the former happens.

@gadmm
Copy link
Contributor

gadmm commented Apr 8, 2024

To further clarify and sum up, I think a way forward is the following:

  • There seems to be an issue with domain termination order, and not just in the "memory cleanup at exit" mode.
  • The behaviour for caml_do_exit is to exit without waiting. In normal mode there is nothing to change, but for the "memory cleanup at exit" mode, this requires stopping other domains (by having their threads either exit or be stuck waiting for the domain lock).
  • To be discussed separately, the best behaviour of caml_shutdown for callers that do not immediately call exit might be to wait (but this might motivate splitting caml_shutdown in two versions already).
  • Separately, support for interruption could be improved in the runtime.

@c-cube
Copy link
Contributor

c-cube commented Apr 8, 2024

To clarify: domain 1 waits for the threads to stop, whereas domain 0 waits neither for threads nor the domains.

Ah!!! That makes a lot more sense to me, thank you for clarifying.

@gasche gasche mentioned this pull request May 15, 2024
@dustanddreams
Copy link
Contributor Author

I'm closing this PR for now. This comment sums up what needs to be done, and this PR does not fit. Thanks for the feedback, everyone! I'll try to come up with a better PR in the not-so-distant future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants