runtime: ensure that finalizers are not called in [noexc] mode #11868

gasche · 2023-01-04T22:23:55Z

This is an attempt at fixing #11865. It is a small but invasive change to the runtime code, and we would need feedback from people familiar with shared_heap.c to tell if this may work.

The problem

#11865 is a deadlock caused by the execution of a finalizer during the scanning of minor GC roots. The reason that a finalizer might run during the minor GC root scanning is that moving live roots to the major heap will try to find "free slots" in the major heap, which will in some case try to sweep some dead values to get such free slots. When the values swept are custom blocks with a finalizer, sweeping calls the finalizer.

I suspect that calling finalizers from oldification code of the minor GC is adefect of the current runtime, that is bound to create many more issues than #11865. (In particular issues that are not specific to global roots, which are involved in the #11865 deadlock.) For example, it appears that currently the intern_alloc_obj function, which also calls caml_shared_try_alloc, could also call finalizers.

(A look at the 4.x runtime sources suggests that the 4.x runtime would not try to sweep existing dead blocks in these situations, it would extend the major heap right away.)

The proposed fix

The fix proposed here is to propagate a noexc boolean to all runtime functions that may eventually call a sweep function. (noexc is an existing runtime convention to say that a function is called in a setting where raising exceptions is not safe. In particular, if one may not raise exceptions then one cannot call finalizers.) When sweeping functions find a dead custom block with finalizer in noexc mode, the value is left as-is instead of being freed. Those values can only be freed by major-GC invocations that are not in noexc mode.

I could observe that this fixes the issue in #11865.

Remaining issues

This change may affect the dynamics of the major GC in non-trivial way, and we should test the performance impact carefully.

In particular, the test compaction_corner_case.ml seems to slow down massively due to this change (to the point that it appears to never terminate), and I currently have no idea why. (The rest of the tessuite passes.)

I have marked the PR as "draft" while I haven't found the cause of the compaction_corner_case slowdown, but design comments and reviews are still warmly welcome in the meantime.

This commit implements no functional change, it merely propagates a 'noexc' parameter from the shared-heap entry points to "sweep" functions -- which are in charge of potentially running finalizers. These 'noexc' parameters are currently not used.

bluddy · 2023-01-05T07:04:08Z

(A look at the 4.x runtime sources suggests that the 4.x runtime would not try to sweep existing dead blocks in these situations, it would extend the major heap right away.)

Is there a reason we can't just reuse the 4.x logic?

gasche · 2023-01-05T09:29:09Z

We could, but this also requires an invasive small change, and the performance implications would be similarly hard to predict, even harder than the current approach that preserves the current behavior in absence of custom blocks with finalisers.

gasche · 2023-01-05T09:31:00Z

Note: @xavierleroy pointed out to me that there is a difference between "C finalisers", which are added by some custom blocks, and "OCaml finalisers" which are added by Gc.finalise. The finalisers that are executed during sweeping are C finalisers, not OCaml finalisers.

gasche · 2023-01-05T10:10:14Z

On the multicore side, I think the relevant people to ping for feedback would be @kayceesrk and @ctk21.

gasche · 2023-01-05T13:16:34Z

@damiendoligez pointed out (from just an oral description of the PR) that it is not correct to just "skip" a garbage value during sweeping. I tried to change the code to "stop sweeping" on the first custom value with a finalizer, but this proved a bit too hard to someone unfamiliar with the sweeping logic. (I have the impression that pool_sweep expects to always sweep the whole pool, and I don't know what I would break if I stopped earlier. In particular there seems to be code assuming that if no sweeping work was done, then the sweeping phase has finished.)

What I did instead is to update this PR to avoid sweeping completely in noexc settings. This sounds simpler but the logic is also a bit tricky in the case of opportunistic major slices -- I'm not sure I handled them correctly.

The old implementation remains available on my remote fork:
trunk...gasche:ocaml:noexc-pool-sweep-old

The new version still behaves badly with the "compaction corner case" (the previous version was 200x slower than 5.00 on this code, the new version is only 100x slower, but something is still clearly wrong).

gasche · 2023-01-05T14:54:53Z

The slowdown in completion_corner_case is due to a quadratic behavior in the runtime if sweeping is disabled during minor collections. The corner_case code is very simple:

let c = ref []

let () =
  for i = 0 to 1000 do
    c := 0 :: !c;
    Gc.compact ()
  done

(In 5.x, Gc.compact () just means Gc.major ().)

So: a loop with an allocation followed by a major GC.

The quadratic behavior when sweeping is disabled during the minor GC is as follows;

at the beginning of the loop, there are no "available pools" in the major GC, only "unswept pools"
some loop iterations trigger a minor GC
- with the trunk code, this sweeps the existing "unswept pool" which becomes "available" for the minor GC
- with the PR code, no sweeping is allowed so a new pool is acquired
at the end of each iteration, the major GC cycles the heap and moves all available pools into "unswept pools"

With the trunk code, a single pool moves from unswept to available to unswept again during each iteration performing a minor GC.

With the PR code, each minor GC allocates a new pool that immediately becomes "unswept" and is never available for further allocations. Each major GC then has one more pool to sweep, resulting in quadratic behavior.

With an iteration count of 1000, trunk creates 23 pools in total, while the PR creates 1027.

gasche · 2023-01-05T14:59:27Z

Currently we are in a weird situation where the major GC cycles never make pools available to the allocator, only major slices and the minor GC make pools available. If we disable sweeping during minor GCs, and the user continually forces major cycles, we have lost.

I'm out of my league to fix this issue. I would expect a full major cycle (or whatever it is that Gc.compaction () do) to end with pools available, instead of being unavailable. Maybe it would be enough to just run a GC slice at the end of the cycle, rather than ending on heap cycling. But this feels like a hack, and I think that we need someone who actually knows the major GC to tell what a reasonable approach would be.

sadiqj · 2023-01-05T15:19:52Z

I'm not sure I understand why we can't sweep during opportunistic major slices. These are done before and after a domain has finished it's minor collection. I don't think there's a circumstance whereby you could do an opportunistic major slice while iterating over global roots?

gasche · 2023-01-05T15:31:38Z

I'm not fully sure, but my understanding is that during an opportunistic major slice, other domains may be still in their own minor GC. For example one domain may be in the middle of scanning the global roots.

Note that restoring sweeping in opportunistic major slices would not prevent the quadratic behavior in this test, which is single-domain and thus never performs opportunistic slices.

gadmm · 2023-01-05T17:33:02Z

I'll take a contrarian view from the description.

C (custom) finalizers can run outside of safe points, unlike OCaml finalizers. In exchange, custom finalizers should not "access the runtime" (they do not possess the runtime capability, in ownership parlance). They are meant to deal with external data and resources. In the original bug report, removing a global root is an access to the runtime (e.g. this borrows the runtime capability in the Rust-OCaml FFI). From this point of view, the original code has a programming error. This restriction was undocumented AFAIR, but clear when reading the runtime code already in OCaml 4 (for the few people doing so; I am not blaming the user).

The question is what to do about it; from this angle the choice is between keeping the restriction and trying to relax it. I do not yet understand what state of affairs this PR tries to achieve. What would be a high-level description of the new situation with this PR, in terms that can be explained to users? I suspect that if you want to allow runtime access inside custom finalizers, there is more work to do.

Possible alternative solutions:

Keep the restriction, fix the code from the issue. If keeping the restriction, there is a systematic fix that works for the reported issue and any other code in the wild that incorrectly accesses the runtime inside a custom finalizer: simply convert the custom finalizer into an OCaml finalizer, so that it is guaranteed to run at a safe point.
Keep the restriction, relax the requirements of caml_remove_[generational_]global_root. The latter two functions are in a unique situation in the sense that both can move from requiring the runtime access in OCaml 4 to having no special requirement in OCaml 5, thanks to the addition of a mutex in OCaml 5. In this case, fix the runtime so that it is always safe to call caml_remove_[generational_]global_root anytime (the discussion in the issue mentions using a recursive mutex, but there are likely other solutions involving minor changes to the runtime).
Remove the restriction entirely. A solution to "Keep It Simple" would be to treat C finalizers like OCaml finalizers and delay their execution to safe points, using the same queue as OCaml finalizers. (I did not look deeply into this solution, this might have drawbacks such as further delaying the reclamation of managed bigarrays.)

Solution 2. sounds sensible to me, because it is very accidental that the original code worked in OCaml 4. There are very few things one can do inside custom finalizers, and it is very accidental that remove_global_roots worked in this situation. There are probably few functions that would work accidentally, and I am not aware of other runtime functions being in situation of being relaxed to no longer require the runtime capability. In addition, it is desirable to be able to call these functions from custom finalizers. It is therefore a reasonable way to manage backwards-compatibility expectations.

gadmm · 2023-01-05T18:05:18Z

As for the documentation that could be clearer:

Note: the "finalize", "compare", "hash", "serialize" and "deserialize" functions attached to custom block descriptors must never trigger a garbage collection. Within these functions, do not call any of the OCaml allocation functions, and do not perform a callback into OCaml code. Do not use "CAMLparam" to register the parameters to these functions, and do not use "CAMLreturn" to return the result.

This is very informal speak for saying that it does not have the runtime capability; as we noticed in our work on the OCaml-Rust FFI, we can see the benefit of adding "runtime capability" or "domain capability" to the vocabulary because with it we see that the remove_global_root functions require it and we do not have it. There are probably other things one cannot do inside custom finalizers in OCaml 4 already, that are not listed in the documentation above. (It is also unclear that the argument to caml_alloc_final has the same restrictions, this could be clarified too.)

xavierleroy · 2023-01-05T18:52:09Z

custom finalizers should not "access the runtime" (they do not possess the runtime capability, in ownership parlance). They are meant to deal with external data and resources

I have sympathy for this viewpoint. Maybe the appropriate response to #11865 is to better document the restrictions on custom block operations, and to mention OCaml finalizers as a more flexible alternative to custom block finalizers.

tbrk · 2023-01-05T20:21:41Z

In our work on Sundials/ML, the permissibility of calling caml_remove_global_root within a C finalizer was based on the thread “Custom blocks and finalization” started by @mmottl on the caml-list over April and May 2009. There are only around 20 such instances in Sundials/ML. I will try @gadmm's solution no. 1, i.e., use OCaml finalizers, and link to the patch when I have a chance.

gasche · 2023-01-05T20:47:58Z

Excellent points, thanks @gadmm. My answers:

If we can easily enough enable calling C finalisers from safe points, I think we should consider doing it. (In particular: currently it is not possible to define custom values from OCaml, but it would be very nice to eventually do so and this is one of the requirements.)
In this PR my aim was rather to ensure that they are not called from the minor GC and some other "obviously unsafe" places. Now that you ask, I guess that I make a mental distinction between the minor GC and the major GC because (most of) the minor GC runs during a critical section, while the major GC is written to be safe against concurrent mutators. (So why not reentrant finalisers, I guess, was my thinking.)

I can see a use-case for calling root-registration functions from a C finaliser: when we have a C value that we want to pass around as an OCaml value, and may in turn refer internally to OCaml values.It seems relatively natural in this case to store the C value in an OCaml custom block, and register its OCaml children as roots, deleted by the custom finalization function.
I guess that your proposal in this case is to use, instead of the custom C finalisation function, an OCaml finaliser attached to the custom value. Is that right? (That sounds doable but also a somewhat artifical limitation to work around a runtime restriction.)

gasche · 2023-01-05T20:50:27Z

Note: the library boxroot that @gadmm and myself are working on tries hard to allow "remote" deallocation of boxroots, that is without holding the runtime capability, precisely because @gadmm observed that this is required in some FFI scenarios -- in our case when called from Rust destructors.

gadmm · 2023-01-06T01:26:45Z

@tbrk Nice find. What I do not understand, though, is that the thread says that one should probably not do so.

The discussion still reports a good use-case and the existence of programs relying on it (which did work in practice).

For example not being allowed to register/unregister roots to values seems overly restrictive, since global roots (e.g. for protecting OCaml-callback closures) sometimes need to be associated with C-values (e.g. for allowing some external C-library to efficiently perform callbacks into OCaml). Registering the finalizer from within C during allocation rather than having to wrap up the value a second time with an OCaml-finalizer would seem much simpler.

In relation to boxroot, yes, one should ideally not require the runtime capability to remove a root for various reasons (discussed in the paper). I used to believe that OCaml 5 improved on this point thanks to the mutex. But as the issue shows, OCaml 5 has not entirely lifted the restriction, so it would be nice to go the (easy) extra step. This would be a win-win.

For a bigger change such as delaying custom finalizers to a safe point, what is the use-case for defining custom values from OCaml? Please also take into account whether you have sufficient maintainance workforce to implement the change and carefully evaluate its effects. I might be able to review such a change, but please prioritize the collective time.

gasche · 2023-01-06T08:26:53Z

For a bigger change such as delaying custom finalizers to a safe point, what is the use-case for defining custom values from OCaml? Please also take into account whether you have sufficient maintainance workforce to implement the change and carefully evaluate its effects. I might be able to review such a change, but please prioritize the collective time.

I read this as Guillaume-speak for "there is no point in working on this now". I agree that it's not a priority, and even my intermediate fix here (which is not providing all the guarantees we want, but at least fixing the bug) is running into unexpected snags that I don't have the expertise to fix by myself. Closing.

I will restart discussing lifting the mutex restriction in the relevant issue thread., #11865.

To summarize the understanding of the runtime that I acquired during the work on this PR:

I don't see how to support sweeping a pool only partially. This means that we cannot easily implement restrictions to delay sweeping of certain values, except by not sweeping at all at this point. (It also means that pool sizes should remain small enough for latency; @gadmm already remarked that we needed small-ish pool to give memory back to the OS.)
The major GC has a design (that was unintuitive to me) where, if i understand correctly, a major cycle first sweeps and then marks. At the end of the cycle, which corresponds to the point where colors are cycled, marking is done but pools are all unswept. We thus crucially rely on the minor GC sweeping on-demand -- if we disabled this we would allocate at least one new pool per major GC cycle.

(The sweep-then-mark design is clearly explained in the "Retrofitting" paper, but I had forgotten about it since.)

tbrk · 2023-01-06T09:20:56Z

@gadmm: My reading of the thread is that calls to caml_remove_global_root were considered unproblematic in May 2009. @mmottl refers to explicit statements in the OCaml manual but the paragraph about finalizers in the 3.11.0 (26/11/2008) release is the same as the one you cite above. I don't want to build a legal case 🙂, just point out that the principle "C finalizers must not access any part of the runtime" did not necessarily hold up until now. The principle to date seems to have been "C finalizers must not allocate memory, make callbacks, or raise exceptions". The new principle may well be the better one, but I would be surprised if it does not require changes to a number of existing libraries. To make the required changes explicit, the previous example goes from

external c_wrap : 'a -> ptr = "mlwrap"
let make v = { payload = v; cptr = c_wrap v }

to

external c_finalize : ptr -> unit = "mlfinalize"
let make v =
  let cptr = c_wrap v in
  Gc.finalise c_finalize cptr;
  { payload = v; cptr }

This is not a big deal, but it is slightly more complicated. The finalizer can no longer be a static function hidden in the C file, but has to be mentioned across the C and OCaml files, the call to Gc.finalise must not be forgotten, and care must be taken with exceptions between a caml_alloc* in c_wrap and the call to Gc.finalize in make.

bluddy · 2023-01-06T09:40:44Z

Slightly orthogonal, but this reminds me that we desperately need design documents IMO. Published papers that don't stay up to date with the code do not cut it, I'm afraid. There's too much institutional knowledge kept in the heads of specific people. This would never be acceptable in a modern company setting.

gasche · 2023-01-06T10:43:19Z

Note: I parked my last iteration on this PR in a branch of its own:
https://github.com/gasche/ocaml/commits/noexc-pool-nosweep

The previous iteration:
https://github.com/gasche/ocaml/commits/noexc-pool-sweep-old

gadmm · 2023-01-06T19:02:26Z

Yes, it happened before that the documentation was incomplete and left users to make assumptions based on observed behaviour.

I read this as Guillaume-speak for "there is no point in working on this now".

No, it transparently said what it said (and asked a question).

gasche force-pushed the globroots-mutex branch from 1c2db0a to ff33f3b Compare January 4, 2023 22:24

gasche mentioned this pull request Jan 4, 2023

caml_remove_global_root in finalizer spins on roots_mutex #11865

Open

gasche added 2 commits January 5, 2023 14:12

runtime: do not try to sweep at all in [noexc] mode

6042263

Changes

967aad8

gasche force-pushed the globroots-mutex branch from ff33f3b to 967aad8 Compare January 5, 2023 13:12

gasche closed this Jan 6, 2023

gadmm mentioned this pull request Jan 9, 2023

Document the situation regarding remove_global_root inside custom finalizers #11876

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: ensure that finalizers are not called in [noexc] mode #11868

runtime: ensure that finalizers are not called in [noexc] mode #11868

gasche commented Jan 4, 2023

bluddy commented Jan 5, 2023

gasche commented Jan 5, 2023

gasche commented Jan 5, 2023

gasche commented Jan 5, 2023

gasche commented Jan 5, 2023

gasche commented Jan 5, 2023

gasche commented Jan 5, 2023 •

edited

Loading

sadiqj commented Jan 5, 2023

gasche commented Jan 5, 2023 •

edited

Loading

gadmm commented Jan 5, 2023 •

edited

Loading

gadmm commented Jan 5, 2023 •

edited

Loading

xavierleroy commented Jan 5, 2023

tbrk commented Jan 5, 2023

gasche commented Jan 5, 2023

gasche commented Jan 5, 2023

gadmm commented Jan 6, 2023 •

edited

Loading

gasche commented Jan 6, 2023

tbrk commented Jan 6, 2023

bluddy commented Jan 6, 2023

gasche commented Jan 6, 2023

gadmm commented Jan 6, 2023

runtime: ensure that finalizers are not called in [noexc] mode #11868

runtime: ensure that finalizers are not called in [noexc] mode #11868

Conversation

gasche commented Jan 4, 2023

The problem

The proposed fix

Remaining issues

bluddy commented Jan 5, 2023

gasche commented Jan 5, 2023

gasche commented Jan 5, 2023

gasche commented Jan 5, 2023

gasche commented Jan 5, 2023

gasche commented Jan 5, 2023

gasche commented Jan 5, 2023 • edited Loading

sadiqj commented Jan 5, 2023

gasche commented Jan 5, 2023 • edited Loading

gadmm commented Jan 5, 2023 • edited Loading

gadmm commented Jan 5, 2023 • edited Loading

xavierleroy commented Jan 5, 2023

tbrk commented Jan 5, 2023

gasche commented Jan 5, 2023

gasche commented Jan 5, 2023

gadmm commented Jan 6, 2023 • edited Loading

gasche commented Jan 6, 2023

tbrk commented Jan 6, 2023

bluddy commented Jan 6, 2023

gasche commented Jan 6, 2023

gadmm commented Jan 6, 2023

gasche commented Jan 5, 2023 •

edited

Loading

gasche commented Jan 5, 2023 •

edited

Loading

gadmm commented Jan 5, 2023 •

edited

Loading

gadmm commented Jan 5, 2023 •

edited

Loading

gadmm commented Jan 6, 2023 •

edited

Loading