Skip to content

Conversation

@kayceesrk
Copy link
Contributor

@kayceesrk kayceesrk commented Nov 1, 2023

This is a fix for #12345.

Problem

Finalisers can only be orphaned and adopted in Phase_sweep_and_mark_main. This simplifies invariants since finalisers are only processed in the two subsequent phases -- Gc.finalise (finalise first) finalisers in Phase_mark_final and Gc.finalise_last (finalise last) finalisers in Phase_sweep_ephe. The existing code attempts to get to Phase_sweep_and_mark_main by performing a caml_finish_major_cycle in handover_finalisers and assumes that after the call to caml_finish_major_cycle, the GC is in Phase_sweep_and_mark_main. However, given that there may be incoming interrupts from other domains to concurrently switch phases, we may move past Phase_mark_and_sweep_main. This triggers the assertion failure.

Solution

The fix is in the first commit. It will be useful to review this PR commit by commit. We introduce a global counter num_domains_orphaning_finalisers that keeps track of terminating domains orphaning their finalisers. If such domains exist, then we prevent the GC phase from moving past Phase_sweep_and_mark_main. The orphaned finalisers are guaranteed to be adopted since there is already a check that only allows phase changes when there is no orphaned work.

The second commit is semantics preserving and adds code comments for some of the global variables that are involved in phase changes. I thought it might be a good idea to add code comments when I'm actively looking at these details. It also does a bit of tidying up of the code.

Testing

For testing, I followed @Octachron's advice of spawning a large number of concurrent tests. Without this fix, on the debug runtime, I regularly saw assertion failures. With the fix, I don't have the failures anymore.

Copy link
Member

@gasche gasche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

I don't know this part of the codebase so I would encourage other people to look at this as well.

Here is what I learned from your detailed comments:

  • The major GC uses global atomic counters as a sort of "phase barriers": each domains increments the counter when they need a particular kind of global work done, and we stay in the phase doing this work as long as the counter is non-zero.
    You added detailed documentation for these counters that have slightly different speecifications. (Are they incremented on creation of a new domain or not? how is termination handled? etec.) Thanks!

  • You are proposing to use a new counter with a different protocol to act as a similar "phase barrier" for orphaning finalisers. The change is very small in numbers of code line. It is interesting that this synchronization mechanism is not currently used for anything else in the major GC.

On the other hand, I still don't understand what is special about finaliser handover that make them require this phase barrier. The synchronization logic for ephemerons, for example, seems simpler.

@kayceesrk
Copy link
Contributor Author

Several questions are about the motivation for the chosen constraints. I'll try to answer that here.

A design approach that I've been following for the multicore runtime is that, in places where performance is not critical, I tend to make strong constraints on the state. The idea is to be able to

  1. constrain the state space explosion
  2. test such invariants with assertions that will easily trigger when the invariants are violated

This aim is that this will lead to simpler code that does not have to handle corner cases which are rarely exercised. Note that I am not claiming that all the code in the runtime has been written according to this. This approach is an ideal that I am striving for.

In the case of phase barriers, assuming that certain global counters are strictly decreasing makes reasoning about the concurrent program states easier. Any crash or failure where we see the counters grow will be an error. In this particular issue, the counters num_domains_to_final_update_* have the property that they are strictly decreasing. If we allow finalisers to be adopted in phases other than Phase_sweep_and_mark_main then we have to handle the case that num_domains_to_final_update_* will increase (when the adopting domain has already "updated"1
its finalise first or last finalisers). I would like to avoid this.

The second, more challenging reason is that if the GC phase is not Phase_sweep_and_mark_main at

ocaml/runtime/domain.c

Lines 1837 to 1844 in ed7b382

if (f->todo_head != NULL || f->first.size != 0 || f->last.size != 0) {
/* have some final structures */
if (caml_gc_phase != Phase_sweep_and_mark_main) {
/* Force a major GC cycle to simplify constraints for
* handing over finalisers. */
caml_finish_major_cycle();
CAMLassert(caml_gc_phase == Phase_sweep_and_mark_main);
}

this domain may or may not have "updated" its finalise first or finalise last finalisers (depending on which phase the GC is currently in). Domain local variables updated_first and updated_last capture this fact. Based on this, we may choose to go for a different implementation that orphans the finalisers as is (with or without updating them). Let's explore both options.

With update: Note that the logic to update the finalisers is present in major_collection_slice. It seemed easiest to call caml_finish_major_cycle to finish the pending work and also complete the cycle. The original code assumed that we would get to Phase_sweep_and_mark_main, which provided stronger invariants for the handover. The current PR sticks to this approach and fixes the current bug.

Without update: If the finalisers are orphaned without updating them, then we have to ensure that they're processed in the current cycle; UNMARKED objects on which the finalise first finalisers are installed need to be MARKED in the current cycle. Otherwise, they'll become GARBAGE in the next cycle which will lead to a crash. It is possible that we can make this design, where finalisers can be orphaned in any phase, work. I suspect this will be more (a) code than what we have now, (b) more cases to handle and reason about and (c) potentially more bugs due to more code. It seems to me that this approach is neither going to give us better performance (as domain termination is a rare event) nor make reasoning easier. I tend to prefer the current approach.

Footnotes

  1. In the sense of generic_final_update functions. See https://github.com/ocaml/ocaml/blob/ed7b3824a260ec20a694f5070f42a7215303f8bc/runtime/finalise.c#L52.

@gasche
Copy link
Member

gasche commented Nov 1, 2023

In this particular issue, the counters num_domains_to_final_update_* have the property that they are strictly decreasing. If we allow finalisers to be adopted in phases other than Phase_sweep_and_mark_main then we have to handle the case that num_domains_to_final_update_* will increase (when the adopting domain has already "updated" its finalise first or last finalisers). I would like to avoid this.

I don't follow this part of your explanation. We are talking about {handover => orphan}_finalisers that is orphaning the finalisers of a terminating domain -- adoption will happen later. I don't understand why orphaning would require incrementing num_domains_to_final_update_*.

On the other hand, I see why we want to ensure that the finalisers have been updated before orphaning them, and why calling caml_finish_major_cycle is a good way to do that. (I wondered if the terminating domain could in turn adopt new finalisers at this point, but no, adopt_orphaned_work guards against this by bailing out when domain_is_terminating.)

@gasche
Copy link
Member

gasche commented Nov 1, 2023

The present post is a comment about my own remark on the fact that this PR introduces a new synchronisation mechanism that is a bit different from the "phase barriers" that already exist.

If we can think of the num_domains_to_* variables as "phase barriers" (they reach 0 when a global condition holds, allowing the GC automata to move to a new state), then num_domains_orphaning_finalisers is not really a phase barrier but rather a "phase breakpoint" -- it ensures that the GC automata will get blocked at a specific point until we disable it.

One issue with this mechanism is that it does not compose. If some other part of the runtime decided to introduce another "phase " to break at specific phase (say Phase_mark_final), then the two variables would work against each other. Setting the breakpoint and calling caml_finish_major_cycle would not guarantee that we end up in Phase_mark_final, we could be stuck in Phase_sweep_and_mark_main because some other domain concurrently tried to orphan its finalisers. And I don't see how the code would recover from this situation -- there is no obvious way to use this mechanism to say "play along with the other breakpoint request and then come back to do what I want here".

This is not a deal blocker for the present PR, given that there are no plans to introduce another such "breakpoint" in the future. It does make me wonder if there are other ways to do this.

For example: would it be possible to have caml_finish_major_cycle itself take care of orphaning the finalisers, and other things that need orphaning, when it runs on a terminating domain?

@kayceesrk
Copy link
Contributor Author

I don't understand why orphaning would require incrementing num_domains_to_final_update_*.

Orphaning wouldn't, but adoption might.

Let us assume that we're implementing the without update solution described in #12710 (comment). Assume that there are 3 domains (dom0, dom1 and dom2) at the start of the major cycle. So num_domains_to_final_update_first will be 3. Let's say dom0 finishes updating its finalise-first finalisers. It will decrement num_domains_to_final_update_first to 2. Now let us also assume that both dom1 and dom2 terminate before updating their respective finalise-first finalisers. Both dom1 and dom2 will orphan their finalisers.

We can't leave num_domains_to_final_update_first at 2 since there will only be 1 domain (dom0) running after termination.
We can choose to decrement num_domains_to_final_update_first when a domain terminates. This will make num_domains_to_final_update_first to be 0 when both domains terminate. Now, when dom0 adopts the orphaned finalizers, it must increment num_domains_to_final_update_first to 1 so that the state is consistent.

Is that helpful?

@kayceesrk kayceesrk force-pushed the fix_finaliser_handover_flakiness branch from 6075aea to d345acf Compare November 2, 2023 08:07
@kayceesrk
Copy link
Contributor Author

If we can think of the num_domains_to_* variables as "phase barriers" (they reach 0 when a global condition holds, allowing the GC automata to move to a new state), then num_domains_orphaning_finalisers is not really a phase barrier but rather a "phase breakpoint" -- it ensures that the GC automata will get blocked at a specific point until we disable it.

Generally, I think it would make sense to take a global look at the phasing protocol and see whether we can do better. We have been accumulating the changes as we've developed Multicore OCaml, and may not represent the ideal way to do things. Something like #12579.

For example: would it be possible to have caml_finish_major_cycle itself take care of orphaning the finalisers, and other things that need orphaning, when it runs on a terminating domain?

Would this still not need a "phase barrier" variable?

@kayceesrk
Copy link
Contributor Author

This is an odd failure: https://github.com/ocaml/ocaml/actions/runs/6729852164/job/18291457670?pr=12710

the file './ocamlopt' has not the right magic number: expected Caml1999X033, got 
make[2]: *** [Makefile:2324: ocamldoc/odoc_types.cmx] Error 127
make[2]: *** Waiting for unfinished jobs....
make[3]: Leaving directory '/home/runner/work/ocaml/ocaml'

I'll restart this job.

@lpw25
Copy link
Contributor

lpw25 commented Nov 2, 2023

I was just reading this in passing, but I found this sentence a bit surprising:

We can't leave num_domains_to_final_update_first at 2 since there will only be 1 domain (dom0) running after termination.

I would have thought that such counters were intended to represent not literal domains with work to do, but partitions of the finalizers that remain to be processed. At the start of a cycle the number of partitions and domains is the same, but when one domain terminates, it's partition is orphaned and adopted by a different domain. This action doesn't change the number of partitions that remain to be processsed and so doesn't wouldn't affect the counter. Each domain would remeber how many partitions its queue of work accounted for and would subtract that amount from the counter when it was done.

@gasche
Copy link
Member

gasche commented Nov 2, 2023

Is that helpful?

I am still confused. I will try to give more details below, but maybe this sub-discussion-point is not so important.

My original question is: why does the implementation need to move the major GC to the phase Phase_sweep_and_mark_main in the function orphan_finalisers that is tasked with orphaning (not adopting) finalisers of a terminating domain?

I understand #12710 (comment) as providing two distinct answers:

  1. This simplifies the specification and implementation of the num_domains_to_final_update_* phase barrier.
  2. Doing the orphaning at any point would encounter finalisers that are not updated yet, and this makes things more difficult (both updating on the spot or delaying the update to the adopter domain are more difficult than the current approach).

I understand (2) and I am willing to trust your intuition that the current approach is simpler. My understanding is that we put ourselves in the phase Phase_sweep_and_mark_main to ensure that the finalisers are updated before orphaning them.

I do not understand (1), as the situation where we would need to increment num_domains_to_final_update is related to adoption of finalisers, and not orphaning of finalisers. If we stopped calling caml_finish_major_cycle() in orphan_finalisers, we would change when adoption happens, but not when adoption happens, and in particular I don't see how this would interact with num_domains_to_final_update.

My impression is that misunderstood your remark about (1) but that this is not essential to understanding the proposed change and review it, so I am happy to drop this sub-thread anytime.

@kayceesrk
Copy link
Contributor Author

I would have thought that such counters were intended to represent not literal domains with work to do, but partitions of the finalizers that remain to be processed.

Unfortunately, the current code doesn't work this way. We merge the adopted partition (final info structure) into the adopting domain's one.

ocaml/runtime/major_gc.c

Lines 432 to 447 in ed7b382

if (f->todo_head) {
if (myf->todo_tail == NULL) {
CAMLassert(myf->todo_head == NULL);
myf->todo_head = f->todo_head;
myf->todo_tail = f->todo_tail;
} else {
myf->todo_tail->next = f->todo_head;
myf->todo_tail = f->todo_tail;
}
}
if (f->first.young > 0) {
caml_final_merge_finalisable (&f->first, &myf->first);
}
if (f->last.young > 0) {
caml_final_merge_finalisable (&f->last, &myf->last);
}

@gasche
Copy link
Member

gasche commented Nov 2, 2023

For example: would it be possible to have caml_finish_major_cycle itself take care of orphaning the finalisers, and other things that need orphaning, when it runs on a terminating domain?

Would this still not need a "phase barrier" variable?

I gave this more thought. My conclusion is that I am interested in understanding the design better and proposing alternatives, but I think that this should not block the current PR and take a lock on KC's time. I think that the current PR is not my favorite new synchronization mechanism, but it solves the issue at hand, and it adds a lot of new valuable documentation; in short, it improves the codebase. I think we should focus on reviewing for correctness and clarity, merge, and then possibly discuss alternative mechanisms.

@kayceesrk
Copy link
Contributor Author

kayceesrk commented Nov 2, 2023

I do not understand (1), as the situation where we would need to increment num_domains_to_final_update is related to adoption of finalisers, and not orphaning of finalisers.

I agree with this statement. My point was that on trunk num_domains_to_final_update* counters strictly decrease in a cycle. My initial impression was that if we wanted to orphan and adopt finalisers in phases other than Phase_sweep_and_mark_main, then we may potentially change the strictly decreasing nature of the counters. That said, @lpw25's suggestion of thinking about the counter values as counting partitions rather than domains is a promising approach as it retains the strictly decreasing nature of the counters.

It may be useful to look at the phase barrier mechanisms holistically. The code currently carries the vestiges of the evolution and there may be potential improvements. I'll try to add more documentation to the runtime in future PRs, which will help when we refactor the code in the future.

Copy link
Member

@gasche gasche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that the PR is correct. The trickiest part to review in the details was in fact the refactoring commit that you added on my suggestion :-)
The overall design of the change is of course more complex than the refactoring, but it is explained well in the extra documentation comments, which go further and document other parts of the major GC as well. Thanks!

(I would still be in favor of a slightly more explicit comment as to why we finish a major cycle when orphaning finalisers. I mentioned in a previous message today that I thought a reason why "we want the finalisers to be updated at this point", but in fact this is not the case, on the contrary the negation !f->updated_{first,last} holds after finishing the cycle. This suggests that I am still missing something.)

@gasche
Copy link
Member

gasche commented Nov 2, 2023

What would need to be done before merging?

  • We should decide whether we are happy with a non-expert-reviewer approval (mine), or we want to put extra eyeballs on this. I would let KC judge this. My intuition is that the change is safe enough that merging it right now is not a risk -- it leaves the codebase in a better state anyway.
  • If Leo wants to request some changes to the implementation, he will say so. (My current understanding is that he is okay with leaving improvements to future PRs.)
  • This needs a Changes entry.
  • I think that the PR should include a revert of my own disable the finaliser_handover test #12708. (It appears that disabling the flaky test a few days before the bug gets fixed was not a good choice. I have overestimated the extra time it would take to resolve this issue. A good problem to have.)

@kayceesrk kayceesrk force-pushed the fix_finaliser_handover_flakiness branch from d345acf to 4280943 Compare November 2, 2023 12:53
@gasche
Copy link
Member

gasche commented Nov 2, 2023

cc @fabbing, @OlivierNicole : if one of you would like to have a look at this, it is not a bad way to get out of your fp+TSan comfort zone and learn about other aspects of the runtime.

@kayceesrk
Copy link
Contributor Author

@sadiqj has agreed to review this PR next week.

@xavierleroy
Copy link
Contributor

(It appears that disabling the flaky test a few days before the bug gets fixed was not a good choice. I have overestimated the extra time it would take to resolve this issue. A good problem to have.)

Either that, or disabling flaky tests is a way to trigger examination of the underlying problems. In the latter case, it's a winning strategy that we should apply more often...

Copy link
Contributor

@sadiqj sadiqj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. Also appreciate the more extensive docs on the num_* counters. Thanks @kayceesrk

@gasche
Copy link
Member

gasche commented Nov 7, 2023

Thanks @sadiqj! This is good to merge once rebased.

@kayceesrk
Copy link
Contributor Author

Thanks for the review. Let me rebase the changes.

When a domain terminates, the terminating domain's finalisers must be
orphaned and adopted in Phase_sweep_and_mark_main GC phase. We introduce
a global counter [num_domains_orphaning_finalisers] to prevent the GC
from proceeding past [Phase_sweep_and_mark_main] when orphaning
finalisers.
Move orphaning code from domain.c to major_gc.c. This makes the code
more modular.
@kayceesrk kayceesrk force-pushed the fix_finaliser_handover_flakiness branch from 4280943 to a343a53 Compare November 8, 2023 07:29
Copy link
Contributor

@fabbing fabbing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty far from being a GC expert but the changes seem reasonable to me. Aslo, the comments are a welcome addition!

Copy link
Member

@gasche gasche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was planning to merge and add Fabrice as reviewer afterwards, but there is a small change to the Changes that might be intentional, so I prefer to let @kayceesrk sort it out. After that, I would be in favor of merging without waiting for the CI again -- if only the Changes are modified.

Includes minor edits addressing review comments.
@kayceesrk kayceesrk force-pushed the fix_finaliser_handover_flakiness branch from a343a53 to aa42332 Compare November 9, 2023 03:01
@kayceesrk
Copy link
Contributor Author

Added @fabbing to reviewers in Changes. Fixed the other typo. CI is green. Merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants