-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Fix finaliser handover flakiness #12710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix finaliser handover flakiness #12710
Conversation
gasche
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
I don't know this part of the codebase so I would encourage other people to look at this as well.
Here is what I learned from your detailed comments:
-
The major GC uses global atomic counters as a sort of "phase barriers": each domains increments the counter when they need a particular kind of global work done, and we stay in the phase doing this work as long as the counter is non-zero.
You added detailed documentation for these counters that have slightly different speecifications. (Are they incremented on creation of a new domain or not? how is termination handled? etec.) Thanks! -
You are proposing to use a new counter with a different protocol to act as a similar "phase barrier" for orphaning finalisers. The change is very small in numbers of code line. It is interesting that this synchronization mechanism is not currently used for anything else in the major GC.
On the other hand, I still don't understand what is special about finaliser handover that make them require this phase barrier. The synchronization logic for ephemerons, for example, seems simpler.
|
Several questions are about the motivation for the chosen constraints. I'll try to answer that here. A design approach that I've been following for the multicore runtime is that, in places where performance is not critical, I tend to make strong constraints on the state. The idea is to be able to
This aim is that this will lead to simpler code that does not have to handle corner cases which are rarely exercised. Note that I am not claiming that all the code in the runtime has been written according to this. This approach is an ideal that I am striving for. In the case of phase barriers, assuming that certain global counters are strictly decreasing makes reasoning about the concurrent program states easier. Any crash or failure where we see the counters grow will be an error. In this particular issue, the counters The second, more challenging reason is that if the GC phase is not Lines 1837 to 1844 in ed7b382
this domain may or may not have "updated" its finalise first or finalise last finalisers (depending on which phase the GC is currently in). Domain local variables With update: Note that the logic to update the finalisers is present in Without update: If the finalisers are orphaned without updating them, then we have to ensure that they're processed in the current cycle; UNMARKED objects on which the finalise first finalisers are installed need to be MARKED in the current cycle. Otherwise, they'll become GARBAGE in the next cycle which will lead to a crash. It is possible that we can make this design, where finalisers can be orphaned in any phase, work. I suspect this will be more (a) code than what we have now, (b) more cases to handle and reason about and (c) potentially more bugs due to more code. It seems to me that this approach is neither going to give us better performance (as domain termination is a rare event) nor make reasoning easier. I tend to prefer the current approach. Footnotes
|
I don't follow this part of your explanation. We are talking about On the other hand, I see why we want to ensure that the finalisers have been updated before orphaning them, and why calling |
|
The present post is a comment about my own remark on the fact that this PR introduces a new synchronisation mechanism that is a bit different from the "phase barriers" that already exist. If we can think of the One issue with this mechanism is that it does not compose. If some other part of the runtime decided to introduce another "phase " to break at specific phase (say This is not a deal blocker for the present PR, given that there are no plans to introduce another such "breakpoint" in the future. It does make me wonder if there are other ways to do this. For example: would it be possible to have |
Orphaning wouldn't, but adoption might. Let us assume that we're implementing the without update solution described in #12710 (comment). Assume that there are 3 domains (dom0, dom1 and dom2) at the start of the major cycle. So We can't leave Is that helpful? |
6075aea to
d345acf
Compare
Generally, I think it would make sense to take a global look at the phasing protocol and see whether we can do better. We have been accumulating the changes as we've developed Multicore OCaml, and may not represent the ideal way to do things. Something like #12579.
Would this still not need a "phase barrier" variable? |
|
This is an odd failure: https://github.com/ocaml/ocaml/actions/runs/6729852164/job/18291457670?pr=12710 I'll restart this job. |
|
I was just reading this in passing, but I found this sentence a bit surprising:
I would have thought that such counters were intended to represent not literal domains with work to do, but partitions of the finalizers that remain to be processed. At the start of a cycle the number of partitions and domains is the same, but when one domain terminates, it's partition is orphaned and adopted by a different domain. This action doesn't change the number of partitions that remain to be processsed and so doesn't wouldn't affect the counter. Each domain would remeber how many partitions its queue of work accounted for and would subtract that amount from the counter when it was done. |
I am still confused. I will try to give more details below, but maybe this sub-discussion-point is not so important. My original question is: why does the implementation need to move the major GC to the phase I understand #12710 (comment) as providing two distinct answers:
I understand (2) and I am willing to trust your intuition that the current approach is simpler. My understanding is that we put ourselves in the phase I do not understand (1), as the situation where we would need to increment My impression is that misunderstood your remark about (1) but that this is not essential to understanding the proposed change and review it, so I am happy to drop this sub-thread anytime. |
Unfortunately, the current code doesn't work this way. We merge the adopted partition (final info structure) into the adopting domain's one. Lines 432 to 447 in ed7b382
|
I gave this more thought. My conclusion is that I am interested in understanding the design better and proposing alternatives, but I think that this should not block the current PR and take a lock on KC's time. I think that the current PR is not my favorite new synchronization mechanism, but it solves the issue at hand, and it adds a lot of new valuable documentation; in short, it improves the codebase. I think we should focus on reviewing for correctness and clarity, merge, and then possibly discuss alternative mechanisms. |
I agree with this statement. My point was that on trunk It may be useful to look at the phase barrier mechanisms holistically. The code currently carries the vestiges of the evolution and there may be potential improvements. I'll try to add more documentation to the runtime in future PRs, which will help when we refactor the code in the future. |
gasche
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that the PR is correct. The trickiest part to review in the details was in fact the refactoring commit that you added on my suggestion :-)
The overall design of the change is of course more complex than the refactoring, but it is explained well in the extra documentation comments, which go further and document other parts of the major GC as well. Thanks!
(I would still be in favor of a slightly more explicit comment as to why we finish a major cycle when orphaning finalisers. I mentioned in a previous message today that I thought a reason why "we want the finalisers to be updated at this point", but in fact this is not the case, on the contrary the negation !f->updated_{first,last} holds after finishing the cycle. This suggests that I am still missing something.)
|
What would need to be done before merging?
|
d345acf to
4280943
Compare
|
cc @fabbing, @OlivierNicole : if one of you would like to have a look at this, it is not a bad way to get out of your fp+TSan comfort zone and learn about other aspects of the runtime. |
|
@sadiqj has agreed to review this PR next week. |
Either that, or disabling flaky tests is a way to trigger examination of the underlying problems. In the latter case, it's a winning strategy that we should apply more often... |
sadiqj
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. Also appreciate the more extensive docs on the num_* counters. Thanks @kayceesrk
|
Thanks @sadiqj! This is good to merge once rebased. |
|
Thanks for the review. Let me rebase the changes. |
When a domain terminates, the terminating domain's finalisers must be orphaned and adopted in Phase_sweep_and_mark_main GC phase. We introduce a global counter [num_domains_orphaning_finalisers] to prevent the GC from proceeding past [Phase_sweep_and_mark_main] when orphaning finalisers.
Move orphaning code from domain.c to major_gc.c. This makes the code more modular.
4280943 to
a343a53
Compare
fabbing
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty far from being a GC expert but the changes seem reasonable to me. Aslo, the comments are a welcome addition!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was planning to merge and add Fabrice as reviewer afterwards, but there is a small change to the Changes that might be intentional, so I prefer to let @kayceesrk sort it out. After that, I would be in favor of merging without waiting for the CI again -- if only the Changes are modified.
Includes minor edits addressing review comments.
a343a53 to
aa42332
Compare
|
Added @fabbing to reviewers in Changes. Fixed the other typo. CI is green. Merging. |
This is a fix for #12345.
Problem
Finalisers can only be orphaned and adopted in
Phase_sweep_and_mark_main. This simplifies invariants since finalisers are only processed in the two subsequent phases --Gc.finalise(finalise first) finalisers inPhase_mark_finalandGc.finalise_last(finalise last) finalisers inPhase_sweep_ephe. The existing code attempts to get toPhase_sweep_and_mark_mainby performing acaml_finish_major_cycleinhandover_finalisersand assumes that after the call tocaml_finish_major_cycle, the GC is inPhase_sweep_and_mark_main. However, given that there may be incoming interrupts from other domains to concurrently switch phases, we may move pastPhase_mark_and_sweep_main. This triggers the assertion failure.Solution
The fix is in the first commit. It will be useful to review this PR commit by commit. We introduce a global counter
num_domains_orphaning_finalisersthat keeps track of terminating domains orphaning their finalisers. If such domains exist, then we prevent the GC phase from moving pastPhase_sweep_and_mark_main. The orphaned finalisers are guaranteed to be adopted since there is already a check that only allows phase changes when there is no orphaned work.The second commit is semantics preserving and adds code comments for some of the global variables that are involved in phase changes. I thought it might be a good idea to add code comments when I'm actively looking at these details. It also does a bit of tidying up of the code.
Testing
For testing, I followed @Octachron's advice of spawning a large number of concurrent tests. Without this fix, on the debug runtime, I regularly saw assertion failures. With the fix, I don't have the failures anymore.