New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix custom block promotion #12439
Fix custom block promotion #12439
Conversation
6ab4597
to
db2079a
Compare
I rebased on trunk to get rid of the conflict with #12445 . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a GC expert, so this PR deserves a second approval, but I note that the new code is essentially the same that we have in 4.14:
Lines 403 to 415 in 6de2914
/* Run custom block finalisation of dead minor values */ | |
for (elt = Caml_state->custom_table->base; | |
elt < Caml_state->custom_table->ptr; elt++){ | |
value v = elt->block; | |
if (Hd_val (v) == 0){ | |
/* Block was copied to the major heap: adjust GC speed numbers. */ | |
caml_adjust_gc_speed(elt->mem, elt->max); | |
}else{ | |
/* Block will be freed: call finalization function, if any. */ | |
void (*final_fun)(value) = Custom_ops_val(v)->finalize; | |
if (final_fun != NULL) final_fun(v); | |
} | |
} |
This inspires much confidence. In contrast, the 5.0 code being replaced is... strange! "unconditionally promote custom blocks so accounting is correct" sounds like the wrong thing to do.
@kayceesrk : how can we move forward on this PR? Who could give a second review? To me it looks like an obvious fix to an obvious mistake, so let's give it some consideration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My impression is that what is going on here is fairly subtle, and indeed a parallel-minor-GC expert would be helpful to tell whether the change is correct. As far as I can see, the choice of the current behavior for custom finalizers was done by @ctk21 in ocaml-multicore/ocaml-multicore@c56f5d5, and the last person to touch this code was @sadiqj in a8fa947.
To start with, a basic question: where is the bug in the current code? Sure, promoting unconditionally sounds less efficient, but why would it trigger a failure in the testsuite? I haven't understood this point from the PR description or code.
Regarding the promotion logic, the point that is delicate, I think, is when to run the code to finalize custom blocks in the minor GC, if at all. The multicore-GC code started with a minor custom promotion logic very close to the 4.14 code, but at a place where that logic was incorrect: it was inside the part of the STW section where all domains promote their roots in parallel, and so at this point there was no guarantee that all live values had been promoted yet, and a custom block held alive by a different domain could be wrongly finalized. I believe that the "unconditional promotion" code was designed to be (less efficient but) safer / more clearly correct than the original multicore code.
The new code placement chosen by @damiendoligez looks more correct that the initial-multicore-GC placement -- it is after the second barrier of the minor GC, at a point where we know that all domains are done scanning their roots. But is it actually correct? (And why did the multicore people not think of putting it there, in that case?)
runtime/minor_gc.c
Outdated
@@ -612,6 +597,7 @@ void caml_empty_minor_heap_promote(caml_domain_state* domain, | |||
|| (get_header_val(vnew) != 0 && !Is_young(vnew))); | |||
} | |||
|
|||
struct caml_custom_elt *elt; | |||
for (elt = self_minor_tables->custom.base; | |||
elt < self_minor_tables->custom.ptr; elt++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that the assertions in this debug-only code will not hold anymore, now that promotion of custom blocks has been moved to later in the minor GC logic. (The assertion will still not hold after the new promotion code, as it may now leave header-0 entries in the custom table which is cleared shortly after.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with this. Github UI seems to point to the wrong location. The assertions are here:
Lines 615 to 620 in bd952ad
for (elt = self_minor_tables->custom.base; | |
elt < self_minor_tables->custom.ptr; elt++) { | |
value vnew = elt->block; | |
CAMLassert (!Is_block(vnew) | |
|| (get_header_val(vnew) != 0 && !Is_young(vnew))); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may explain the CI failures.
runtime/minor_gc.c
Outdated
if (final_fun != NULL) final_fun(*v); | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: while the code of caml_empty_minor_heap_promote
is a sprawling mess, the present function is neatly separated with all sub-phases called as auxiliary functions. Could you move this code to an auxiliary function to follow that style? I would suggest a static
function named custom_update_last_minor
, by analogy with caml_final_update_last_minor
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with this one as well.
while the code of caml_empty_minor_heap_promote is a sprawling mess
The whole file needs spring cleaning and better organisation without making semantic changes. The code is hard to get into every time I come back to this file. For example, caml_empty_minor_heap_promote
is only called from within this file and should be made static (and caml_
should be dropped by convention). There are a few others such as these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The overall solution looks good to me. It restores the behaviour that we had in OCaml 4.
This assertion needs to go
Lines 615 to 620 in bd952ad
for (elt = self_minor_tables->custom.base; | |
elt < self_minor_tables->custom.ptr; elt++) { | |
value vnew = elt->block; | |
CAMLassert (!Is_block(vnew) | |
|| (get_header_val(vnew) != 0 && !Is_young(vnew))); | |
} |
There is no bug in the sense of "it crashes", but there's a bug in the sense of "the GC is doing useless work and fails to recover memory quickly enough". If memory serves:
It's the other way around (possibly): with this PR, finalization functions are called much earlier following the death of a custom block, so root registration problems like we had in #12436 may show up more easily. Not sure about this one. |
@xavierleroy ah, I somehow misread the original description of @damiendoligez as assuming this PR fixes an issue with intext_par. Thanks! |
db2079a
to
98f12c3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the new commits. I believe the PR is correct and the code looks good, I would be in favor of merging.
There remains a small question about the runtime events related to this specific part of the minor heap code, I will discuss this with @damiendoligez right away.
Dead custom blocks in the minor heap are now finalized and not promoted to the major heap. (cherry picked from commit 210cc86) (cherry picked from commit d06f9165738be93ae61fbe75b0eaa9cf8cb60b73)
Cherry-picked as a part of the GC performance fix package for 5.1.1 as dcd29ea . |
Tying a loose end from the multicore work: the minor GC obviously shouldn't promote custom blocks that are dead, but should instead run their finalization functions.
PR based on #12436 (which is the first commit here)because it makes that bug much more likely to occur (triggers with about 10 to 20% probability intests/lib-marshal/intext_par.ml
).