New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix bad behaviour of caml_alloc_custom_mem
#12318
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CI fails because tests/lib-runtime-events/test_instrumented
now performs an extra minor GC on some systems (from 31 to 32 minor collections), and I am not sure why. Runtime events involve one custom allocation (for the "cursor"), but it is a structure of callback functions, well below the previous cutoff size for mem_minor.
Do you understand why the behavior of this test changes?
Otherwise, this seems very reasonable. The one thing that makes me uneasy is that you chose a perfectly arbitrary value that happens to be slightly above this one workflow that you care about.
I don't understand why we need a cutoff value at all. Currently there are two mechanisms for limiting the amount of out-of-heap memory held by custom objects in the minor heap:
custom_minor_ratio
which causes a minor collection if minor-held out-of-heap memory reaches a certain factor of the minor heap sizecustom_minor_max_size
which (which this PR) allocates custom objects on the major heap directly when their out-of-heap size is above a certain value
Why have the second control at all? Would anything bad happen if we used an infinite size as the default value? The first control is more robust to applications that change the default minor heap size.
Thinking out loud: could we have a caml_alloc_custom_shr
function that lets users be explicit that a custom block will be long-lived? It sounds more robust than a distinction based on the out-of-heap size.
Wouldn't any large custom block allocation cause an immediate minor collection? Since that would put it on the major heap anyway, might as well just put such custom blocks straight on the major heap. |
Yes, instead of "infinite limit", a better default may be But on the other hand, I don't think that a dynamic policy that major-allocates blocks of out-of-heap size For example, if a program somewhat-frequently allocates short-lived custom blocks whose out-of-heap size a third of the out-of-heap-minor-size we configured, then we probably want to run a minor allocation every three such allocations, instead of allocating the first two on the minor heap and then promoting all remaining on the major heap as long as the minor heap is not full. (If we minor-collect more often, we will notice promptly that those large-ish custom blocks are free, keeping the out-of-heap space usage in check. If we major-promote instead, we either blow up the out-of-heap space of this workflow, or we force running major collections more often which is even more costly.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs a Changes entry.
This PR let me realize that there is no documentation of caml_alloc_custom_mem
in caml/custom.h
. There is a documentation of custom_minor_max_size
in gc.mli but this is not a natural place to look for. Could you consider adding a documentation comment in custom.h?
I investigated a bit and found out that the standard library allocates 3 custom blocks with dependent memory just above 64KiB each: stdin, stdout, and stderr. In trunk, this forces the major GC to do a slice (and thus force a minor collection) before the program has even started. After that, collections are triggered at completely different points in the program, so it's hard to conclude anything from a difference of one minor collection. |
If we did have a |
ee5ee5d
to
24518e0
Compare
/* [caml_alloc_custom_mem] allocates a custom block with dependent memory | ||
(memory outside the heap that will be reclaimed when the block is | ||
finalized). If [mem] is greater than [custom_minor_max_size] (see gc.mli) | ||
the block is allocated directly in the major heap. */ | ||
CAMLextern value caml_alloc_custom_mem(const struct custom_operations * ops, | ||
uintnat size, /*size in bytes*/ | ||
mlsize_t mem /*memory consumed*/); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to gc.mli, mem
and caml_minor_max_size
are supposed to be bytes. This should be documented, for example /* memory consumed */
could be /* dependent memory in bytes */
.
(The behavior of the runtime is scale-independent with respect to these variables, but we should fix a convention to avoid ending up with different libraries using different units, and so that users know how to configure the maximum.)
@damiendoligez another suggestion to make the |
@gasche the problem here is that As it stands now, we know exactly what the bureaucracy blocks are: the cursor (2 blocks) and the Another solution would be to run the loop twice with different numbers of iterations (e.g. 0 and 1000000) and print the difference. |
Yes, that sounds more robust than your current approach? |
It looks like something is happening on the "benchmarks" CI on this PR -- I found out by chance. Looking at https://autumn.ocamllabs.io/ocaml/ocaml/pull/12318/base/trunk?worker=autumn&image=ocaml%2Fopam%3Adebian-11-ocaml-4.14#line-graph-projects-ocamlgraph/typing , you can fold the "binaries" pane (which is measuring the size of compiled objects) and unfold the "projects" pane (which is measuring the compilation time), and see that:
Do we know how to interpret those results? As far as I can tell, the only uses of cc @Octachron |
Probably noise? I'm guessing you're looking at the numbers from the tooltips on the graphs rather than the table above the graphs. Note how ocaml/typing went from 224 to 230 seconds on commit 24518e0 : "add Changes entry and documentation comment". |
02cf796
to
0ec5f03
Compare
I've rebased and force-pushed, let's see if these benchmarks still look suspicious. |
I don't think this is just noise, because the results are stable for several commits before the change and also after the change. |
0ec5f03
to
17eb163
Compare
@gasche you're right, it's not noise. I've run some benchmarks on my machine and got a clear slowdown. I'm investigating. |
17eb163
to
5de8e09
Compare
This is the current state of my investigations. Benchmarking this PR (and its sister #12322) I get these results for the typing phase of the compiler compiling ocamlgraph.1.8.8:
4.14.1 is the release The top of the table is raw data, the bottom is average / median / standard deviation. There is a hole in the 4.14.2 data where I removed an obvious outlier. We can see that the slowdown in typing is real, but only exists in OCaml 5. It disappears if I still don't fully understand the source of this slowdown. |
I found the source of the slowdown. This is because In (the intial version of) this PR, we compute it at the time of allocation in the minor heap but add it to the major GC counter when (and if) the custom block gets promoted. Which means every time, because #12439 wasn't merged yet. This is also wrong because the heap is growing fast. The workload factor turns out large because the heap is small, but by the time the block gets promoted, the heap is much larger and we apply a large factor to it, which forces the major GC to do a lot of work (essentially a complete cycle for each minor collection). There are 3 things that can hide the slowdown:
Once the problem is identified, it's not hard to fix: compute the GC acceleration factor when the custom block is promoted rather than when it is allocated in the minor heap. I've pushed a commit with this fix. |
The explanation is clear, but relating it to the code is not easy for me -- this is out of my memory and I need to page it back in. Here is my current belief about the current state of the PR, last commit
|
…ater back to work units using a different heap size. When the heap is growing, this unduly accelerates the major GC, slowing down the program.
ca989ec
to
7ae91bc
Compare
All your points are perfectly accurate. I have now rebased and pushed the commit in question. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two uses of the max_major
parameter in alloc_custom_gen
.
if (wosize <= Max_young_wosize && mem <= caml_custom_minor_max_bsz) {
// put the block on the minor heap...
/* Record the extra resources in case the block gets promoted. */
add_to_custom_table (&Caml_state->minor_tables->custom, result,
mem, max_major);
}
} else {
// put the block on the major heap...
caml_adjust_gc_speed(mem, max_major);
}
I understand that you want to use 0
for the first use in add_to_custom_table
. But the patch currently results in also using 0
for caml_adjust_gc_speed
in the directly-on-the-major-heap case, and this feels wrong?
runtime/minor_gc.c
Outdated
previous cycle. [heap_size / 150] is really [heap_size * | ||
(2/3) / 100] (but faster). */ | ||
elt->max = caml_heap_size(Caml_state->shared_heap) / 150 | ||
* caml_custom_major_ratio; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of exposing caml_custom_major_ratio
and redoing the weird computation with 150
here, I think that custom.h
should export a mlsize_t caml_custom_current_max_major(void)
function to be called here. It may even be possible to factor more logic with alloc_custom_gen
by exporting a void caml_custom_adjust_speed_after_major_allocation(mlsize_t mem)
function.
computation from adjust_gc_speed whenever max = 0. This streamlines alloc_custom_gen and fixes a bug reported by @gasche.
@gasche I've combined changes suggested by your two remarks (thanks for finding a bug, BTW) and got to a state that I think is both correct and elegant. |
runtime/custom.c
Outdated
@@ -84,30 +102,16 @@ CAMLexport value caml_alloc_custom(const struct custom_operations * ops, | |||
mlsize_t mem, | |||
mlsize_t max) | |||
{ | |||
return alloc_custom_gen (ops, bsz, mem, max, mem, max); | |||
return alloc_custom_gen (ops, bsz, mem, max, max == 0 ? 1 : max); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure why there is a change here, but my guess is that you are trying to avoid running into the new semantics for max == 0
for the major heap by normalizing to 1
instead.
If this is indeed the intent, then there is a mistake: the parameter order is max_major, max_minor
, so you are normalizing max_minor
instead of max_major
here.
Whatever the intent is, there is something strange, because max_minor == 0
is already normalized to 1
within the body of alloc_custom_gen
so the test here is superfluous.
I think it would be helpful to write this function as:
mlsize_t max_major = ...;
mlsize_t max_minor = ...;
return alloc_custom_gen (ops, bsz, mem, max_major, max_minor);
to make it easy for the code reader to follow the intent of the code writer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intent was to expose the new meaning of max_major == 0, while avoiding a call to alloc_custom_mem with max_minor == 0 (which is redundant, as you noted).
This is to change (for the better) the behavior of programs that pass (mem=1, max=0) to caml_alloc_custom
instead of the usually intended (mem=0, max=1).
But that was incomplete. I have now changed the code to make caml_alloc_custom
with max=0 behave exactly the same as caml_alloc_custom_mem
.
…stom_mem when max is 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code is now clearer than before, better documented, and it works better. Approved.
fix bad behaviour of `caml_alloc_custom_mem` (cherry picked from commit b4308a0) (cherry picked from commit aa1b842892e1e7607727cda0613bd0737d16a450)
Cherry-picked for the GC performance fix for 5.1.1 as 1bbed4e . |
I got an out-of-band report of a bad behaviour of
alloc_custom_mem
. It is triggered by a program that allocates many short-lived 64kB-sized bigarrays. As explained in #1738 (comment), the GC counts the first 8k to accelerate the minor GC and the rest to accelerate the major GC, because "the block is assumed to be long-lived". This means the major GC is running too fast and wasting a lot of time because the assumption is false and these blocks never hit the major heap.A bit of brainstorming with @stedolan brought us to these conclusions:
custom_minor_max_size
is probably too low.This PR changes
alloc_custom_mem
to allocate directly in the major heap if the block's dependent memory is abovecustom_minor_max_size
, and use the full size of dependent memory to accelerate the relevant GC. It also increases the default setting ofcustom_minor_max_size
to an arbitrary value slightly above 64KiB.There will be a sister PR to backport it to 4.14 because the conflicts are non-trivial, and a follow-up PR to fix a related TODO in the minor GC that's left over from the multicore work:
ocaml/runtime/minor_gc.c
Line 573 in d2ccddc