fix bad behaviour of `caml_alloc_custom_mem` #12318

damiendoligez · 2023-06-21T15:43:05Z

I got an out-of-band report of a bad behaviour of alloc_custom_mem. It is triggered by a program that allocates many short-lived 64kB-sized bigarrays. As explained in #1738 (comment), the GC counts the first 8k to accelerate the minor GC and the rest to accelerate the major GC, because "the block is assumed to be long-lived". This means the major GC is running too fast and wasting a lot of time because the assumption is false and these blocks never hit the major heap.

A bit of brainstorming with @stedolan brought us to these conclusions:

The accounting should accurately reflect the actual allocations.
If a block is assumed to be long-lived, it should be allocated in the major heap directly.
The default value of custom_minor_max_size is probably too low.

This PR changes alloc_custom_mem to allocate directly in the major heap if the block's dependent memory is above custom_minor_max_size, and use the full size of dependent memory to accelerate the relevant GC. It also increases the default setting of custom_minor_max_size to an arbitrary value slightly above 64KiB.

There will be a sister PR to backport it to 4.14 because the conflicts are non-trivial, and a follow-up PR to fix a related TODO in the minor GC that's left over from the multicore work:

ocaml/runtime/minor_gc.c

Line 573 in d2ccddc

/* unconditionally promote custom blocks so accounting is correct */

gasche

The CI fails because tests/lib-runtime-events/test_instrumented now performs an extra minor GC on some systems (from 31 to 32 minor collections), and I am not sure why. Runtime events involve one custom allocation (for the "cursor"), but it is a structure of callback functions, well below the previous cutoff size for mem_minor.
Do you understand why the behavior of this test changes?

Otherwise, this seems very reasonable. The one thing that makes me uneasy is that you chose a perfectly arbitrary value that happens to be slightly above this one workflow that you care about.

I don't understand why we need a cutoff value at all. Currently there are two mechanisms for limiting the amount of out-of-heap memory held by custom objects in the minor heap:

custom_minor_ratio which causes a minor collection if minor-held out-of-heap memory reaches a certain factor of the minor heap size
custom_minor_max_size which (which this PR) allocates custom objects on the major heap directly when their out-of-heap size is above a certain value

Why have the second control at all? Would anything bad happen if we used an infinite size as the default value? The first control is more robust to applications that change the default minor heap size.

Thinking out loud: could we have a caml_alloc_custom_shr function that lets users be explicit that a custom block will be long-lived? It sounds more robust than a distinction based on the out-of-heap size.

lpw25 · 2023-06-22T20:17:32Z

Would anything bad happen if we used an infinite size as the default value?

Wouldn't any large custom block allocation cause an immediate minor collection? Since that would put it on the major heap anyway, might as well just put such custom blocks straight on the major heap.

gasche · 2023-06-22T21:14:19Z

Yes, instead of "infinite limit", a better default may be minor-heap-size * custom-minor-heap-ratio
(blocks this large cause a minor collection regardless of what is already on the minor heap.

But on the other hand, I don't think that a dynamic policy that major-allocates blocks of out-of-heap size remaining-minor-heap-size * custom-minor-heap-ratio would be a good choice, which shows that these things are tricky indeed. (This would be another reasonable way to interpret your idea to major-allocate all custom blocks that would immediately cause a minor collection.)

For example, if a program somewhat-frequently allocates short-lived custom blocks whose out-of-heap size a third of the out-of-heap-minor-size we configured, then we probably want to run a minor allocation every three such allocations, instead of allocating the first two on the minor heap and then promoting all remaining on the major heap as long as the minor heap is not full. (If we minor-collect more often, we will notice promptly that those large-ish custom blocks are free, keeping the out-of-heap space usage in check. If we major-promote instead, we either blow up the out-of-heap space of this workflow, or we force running major collections more often which is even more costly.)

gasche

This needs a Changes entry.

This PR let me realize that there is no documentation of caml_alloc_custom_mem in caml/custom.h. There is a documentation of custom_minor_max_size in gc.mli but this is not a natural place to look for. Could you consider adding a documentation comment in custom.h?

damiendoligez · 2023-06-28T09:15:01Z

Do you understand why the behavior of this test changes?

I investigated a bit and found out that the standard library allocates 3 custom blocks with dependent memory just above 64KiB each: stdin, stdout, and stderr. In trunk, this forces the major GC to do a slice (and thus force a minor collection) before the program has even started. After that, collections are triggered at completely different points in the program, so it's hard to conclude anything from a difference of one minor collection.

gasche · 2023-06-28T09:23:05Z

If we did have a caml_custom_alloc_custom_mem_shr function to force allocation on the major heap, it would be nice to use it for std{in,out,err}.

gasche · 2023-06-28T13:04:19Z

runtime/caml/custom.h

+/* [caml_alloc_custom_mem] allocates a custom block with dependent memory
+   (memory outside the heap that will be reclaimed when the block is
+   finalized). If [mem] is greater than [custom_minor_max_size] (see gc.mli)
+   the block is allocated directly in the major heap. */
 CAMLextern value caml_alloc_custom_mem(const struct custom_operations * ops,
                                       uintnat size, /*size in bytes*/
                                       mlsize_t mem  /*memory consumed*/);


According to gc.mli, mem and caml_minor_max_size are supposed to be bytes. This should be documented, for example /* memory consumed */ could be /* dependent memory in bytes */.

(The behavior of the runtime is scale-independent with respect to these variables, but we should fix a convention to avoid ending up with different libraries using different units, and so that users know how to configure the maximum.)

gasche · 2023-06-30T13:36:40Z

@damiendoligez another suggestion to make the total_blocks count more accurate: use the Gc interface to count the number of "bureaucracy" words (between start () and the for loop, and after the for loop and read_cursor), and then remove those "bureaucracy words" from the printing count. The output we expect is that the allocations counted through runtime events are exactly the allocations in the loop -- and those we can compute robustly, right?

damiendoligez · 2023-06-30T14:10:34Z

@gasche the problem here is that Gc.quick_stat and Gc.counters both allocate and, depending on flambda, the allocated blocks may be promoted, which throws off the count reported by runtime events. Gc.minor_words would be a nice solution but in bytecode it allocates and in native code it's not accurate...

As it stands now, we know exactly what the bureaucracy blocks are: the cursor (2 blocks) and the GC.stat record (4 blocks). When you add the 1000001 loop iterations with 2 blocks each, you get the right number of 2000008 blocks reported by runtime events.

Another solution would be to run the loop twice with different numbers of iterations (e.g. 0 and 1000000) and print the difference.

gasche · 2023-06-30T14:12:59Z

Another solution would be to run the loop twice with different numbers of iterations (e.g. 0 and 1000000) and print the difference.

Yes, that sounds more robust than your current approach?

gasche · 2023-07-03T13:46:19Z

It looks like something is happening on the "benchmarks" CI on this PR -- I found out by chance.

Looking at https://autumn.ocamllabs.io/ocaml/ocaml/pull/12318/base/trunk?worker=autumn&image=ocaml%2Fopam%3Adebian-11-ocaml-4.14#line-graph-projects-ocamlgraph/typing , you can fold the "binaries" pane (which is measuring the size of compiled objects) and unfold the "projects" pane (which is measuring the compilation time), and see that:

there is an increase in "parsing" time for menhir (from 0.7 to 1.2)
there is an increase in "parsing" and "typing" time for ocamlgraph (parsing: 0.5 to 0.8; typing: 4.3 to 4.7)

Do we know how to interpret those results?

As far as I can tell, the only uses of caml_alloc_custom_mem in the compiler that could be affected by this PR come from opening {input,output}_channel values. The stdlib will open 3 such files, and the compiler will open a few more as it compiles programs. Before those would speed up the major GC, and now they only provoke extra minor collections.

cc @Octachron

damiendoligez · 2023-07-03T14:29:39Z

Do we know how to interpret those results?

Probably noise? I'm guessing you're looking at the numbers from the tooltips on the graphs rather than the table above the graphs. Note how ocaml/typing went from 224 to 230 seconds on commit 24518e0 : "add Changes entry and documentation comment".

damiendoligez · 2023-07-03T14:34:54Z

I've rebased and force-pushed, let's see if these benchmarks still look suspicious.

gasche · 2023-07-03T16:28:52Z

I don't think this is just noise, because the results are stable for several commits before the change and also after the change.

damiendoligez · 2023-07-25T14:33:57Z

@gasche you're right, it's not noise. I've run some benchmarks on my machine and got a clear slowdown. I'm investigating.

damiendoligez · 2023-07-28T16:44:07Z

This is the current state of my investigations. Benchmarking this PR (and its sister #12322) I get these results for the typing phase of the compiler compiling ocamlgraph.1.8.8:

	4.14.1	4.14.2	4.14.2+PR	trunk	trunk+PR	trunk+60	trunk+PR+PR
	4.781	4.854	4.842	1.732	1.936	1.831	1.784
	4.806	4.982	4.938	1.716	1.934	1.783	1.815
	4.827	4.919	4.841	1.751	1.888	1.786	1.809
	4.883	4.901	4.891	1.851	1.877	1.789	1.822
	4.855	4.857	4.877	1.785	1.877	1.767	1.848
	4.960	4.852	4.855	1.761	1.872	1.765	1.818
	4.851		4.767	1.746	1.871	1.762	1.844
	4.873	4.962	4.873	1.795	1.872	1.747	1.846
	4.820	4.828	4.756	1.801	1.859	1.765	1.842
	4.828	4.806	4.761	1.866	1.845	1.738	1.849

avg	4.848	4.885	4.840	1.780	1.883	1.773	1.828
median	4.840	4.857	4.849	1.773	1.875	1.766	1.832
stddev	0.050	0.060	0.061	0.049	0.030	0.026	0.022

4.14.1 is the release
4.14.2 is the current 4.14 branch (at 74fe398)
4.14.2+PR is #12322 (based on 74fe398)
trunk is at 3301bdc
trunk+PR is this PR (based on 3301bdc)
trunk+60 is this PR with custom_minor_max_size reduced to 60kB.
trunk+PR+PR is this PR plus #12439 (based on 3301bdc)

The top of the table is raw data, the bottom is average / median / standard deviation. There is a hole in the 4.14.2 data where I removed an obvious outlier.

We can see that the slowdown in typing is real, but only exists in OCaml 5. It disappears if custom_minor_max_size is smaller than the channel buffer size. #12439 also reduces the slowdown but doesn't eliminate it.

I still don't fully understand the source of this slowdown.

damiendoligez · 2023-10-19T14:10:52Z

I found the source of the slowdown.

This is because alloc_custom_mem computes a workload factor for the GC that gets added to extra_heap_resources. This factor is a percentage of the whole GC cycle work and, in the case of alloc_custom_mem, is relative to the size of the major heap. In trunk, it is computed at the time of allocation in the minor heap and immediately added to the major GC counter. This is wrong because it accelerates the major GC at the wrong time (and even if the custom block never gets promoted).

In (the intial version of) this PR, we compute it at the time of allocation in the minor heap but add it to the major GC counter when (and if) the custom block gets promoted. Which means every time, because #12439 wasn't merged yet. This is also wrong because the heap is growing fast. The workload factor turns out large because the heap is small, but by the time the block gets promoted, the heap is much larger and we apply a large factor to it, which forces the major GC to do a lot of work (essentially a complete cycle for each minor collection).

There are 3 things that can hide the slowdown:

The compilation of ocamlgraph.1.8.8 stops early in trunk (after compiling the .mli files) because there are references to Pervasives in the source. Compiling .mli files uses a lot of channels. If we fix the ocamlgraph source, the slowdown is hidden in the noise of the time taken to compile .ml files. This means fix bad behaviour of caml_alloc_custom_mem (4.14) #12322 will have to be fixed as well.
Fix custom block promotion #12439 also hides the problem: almost none of these channels is promoted to the major heap.
Free channel buffers on close rather than leaving them to the GC. #12678 also hides the problem by removing the channel's buffer as soon as the channel is closed, instead of relying on the GC's finalization feature. Once channels are out of the picture, there's nothing in the compiler to trigger the problem (but some other programs could still suffer from it).

Once the problem is identified, it's not hard to fix: compute the GC acceleration factor when the custom block is promoted rather than when it is allocated in the minor heap. I've pushed a commit with this fix.

gasche · 2023-10-19T14:32:25Z

The explanation is clear, but relating it to the code is not easy for me -- this is out of my memory and I need to page it back in.

Here is my current belief about the current state of the PR, last commit

5de8e09

You say that you have pushed a commit to fix the issue, but you did not actually do it, and I am observing the old/previous version of the patch. (There are two hints about this: I cannot tell where in the patch is the fix you are describing, and also the patch is mostly identical to the 4.14-backport PR fix bad behaviour of caml_alloc_custom_mem (4.14) #12322 which you haven't updated yet.)
Despite me guessing that this is the old version, it is hard for me to tell where the problem that you are talking about is in the code. In particular, the only obvious/explicit ratio computation occurring is the increment to extra_heap_resources_minor, which is done now and not at promotion time, so using the current values to compute the ratio feels correct.
I think that the issue in the PR is not to compute a ratio too early, but to store the current max_major value in the custom table, which depends on the current heap size, instead of recomputing a correct/accurate max_major value at promotion time.

… behavior

…ater back to work units using a different heap size. When the heap is growing, this unduly accelerates the major GC, slowing down the program.

damiendoligez · 2023-10-20T18:17:17Z

All your points are perfectly accurate. I have now rebased and pushed the commit in question.

gasche

There are two uses of the max_major parameter in alloc_custom_gen.

  if (wosize <= Max_young_wosize && mem <= caml_custom_minor_max_bsz) {
      // put the block on the minor heap...
      /* Record the extra resources in case the block gets promoted. */
      add_to_custom_table (&Caml_state->minor_tables->custom, result,
                           mem, max_major);
    }
  } else {
    // put the block on the major heap...
    caml_adjust_gc_speed(mem, max_major);
  }

I understand that you want to use 0 for the first use in add_to_custom_table. But the patch currently results in also using 0 for caml_adjust_gc_speed in the directly-on-the-major-heap case, and this feels wrong?

gasche · 2023-10-20T20:06:26Z

runtime/minor_gc.c

+             previous cycle.  [heap_size / 150] is really [heap_size *
+             (2/3) / 100] (but faster). */
+          elt->max = caml_heap_size(Caml_state->shared_heap) / 150
+                     * caml_custom_major_ratio;


Instead of exposing caml_custom_major_ratio and redoing the weird computation with 150 here, I think that custom.h should export a mlsize_t caml_custom_current_max_major(void) function to be called here. It may even be possible to factor more logic with alloc_custom_gen by exporting a void caml_custom_adjust_speed_after_major_allocation(mlsize_t mem) function.

@gasche

computation from adjust_gc_speed whenever max = 0. This streamlines alloc_custom_gen and fixes a bug reported by @gasche.

damiendoligez · 2023-11-10T15:54:21Z

@gasche I've combined changes suggested by your two remarks (thanks for finding a bug, BTW) and got to a state that I think is both correct and elegant.

runtime/custom.c

gasche · 2023-11-11T10:19:02Z

runtime/custom.c

@@ -84,30 +102,16 @@ CAMLexport value caml_alloc_custom(const struct custom_operations * ops,
                                   mlsize_t mem,
                                   mlsize_t max)
 {
-  return alloc_custom_gen (ops, bsz, mem, max, mem, max);
+  return alloc_custom_gen (ops, bsz, mem, max, max == 0 ? 1 : max);


I am not sure why there is a change here, but my guess is that you are trying to avoid running into the new semantics for max == 0 for the major heap by normalizing to 1 instead.

If this is indeed the intent, then there is a mistake: the parameter order is max_major, max_minor, so you are normalizing max_minor instead of max_major here.

Whatever the intent is, there is something strange, because max_minor == 0 is already normalized to 1 within the body of alloc_custom_gen so the test here is superfluous.

I think it would be helpful to write this function as:

mlsize_t max_major = ...; mlsize_t max_minor = ...; return alloc_custom_gen (ops, bsz, mem, max_major, max_minor);

to make it easy for the code reader to follow the intent of the code writer.

The intent was to expose the new meaning of max_major == 0, while avoiding a call to alloc_custom_mem with max_minor == 0 (which is redundant, as you noted).
This is to change (for the better) the behavior of programs that pass (mem=1, max=0) to caml_alloc_custom instead of the usually intended (mem=0, max=1).
But that was incomplete. I have now changed the code to make caml_alloc_custom with max=0 behave exactly the same as caml_alloc_custom_mem.

…stom_mem when max is 0

gasche

The code is now clearer than before, better documented, and it works better. Approved.

fix bad behaviour of `caml_alloc_custom_mem` (cherry picked from commit b4308a0) (cherry picked from commit aa1b842892e1e7607727cda0613bd0737d16a450)

Octachron · 2023-11-28T15:53:13Z

Cherry-picked for the GC performance fix for 5.1.1 as 1bbed4e .

gasche reviewed Jun 22, 2023

View reviewed changes

damiendoligez mentioned this pull request Jun 23, 2023

fix bad behaviour of caml_alloc_custom_mem (4.14) #12322

Open

gasche approved these changes Jun 28, 2023

View reviewed changes

damiendoligez force-pushed the fix-caml-alloc-custom-mem branch from ee5ee5d to 24518e0 Compare June 28, 2023 12:19

gasche reviewed Jun 28, 2023

View reviewed changes

damiendoligez force-pushed the fix-caml-alloc-custom-mem branch from 02cf796 to 0ec5f03 Compare July 3, 2023 14:34

gasche added the run-benchmarks label Jul 6, 2023

damiendoligez force-pushed the fix-caml-alloc-custom-mem branch from 0ec5f03 to 17eb163 Compare July 25, 2023 14:27

damiendoligez force-pushed the fix-caml-alloc-custom-mem branch from 17eb163 to 5de8e09 Compare July 28, 2023 10:21

damiendoligez marked this pull request as draft October 17, 2023 14:42

damiendoligez marked this pull request as ready for review October 19, 2023 14:11

damiendoligez removed the run-benchmarks label Oct 20, 2023

damiendoligez added 3 commits October 20, 2023 15:54

change the meaning of custom_minor_max_size

1ce30c7

make tests/lib-runtime-events/test_instrumented.ml more stable wrt GC…

ab63129

… behavior

add Changes entry and documentation comment

9ae9928

damiendoligez added 4 commits October 20, 2023 15:54

tweak test to work identically with and without flambda

26ebc3d

make the test differential (probably more robust)

7b2c759

alloc_custom_mem: do not convert memory size to heap proportion and l…

a7cd0c6

…ater back to work units using a different heap size. When the heap is growing, this unduly accelerates the major GC, slowing down the program.

fix runtime-events tests to be more robust to GC variations

7ae91bc

damiendoligez force-pushed the fix-caml-alloc-custom-mem branch from ca989ec to 7ae91bc Compare October 20, 2023 18:17

gasche reviewed Oct 20, 2023

View reviewed changes

gasche self-assigned this Oct 24, 2023

kayceesrk added the runtime-system label Nov 9, 2023

Move the computation of max_major back to custom.c and call this

e27a6a3

computation from adjust_gc_speed whenever max = 0. This streamlines alloc_custom_gen and fixes a bug reported by @gasche.

gasche reviewed Nov 11, 2023

View reviewed changes

small refactoring; make caml_alloc_custom equivalent to caml_alloc_cu…

c49bf68

…stom_mem when max is 0

gasche approved these changes Nov 13, 2023

View reviewed changes

gasche merged commit b4308a0 into ocaml:trunk Nov 13, 2023
9 checks passed

damiendoligez added a commit to damiendoligez/ocaml that referenced this pull request Nov 21, 2023

Backport of ocaml#12318 to the 4.14 LTS branch.

4aa224d

damiendoligez added a commit to damiendoligez/ocaml that referenced this pull request Nov 21, 2023

Backport of ocaml#12318 to the 4.14 LTS branch.

5d1e983

jmid mentioned this pull request Nov 23, 2023

Long Out_channel shrinking runs cause timeout ocaml-multicore/multicoretests#378

Closed

Octachron mentioned this pull request Nov 23, 2023

Performance regression with multiple domains between 5.0 and 5.1 #12460

Open

dustanddreams mentioned this pull request Nov 23, 2023

TSan-reported race in tests/parallel/test_c_thread_register.ml #12773

Closed

damiendoligez deleted the fix-caml-alloc-custom-mem branch February 7, 2024 13:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix bad behaviour of `caml_alloc_custom_mem` #12318

fix bad behaviour of `caml_alloc_custom_mem` #12318

damiendoligez commented Jun 21, 2023

gasche left a comment

lpw25 commented Jun 22, 2023

gasche commented Jun 22, 2023 •

edited

gasche left a comment

damiendoligez commented Jun 28, 2023

gasche commented Jun 28, 2023

gasche Jun 28, 2023

gasche commented Jun 30, 2023

damiendoligez commented Jun 30, 2023

gasche commented Jun 30, 2023

gasche commented Jul 3, 2023

damiendoligez commented Jul 3, 2023

damiendoligez commented Jul 3, 2023

gasche commented Jul 3, 2023

damiendoligez commented Jul 25, 2023

damiendoligez commented Jul 28, 2023

damiendoligez commented Oct 19, 2023

gasche commented Oct 19, 2023

damiendoligez commented Oct 20, 2023

gasche left a comment •

edited

gasche Oct 20, 2023

damiendoligez commented Nov 10, 2023

gasche Nov 11, 2023

damiendoligez Nov 13, 2023

gasche left a comment

Octachron commented Nov 28, 2023

fix bad behaviour of caml_alloc_custom_mem #12318

fix bad behaviour of caml_alloc_custom_mem #12318

Conversation

damiendoligez commented Jun 21, 2023

gasche left a comment

Choose a reason for hiding this comment

lpw25 commented Jun 22, 2023

gasche commented Jun 22, 2023 • edited

gasche left a comment

Choose a reason for hiding this comment

damiendoligez commented Jun 28, 2023

gasche commented Jun 28, 2023

gasche Jun 28, 2023

Choose a reason for hiding this comment

gasche commented Jun 30, 2023

damiendoligez commented Jun 30, 2023

gasche commented Jun 30, 2023

gasche commented Jul 3, 2023

damiendoligez commented Jul 3, 2023

damiendoligez commented Jul 3, 2023

gasche commented Jul 3, 2023

damiendoligez commented Jul 25, 2023

damiendoligez commented Jul 28, 2023

damiendoligez commented Oct 19, 2023

gasche commented Oct 19, 2023

damiendoligez commented Oct 20, 2023

gasche left a comment • edited

Choose a reason for hiding this comment

gasche Oct 20, 2023

Choose a reason for hiding this comment

damiendoligez commented Nov 10, 2023

gasche Nov 11, 2023

Choose a reason for hiding this comment

damiendoligez Nov 13, 2023

Choose a reason for hiding this comment

gasche left a comment

Choose a reason for hiding this comment

Octachron commented Nov 28, 2023

fix bad behaviour of `caml_alloc_custom_mem` #12318

fix bad behaviour of `caml_alloc_custom_mem` #12318

gasche commented Jun 22, 2023 •

edited

gasche left a comment •

edited