Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix bad behaviour of caml_alloc_custom_mem #12318

Merged
merged 9 commits into from Nov 13, 2023

Conversation

damiendoligez
Copy link
Member

I got an out-of-band report of a bad behaviour of alloc_custom_mem. It is triggered by a program that allocates many short-lived 64kB-sized bigarrays. As explained in #1738 (comment), the GC counts the first 8k to accelerate the minor GC and the rest to accelerate the major GC, because "the block is assumed to be long-lived". This means the major GC is running too fast and wasting a lot of time because the assumption is false and these blocks never hit the major heap.

A bit of brainstorming with @stedolan brought us to these conclusions:

  1. The accounting should accurately reflect the actual allocations.
  2. If a block is assumed to be long-lived, it should be allocated in the major heap directly.
  3. The default value of custom_minor_max_size is probably too low.

This PR changes alloc_custom_mem to allocate directly in the major heap if the block's dependent memory is above custom_minor_max_size, and use the full size of dependent memory to accelerate the relevant GC. It also increases the default setting of custom_minor_max_size to an arbitrary value slightly above 64KiB.

There will be a sister PR to backport it to 4.14 because the conflicts are non-trivial, and a follow-up PR to fix a related TODO in the minor GC that's left over from the multicore work:

/* unconditionally promote custom blocks so accounting is correct */

Copy link
Member

@gasche gasche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CI fails because tests/lib-runtime-events/test_instrumented now performs an extra minor GC on some systems (from 31 to 32 minor collections), and I am not sure why. Runtime events involve one custom allocation (for the "cursor"), but it is a structure of callback functions, well below the previous cutoff size for mem_minor.
Do you understand why the behavior of this test changes?

Otherwise, this seems very reasonable. The one thing that makes me uneasy is that you chose a perfectly arbitrary value that happens to be slightly above this one workflow that you care about.

I don't understand why we need a cutoff value at all. Currently there are two mechanisms for limiting the amount of out-of-heap memory held by custom objects in the minor heap:

  • custom_minor_ratio which causes a minor collection if minor-held out-of-heap memory reaches a certain factor of the minor heap size
  • custom_minor_max_size which (which this PR) allocates custom objects on the major heap directly when their out-of-heap size is above a certain value

Why have the second control at all? Would anything bad happen if we used an infinite size as the default value? The first control is more robust to applications that change the default minor heap size.

Thinking out loud: could we have a caml_alloc_custom_shr function that lets users be explicit that a custom block will be long-lived? It sounds more robust than a distinction based on the out-of-heap size.

@lpw25
Copy link
Contributor

lpw25 commented Jun 22, 2023

Would anything bad happen if we used an infinite size as the default value?

Wouldn't any large custom block allocation cause an immediate minor collection? Since that would put it on the major heap anyway, might as well just put such custom blocks straight on the major heap.

@gasche
Copy link
Member

gasche commented Jun 22, 2023

Yes, instead of "infinite limit", a better default may be minor-heap-size * custom-minor-heap-ratio
(blocks this large cause a minor collection regardless of what is already on the minor heap.

But on the other hand, I don't think that a dynamic policy that major-allocates blocks of out-of-heap size remaining-minor-heap-size * custom-minor-heap-ratio would be a good choice, which shows that these things are tricky indeed. (This would be another reasonable way to interpret your idea to major-allocate all custom blocks that would immediately cause a minor collection.)

For example, if a program somewhat-frequently allocates short-lived custom blocks whose out-of-heap size a third of the out-of-heap-minor-size we configured, then we probably want to run a minor allocation every three such allocations, instead of allocating the first two on the minor heap and then promoting all remaining on the major heap as long as the minor heap is not full. (If we minor-collect more often, we will notice promptly that those large-ish custom blocks are free, keeping the out-of-heap space usage in check. If we major-promote instead, we either blow up the out-of-heap space of this workflow, or we force running major collections more often which is even more costly.)

Copy link
Member

@gasche gasche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a Changes entry.

This PR let me realize that there is no documentation of caml_alloc_custom_mem in caml/custom.h. There is a documentation of custom_minor_max_size in gc.mli but this is not a natural place to look for. Could you consider adding a documentation comment in custom.h?

@damiendoligez
Copy link
Member Author

Do you understand why the behavior of this test changes?

I investigated a bit and found out that the standard library allocates 3 custom blocks with dependent memory just above 64KiB each: stdin, stdout, and stderr. In trunk, this forces the major GC to do a slice (and thus force a minor collection) before the program has even started. After that, collections are triggered at completely different points in the program, so it's hard to conclude anything from a difference of one minor collection.

@gasche
Copy link
Member

gasche commented Jun 28, 2023

If we did have a caml_custom_alloc_custom_mem_shr function to force allocation on the major heap, it would be nice to use it for std{in,out,err}.

/* [caml_alloc_custom_mem] allocates a custom block with dependent memory
(memory outside the heap that will be reclaimed when the block is
finalized). If [mem] is greater than [custom_minor_max_size] (see gc.mli)
the block is allocated directly in the major heap. */
CAMLextern value caml_alloc_custom_mem(const struct custom_operations * ops,
uintnat size, /*size in bytes*/
mlsize_t mem /*memory consumed*/);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to gc.mli, mem and caml_minor_max_size are supposed to be bytes. This should be documented, for example /* memory consumed */ could be /* dependent memory in bytes */.

(The behavior of the runtime is scale-independent with respect to these variables, but we should fix a convention to avoid ending up with different libraries using different units, and so that users know how to configure the maximum.)

@gasche
Copy link
Member

gasche commented Jun 30, 2023

@damiendoligez another suggestion to make the total_blocks count more accurate: use the Gc interface to count the number of "bureaucracy" words (between start () and the for loop, and after the for loop and read_cursor), and then remove those "bureaucracy words" from the printing count. The output we expect is that the allocations counted through runtime events are exactly the allocations in the loop -- and those we can compute robustly, right?

@damiendoligez
Copy link
Member Author

@gasche the problem here is that Gc.quick_stat and Gc.counters both allocate and, depending on flambda, the allocated blocks may be promoted, which throws off the count reported by runtime events. Gc.minor_words would be a nice solution but in bytecode it allocates and in native code it's not accurate...

As it stands now, we know exactly what the bureaucracy blocks are: the cursor (2 blocks) and the GC.stat record (4 blocks). When you add the 1000001 loop iterations with 2 blocks each, you get the right number of 2000008 blocks reported by runtime events.

Another solution would be to run the loop twice with different numbers of iterations (e.g. 0 and 1000000) and print the difference.

@gasche
Copy link
Member

gasche commented Jun 30, 2023

Another solution would be to run the loop twice with different numbers of iterations (e.g. 0 and 1000000) and print the difference.

Yes, that sounds more robust than your current approach?

@gasche
Copy link
Member

gasche commented Jul 3, 2023

It looks like something is happening on the "benchmarks" CI on this PR -- I found out by chance.

Looking at https://autumn.ocamllabs.io/ocaml/ocaml/pull/12318/base/trunk?worker=autumn&image=ocaml%2Fopam%3Adebian-11-ocaml-4.14#line-graph-projects-ocamlgraph/typing , you can fold the "binaries" pane (which is measuring the size of compiled objects) and unfold the "projects" pane (which is measuring the compilation time), and see that:

  • there is an increase in "parsing" time for menhir (from 0.7 to 1.2)
  • there is an increase in "parsing" and "typing" time for ocamlgraph (parsing: 0.5 to 0.8; typing: 4.3 to 4.7)

Do we know how to interpret those results?

As far as I can tell, the only uses of caml_alloc_custom_mem in the compiler that could be affected by this PR come from opening {input,output}_channel values. The stdlib will open 3 such files, and the compiler will open a few more as it compiles programs. Before those would speed up the major GC, and now they only provoke extra minor collections.

cc @Octachron

@damiendoligez
Copy link
Member Author

Do we know how to interpret those results?

Probably noise? I'm guessing you're looking at the numbers from the tooltips on the graphs rather than the table above the graphs. Note how ocaml/typing went from 224 to 230 seconds on commit 24518e0 : "add Changes entry and documentation comment".

@damiendoligez
Copy link
Member Author

I've rebased and force-pushed, let's see if these benchmarks still look suspicious.

@gasche
Copy link
Member

gasche commented Jul 3, 2023

I don't think this is just noise, because the results are stable for several commits before the change and also after the change.

@damiendoligez
Copy link
Member Author

@gasche you're right, it's not noise. I've run some benchmarks on my machine and got a clear slowdown. I'm investigating.

@damiendoligez
Copy link
Member Author

This is the current state of my investigations. Benchmarking this PR (and its sister #12322) I get these results for the typing phase of the compiler compiling ocamlgraph.1.8.8:

4.14.1 4.14.2 4.14.2+PR trunk trunk+PR trunk+60 trunk+PR+PR
4.781 4.854 4.842 1.732 1.936 1.831 1.784
4.806 4.982 4.938 1.716 1.934 1.783 1.815
4.827 4.919 4.841 1.751 1.888 1.786 1.809
4.883 4.901 4.891 1.851 1.877 1.789 1.822
4.855 4.857 4.877 1.785 1.877 1.767 1.848
4.960 4.852 4.855 1.761 1.872 1.765 1.818
4.851 4.767 1.746 1.871 1.762 1.844
4.873 4.962 4.873 1.795 1.872 1.747 1.846
4.820 4.828 4.756 1.801 1.859 1.765 1.842
4.828 4.806 4.761 1.866 1.845 1.738 1.849
avg 4.848 4.885 4.840 1.780 1.883 1.773 1.828
median 4.840 4.857 4.849 1.773 1.875 1.766 1.832
stddev 0.050 0.060 0.061 0.049 0.030 0.026 0.022

4.14.1 is the release
4.14.2 is the current 4.14 branch (at 74fe398)
4.14.2+PR is #12322 (based on 74fe398)
trunk is at 3301bdc
trunk+PR is this PR (based on 3301bdc)
trunk+60 is this PR with custom_minor_max_size reduced to 60kB.
trunk+PR+PR is this PR plus #12439 (based on 3301bdc)

The top of the table is raw data, the bottom is average / median / standard deviation. There is a hole in the 4.14.2 data where I removed an obvious outlier.

We can see that the slowdown in typing is real, but only exists in OCaml 5. It disappears if custom_minor_max_size is smaller than the channel buffer size. #12439 also reduces the slowdown but doesn't eliminate it.

I still don't fully understand the source of this slowdown.

@damiendoligez damiendoligez marked this pull request as draft October 17, 2023 14:42
@damiendoligez
Copy link
Member Author

I found the source of the slowdown.

This is because alloc_custom_mem computes a workload factor for the GC that gets added to extra_heap_resources. This factor is a percentage of the whole GC cycle work and, in the case of alloc_custom_mem, is relative to the size of the major heap. In trunk, it is computed at the time of allocation in the minor heap and immediately added to the major GC counter. This is wrong because it accelerates the major GC at the wrong time (and even if the custom block never gets promoted).

In (the intial version of) this PR, we compute it at the time of allocation in the minor heap but add it to the major GC counter when (and if) the custom block gets promoted. Which means every time, because #12439 wasn't merged yet. This is also wrong because the heap is growing fast. The workload factor turns out large because the heap is small, but by the time the block gets promoted, the heap is much larger and we apply a large factor to it, which forces the major GC to do a lot of work (essentially a complete cycle for each minor collection).

There are 3 things that can hide the slowdown:

  • The compilation of ocamlgraph.1.8.8 stops early in trunk (after compiling the .mli files) because there are references to Pervasives in the source. Compiling .mli files uses a lot of channels. If we fix the ocamlgraph source, the slowdown is hidden in the noise of the time taken to compile .ml files. This means fix bad behaviour of caml_alloc_custom_mem (4.14) #12322 will have to be fixed as well.

  • Fix custom block promotion #12439 also hides the problem: almost none of these channels is promoted to the major heap.

  • Free channel buffers on close rather than leaving them to the GC. #12678 also hides the problem by removing the channel's buffer as soon as the channel is closed, instead of relying on the GC's finalization feature. Once channels are out of the picture, there's nothing in the compiler to trigger the problem (but some other programs could still suffer from it).

Once the problem is identified, it's not hard to fix: compute the GC acceleration factor when the custom block is promoted rather than when it is allocated in the minor heap. I've pushed a commit with this fix.

@damiendoligez damiendoligez marked this pull request as ready for review October 19, 2023 14:11
@gasche
Copy link
Member

gasche commented Oct 19, 2023

The explanation is clear, but relating it to the code is not easy for me -- this is out of my memory and I need to page it back in.

Here is my current belief about the current state of the PR, last commit

5de8e09

  1. You say that you have pushed a commit to fix the issue, but you did not actually do it, and I am observing the old/previous version of the patch. (There are two hints about this: I cannot tell where in the patch is the fix you are describing, and also the patch is mostly identical to the 4.14-backport PR fix bad behaviour of caml_alloc_custom_mem (4.14) #12322 which you haven't updated yet.)

  2. Despite me guessing that this is the old version, it is hard for me to tell where the problem that you are talking about is in the code. In particular, the only obvious/explicit ratio computation occurring is the increment to extra_heap_resources_minor, which is done now and not at promotion time, so using the current values to compute the ratio feels correct.

  3. I think that the issue in the PR is not to compute a ratio too early, but to store the current max_major value in the custom table, which depends on the current heap size, instead of recomputing a correct/accurate max_major value at promotion time.

@damiendoligez
Copy link
Member Author

All your points are perfectly accurate. I have now rebased and pushed the commit in question.

Copy link
Member

@gasche gasche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two uses of the max_major parameter in alloc_custom_gen.

  if (wosize <= Max_young_wosize && mem <= caml_custom_minor_max_bsz) {
      // put the block on the minor heap...
      /* Record the extra resources in case the block gets promoted. */
      add_to_custom_table (&Caml_state->minor_tables->custom, result,
                           mem, max_major);
    }
  } else {
    // put the block on the major heap...
    caml_adjust_gc_speed(mem, max_major);
  }

I understand that you want to use 0 for the first use in add_to_custom_table. But the patch currently results in also using 0 for caml_adjust_gc_speed in the directly-on-the-major-heap case, and this feels wrong?

previous cycle. [heap_size / 150] is really [heap_size *
(2/3) / 100] (but faster). */
elt->max = caml_heap_size(Caml_state->shared_heap) / 150
* caml_custom_major_ratio;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of exposing caml_custom_major_ratio and redoing the weird computation with 150 here, I think that custom.h should export a mlsize_t caml_custom_current_max_major(void) function to be called here. It may even be possible to factor more logic with alloc_custom_gen by exporting a void caml_custom_adjust_speed_after_major_allocation(mlsize_t mem) function.

@gasche gasche self-assigned this Oct 24, 2023
computation from adjust_gc_speed whenever max = 0.
This streamlines alloc_custom_gen and fixes a bug reported by @gasche.
@damiendoligez
Copy link
Member Author

@gasche I've combined changes suggested by your two remarks (thanks for finding a bug, BTW) and got to a state that I think is both correct and elegant.

runtime/custom.c Show resolved Hide resolved
runtime/custom.c Outdated
@@ -84,30 +102,16 @@ CAMLexport value caml_alloc_custom(const struct custom_operations * ops,
mlsize_t mem,
mlsize_t max)
{
return alloc_custom_gen (ops, bsz, mem, max, mem, max);
return alloc_custom_gen (ops, bsz, mem, max, max == 0 ? 1 : max);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure why there is a change here, but my guess is that you are trying to avoid running into the new semantics for max == 0 for the major heap by normalizing to 1 instead.

If this is indeed the intent, then there is a mistake: the parameter order is max_major, max_minor, so you are normalizing max_minor instead of max_major here.

Whatever the intent is, there is something strange, because max_minor == 0 is already normalized to 1 within the body of alloc_custom_gen so the test here is superfluous.

I think it would be helpful to write this function as:

mlsize_t max_major = ...;
mlsize_t max_minor = ...;
return alloc_custom_gen (ops, bsz, mem, max_major, max_minor);

to make it easy for the code reader to follow the intent of the code writer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent was to expose the new meaning of max_major == 0, while avoiding a call to alloc_custom_mem with max_minor == 0 (which is redundant, as you noted).
This is to change (for the better) the behavior of programs that pass (mem=1, max=0) to caml_alloc_custom instead of the usually intended (mem=0, max=1).
But that was incomplete. I have now changed the code to make caml_alloc_custom with max=0 behave exactly the same as caml_alloc_custom_mem.

Copy link
Member

@gasche gasche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code is now clearer than before, better documented, and it works better. Approved.

@gasche gasche merged commit b4308a0 into ocaml:trunk Nov 13, 2023
9 checks passed
damiendoligez added a commit to damiendoligez/ocaml that referenced this pull request Nov 21, 2023
damiendoligez added a commit to damiendoligez/ocaml that referenced this pull request Nov 21, 2023
Octachron pushed a commit that referenced this pull request Nov 28, 2023
fix bad behaviour of `caml_alloc_custom_mem`

(cherry picked from commit b4308a0)
(cherry picked from commit aa1b842892e1e7607727cda0613bd0737d16a450)
@Octachron
Copy link
Member

Cherry-picked for the GC performance fix for 5.1.1 as 1bbed4e .

@damiendoligez damiendoligez deleted the fix-caml-alloc-custom-mem branch February 7, 2024 13:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants