Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compaction #12193

Merged
merged 54 commits into from
Nov 6, 2023
Merged

Compaction #12193

merged 54 commits into from
Nov 6, 2023

Conversation

sadiqj
Copy link
Contributor

@sadiqj sadiqj commented Apr 18, 2023

TLDR: PR reintroduces compaction for the pools in the GC's major heap (i.e small blocks < 128 words). Explicit only on calls to Gc.compact for now.

This PR adds a parallel compactor for the shared pools forming part of the major heap. I've favoured simplicity where possible though the implementation should be reasonably performant. The shared_heap.c has documentation describing the algorithm:

https://github.com/ocaml/ocaml/blob/trunk/runtime/shared_heap.c#L960-L987

One thing to note is that the compactor works only on the current state of the major heap at the end of one major cycle. Due to the way the major GC works in OCaml 5.x , it may be several major cycles until unreachable blocks have been swept. This means that after a compaction there may still be unreachable blocks in the major heap. To ensure that only live data remains in the heap, users will need to do a full major followed by a compaction.

Items for discussion

Automatic compaction

I think it's probably worth waiting for a release before enabling automatic compaction but could be convinced otherwise.

Pool acquire/release

To enable actually releasing pools to the OS after a compaction I've removed the batching mechanism. It wasn't clear from benchmarks on https://discuss.ocaml.org/t/ocaml-5-gc-releasing-memory-back-to-the-os/11293/16 that batching was beneficial.

An alternative suggested by @stedolan is to use MADV_DONTNEED on Linux instead of unmapping, and potentially keep batching.

Benchmarks

Since there's no changes to the GC itself when compaction isn't forced, I've only included some benchmarks from a small allocator test (github.com/sadiqj/allocator-test) which cycles between creating a large heap and then throwing most of it away.

Graph of RSS for default/full major/compact/full major+compact

The graph shows four different runs of the test with different actions after throwing most of the data away: default is no action, a full major, a compact and a full major then compact.

For the reasons discussed earlier, full major then compaction results in much lower overall RSS.

Thanks

Thanks to @NickBarnes, @stedolan and @mshinwell for debugging help.

A ping for @damiendoligez and @kayceesrk for review. Thanks!

@NickBarnes
Copy link
Contributor

Why does the RSS grow without bound when compaction isn't on? I don't expect it to shrink, but we're still collecting; why isn't each new iteration re-using the blocks collected by the previous iteration?

@sadiqj
Copy link
Contributor Author

sadiqj commented Apr 18, 2023

Why does the RSS grow without bound when compaction isn't on? I don't expect it to shrink, but we're still collecting; why isn't each new iteration re-using the blocks collected by the previous iteration?

That is curious. I think it's an artifact of the duration of the test. I'll set some longer runs going and see if it happens over a long period of time.

@kayceesrk kayceesrk self-assigned this Apr 19, 2023
@kayceesrk
Copy link
Contributor

I've marked myself as a reviewer, but I suspect that it may be a few weeks before I get through my current reviewing backlog.

@NickBarnes would you be able to review this code? It is in your area of expertise.

@kayceesrk
Copy link
Contributor

High-level comment: the majority of new code added now is in runtime/shared_heap.c. This file is now 1400+ lines of code. The compaction feature feels modular compared to the rest of shared_heap.c. Would it make sense to split this into its own file runtime/compaction.c?

@sadiqj
Copy link
Contributor Author

sadiqj commented Apr 19, 2023

Would it make sense to split this into its own file runtime/compaction.c?

The main reason it's in runtime/shared_heap.c is because it manipulates a lot of the internal structures that aren't exposed (e.g the pool struct, the various lists of pools based on their state, etc).

@kayceesrk
Copy link
Contributor

The main reason it's in runtime/shared_heap.c is because it manipulates a lot of the internal structures that aren't exposed (e.g the pool struct, the various lists of pools based on their state, etc).

Ok to keep the functionality in shared_heap.c then.

@sadiqj
Copy link
Contributor Author

sadiqj commented Apr 22, 2023

@NickBarnes so I ran that allocation test for longer and the results are pretty suspicious. Will investigate next week:
heap_overhead_graph
rss_graph

@gasche
Copy link
Member

gasche commented Apr 22, 2023

Your test program is basically a loop that allocates a lot of working memory on each iteration (get_test_data () allocates about 800Mio) but does not retain anything after each iteration.

So the live memory throughout the program should be bounded as long as the GC does enough collection work (in particular in the versions that explicitly ask for a full GC after each iteration), and this is what we observe. (Well, maybe the "default" version actually has higher and higher peaks, this is hard to tell from the current data.)

On the other hand, the RSS depends on our ability to return unused memory back to the OS, and what we observe is that the versions without compaction never do that. Is this the part that you find surprising? (I don't remember if we are supposed to return pools something when there are completely empty, which should be the case here after a full major cycle, or if we don't even try that.)

(One aspect that is a bit strange is that the full_compact line in the RSS plot reads as if full_compact was always at exactly 0 bytes of RSS, which sounds impossible.)

runtime/shared_heap.c Outdated Show resolved Hide resolved
@artempyanykh
Copy link

Very exciting! I'll need to re-run the benchmarks from the original discuss post now that compaction is also available.

@nojb
Copy link
Contributor

nojb commented Apr 25, 2023

I have a naïve question. The lack of compaction had an "upside" which was the possibility of having non-movable ("pinned") blocks in memory (good for FFI). I remember some discussion about having a pinned analog of bytes introduced. Is this idea completely dead if compaction is reintroduced? Or is there a way to salvage the idea idea even with compaction in the picture?

@sidkshatriya
Copy link
Contributor

@nojb

I have a naïve question.

I have a potentially even more naïve follow up query/potential point. Sorry if this absolutely does not make sense. Lets say we have a bigarray. A bigarray is represented by a custom block. The first custom block is a pointer to some custom operations and then we have struct caml_ba_array

----------------------------------------------------------
| header |  pointer to operations | struct caml_ba_array |
----------------------------------------------------------

Now struct caml_ba_array contains data struct member which points to the memory location where the array data is stored on the heap.

Now custom blocks are not scanned by the GC. So even if this custom block itself is "compacted" and moved, the place where the data of the bigarray is stored never changes and is "pinned".

So in other words, for FFI, even in the face of a compacting GC as long as you are you using a bigarray your array data is already pinned.

Is my understanding correct?

@nojb
Copy link
Contributor

nojb commented Apr 25, 2023

Is my understanding correct?

Yes, this is correct. This (bigarrays) is what you must use for FFI whenever you need non-moving memory. However, it has an important cost in terms of API since you can no longer use bytes (or you need to copy data around). Which is why a non-moving bytes would be interesting.

@avsm
Copy link
Member

avsm commented Apr 25, 2023

@nojb wrote:

The lack of compaction had an "upside" which was the possibility of having non-movable ("pinned") blocks in memory (good for FFI). I remember some discussion about having a pinned analog of bytes introduced.

There is still hope! Values above 128 words (1K or so) are still not moved, as only smaller values are compacted. But for IO, you usually do need buffers larger than 1K to write into (they are usually 4K and page aligned), and so it should be compatible with this compactor. There's some relevant discussion in ocaml-multicore/eio#140. I'm very keen to have a go at eliminating Bigarray from our stack and seeing the benchmarking differences with just using bytes -- the main blocker is not having the efficient compiler primitives for reading uint8/16/24/32 that we currently have for Bigarrays.

@nojb
Copy link
Contributor

nojb commented Apr 25, 2023

the main blocker is not having the efficient compiler primitives for reading uint8/16/24/32 that we currently have for Bigarrays.

You mean this?

ocaml/stdlib/bytes.ml

Lines 451 to 467 in 0a7c5fe

external unsafe_get_uint8 : bytes -> int -> int = "%bytes_unsafe_get"
external unsafe_get_uint16_ne : bytes -> int -> int = "%caml_bytes_get16u"
external get_uint8 : bytes -> int -> int = "%bytes_safe_get"
external get_uint16_ne : bytes -> int -> int = "%caml_bytes_get16"
external get_int32_ne : bytes -> int -> int32 = "%caml_bytes_get32"
external get_int64_ne : bytes -> int -> int64 = "%caml_bytes_get64"
external unsafe_set_uint8 : bytes -> int -> int -> unit = "%bytes_unsafe_set"
external unsafe_set_uint16_ne : bytes -> int -> int -> unit
= "%caml_bytes_set16u"
external set_int8 : bytes -> int -> int -> unit = "%bytes_safe_set"
external set_int16_ne : bytes -> int -> int -> unit = "%caml_bytes_set16"
external set_int32_ne : bytes -> int -> int32 -> unit = "%caml_bytes_set32"
external set_int64_ne : bytes -> int -> int64 -> unit = "%caml_bytes_set64"
external swap16 : int -> int = "%bswap16"
external swap32 : int32 -> int32 = "%bswap_int32"
external swap64 : int64 -> int64 = "%bswap_int64"

@avsm
Copy link
Member

avsm commented Apr 25, 2023

When did they appear?! No blockers then! Let the Bigarray destruction begin!

@nojb
Copy link
Contributor

nojb commented Apr 25, 2023

When did they appear?! No blockers then! Let the Bigarray destruction begin!

Some time ago in #1864

@avsm
Copy link
Member

avsm commented Apr 25, 2023

I checked through my notes, and I'd gotten two things confused:

  • using non-moving bytes as the basis for IO pages. We need to ensure that the OCaml header is accounted for in the interfaces.
  • using a directly malloced buffer as an abstract type, which we can guarantee is page aligned. The compiler primitives are missing for this. But I think that the bytes option is just easier and better than going down this route.

@sadiqj
Copy link
Contributor Author

sadiqj commented May 1, 2023

I have a naïve question. The lack of compaction had an "upside" which was the possibility of having non-movable ("pinned") blocks in memory (good for FFI). I remember some discussion about having a pinned analog of bytes introduced. Is this idea completely dead if compaction is reintroduced? Or is there a way to salvage the idea idea even with compaction in the picture?

There are ways to support pinning with this compaction algorithm, it's just that they make fragmentation a lot more likely or compaction much more expensive.

As Anil pointed out, this is only for <1k byte allocations - it may be we decide to make it part of the FFI that these won't move or introduce a pin only for these size allocations (which would be a noop for now but would let us change that in the future).

@kayceesrk
Copy link
Contributor

Are there updates on the suspicious results observed earlier?

#12193 (comment)

@sadiqj
Copy link
Contributor Author

sadiqj commented May 3, 2023

Are there updates on the suspicious results observed earlier?

Yep, finally figured out what it was. Fixed in ff9860f . Without compaction we were not re-using (or releasing) empty pools. With that fixed the test now gives:

rss_graph

With just compact on its own, there's still a bunch of garbage yet to be swept which limits how much memory we can reclaim. Full major never really goes back down because there's no release to the OS.

I should probably test what happens if we release back to the OS at the end of a full major.

Copy link
Contributor

@NickBarnes NickBarnes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've split my review into two parts. This is the first, covering everything apart from the code between "Compaction start" and "Compaction end" in shared_heap.c. All this code looks semantically plausible except where marked - minor changes only.
The inconsistent code style is annoying, as commented: where do we place whitespace; where do we put braces; and can we please decide and then stick to it?

runtime/caml/major_gc.h Show resolved Hide resolved
runtime/major_gc.c Outdated Show resolved Hide resolved
runtime/shared_heap.c Outdated Show resolved Hide resolved
runtime/gc_ctrl.c Show resolved Hide resolved
runtime/gc_ctrl.c Show resolved Hide resolved
runtime/major_gc.c Outdated Show resolved Hide resolved
runtime/major_gc.c Show resolved Hide resolved
runtime/major_gc.c Outdated Show resolved Hide resolved
runtime/caml/runtime_events.h Outdated Show resolved Hide resolved
otherlibs/runtime_events/runtime_events.ml Show resolved Hide resolved
@kayceesrk
Copy link
Contributor

I should probably test what happens if we release back to the OS at the end of a full major.

IIUC OCaml 4 releases memory only at the end of compaction. Are you planning to do this only as a test?

Copy link
Contributor

@NickBarnes NickBarnes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I've only found one bug (line 1049). Generally great of course. Loads of stylistic quibbles and comment typos.

runtime/shared_heap.c Outdated Show resolved Hide resolved
runtime/shared_heap.c Outdated Show resolved Hide resolved
runtime/shared_heap.c Outdated Show resolved Hide resolved
runtime/shared_heap.c Outdated Show resolved Hide resolved
runtime/shared_heap.c Outdated Show resolved Hide resolved
runtime/shared_heap.c Outdated Show resolved Hide resolved
runtime/shared_heap.c Outdated Show resolved Hide resolved
runtime/shared_heap.c Outdated Show resolved Hide resolved
runtime/shared_heap.c Outdated Show resolved Hide resolved
runtime/shared_heap.c Outdated Show resolved Hide resolved
runtime/shared_heap.c Outdated Show resolved Hide resolved
runtime/shared_heap.c Outdated Show resolved Hide resolved
NickBarnes added a commit to NickBarnes/ocaml that referenced this pull request May 4, 2023
…12193: Rewrote compact_update_ephe_list, renamed compact_update_field to compact_update_value, added compact_update_value_at, added POOL_FIRST_BLOCK, POOL_END, POOL_BLOCKS macros.
@sadiqj
Copy link
Contributor Author

sadiqj commented Nov 3, 2023

So being unable to get the macOS CI failure to reproduce I ran precheck against this branch and it found a segfault on the same test on alpine. After some trial and error I got that reproducing locally (it only happens when the test is restricted to a single core) and the bug is fixed in: af5d2ac (the extra barrier introduced was dropped in fca3650 ). macOS CI is green and that test runs cleanly on alpine now too.

Essentially, if we pass an arg to caml_try_run_on_all_domains every domain must make a copy of it before the last barrier in the stw because there's no guarantee that the leader won't finish before the other domains do.

Should be all green now!

@gasche
Copy link
Member

gasche commented Nov 3, 2023

I checked whether this tricky bugs occurs in the rest of the runtime code, and did not find any instance.

@sadiqj
Copy link
Contributor Author

sadiqj commented Nov 3, 2023

As a last item, I used @artempyanykh 's allocation tester to allocate a large (1000 100 200 300 = ~11GB) heap. On Linux this results in a single large mapping and the series of mmaps (that now don't have a hole in them for alignment) get coalesced:

7fec46a3c000-7fef23600000 rw-p 00000000 00:00 0

With Transparent Huge Pages enabled on my machine (kernel 6.2.0) this results in the kernel initially creating 512mb worth of (2mb) huge pages which then slowly grows with time. It hit about 9GB after 2 hours.

I think a follow up item (suggested by @NickBarnes ) is we test doing an madvise MADV_COLLAPSE on contiguous ranges of pools at the end of compaction which would result in the merging being synchronous. If this works it could be an optional behaviour.

I think we're ready to merge now.

@NickBarnes
Copy link
Contributor

Essentially, if we pass an arg to caml_try_run_on_all_domains every domain must make a copy of it before the last barrier in the stw because there's no guarantee that the leader won't finish before the other domains do.

D'oh!

@dra27
Copy link
Member

dra27 commented Nov 3, 2023

One final paranoid run through precheck#911, but after that I'd say :shipit:

@sadiqj, @NickBarnes - are you both OK with the history being squashed?

@dra27 dra27 merged commit bdd8d96 into ocaml:trunk Nov 6, 2023
9 checks passed
@sadiqj sadiqj changed the title Make the GC compact again Compaction Nov 7, 2023
@sadiqj sadiqj deleted the parallel_compact branch November 15, 2023 14:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet