Compaction #12193

sadiqj · 2023-04-18T16:42:53Z

TLDR: PR reintroduces compaction for the pools in the GC's major heap (i.e small blocks < 128 words). Explicit only on calls to Gc.compact for now.

This PR adds a parallel compactor for the shared pools forming part of the major heap. I've favoured simplicity where possible though the implementation should be reasonably performant. The shared_heap.c has documentation describing the algorithm:

https://github.com/ocaml/ocaml/blob/trunk/runtime/shared_heap.c#L960-L987

One thing to note is that the compactor works only on the current state of the major heap at the end of one major cycle. Due to the way the major GC works in OCaml 5.x , it may be several major cycles until unreachable blocks have been swept. This means that after a compaction there may still be unreachable blocks in the major heap. To ensure that only live data remains in the heap, users will need to do a full major followed by a compaction.

Items for discussion

Automatic compaction

I think it's probably worth waiting for a release before enabling automatic compaction but could be convinced otherwise.

Pool acquire/release

To enable actually releasing pools to the OS after a compaction I've removed the batching mechanism. It wasn't clear from benchmarks on https://discuss.ocaml.org/t/ocaml-5-gc-releasing-memory-back-to-the-os/11293/16 that batching was beneficial.

An alternative suggested by @stedolan is to use MADV_DONTNEED on Linux instead of unmapping, and potentially keep batching.

Benchmarks

Since there's no changes to the GC itself when compaction isn't forced, I've only included some benchmarks from a small allocator test (github.com/sadiqj/allocator-test) which cycles between creating a large heap and then throwing most of it away.

The graph shows four different runs of the test with different actions after throwing most of the data away: default is no action, a full major, a compact and a full major then compact.

For the reasons discussed earlier, full major then compaction results in much lower overall RSS.

Thanks

Thanks to @NickBarnes, @stedolan and @mshinwell for debugging help.

A ping for @damiendoligez and @kayceesrk for review. Thanks!

NickBarnes · 2023-04-18T16:50:35Z

Why does the RSS grow without bound when compaction isn't on? I don't expect it to shrink, but we're still collecting; why isn't each new iteration re-using the blocks collected by the previous iteration?

sadiqj · 2023-04-18T20:58:20Z

Why does the RSS grow without bound when compaction isn't on? I don't expect it to shrink, but we're still collecting; why isn't each new iteration re-using the blocks collected by the previous iteration?

That is curious. I think it's an artifact of the duration of the test. I'll set some longer runs going and see if it happens over a long period of time.

kayceesrk · 2023-04-19T03:32:24Z

I've marked myself as a reviewer, but I suspect that it may be a few weeks before I get through my current reviewing backlog.

@NickBarnes would you be able to review this code? It is in your area of expertise.

kayceesrk · 2023-04-19T03:36:55Z

High-level comment: the majority of new code added now is in runtime/shared_heap.c. This file is now 1400+ lines of code. The compaction feature feels modular compared to the rest of shared_heap.c. Would it make sense to split this into its own file runtime/compaction.c?

sadiqj · 2023-04-19T08:32:01Z

Would it make sense to split this into its own file runtime/compaction.c?

The main reason it's in runtime/shared_heap.c is because it manipulates a lot of the internal structures that aren't exposed (e.g the pool struct, the various lists of pools based on their state, etc).

kayceesrk · 2023-04-20T00:48:34Z

The main reason it's in runtime/shared_heap.c is because it manipulates a lot of the internal structures that aren't exposed (e.g the pool struct, the various lists of pools based on their state, etc).

Ok to keep the functionality in shared_heap.c then.

sadiqj · 2023-04-22T13:29:43Z

@NickBarnes so I ran that allocation test for longer and the results are pretty suspicious. Will investigate next week:

gasche · 2023-04-22T15:25:43Z

Your test program is basically a loop that allocates a lot of working memory on each iteration (get_test_data () allocates about 800Mio) but does not retain anything after each iteration.

So the live memory throughout the program should be bounded as long as the GC does enough collection work (in particular in the versions that explicitly ask for a full GC after each iteration), and this is what we observe. (Well, maybe the "default" version actually has higher and higher peaks, this is hard to tell from the current data.)

On the other hand, the RSS depends on our ability to return unused memory back to the OS, and what we observe is that the versions without compaction never do that. Is this the part that you find surprising? (I don't remember if we are supposed to return pools something when there are completely empty, which should be the case here after a full major cycle, or if we don't even try that.)

(One aspect that is a bit strange is that the full_compact line in the RSS plot reads as if full_compact was always at exactly 0 bytes of RSS, which sounds impossible.)

runtime/shared_heap.c

artempyanykh · 2023-04-25T10:00:10Z

Very exciting! I'll need to re-run the benchmarks from the original discuss post now that compaction is also available.

nojb · 2023-04-25T10:05:01Z

I have a naïve question. The lack of compaction had an "upside" which was the possibility of having non-movable ("pinned") blocks in memory (good for FFI). I remember some discussion about having a pinned analog of bytes introduced. Is this idea completely dead if compaction is reintroduced? Or is there a way to salvage the idea idea even with compaction in the picture?

sidkshatriya · 2023-04-25T10:52:37Z

@nojb

I have a naïve question.

I have a potentially even more naïve follow up query/potential point. Sorry if this absolutely does not make sense. Lets say we have a bigarray. A bigarray is represented by a custom block. The first custom block is a pointer to some custom operations and then we have struct caml_ba_array

----------------------------------------------------------
| header |  pointer to operations | struct caml_ba_array |
----------------------------------------------------------

Now struct caml_ba_array contains data struct member which points to the memory location where the array data is stored on the heap.

Now custom blocks are not scanned by the GC. So even if this custom block itself is "compacted" and moved, the place where the data of the bigarray is stored never changes and is "pinned".

So in other words, for FFI, even in the face of a compacting GC as long as you are you using a bigarray your array data is already pinned.

Is my understanding correct?

nojb · 2023-04-25T10:56:26Z

Is my understanding correct?

Yes, this is correct. This (bigarrays) is what you must use for FFI whenever you need non-moving memory. However, it has an important cost in terms of API since you can no longer use bytes (or you need to copy data around). Which is why a non-moving bytes would be interesting.

avsm · 2023-04-25T14:48:12Z

@nojb wrote:

The lack of compaction had an "upside" which was the possibility of having non-movable ("pinned") blocks in memory (good for FFI). I remember some discussion about having a pinned analog of bytes introduced.

There is still hope! Values above 128 words (1K or so) are still not moved, as only smaller values are compacted. But for IO, you usually do need buffers larger than 1K to write into (they are usually 4K and page aligned), and so it should be compatible with this compactor. There's some relevant discussion in ocaml-multicore/eio#140. I'm very keen to have a go at eliminating Bigarray from our stack and seeing the benchmarking differences with just using bytes -- the main blocker is not having the efficient compiler primitives for reading uint8/16/24/32 that we currently have for Bigarrays.

nojb · 2023-04-25T14:54:06Z

the main blocker is not having the efficient compiler primitives for reading uint8/16/24/32 that we currently have for Bigarrays.

You mean this?

ocaml/stdlib/bytes.ml

Lines 451 to 467 in 0a7c5fe

    
           external unsafe_get_uint8 : bytes -> int -> int = "%bytes_unsafe_get" 
        
           external unsafe_get_uint16_ne : bytes -> int -> int = "%caml_bytes_get16u" 
        
           external get_uint8 : bytes -> int -> int = "%bytes_safe_get" 
        
           external get_uint16_ne : bytes -> int -> int = "%caml_bytes_get16" 
        
           external get_int32_ne : bytes -> int -> int32 = "%caml_bytes_get32" 
        
           external get_int64_ne : bytes -> int -> int64 = "%caml_bytes_get64" 
        
           external unsafe_set_uint8 : bytes -> int -> int -> unit = "%bytes_unsafe_set" 
        
           external unsafe_set_uint16_ne : bytes -> int -> int -> unit 
        
                                         = "%caml_bytes_set16u" 
        
           external set_int8 : bytes -> int -> int -> unit = "%bytes_safe_set" 
        
           external set_int16_ne : bytes -> int -> int -> unit = "%caml_bytes_set16" 
        
           external set_int32_ne : bytes -> int -> int32 -> unit = "%caml_bytes_set32" 
        
           external set_int64_ne : bytes -> int -> int64 -> unit = "%caml_bytes_set64" 
        
           external swap16 : int -> int = "%bswap16" 
        
           external swap32 : int32 -> int32 = "%bswap_int32" 
        
           external swap64 : int64 -> int64 = "%bswap_int64"

avsm · 2023-04-25T16:22:51Z

When did they appear?! No blockers then! Let the Bigarray destruction begin!

nojb · 2023-04-25T16:25:14Z

When did they appear?! No blockers then! Let the Bigarray destruction begin!

Some time ago in #1864

avsm · 2023-04-25T16:25:27Z

I checked through my notes, and I'd gotten two things confused:

using non-moving bytes as the basis for IO pages. We need to ensure that the OCaml header is accounted for in the interfaces.
using a directly malloced buffer as an abstract type, which we can guarantee is page aligned. The compiler primitives are missing for this. But I think that the bytes option is just easier and better than going down this route.

sadiqj · 2023-05-01T16:47:59Z

I have a naïve question. The lack of compaction had an "upside" which was the possibility of having non-movable ("pinned") blocks in memory (good for FFI). I remember some discussion about having a pinned analog of bytes introduced. Is this idea completely dead if compaction is reintroduced? Or is there a way to salvage the idea idea even with compaction in the picture?

There are ways to support pinning with this compaction algorithm, it's just that they make fragmentation a lot more likely or compaction much more expensive.

As Anil pointed out, this is only for <1k byte allocations - it may be we decide to make it part of the FFI that these won't move or introduce a pin only for these size allocations (which would be a noop for now but would let us change that in the future).

kayceesrk · 2023-05-02T01:52:57Z

Are there updates on the suspicious results observed earlier?

#12193 (comment)

sadiqj · 2023-05-03T10:10:57Z

Are there updates on the suspicious results observed earlier?

Yep, finally figured out what it was. Fixed in ff9860f . Without compaction we were not re-using (or releasing) empty pools. With that fixed the test now gives:

With just compact on its own, there's still a bunch of garbage yet to be swept which limits how much memory we can reclaim. Full major never really goes back down because there's no release to the OS.

I should probably test what happens if we release back to the OS at the end of a full major.

NickBarnes

I've split my review into two parts. This is the first, covering everything apart from the code between "Compaction start" and "Compaction end" in shared_heap.c. All this code looks semantically plausible except where marked - minor changes only.
The inconsistent code style is annoying, as commented: where do we place whitespace; where do we put braces; and can we please decide and then stick to it?

runtime/caml/major_gc.h

runtime/major_gc.c

runtime/shared_heap.c

runtime/gc_ctrl.c

runtime/major_gc.c

runtime/caml/runtime_events.h

otherlibs/runtime_events/runtime_events.ml

kayceesrk · 2023-05-04T03:19:40Z

I should probably test what happens if we release back to the OS at the end of a full major.

IIUC OCaml 4 releases memory only at the end of compaction. Are you planning to do this only as a test?

NickBarnes

I think I've only found one bug (line 1049). Generally great of course. Loads of stylistic quibbles and comment typos.

runtime/shared_heap.c

…12193: Rewrote compact_update_ephe_list, renamed compact_update_field to compact_update_value, added compact_update_value_at, added POOL_FIRST_BLOCK, POOL_END, POOL_BLOCKS macros.

…r, free them immediately

…_mem_map

Co-authored-by: Damien Doligez <damien.doligez@gmail.com>

proceed with compaction because we might be able to free up some memory.

sadiqj · 2023-11-03T16:05:42Z

So being unable to get the macOS CI failure to reproduce I ran precheck against this branch and it found a segfault on the same test on alpine. After some trial and error I got that reproducing locally (it only happens when the test is restricted to a single core) and the bug is fixed in: af5d2ac (the extra barrier introduced was dropped in fca3650 ). macOS CI is green and that test runs cleanly on alpine now too.

Essentially, if we pass an arg to caml_try_run_on_all_domains every domain must make a copy of it before the last barrier in the stw because there's no guarantee that the leader won't finish before the other domains do.

Should be all green now!

gasche · 2023-11-03T16:12:08Z

I checked whether this tricky bugs occurs in the rest of the runtime code, and did not find any instance.

sadiqj · 2023-11-03T17:07:34Z

As a last item, I used @artempyanykh 's allocation tester to allocate a large (1000 100 200 300 = ~11GB) heap. On Linux this results in a single large mapping and the series of mmaps (that now don't have a hole in them for alignment) get coalesced:

7fec46a3c000-7fef23600000 rw-p 00000000 00:00 0

With Transparent Huge Pages enabled on my machine (kernel 6.2.0) this results in the kernel initially creating 512mb worth of (2mb) huge pages which then slowly grows with time. It hit about 9GB after 2 hours.

I think a follow up item (suggested by @NickBarnes ) is we test doing an madvise MADV_COLLAPSE on contiguous ranges of pools at the end of compaction which would result in the merging being synchronous. If this works it could be an optional behaviour.

I think we're ready to merge now.

NickBarnes · 2023-11-03T17:36:40Z

Essentially, if we pass an arg to caml_try_run_on_all_domains every domain must make a copy of it before the last barrier in the stw because there's no guarantee that the leader won't finish before the other domains do.

D'oh!

dra27 · 2023-11-03T20:00:21Z

One final paranoid run through precheck#911, but after that I'd say

@sadiqj, @NickBarnes - are you both OK with the history being squashed?

kayceesrk self-assigned this Apr 19, 2023

kayceesrk mentioned this pull request Apr 19, 2023

Mention in Gc documentation that compaction is currently not implemented. #11816

Open

avsm reviewed Apr 25, 2023

View reviewed changes

runtime/shared_heap.c Outdated Show resolved Hide resolved

sadiqj mentioned this pull request Apr 29, 2023

Add cache-aligned atomic #12212

Merged

NickBarnes suggested changes May 3, 2023

View reviewed changes

NickBarnes suggested changes May 4, 2023

View reviewed changes

NickBarnes reviewed May 4, 2023

View reviewed changes

runtime/shared_heap.c Outdated Show resolved Hide resolved

NickBarnes suggested changes May 4, 2023

View reviewed changes

runtime/shared_heap.c Outdated Show resolved Hide resolved

sadiqj and others added 18 commits November 3, 2023 14:52

Changes

7e30658

add runtime_events sub-phases for compaction and a test

e5f0ade

don't lose forced major correction stat

cb6cbd0

don't put evacuated pools back on the freelist only to free them late…

d5facd8

…r, free them immediately

reword Changes

561faf2

regenerate sizeclasses

7d7359a

don't forget someone needs to release the global free pools

515c228

keep a track of compactions for GC stats

5904d27

clear the freelist after we've freed it and add a test we have done so

4415f2a

remove shared pool alignment requirements and supporting code in caml…

0463c15

…_mem_map

exactly compute the number of live blocks

41a4636

Update testsuite/tests/compaction/test_freelist_free.ml

4c9cb71

Co-authored-by: Damien Doligez <damien.doligez@gmail.com>

Fix a few comments to reflect recent changes.

0a78317

Remove pool->evacuating flag, as we no longer need it.

745706a

add more comments

46e3549

signpost where to start reading

475d89e

if we can't allocate for collecting the pool stats, log it and

478b1e9

proceed with compaction because we might be able to free up some memory.

make sure we copy params before a barrier whenever it is passed

af5d2ac

sadiqj force-pushed the parallel_compact branch from 354f011 to af5d2ac Compare November 3, 2023 14:52

sadiqj added 3 commits November 3, 2023 14:57

add other reviewers

70cfa57

add even more reviewers

1d5a5b4

we can actually drop this global barrier

fca3650

dra27 merged commit bdd8d96 into ocaml:trunk Nov 6, 2023
9 checks passed

sadiqj changed the title ~~Make the GC compact again~~ Compaction Nov 7, 2023

sadiqj deleted the parallel_compact branch November 15, 2023 14:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compaction #12193

Compaction #12193

sadiqj commented Apr 18, 2023 •

edited

NickBarnes commented Apr 18, 2023

sadiqj commented Apr 18, 2023

kayceesrk commented Apr 19, 2023

kayceesrk commented Apr 19, 2023

sadiqj commented Apr 19, 2023

kayceesrk commented Apr 20, 2023

sadiqj commented Apr 22, 2023

gasche commented Apr 22, 2023

artempyanykh commented Apr 25, 2023

nojb commented Apr 25, 2023

sidkshatriya commented Apr 25, 2023

nojb commented Apr 25, 2023

avsm commented Apr 25, 2023

nojb commented Apr 25, 2023

avsm commented Apr 25, 2023

nojb commented Apr 25, 2023

avsm commented Apr 25, 2023

sadiqj commented May 1, 2023 •

edited

kayceesrk commented May 2, 2023

sadiqj commented May 3, 2023

NickBarnes left a comment

kayceesrk commented May 4, 2023

NickBarnes left a comment

sadiqj commented Nov 3, 2023

gasche commented Nov 3, 2023

sadiqj commented Nov 3, 2023 •

edited

NickBarnes commented Nov 3, 2023

dra27 commented Nov 3, 2023

Compaction #12193

Compaction #12193

Conversation

sadiqj commented Apr 18, 2023 • edited

Items for discussion

Automatic compaction

Pool acquire/release

Benchmarks

Thanks

NickBarnes commented Apr 18, 2023

sadiqj commented Apr 18, 2023

kayceesrk commented Apr 19, 2023

kayceesrk commented Apr 19, 2023

sadiqj commented Apr 19, 2023

kayceesrk commented Apr 20, 2023

sadiqj commented Apr 22, 2023

gasche commented Apr 22, 2023

artempyanykh commented Apr 25, 2023

nojb commented Apr 25, 2023

sidkshatriya commented Apr 25, 2023

nojb commented Apr 25, 2023

avsm commented Apr 25, 2023

nojb commented Apr 25, 2023

avsm commented Apr 25, 2023

nojb commented Apr 25, 2023

avsm commented Apr 25, 2023

sadiqj commented May 1, 2023 • edited

kayceesrk commented May 2, 2023

sadiqj commented May 3, 2023

NickBarnes left a comment

Choose a reason for hiding this comment

kayceesrk commented May 4, 2023

NickBarnes left a comment

Choose a reason for hiding this comment

sadiqj commented Nov 3, 2023

gasche commented Nov 3, 2023

sadiqj commented Nov 3, 2023 • edited

NickBarnes commented Nov 3, 2023

dra27 commented Nov 3, 2023

sadiqj commented Apr 18, 2023 •

edited

sadiqj commented May 1, 2023 •

edited

sadiqj commented Nov 3, 2023 •

edited