GC reachability traversal in revorder #2085

art-w · 2022-09-14T09:59:11Z

I've been looking at the GC reachability traversal. I still have some ideas but I would like to avoid going down too far the wrong path because I overlooked something... The quoted numbers are from the Tezos replay trace, running the GC pretty aggressively (so real life might behave differently? by how much?)

The GC traversal of the tree uses a hashtable to avoid revisiting the same objects again (because it's actually a DAG). However it only checks the hashtable after re-inserting the (offset,length) into the reachable file: The reachable file ends up with 28m pairs (~440Mb) that gets sorted/deduplicated in ~15s. On the Tezos replay trace, I'm seeing a lot of duplication as there are actually 14m real objects (6m contents + 8m nodes, with indistinct repetitions for both kinds). By checking the hashtable first before inserting, the store.X.reachable file nearly halves (~230Mb) and its sorting time too.
However, the hashtable now contains 14m entries rather than 8m (as only the nodes where kept before.) Collection of IO and GC improvements #2039 mentions that the GC reachability could traverse the disk in-order. By using a priority queue on the decreasing offsets, we can clear the hashtable from old entries that will not be visited again... But it's not very clear that the priority queue won't explose in size: I'm seeing it stabilize at max ~4m elements. I'm guessing that the worst case would be max |contents| |nodes| with an unlikely use of irmin that would lay out the tree in bfs order on the disk before GC.
At this point, the store.X.reachable gets almost produced in reverse sorted order thanks to the priority queue. Only the GC commit's parents are inserted at the wrong time, but this is easily fixed by including them in the in-order traversal. We can then skip sorting store.X.reachable and just reverse it to compute the store.X.mapping file directly. This removes the need for the large intermediate store.X.sorted file.
(Sorting with the priority queue during traversal is a in theory a bit more efficient than a general purpose sort, as it can exploit the backward edges characteristic of the irmin store)
Finally, the mapping file collapses consecutive (offset,length) ranges. This can now be done much earlier when producing the store.X.reachable, such that this file drops to the final size of store.X.mapping (which is much smaller.)

I believe that this PR should already be a bit faster (it at least saves a few Mb on disk), although the following "GC total duration" from the replay trace should be taken with a grain of salt as I'm seeing a lot of variations across multiple runs on my computer:

Before:  64.9s  130.7s  140.5s  151.5s  154.1s  151.6s  154.9s
 After:  34.8s   87.6s  104.5s  106.7s  110.1s  124.3s  118.5s

Some pending questions/notes:

How bad is the max memory consumption with the priority queue? (I'm hoping it should be about the same as before)
The in-disk-order traversal is not quite there yet, as the node's decode_bin does too much for our purpose (including arbitrary disk reads)
I was lazy and duplicated some routines in Mapping_file (which I can clean up), but I'm not sure if we should keep the original dead(ly interesting) code? (it would still be available in the git history if needed)
The store.X.reachable could be reversed in place to save a few Mb... Do we know how large it gets in practice?

codecov-commenter · 2022-09-14T15:42:40Z

Codecov Report

Merging #2085 (83984d6) into main (2f55642) will decrease coverage by 0.17%.
The diff coverage is 48.71%.

@@            Coverage Diff             @@
##             main    #2085      +/-   ##
==========================================
- Coverage   68.23%   68.05%   -0.18%     
==========================================
  Files         134      134              
  Lines       16096    16060      -36     
==========================================
- Hits        10983    10930      -53     
- Misses       5113     5130      +17

Impacted Files	Coverage Δ
src/irmin-pack/unix/inode.ml	`12.50% <0.00%> (-20.84%)`	⬇️
src/irmin-pack/unix/gc_worker.ml	`4.31% <2.50%> (-1.66%)`	⬇️
src/irmin-pack/unix/pack_store.ml	`81.67% <50.00%> (-1.31%)`	⬇️
src/irmin-pack/unix/pack_key.ml	`70.83% <58.33%> (-3.53%)`	⬇️
src/irmin-pack/unix/mapping_file.ml	`89.88% <92.85%> (-3.99%)`	⬇️
src/irmin-test/store.ml	`94.89% <0.00%> (-0.06%)`	⬇️
src/irmin-fs/unix/irmin_fs_unix.ml	`69.03% <0.00%> (+1.29%)`	⬆️
src/irmin-pack/layout.ml	`84.00% <0.00%> (+2.00%)`	⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

metanivek

Thanks! From what I know and my first pass review, I think this is a positive change. I'll also think more about your questions and possible implications.

Do you know how big the sorted file is? You say it is large, so it's very nice that this PR eliminates this disk usage!

I also would champion a clean up of the create_rev code that you copy-pasted but maybe in a follow up PR.

src/irmin-pack/unix/gc.ml

src/irmin-pack/unix/mapping_file_intf.ml

art-w · 2022-09-21T09:51:46Z

Do you know how big the sorted file is? You say it is large, so it's very nice that this PR eliminates this disk usage!

Oh sorry for the confusion, it's only "large" when compared to the final store.X.mapping file. The store.X.sorted size was ~440Mb at the start (+ same size for store.X.reachable which needs to exist at the same time), then half with deduplication, then 0 since we don't need it anymore. So this saves less than 1Gb in the end for the Tezos replay trace.

I went ahead to reverse the mapping file in place, not because it saves the last few Mb, but because it's one less temporary file that we need to clean up in case of failure (... and the code is a bit shorter/simpler from my perspective ^^')

irmaTS · 2022-09-21T10:00:43Z

I went ahead to reverse the mapping file in place, not because it saves the last few Mb, but because it's one less temporary file that we need to clean up in case of failure (... and the code is a bit shorter/simpler from my perspective ^^')

If this means that the mapping file used by main is modified by the GC worker, then this breaks the RW consistency.

art-w · 2022-09-21T10:10:38Z

It would break everything yes! But it is the new mapping file before it becomes visible and used by anyone :)

icristescu

great work! I only have some nitpicks, feel free to ignore them.

Related to the memory concerns: not only we store more entres in the hashtable, but also we used to store pairs of (offsets, unit). Also we don't know how much allocations the priority queue does. And this is memory used but the gc process, which we don't track yet in our benchmarks, I think? But I propose we go ahead with this PR, and we will see the impact when we benchmark a new release.

src/irmin-pack/unix/gc.ml

src/irmin-pack/unix/mapping_file.ml

Ngoguey42 · 2022-09-26T10:54:34Z

I would like to benchmark this PR before merging. I'm still working on the stats

Ngoguey42 · 2022-11-03T17:11:01Z

Here is the GC benchmark that compares main with this PR. I have not looked in detail
pp-gc.txt

icristescu · 2022-11-09T11:38:44Z

Here is the GC benchmark that compares main with this PR. I have not looked in detail
pp-gc.txt

Thanks, I had a quick look - these metrics are super nice to have btw, great work!

The worker total time is a bit down, but the maxrss is a bit up:

                       | with-gc_a  |    with-gc_b    |    pr-2085_a    |    pr-2085_b
total duration (wall)  |    5min42s |    5min45s 101% |    5min31s  97% |    5min32s  97%

initial maxrss (bytes) |   75.308 M |   75.312 M 100% |   75.176 M 100% |   77.160 M 102%
final maxrss (bytes)   | 1129.136 M | 1129.312 M 100% | 1274.100 M 113% | 1274.180 M 113%

I think we should go forward with this PR, the code is cleaner and the maxrss overhead is not so big, IMHO.

The diffs in time and maxrss come from where we expected it:

-- Worker's timings --
                               |           |      wall      | share |      sys       |      user      | %cpu | %sys | %user
mapping: of reachable          | with-gc_a | 13.306  s      |  3.9% |  0.503  s      | 12.311  s      |  96% | 3.8% |   93%
                               | with-gc_b | 13.300  s 100% |  3.9% |  0.769  s 153% | 12.022  s  98% |  96% | 5.8% |   90%
                               | pr-2085_a |  0.162  s 1.2% |  .05% |  0.027  s 5.3% |  0.036  s .29% |  39% |  17% |   22%
                               | pr-2085_b |  0.161  s 1.2% |  .05% |  0.029  s 5.7% |  0.028  s .22% |  35% |  18% |   17%
-- Worker's rusage stats per step --
                               |           |     maxrss     |    minflt    |    majflt    |    inblock     |    oublock     |    nvcsw     |   nivcsw
mapping: objects to reachable  | with-gc_a |   729_416      | 153_012      |       0      | 6_216_864      |   933_640      | 506_907      |  2_594     
                               | with-gc_b |   729_392 100% | 153_011 100% |       0  --  | 6_219_336 100% |   933_640 100% | 507_179 100% |  2_644 102%
                               | pr-2085_a | 1_131_032 155% | 254_503 166% |       0  --  | 7_081_456 114% |    82_072 8.8% | 475_903  94% |  2_686 104%
                               | pr-2085_b | 1_131_180 155% | 254_506 166% |       0  --  | 7_084_568 114% |    82_056 8.8% | 476_548  94% |  2_760 106%
mapping: of reachable          | with-gc_a | 1_129_136      | 127_831      | 118_376      |   406_744      | 1_016_928      |   1_093      |    427     
                               | with-gc_b | 1_129_312 100% | 126_266  99% | 118_368 100% |   404_752 100% | 1_016_928 100% |   1_042  95% |    426 100%
                               | pr-2085_a | 1_172_460 104% |     690 .54% |  10_355 8.7% |    13_672 3.4% |    72_632 7.1% |     226  21% |      5 1.2%
                               | pr-2085_b | 1_172_568 104% |     693 .54% |  10_363 8.8% |    15_632 3.8% |    70_680 7.0% |      95 8.7% |      8 1.9%

this PR does not track all the stats as main? as there are some stats n/a or set to 0
is it normal to have 0 in these columns?

appended_hashes        |          0 |          0  --  |          0  --  |          0  -- 
appended_offsets       |          0 |          0  --  |          0  --  |          0  -- 
total                  |  9_012_474 |  9_012_474 100% |  9_012_474 100% |  9_012_474 100%

Ngoguey42 · 2022-11-09T12:41:14Z

this PR does not track all the stats as main? as there are some stats n/a or set to 0

The stats only track give informations on main in Main thread during GC and Main's timings. The rest concerns the worker

is it normal to have 0 in these columns?

Yes this is normal. The worker doesn't encore anything.

art-w · 2022-11-09T13:32:40Z

Thanks a lot for the benchmarks, they are great! <3

this PR does not track all the stats as main? as there are some stats n/a or set to 0

Not sure which 0 are missing... I found these n/a values for the former prefix and the mapping file where the stats weren't recorded, but otherwise I believe the n/a are related to files that don't exist anymore :)

-- File sizes (bytes) --
                                    |  with-gc_a  |    with-gc_b     |    pr-2085_a     |    pr-2085_b
former suffix file size             | 21533.289 M | 21533.289 M 100% | 21533.289 M 100% | 21533.289 M 100%
new suffix file size                | 19092.716 M | 19102.575 M 100% | 19035.959 M 100% | 19038.785 M 100%
former prefix file size             |  1894.588 M |  1894.588 M 100% |         n/a  --  |         n/a  -- 
new prefix file size                |  1958.165 M |  1958.165 M 100% |  1958.165 M 100% |  1958.165 M 100%
former mapping file size            |    27.479 M |    27.479 M 100% |         n/a  --  |         n/a  -- 
new mapping file size               |    42.188 M |    42.188 M 100% |         n/a  --  |         n/a  --

art-w · 2022-12-07T14:58:38Z

Rebased and added two commits to make this PR work:

irmin-pack: gc reachability, optimize memory by keeping only the key
We can save quite a lot of memory during the GC traversal if we keep only the object offsets that have yet to be visited, and don't carry their hash and length (since we can recover that information from the offset)
irmin-pack: gc reachability, do not prefetch children hash/length
This is a bit dirty, but it makes the GC traversal a lot faster: since we don't care anymore about the hash/length of a node children, we can avoid doing random reads on the disk by not prefetching those informations. I didn't find a nice way to do this, so I introduced a new unsafe Pack_key.Offset constructor that should never appear outside of the GC traversal.

I've removed a bunch of optimizations that didn't have a big impact, the latest benchmarks looks like: (orange is still running to check what happens with more GC iterations, I'll update later)

metanivek

Nothing too new or scary in these changes and the memory saving is great! I don't think the solution you came up with for exposing these "optimized" keys is bad. It seems to fit with the node/pack store API.

src/irmin-pack/unix/gc_worker.ml

src/irmin-pack/unix/inode.ml

src/irmin-pack/unix/mapping_file_intf.ml

src/irmin-pack/unix/mapping_file.ml

src/irmin-pack/unix/pack_key_intf.ml

src/irmin-pack/unix/pack_key.ml

src/irmin-pack/unix/pack_store_intf.ml

metanivek · 2022-12-09T19:15:22Z

@art-w I know it's a tad annoying but would you mind applying the parts of ce23879 (#2085) to their appropriate commits that you're fixuping? (any changes worthy of a new commit(s) should of course be self-standing)

…offset

@art-w

…ils, irmin-test, irmin-pack, irmin-mirage, irmin-mirage-graphql, irmin-mirage-git, irmin-http, irmin-graphql, irmin-git, irmin-fs, irmin-containers, irmin-cli, irmin-chunk and irmin-bench (3.6.0) CHANGES: ### Changed - Improve GC reachability traversal to optimize memory, speed and remove the need for temporary files. (mirage/irmin#2085, @art-w)

@art-w

…ils, irmin-test, irmin-pack, irmin-mirage, irmin-mirage-graphql, irmin-mirage-git, irmin-http, irmin-graphql, irmin-git, irmin-fs, irmin-containers, irmin-cli, irmin-chunk and irmin-bench (3.6.0) CHANGES: ### Changed - **irmin-pack** - Improve GC reachability traversal to optimize memory, speed and remove the need for temporary files. (mirage/irmin#2085, @art-w)

@art-w

…ils, irmin-test, irmin-pack, irmin-mirage, irmin-mirage-graphql, irmin-mirage-git, irmin-http, irmin-graphql, irmin-git, irmin-fs, irmin-containers, irmin-cli, irmin-chunk and irmin-bench (3.6.0) CHANGES: ### Changed - **irmin-pack** - Improve GC reachability traversal to optimize memory, speed and remove the need for temporary files. (mirage/irmin#2085, @art-w)

art-w force-pushed the gc-reach branch from 2ce8382 to 0c5d14f Compare September 14, 2022 10:03

metanivek reviewed Sep 14, 2022

View reviewed changes

src/irmin-pack/unix/gc.ml Outdated Show resolved Hide resolved

src/irmin-pack/unix/mapping_file_intf.ml Outdated Show resolved Hide resolved

src/irmin-pack/unix/mapping_file_intf.ml Outdated Show resolved Hide resolved

icristescu approved these changes Sep 26, 2022

View reviewed changes

src/irmin-pack/unix/gc.ml Outdated Show resolved Hide resolved

src/irmin-pack/unix/gc.ml Outdated Show resolved Hide resolved

src/irmin-pack/unix/mapping_file.ml Outdated Show resolved Hide resolved

src/irmin-pack/unix/mapping_file.ml Outdated Show resolved Hide resolved

icristescu mentioned this pull request Sep 26, 2022

Collection of IO and GC improvements #2039

Open

icristescu mentioned this pull request Sep 27, 2022

Release 3.5 #2096

Closed

34 tasks

art-w force-pushed the gc-reach branch from 528bee3 to cae3fb8 Compare October 10, 2022 16:20

metanivek added this to the Irmin 3.5 milestone Oct 13, 2022

art-w force-pushed the gc-reach branch 2 times, most recently from ac3d14d to 55ab9b0 Compare December 7, 2022 13:06

metanivek reviewed Dec 7, 2022

View reviewed changes

art-w force-pushed the gc-reach branch 2 times, most recently from 663b498 to ce23879 Compare December 9, 2022 15:58

art-w force-pushed the gc-reach branch from ce23879 to 7653288 Compare December 16, 2022 19:00

art-w added 6 commits December 16, 2022 20:16

irmin-pack: no duplicates in gc reachability

99d0487

irmin-pack: gc reachability traverses disk in order

dce102e

irmin-pack: gc reachability outputs reachable file in rev order

79b621f

irmin-pack: gc reachability, refactor priority queue

6a22e61

irmin-pack: mapping_file simplified create for reverse sorted entries

202aea9

irmin-pack: mapping_file collapse consecutive entries asap

b8ff828

art-w added 3 commits December 16, 2022 20:16

irmin-pack: mapping_file checks order

0e4e8e0

irmin-pack: mapping file creation without temporary files

64a2760

irmin-pack: gc reachability, optimize memory by keeping only the key …

243ba5b

…offset

art-w force-pushed the gc-reach branch 2 times, most recently from 1d1e8d6 to c54bf0c Compare December 16, 2022 19:25

art-w and others added 2 commits December 19, 2022 11:06

irmin-pack: gc reachability, do not prefetch children hash/length

3dad6cb

Add changelog entry for GC optimizations

83984d6

metanivek force-pushed the gc-reach branch from c54bf0c to 83984d6 Compare December 19, 2022 16:08

metanivek merged commit b6c64d7 into mirage:main Dec 19, 2022

metanivek mentioned this pull request Dec 19, 2022

[new release] irmin project (3.6.0) ocaml/opam-repository#22725

Closed

metanivek mentioned this pull request Feb 9, 2023

[new release] Irmin project (3.6.0) ocaml/opam-repository#23280

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GC reachability traversal in revorder #2085

GC reachability traversal in revorder #2085

art-w commented Sep 14, 2022 •

edited

codecov-commenter commented Sep 14, 2022 •

edited

metanivek left a comment •

edited

art-w commented Sep 21, 2022

irmaTS commented Sep 21, 2022

art-w commented Sep 21, 2022

icristescu left a comment

Ngoguey42 commented Sep 26, 2022

Ngoguey42 commented Nov 3, 2022

icristescu commented Nov 9, 2022

Ngoguey42 commented Nov 9, 2022

art-w commented Nov 9, 2022

art-w commented Dec 7, 2022 •

edited

metanivek left a comment

metanivek commented Dec 9, 2022

GC reachability traversal in revorder #2085

GC reachability traversal in revorder #2085

Conversation

art-w commented Sep 14, 2022 • edited

codecov-commenter commented Sep 14, 2022 • edited

Codecov Report

metanivek left a comment • edited

Choose a reason for hiding this comment

art-w commented Sep 21, 2022

irmaTS commented Sep 21, 2022

art-w commented Sep 21, 2022

icristescu left a comment

Choose a reason for hiding this comment

Ngoguey42 commented Sep 26, 2022

Ngoguey42 commented Nov 3, 2022

icristescu commented Nov 9, 2022

Ngoguey42 commented Nov 9, 2022

art-w commented Nov 9, 2022

art-w commented Dec 7, 2022 • edited

metanivek left a comment

Choose a reason for hiding this comment

metanivek commented Dec 9, 2022

art-w commented Sep 14, 2022 •

edited

codecov-commenter commented Sep 14, 2022 •

edited

metanivek left a comment •

edited

art-w commented Dec 7, 2022 •

edited