Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GC reachability traversal in revorder #2085

Merged
merged 11 commits into from Dec 19, 2022
Merged

GC reachability traversal in revorder #2085

merged 11 commits into from Dec 19, 2022

Conversation

art-w
Copy link
Contributor

@art-w art-w commented Sep 14, 2022

I've been looking at the GC reachability traversal. I still have some ideas but I would like to avoid going down too far the wrong path because I overlooked something... The quoted numbers are from the Tezos replay trace, running the GC pretty aggressively (so real life might behave differently? by how much?)

  • The GC traversal of the tree uses a hashtable to avoid revisiting the same objects again (because it's actually a DAG). However it only checks the hashtable after re-inserting the (offset,length) into the reachable file: The reachable file ends up with 28m pairs (~440Mb) that gets sorted/deduplicated in ~15s. On the Tezos replay trace, I'm seeing a lot of duplication as there are actually 14m real objects (6m contents + 8m nodes, with indistinct repetitions for both kinds). By checking the hashtable first before inserting, the store.X.reachable file nearly halves (~230Mb) and its sorting time too.

  • However, the hashtable now contains 14m entries rather than 8m (as only the nodes where kept before.) Collection of IO and GC improvements #2039 mentions that the GC reachability could traverse the disk in-order. By using a priority queue on the decreasing offsets, we can clear the hashtable from old entries that will not be visited again... But it's not very clear that the priority queue won't explose in size: I'm seeing it stabilize at max ~4m elements. I'm guessing that the worst case would be max |contents| |nodes| with an unlikely use of irmin that would lay out the tree in bfs order on the disk before GC.

  • At this point, the store.X.reachable gets almost produced in reverse sorted order thanks to the priority queue. Only the GC commit's parents are inserted at the wrong time, but this is easily fixed by including them in the in-order traversal. We can then skip sorting store.X.reachable and just reverse it to compute the store.X.mapping file directly. This removes the need for the large intermediate store.X.sorted file.
    (Sorting with the priority queue during traversal is a in theory a bit more efficient than a general purpose sort, as it can exploit the backward edges characteristic of the irmin store)

  • Finally, the mapping file collapses consecutive (offset,length) ranges. This can now be done much earlier when producing the store.X.reachable, such that this file drops to the final size of store.X.mapping (which is much smaller.)

I believe that this PR should already be a bit faster (it at least saves a few Mb on disk), although the following "GC total duration" from the replay trace should be taken with a grain of salt as I'm seeing a lot of variations across multiple runs on my computer:

Before:  64.9s  130.7s  140.5s  151.5s  154.1s  151.6s  154.9s
 After:  34.8s   87.6s  104.5s  106.7s  110.1s  124.3s  118.5s

Some pending questions/notes:

  • How bad is the max memory consumption with the priority queue? (I'm hoping it should be about the same as before)
  • The in-disk-order traversal is not quite there yet, as the node's decode_bin does too much for our purpose (including arbitrary disk reads)
  • I was lazy and duplicated some routines in Mapping_file (which I can clean up), but I'm not sure if we should keep the original dead(ly interesting) code? (it would still be available in the git history if needed)
  • The store.X.reachable could be reversed in place to save a few Mb... Do we know how large it gets in practice?

@codecov-commenter
Copy link

codecov-commenter commented Sep 14, 2022

Codecov Report

Merging #2085 (83984d6) into main (2f55642) will decrease coverage by 0.17%.
The diff coverage is 48.71%.

@@            Coverage Diff             @@
##             main    #2085      +/-   ##
==========================================
- Coverage   68.23%   68.05%   -0.18%     
==========================================
  Files         134      134              
  Lines       16096    16060      -36     
==========================================
- Hits        10983    10930      -53     
- Misses       5113     5130      +17     
Impacted Files Coverage Δ
src/irmin-pack/unix/inode.ml 12.50% <0.00%> (-20.84%) ⬇️
src/irmin-pack/unix/gc_worker.ml 4.31% <2.50%> (-1.66%) ⬇️
src/irmin-pack/unix/pack_store.ml 81.67% <50.00%> (-1.31%) ⬇️
src/irmin-pack/unix/pack_key.ml 70.83% <58.33%> (-3.53%) ⬇️
src/irmin-pack/unix/mapping_file.ml 89.88% <92.85%> (-3.99%) ⬇️
src/irmin-test/store.ml 94.89% <0.00%> (-0.06%) ⬇️
src/irmin-fs/unix/irmin_fs_unix.ml 69.03% <0.00%> (+1.29%) ⬆️
src/irmin-pack/layout.ml 84.00% <0.00%> (+2.00%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Member

@metanivek metanivek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! From what I know and my first pass review, I think this is a positive change. I'll also think more about your questions and possible implications.

Do you know how big the sorted file is? You say it is large, so it's very nice that this PR eliminates this disk usage!

I also would champion a clean up of the create_rev code that you copy-pasted but maybe in a follow up PR.

src/irmin-pack/unix/gc.ml Outdated Show resolved Hide resolved
src/irmin-pack/unix/mapping_file_intf.ml Outdated Show resolved Hide resolved
src/irmin-pack/unix/mapping_file_intf.ml Outdated Show resolved Hide resolved
@art-w
Copy link
Contributor Author

art-w commented Sep 21, 2022

Do you know how big the sorted file is? You say it is large, so it's very nice that this PR eliminates this disk usage!

Oh sorry for the confusion, it's only "large" when compared to the final store.X.mapping file. The store.X.sorted size was ~440Mb at the start (+ same size for store.X.reachable which needs to exist at the same time), then half with deduplication, then 0 since we don't need it anymore. So this saves less than 1Gb in the end for the Tezos replay trace.

I went ahead to reverse the mapping file in place, not because it saves the last few Mb, but because it's one less temporary file that we need to clean up in case of failure (... and the code is a bit shorter/simpler from my perspective ^^')

@irmaTS
Copy link

irmaTS commented Sep 21, 2022

I went ahead to reverse the mapping file in place, not because it saves the last few Mb, but because it's one less temporary file that we need to clean up in case of failure (... and the code is a bit shorter/simpler from my perspective ^^')

If this means that the mapping file used by main is modified by the GC worker, then this breaks the RW consistency.

@art-w
Copy link
Contributor Author

art-w commented Sep 21, 2022

It would break everything yes! But it is the new mapping file before it becomes visible and used by anyone :)

Copy link
Contributor

@icristescu icristescu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great work! I only have some nitpicks, feel free to ignore them.

Related to the memory concerns: not only we store more entres in the hashtable, but also we used to store pairs of (offsets, unit). Also we don't know how much allocations the priority queue does. And this is memory used but the gc process, which we don't track yet in our benchmarks, I think? But I propose we go ahead with this PR, and we will see the impact when we benchmark a new release.

src/irmin-pack/unix/gc.ml Outdated Show resolved Hide resolved
src/irmin-pack/unix/gc.ml Outdated Show resolved Hide resolved
src/irmin-pack/unix/mapping_file.ml Outdated Show resolved Hide resolved
src/irmin-pack/unix/mapping_file.ml Outdated Show resolved Hide resolved
@Ngoguey42
Copy link
Contributor

I would like to benchmark this PR before merging. I'm still working on the stats

@icristescu icristescu mentioned this pull request Sep 27, 2022
34 tasks
@metanivek metanivek added this to the Irmin 3.5 milestone Oct 13, 2022
@Ngoguey42
Copy link
Contributor

Here is the GC benchmark that compares main with this PR. I have not looked in detail
pp-gc.txt

@icristescu
Copy link
Contributor

Here is the GC benchmark that compares main with this PR. I have not looked in detail
pp-gc.txt

Thanks, I had a quick look - these metrics are super nice to have btw, great work!

The worker total time is a bit down, but the maxrss is a bit up:

                       | with-gc_a  |    with-gc_b    |    pr-2085_a    |    pr-2085_b
total duration (wall)  |    5min42s |    5min45s 101% |    5min31s  97% |    5min32s  97%

initial maxrss (bytes) |   75.308 M |   75.312 M 100% |   75.176 M 100% |   77.160 M 102%
final maxrss (bytes)   | 1129.136 M | 1129.312 M 100% | 1274.100 M 113% | 1274.180 M 113%

I think we should go forward with this PR, the code is cleaner and the maxrss overhead is not so big, IMHO.

The diffs in time and maxrss come from where we expected it:

-- Worker's timings --
                               |           |      wall      | share |      sys       |      user      | %cpu | %sys | %user
mapping: of reachable          | with-gc_a | 13.306  s      |  3.9% |  0.503  s      | 12.311  s      |  96% | 3.8% |   93%
                               | with-gc_b | 13.300  s 100% |  3.9% |  0.769  s 153% | 12.022  s  98% |  96% | 5.8% |   90%
                               | pr-2085_a |  0.162  s 1.2% |  .05% |  0.027  s 5.3% |  0.036  s .29% |  39% |  17% |   22%
                               | pr-2085_b |  0.161  s 1.2% |  .05% |  0.029  s 5.7% |  0.028  s .22% |  35% |  18% |   17%
-- Worker's rusage stats per step --
                               |           |     maxrss     |    minflt    |    majflt    |    inblock     |    oublock     |    nvcsw     |   nivcsw
mapping: objects to reachable  | with-gc_a |   729_416      | 153_012      |       0      | 6_216_864      |   933_640      | 506_907      |  2_594     
                               | with-gc_b |   729_392 100% | 153_011 100% |       0  --  | 6_219_336 100% |   933_640 100% | 507_179 100% |  2_644 102%
                               | pr-2085_a | 1_131_032 155% | 254_503 166% |       0  --  | 7_081_456 114% |    82_072 8.8% | 475_903  94% |  2_686 104%
                               | pr-2085_b | 1_131_180 155% | 254_506 166% |       0  --  | 7_084_568 114% |    82_056 8.8% | 476_548  94% |  2_760 106%
mapping: of reachable          | with-gc_a | 1_129_136      | 127_831      | 118_376      |   406_744      | 1_016_928      |   1_093      |    427     
                               | with-gc_b | 1_129_312 100% | 126_266  99% | 118_368 100% |   404_752 100% | 1_016_928 100% |   1_042  95% |    426 100%
                               | pr-2085_a | 1_172_460 104% |     690 .54% |  10_355 8.7% |    13_672 3.4% |    72_632 7.1% |     226  21% |      5 1.2%
                               | pr-2085_b | 1_172_568 104% |     693 .54% |  10_363 8.8% |    15_632 3.8% |    70_680 7.0% |      95 8.7% |      8 1.9%

  • this PR does not track all the stats as main? as there are some stats n/a or set to 0
  • is it normal to have 0 in these columns?
appended_hashes        |          0 |          0  --  |          0  --  |          0  -- 
appended_offsets       |          0 |          0  --  |          0  --  |          0  -- 
total                  |  9_012_474 |  9_012_474 100% |  9_012_474 100% |  9_012_474 100%

@Ngoguey42
Copy link
Contributor

this PR does not track all the stats as main? as there are some stats n/a or set to 0

The stats only track give informations on main in Main thread during GC and Main's timings. The rest concerns the worker

is it normal to have 0 in these columns?

Yes this is normal. The worker doesn't encore anything.

@art-w
Copy link
Contributor Author

art-w commented Nov 9, 2022

Thanks a lot for the benchmarks, they are great! <3

this PR does not track all the stats as main? as there are some stats n/a or set to 0

Not sure which 0 are missing... I found these n/a values for the former prefix and the mapping file where the stats weren't recorded, but otherwise I believe the n/a are related to files that don't exist anymore :)

-- File sizes (bytes) --
                                    |  with-gc_a  |    with-gc_b     |    pr-2085_a     |    pr-2085_b
former suffix file size             | 21533.289 M | 21533.289 M 100% | 21533.289 M 100% | 21533.289 M 100%
new suffix file size                | 19092.716 M | 19102.575 M 100% | 19035.959 M 100% | 19038.785 M 100%
former prefix file size             |  1894.588 M |  1894.588 M 100% |         n/a  --  |         n/a  -- 
new prefix file size                |  1958.165 M |  1958.165 M 100% |  1958.165 M 100% |  1958.165 M 100%
former mapping file size            |    27.479 M |    27.479 M 100% |         n/a  --  |         n/a  -- 
new mapping file size               |    42.188 M |    42.188 M 100% |         n/a  --  |         n/a  -- 

@art-w art-w force-pushed the gc-reach branch 2 times, most recently from ac3d14d to 55ab9b0 Compare December 7, 2022 13:06
@art-w
Copy link
Contributor Author

art-w commented Dec 7, 2022

Rebased and added two commits to make this PR work:

  • irmin-pack: gc reachability, optimize memory by keeping only the key
    We can save quite a lot of memory during the GC traversal if we keep only the object offsets that have yet to be visited, and don't carry their hash and length (since we can recover that information from the offset)
  • irmin-pack: gc reachability, do not prefetch children hash/length
    This is a bit dirty, but it makes the GC traversal a lot faster: since we don't care anymore about the hash/length of a node children, we can avoid doing random reads on the disk by not prefetching those informations. I didn't find a nice way to do this, so I introduced a new unsafe Pack_key.Offset constructor that should never appear outside of the GC traversal.

I've removed a bunch of optimizations that didn't have a big impact, the latest benchmarks looks like: (orange is still running to check what happens with more GC iterations, I'll update later)

tezos_trio

Copy link
Member

@metanivek metanivek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing too new or scary in these changes and the memory saving is great! I don't think the solution you came up with for exposing these "optimized" keys is bad. It seems to fit with the node/pack store API.

src/irmin-pack/unix/gc_worker.ml Outdated Show resolved Hide resolved
src/irmin-pack/unix/inode.ml Outdated Show resolved Hide resolved
src/irmin-pack/unix/mapping_file_intf.ml Outdated Show resolved Hide resolved
src/irmin-pack/unix/mapping_file.ml Show resolved Hide resolved
src/irmin-pack/unix/mapping_file.ml Outdated Show resolved Hide resolved
src/irmin-pack/unix/pack_key_intf.ml Show resolved Hide resolved
src/irmin-pack/unix/pack_key.ml Outdated Show resolved Hide resolved
src/irmin-pack/unix/pack_store_intf.ml Show resolved Hide resolved
@art-w art-w force-pushed the gc-reach branch 2 times, most recently from 663b498 to ce23879 Compare December 9, 2022 15:58
@metanivek
Copy link
Member

@art-w I know it's a tad annoying but would you mind applying the parts of ce23879 (#2085) to their appropriate commits that you're fixuping? (any changes worthy of a new commit(s) should of course be self-standing)

@art-w art-w force-pushed the gc-reach branch 2 times, most recently from 1d1e8d6 to c54bf0c Compare December 16, 2022 19:25
@metanivek metanivek merged commit b6c64d7 into mirage:main Dec 19, 2022
metanivek added a commit to metanivek/opam-repository that referenced this pull request Dec 19, 2022
…ils, irmin-test, irmin-pack, irmin-mirage, irmin-mirage-graphql, irmin-mirage-git, irmin-http, irmin-graphql, irmin-git, irmin-fs, irmin-containers, irmin-cli, irmin-chunk and irmin-bench (3.6.0)

CHANGES:

### Changed
  - Improve GC reachability traversal to optimize memory, speed and remove
    the need for temporary files. (mirage/irmin#2085, @art-w)
metanivek added a commit to metanivek/opam-repository that referenced this pull request Feb 9, 2023
…ils, irmin-test, irmin-pack, irmin-mirage, irmin-mirage-graphql, irmin-mirage-git, irmin-http, irmin-graphql, irmin-git, irmin-fs, irmin-containers, irmin-cli, irmin-chunk and irmin-bench (3.6.0)

CHANGES:

### Changed

- **irmin-pack**
  - Improve GC reachability traversal to optimize memory, speed and remove
    the need for temporary files. (mirage/irmin#2085, @art-w)
metanivek added a commit to metanivek/opam-repository that referenced this pull request Feb 10, 2023
…ils, irmin-test, irmin-pack, irmin-mirage, irmin-mirage-graphql, irmin-mirage-git, irmin-http, irmin-graphql, irmin-git, irmin-fs, irmin-containers, irmin-cli, irmin-chunk and irmin-bench (3.6.0)

CHANGES:

### Changed

- **irmin-pack**
  - Improve GC reachability traversal to optimize memory, speed and remove
    the need for temporary files. (mirage/irmin#2085, @art-w)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants