Compaction: sort on slices directly instead of kmerge #4839

arpad-m · 2023-07-28T23:16:57Z

Problem

The k-merge in pageserver compaction currently relies on iterators over the keys and also over the values. This approach does not support async code because we are using iterators and those don't support async in general. Also, the k-merge implementation we use doesn't support async either. Instead, as we already load all the keys into memory, just do sorting in-memory.

Summary of changes

The PR can be read commit-by-commit, but most importantly, it:

Stops using kmerge in compaction, using slice sorting instead.
Makes load_keys and load_val_refs async, using Handle::block_on in the compaction code as we don't want to turn the compaction function, called inside spawn_blocking, into an async fn.

Builds on top of #4836, part of #4743

Checklist before requesting a review

I have performed a self-review of my code.
If it is a core feature, I have added thorough tests.
Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

Do not forget to reformat commit message to not include the above checklist

github-actions · 2023-07-28T23:21:10Z

1264 tests run: 1213 passed, 0 failed, 51 skipped (full report)

Flaky tests (1)

Postgres 15

test_crafted_wal_end[wal_record_crossing_segment_followed_by_small_one]: debug

pageserver/src/tenant/storage_layer/image_layer.rs

pageserver/src/tenant/timeline.rs

koivunej

I understand this already implements what I suggested; a single walk to get sizes and valuerefs. Looking good.

Always enjoy a decrease in:

amount of virtual calls
amount of duplicated in-mem contents
amount of LOC

pageserver/src/tenant/timeline.rs

arpad-m · 2023-08-01T14:13:31Z

I understand this already implements what I suggested; a single walk to get sizes and valuerefs. Looking good.

Indeed I have pushed a commit to merge the two functions, but now there is a failure in tenant::tests::test_random_updates, and I can reproduce it locally so it seems to be valid. I am still investigating, but the load_keys function is doing some early returning if last.0 == delta_key.key() holds, while the load_val_refs function doesn't do that.

arpad-m · 2023-08-01T15:40:45Z

Mhh yeah if I print the size of the arrays, it seems that the key search has way less items than the value search:

val refs len: 1000, becoming  1000
val refs len: 1000, becoming  2000
val refs len: 1000, becoming  3000
val refs len: 1000, becoming  4000
val refs len: 1000, becoming  5000
val refs len: 1000, becoming  6000
val refs len: 1000, becoming  7000
val refs len: 1000, becoming  8000
val refs len: 1000, becoming  9000
val refs len: 1000, becoming 10000
keys     len:  641, becoming   641
keys     len:  635, becoming  1276
keys     len:  653, becoming  1929
keys     len:  619, becoming  2548
keys     len:  627, becoming  3175
keys     len:  629, becoming  3804
keys     len:  631, becoming  4435
keys     len:  627, becoming  5062
keys     len:  625, becoming  5687
keys     len:  642, becoming  6329

The failure is gone if I remove the return true from the load_keys function, but that might mean that we now iterate over more keys than needed...

arpad-m · 2023-08-02T00:59:28Z

Okay, more investigation reveals that the code in question comes from #1927, more specifically the commit "Optimize stoage keys iterator" (link). It feels weird to me that we early exit the search and also that we extend the length of the last key to the entire rest of the file.

arpad-m · 2023-08-02T01:17:02Z

I've pushed a commit which I hope fixes the test failures.... but I don't understand why this prior optimization worked. I think due to the reordering, it's not possible to have a similarly optimized state where a chunk of the keys are gone when we have a shared list (for the tenant::tests::test_random_updates it's roughly half as observable above)... not sure if then having a shared list is worth it.

arpad-m · 2023-08-02T21:02:21Z

I've looked at what is done with the keys, and am now more confident that all the merging did was to ensure that blobs for the same key (but different lsns) would end up in the same layer file. To preserve that, I've added a commit to coalesce the iterator. It won't have precisely the same coalescing merging though: first, the old approach would not merge key pairs from different files. in the new vec we have no good way to distinguish files so instead we just look at the size: if the combined key reaches the target file size, we don't merge. This might still create files larger than the target file size, however it will stay below the double of the target file size.

pageserver/src/tenant/timeline.rs

arpad-m · 2023-08-03T12:25:50Z

I've pushed an earlier state of the PR (with two separate vecs) to this branch to enable faster merging. This will unblock some downstream refactors and separates the more involved changes from the ones that just do a refactor. For the changes that originally were part of this PR, see #4888.

## Problem `DiskBtreeReader::get` and `DiskBtreeReader::visit` both call `read_blk` internally, which we would like to make async in the future. This PR focuses on making the interface of these two functions `async`. There is further work to be done in forms of making `visit` to not be recursive any more, similar to #4838. For that, see #4884. Builds on top of #4839, part of #4743 ## Summary of changes Make `DiskBtreeReader::get` and `DiskBtreeReader::visit` async functions and `await` in the places that call these functions.

## Problem The functions `DeltaLayer::load_inner` and `ImageLayer::load_inner` are calling `read_blk` internally, which we would like to turn into an async fn. ## Summary of changes We switch from `once_cell`'s `OnceCell` implementation to the one in `tokio` in order to be able to call an async `get_or_try_init` function. Builds on top of #4839, part of #4743

## Problem PR #4839 didn't output the keys/values in lsn order, but for a given key, the lsns were kept in incoming file order. I think the ordering by lsn is expected. ## Summary of changes We now also sort by `(key, lsn)`, like we did before #4839.

## Problem PR #4839 has already reduced the number of b-tree traversals and vec creations from 3 to 2, but as pointed out in #4839 (comment) , we would ideally just traverse the b-tree once during compaction. Afer #4836, the two vecs created are one for the list of keys, lsns and sizes, and one for the list of `(key, lsn, value reference)`. However, they are not equal, as pointed out in #4839 (comment) and the following comment: the key vec creation combines multiple entries for which the lsn is changing but the key stays the same into one, with the size being the sum of the sub-sizes. In SQL, this would correspond to something like `SELECT key, lsn, SUM(size) FROM b_tree GROUP BY key;` and `SELECT key, lsn, val_ref FROM b_tree;`. Therefore, the join operation is non-trivial. ## Summary of changes This PR merges the two lists of keys and value references into one. It's not a trivial change and affects the size pattern of the resulting files, which is why this is in a separate PR from #4839 . The key vec is used in compaction for determining when to start a new layer file. The loop uses various thresholds to come to this conclusion, but the grouping via the key has led to the behaviour that regardless of the threshold, it only starts a new file when either a new key is encountered, or a new delta file. The new code now does the combination after the merging and sorting of the various keys from the delta files. This *mostly* does the same as the old code, except for a detail: with the grouping done on a per-delta-layer basis, the sorted and merged vec would still have multiple entries for multiple delta files, but now, we don't have an easy way to tell when a new input delta layer file is encountered, so we cannot create multiple entries on that basis easily. To prevent possibly infinite growth, our new grouping code compares the combined size with the threshold, and if it is exceeded, it cuts a new entry so that the downstream code can cut a new output file. Here, we perform a tradeoff however, as if the threshold is too small, we risk putting entries for the same key into multiple layer files, but if the threshold is too big, we can in some instances exceed the target size. Currently, we set the threshold to the target size, so in theory we would stay below or roughly at double the `target_file_size`. We also fix the way the size was calculated for the last key. The calculation was wrong and accounted for the old layer's btree, even though we already account for the overhead of the in-construction btree. Builds on top of #4839 .

…#4938) Implement a new `struct Layer` abstraction which manages downloadness internally, requiring no LayerMap locking or rewriting to download or evict providing a property "you have a layer, you can read it". The new `struct Layer` provides ability to keep the file resident via a RAII structure for new layers which still need to be uploaded. Previous solution solved this `RemoteTimelineClient::wait_completion` which lead to bugs like #5639. Evicting or the final local deletion after garbage collection is done using Arc'd value `Drop`. With a single `struct Layer` the closed open ended `trait Layer`, `trait PersistentLayer` and `struct RemoteLayer` are removed following noting that compaction could be simplified by simply not using any of the traits in between: #4839. The new `struct Layer` is a preliminary to remove `Timeline::layer_removal_cs` documented in #4745. Preliminaries: #4936, #4937, #5013, #5014, #5022, #5033, #5044, #5058, #5059, #5061, #5074, #5103, epic #5172, #5645, #5649. Related split off: #5057, #5134.

This PR does two things: First, it fixes a bug with tiered compaction's k-merge implementation. It ignored the lsn of a key during ordering, so multiple updates of the same key could be read in arbitrary order, say from different layers. For example there is layers `[(a, 2),(b, 3)]` and `[(a, 1),(c, 2)]` in the heap, they might return `(a,2)` and `(a,1)`. Ultimately, this change wasn't enough to fix the ordering issues in #7296, in other words there is likely still bugs in the k-merge. So as the second thing, we switch away from the k-merge to an in-memory based one, similar to #4839, but leave the code around to be improved and maybe switched to later on. Part of #7296

koivunej reviewed Jul 31, 2023

View reviewed changes

pageserver/src/tenant/storage_layer/image_layer.rs Outdated Show resolved Hide resolved

koivunej changed the base branch from main to arpad/pageserver_io_async_kmerge July 31, 2023 11:23

koivunej reviewed Jul 31, 2023

View reviewed changes

pageserver/src/tenant/timeline.rs Show resolved Hide resolved

koivunej reviewed Jul 31, 2023

View reviewed changes

pageserver/src/tenant/timeline.rs Show resolved Hide resolved

koivunej reviewed Jul 31, 2023

View reviewed changes

pageserver/src/tenant/timeline.rs Show resolved Hide resolved

arpad-m force-pushed the arpad/pageserver_io_async_kmerge_slice branch 2 times, most recently from 374cef8 to 92784e3 Compare July 31, 2023 18:18

Base automatically changed from arpad/pageserver_io_async_kmerge to main August 1, 2023 11:38

arpad-m added 2 commits August 1, 2023 13:42

Don't use kmerge but sort on slices directly instead

17922f7

Make load_keys and load_val_refs async

10d2740

arpad-m force-pushed the arpad/pageserver_io_async_kmerge_slice branch from 92784e3 to 10d2740 Compare August 1, 2023 11:43

arpad-m marked this pull request as ready for review August 1, 2023 11:44

arpad-m requested review from a team as code owners August 1, 2023 11:44

arpad-m requested review from save-buffer and koivunej and removed request for a team August 1, 2023 11:44

arpad-m mentioned this pull request Aug 1, 2023

Make DiskBtreeReader::{visit, get} async #4863

Merged

5 tasks

koivunej approved these changes Aug 1, 2023

View reviewed changes

koivunej reviewed Aug 1, 2023

View reviewed changes

pageserver/src/tenant/timeline.rs Show resolved Hide resolved

arpad-m force-pushed the arpad/pageserver_io_async_kmerge_slice branch 2 times, most recently from 64cbebf to ab36f4a Compare August 2, 2023 20:54

arpad-m mentioned this pull request Aug 2, 2023

Make {DeltaLayer,ImageLayer}::{load,load_inner} async #4883

Merged

5 tasks

koivunej reviewed Aug 3, 2023

View reviewed changes

pageserver/src/tenant/timeline.rs Outdated Show resolved Hide resolved

koivunej reviewed Aug 3, 2023

View reviewed changes

pageserver/src/tenant/timeline.rs Outdated Show resolved Hide resolved

arpad-m force-pushed the arpad/pageserver_io_async_kmerge_slice branch from dc9889b to 10d2740 Compare August 3, 2023 12:15

arpad-m mentioned this pull request Aug 3, 2023

Compaction: unify key and value reference vecs #4888

Merged

5 tasks

arpad-m merged commit 416c14b into main Aug 3, 2023
59 checks passed

arpad-m deleted the arpad/pageserver_io_async_kmerge_slice branch August 3, 2023 13:30

arpad-m mentioned this pull request Aug 3, 2023

Epic: convert remaining IO stack for Timeline::get to async fn #4743

Closed

This was referenced Aug 7, 2023

Sort by (key, lsn), not just key #4918

Merged

timeline compaction kmerge_by not necessarily ascending #4666

Closed

koivunej mentioned this pull request Oct 26, 2023

reimpl Layer, remove remote layer, trait Layer, trait PersistentLayer #4938

Merged

arpad-m mentioned this pull request May 8, 2024

Fix tiered compaction k-merge bugs and use in-memory alternative #7661

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compaction: sort on slices directly instead of kmerge #4839

Compaction: sort on slices directly instead of kmerge #4839

arpad-m commented Jul 28, 2023 •

edited

github-actions bot commented Jul 28, 2023 •

edited

Postgres 15

koivunej left a comment

arpad-m commented Aug 1, 2023

arpad-m commented Aug 1, 2023

arpad-m commented Aug 2, 2023

arpad-m commented Aug 2, 2023

arpad-m commented Aug 2, 2023

arpad-m commented Aug 3, 2023 •

edited

Compaction: sort on slices directly instead of kmerge #4839

Compaction: sort on slices directly instead of kmerge #4839

Conversation

arpad-m commented Jul 28, 2023 • edited

Problem

Summary of changes

Checklist before requesting a review

Checklist before merging

github-actions bot commented Jul 28, 2023 • edited

1264 tests run: 1213 passed, 0 failed, 51 skipped (full report)

Postgres 15

koivunej left a comment

Choose a reason for hiding this comment

arpad-m commented Aug 1, 2023

arpad-m commented Aug 1, 2023

arpad-m commented Aug 2, 2023

arpad-m commented Aug 2, 2023

arpad-m commented Aug 2, 2023

arpad-m commented Aug 3, 2023 • edited

arpad-m commented Jul 28, 2023 •

edited

github-actions bot commented Jul 28, 2023 •

edited

arpad-m commented Aug 3, 2023 •

edited