Perform L0 compaction before creating new image layers #5950

ivaxer · 2023-11-28T13:06:47Z

Problem

If there are too many L0 layers before compaction, the compaction process becomes slow because of slow timeline::get. As a result of the slowdown, the pageserver will generate even more L0 layers for the next iteration, further exacerbating the slow performance.

Summary of changes

Perform L0 -> L1 compaction before of creating new images. It simple change speedups compaction time and timeline::get to 5x. timeline::get is faster on top of L1 layers.

For layers map:

reconstruct data latency before:

and after:

Checklist before requesting a review

I have performed a self-review of my code.
If it is a core feature, I have added thorough tests.
Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

Do not forget to reformat commit message to not include the above checklist

hlinnaka · 2023-11-28T13:23:08Z

Interesting. The idea with the current code is that if you're about to create image layers anyway, it's a waste of time and unnecessary write amplification to compact the deltas, because subsequent getpage requests will make use of the image layers and won't look at the deltas anymore. But if it greatly speeds up the image layer creation, maybe we should indeed do it as a quick fix.

Another approach would be to optimize the getpage requests during image layer creation. We're reconstruct the pages with individual "retail" getpage calls, but if you know you're going to call getpage for every page, you could surely do it more efficiently. In other words, we could merge of the WAL records from different L0 layers "on-the-fly", as part of image layer creation, similar to how the L0 compaction does.

jcsp · 2023-11-28T13:38:18Z

In other words, we could merge of the WAL records from different L0 layers "on-the-fly", as part of image layer creation, similar to how the L0 compaction does.

Yep, we'd like to do this at some point in the medium-term future: #4979

In the near term, the measurements in the PR description are quite persuasive... conversely we might need to think a bit and see if we can identify write patterns where this model becomes particularly painful.

ivaxer · 2023-11-28T13:54:51Z

The idea with the current code is that if you're about to create image layers anyway, it's a waste of time and unnecessary write amplification to compact the deltas, because subsequent getpage requests will make use of the image layers and won't look at the deltas anymore.

@hlinnaka , thank you. Is it correct to assume that when PITR horizon is set, we will perform L0 to L1 compaction anyway?

We're reconstruct the pages with individual "retail" getpage calls, but if you know you're going to call getpage for every page, you could surely do it more efficiently.

Interesting. Btw, even after L0->L1 compaction, creating of new images is long process. It's CPU bound to 1 CPU. Disk usage is low:

iostat -xmdy 10 5 /dev/vda
Linux 5.15.0-88-generic (pageserver2) 	11/28/2023 	_x86_64_	(12 CPU)


Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dMB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
vda            106.50     13.11     0.00   0.00    0.19   126.05   30.30     12.87     8.00  20.89    6.47   434.79    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.22  43.28


Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dMB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
vda            105.00     12.99     0.00   0.00    0.20   126.69   29.90     12.84     5.50  15.54    1.67   439.83    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.07  42.24


Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dMB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
vda            105.90     13.06     0.00   0.00    0.21   126.24   32.60     12.84     5.40  14.21    1.55   403.41    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.07  43.00


Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dMB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
vda            106.20     13.09     0.00   0.00    0.22   126.25   37.80     12.86     7.40  16.37    1.74   348.48    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.09  43.20


Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dMB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
vda            106.10     13.08     0.00   0.00    0.20   126.27   33.80     12.86     7.60  18.36    1.62   389.61    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.08  43.44

Should the compacting be reworked for multi-threading, or is there still potential for further optimizations on single core?

jcsp · 2023-11-28T14:02:55Z

Should the compacting be reworked for multi-threading, or is there still potential for further optimizations on single core?

We probably won't want to parallelize compaction itself:

when pageserver sharding is done (Epic: pageserver support for sharding phase 1 #5507 etc), large databases will be broken up into smaller shards that compact separately.
I anticipate we can make one core go much faster, especially by using a more streaming approach to I/O, and pipelining reads so that we read the next pages' input data while we're busy walredo'ing the previous page.

If we got to the point where our I/O was all nice an efficient and the actual walredo (pure CPU) part becomes a bottleneck, then I would perhaps look to do a localized worker pool of walredo processes (currently we have one per tenant), and/or make pure Rust implementations of certain common WAL operations (but I'm told this is harder than it sounds).

github-actions · 2023-11-28T14:27:02Z

2436 tests run: 2339 passed, 0 failed, 97 skipped (full report)

Flaky tests (1)

Postgres 16

test_pitr_gc: debug

Code coverage (full report)

functions: 54.5% (9288 of 17029 functions)
lines: 82.0% (53959 of 65790 lines)

_{The comment gets automatically updated with the latest test results
dd76c94 at 2023-12-04T11:47:32.842Z :recycle:}

ivaxer · 2023-11-28T19:47:50Z

Btw, even after L0->L1 compaction, creating of new images is long process. It's CPU bound to 1 CPU. Disk usage is low:

My apologies, I mistakenly compared the release build (before) with the debug build (after) in the description 🙂. After L0 compaction, image creation works much better. Although it remains CPU-limited, it now performs more IO:

$ iostat -xmdy 10 5 /dev/vda
Linux 5.15.0-88-generic (pageserver2) 	11/28/2023 	_x86_64_	(12 CPU)


Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dMB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
vda           1314.49    158.81     0.00   0.00    0.17   123.71  270.73    119.48    25.17   8.51    6.97   451.90    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    2.11  99.90


Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dMB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
vda           1156.90    143.07     0.00   0.00    0.16   126.64  253.40    141.08    23.40   8.45   10.30   570.11    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    2.79  99.96

test_runner/regress/test_layers_from_future.py

koivunej

The change makes sense to me, cannot see any reason why not to merge this.

test_runner/fixtures/neon_fixtures.py

ivaxer · 2023-11-30T15:27:41Z

The change makes sense to me, cannot see any reason why not to merge this.

Great! Glad to hear that.

I fixed flacky test_issue_5878. I used a set_tenant_config with one option that reset the other options to defaults. Because of this, the compaction loop could be concurrently running with the manual compaction.

@koivunej , could you re-run CI please?

This reverts commit 3772812.

ivaxer requested a review from a team as a code owner November 28, 2023 13:06

ivaxer requested review from jcsp and removed request for a team November 28, 2023 13:06

bayandin added the approved-for-ci-run Changes are safe to trigger CI for the PR label Nov 28, 2023

github-actions bot removed the approved-for-ci-run Changes are safe to trigger CI for the PR label Nov 28, 2023

vipvap mentioned this pull request Nov 28, 2023

CI run for PR #5950 #5952

Closed

ivaxer force-pushed the feature-first-l0-compact branch from e017283 to b6e5ca2 Compare November 29, 2023 11:21

bayandin added the approved-for-ci-run Changes are safe to trigger CI for the PR label Nov 29, 2023

github-actions bot removed the approved-for-ci-run Changes are safe to trigger CI for the PR label Nov 29, 2023

bayandin reviewed Nov 29, 2023

View reviewed changes

test_runner/regress/test_layers_from_future.py Outdated Show resolved Hide resolved

ivaxer force-pushed the feature-first-l0-compact branch from b6e5ca2 to a4a939c Compare November 29, 2023 13:22

bayandin added the approved-for-ci-run Changes are safe to trigger CI for the PR label Nov 29, 2023

github-actions bot removed the approved-for-ci-run Changes are safe to trigger CI for the PR label Nov 29, 2023

koivunej approved these changes Nov 30, 2023

View reviewed changes

koivunej added the approved-for-ci-run Changes are safe to trigger CI for the PR label Nov 30, 2023

github-actions bot removed the approved-for-ci-run Changes are safe to trigger CI for the PR label Nov 30, 2023

ivaxer force-pushed the feature-first-l0-compact branch from a4a939c to 3772812 Compare November 30, 2023 15:13

ivaxer commented Nov 30, 2023

View reviewed changes

test_runner/fixtures/neon_fixtures.py Outdated Show resolved Hide resolved

koivunej added the approved-for-ci-run Changes are safe to trigger CI for the PR label Nov 30, 2023

github-actions bot removed the approved-for-ci-run Changes are safe to trigger CI for the PR label Nov 30, 2023

ivaxer and others added 3 commits December 4, 2023 13:07

Perform L0 compaction before creating new image layers

d9eaac6

fix test_issue_5878: control image creation

dcf1b9c

Revert "fix test_issue_5878: control image creation"

95c566d

This reverts commit 3772812.

test: fix test_issue_5878

dd76c94

koivunej force-pushed the feature-first-l0-compact branch from dee3095 to dd76c94 Compare December 4, 2023 11:07

koivunej added the approved-for-ci-run Changes are safe to trigger CI for the PR label Dec 4, 2023

github-actions bot removed the approved-for-ci-run Changes are safe to trigger CI for the PR label Dec 4, 2023

koivunej enabled auto-merge (squash) December 4, 2023 12:20

koivunej disabled auto-merge December 4, 2023 12:21

koivunej enabled auto-merge (squash) December 4, 2023 12:24

koivunej merged commit eae49ff into neondatabase:main Dec 4, 2023
68 of 70 checks passed

koivunej mentioned this pull request Dec 11, 2023

refactor: remove eviction batching #6060

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perform L0 compaction before creating new image layers #5950

Perform L0 compaction before creating new image layers #5950

ivaxer commented Nov 28, 2023 •

edited

Loading

hlinnaka commented Nov 28, 2023

jcsp commented Nov 28, 2023

ivaxer commented Nov 28, 2023

jcsp commented Nov 28, 2023

github-actions bot commented Nov 28, 2023 •

edited

Loading

Postgres 16

ivaxer commented Nov 28, 2023

koivunej left a comment

ivaxer commented Nov 30, 2023 •

edited

Loading

Perform L0 compaction before creating new image layers #5950

Perform L0 compaction before creating new image layers #5950

Conversation

ivaxer commented Nov 28, 2023 • edited Loading

Problem

Summary of changes

Checklist before requesting a review

Checklist before merging

hlinnaka commented Nov 28, 2023

jcsp commented Nov 28, 2023

ivaxer commented Nov 28, 2023

jcsp commented Nov 28, 2023

github-actions bot commented Nov 28, 2023 • edited Loading

2436 tests run: 2339 passed, 0 failed, 97 skipped (full report)

Postgres 16

Code coverage (full report)

ivaxer commented Nov 28, 2023

koivunej left a comment

Choose a reason for hiding this comment

ivaxer commented Nov 30, 2023 • edited Loading

ivaxer commented Nov 28, 2023 •

edited

Loading

github-actions bot commented Nov 28, 2023 •

edited

Loading

ivaxer commented Nov 30, 2023 •

edited

Loading