Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perform L0 compaction before creating new image layers #5950

Merged
merged 4 commits into from
Dec 4, 2023

Conversation

ivaxer
Copy link
Contributor

@ivaxer ivaxer commented Nov 28, 2023

Problem

If there are too many L0 layers before compaction, the compaction process becomes slow because of slow timeline::get. As a result of the slowdown, the pageserver will generate even more L0 layers for the next iteration, further exacerbating the slow performance.

Summary of changes

Perform L0 -> L1 compaction before of creating new images. It simple change speedups compaction time and timeline::get to 5x. timeline::get is faster on top of L1 layers.

For layers map:

draw-timeline-00007c93000032d6000076de00001d92-0000525d0000687100006cee000051c2-current

reconstruct data latency before:
before

and after:
after

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist

@ivaxer ivaxer requested a review from a team as a code owner November 28, 2023 13:06
@ivaxer ivaxer requested review from jcsp and removed request for a team November 28, 2023 13:06
@bayandin bayandin added the approved-for-ci-run Changes are safe to trigger CI for the PR label Nov 28, 2023
@github-actions github-actions bot removed the approved-for-ci-run Changes are safe to trigger CI for the PR label Nov 28, 2023
@vipvap vipvap mentioned this pull request Nov 28, 2023
@hlinnaka
Copy link
Contributor

Interesting. The idea with the current code is that if you're about to create image layers anyway, it's a waste of time and unnecessary write amplification to compact the deltas, because subsequent getpage requests will make use of the image layers and won't look at the deltas anymore. But if it greatly speeds up the image layer creation, maybe we should indeed do it as a quick fix.

Another approach would be to optimize the getpage requests during image layer creation. We're reconstruct the pages with individual "retail" getpage calls, but if you know you're going to call getpage for every page, you could surely do it more efficiently. In other words, we could merge of the WAL records from different L0 layers "on-the-fly", as part of image layer creation, similar to how the L0 compaction does.

@jcsp
Copy link
Collaborator

jcsp commented Nov 28, 2023

In other words, we could merge of the WAL records from different L0 layers "on-the-fly", as part of image layer creation, similar to how the L0 compaction does.

Yep, we'd like to do this at some point in the medium-term future: #4979

In the near term, the measurements in the PR description are quite persuasive... conversely we might need to think a bit and see if we can identify write patterns where this model becomes particularly painful.

@ivaxer
Copy link
Contributor Author

ivaxer commented Nov 28, 2023

The idea with the current code is that if you're about to create image layers anyway, it's a waste of time and unnecessary write amplification to compact the deltas, because subsequent getpage requests will make use of the image layers and won't look at the deltas anymore.

@hlinnaka , thank you. Is it correct to assume that when PITR horizon is set, we will perform L0 to L1 compaction anyway?

We're reconstruct the pages with individual "retail" getpage calls, but if you know you're going to call getpage for every page, you could surely do it more efficiently.

Interesting. Btw, even after L0->L1 compaction, creating of new images is long process. It's CPU bound to 1 CPU. Disk usage is low:

iostat -xmdy 10 5 /dev/vda
Linux 5.15.0-88-generic (pageserver2) 	11/28/2023 	_x86_64_	(12 CPU)


Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dMB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
vda            106.50     13.11     0.00   0.00    0.19   126.05   30.30     12.87     8.00  20.89    6.47   434.79    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.22  43.28


Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dMB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
vda            105.00     12.99     0.00   0.00    0.20   126.69   29.90     12.84     5.50  15.54    1.67   439.83    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.07  42.24


Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dMB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
vda            105.90     13.06     0.00   0.00    0.21   126.24   32.60     12.84     5.40  14.21    1.55   403.41    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.07  43.00


Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dMB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
vda            106.20     13.09     0.00   0.00    0.22   126.25   37.80     12.86     7.40  16.37    1.74   348.48    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.09  43.20


Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dMB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
vda            106.10     13.08     0.00   0.00    0.20   126.27   33.80     12.86     7.60  18.36    1.62   389.61    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.08  43.44

Should the compacting be reworked for multi-threading, or is there still potential for further optimizations on single core?

@jcsp
Copy link
Collaborator

jcsp commented Nov 28, 2023

Should the compacting be reworked for multi-threading, or is there still potential for further optimizations on single core?

We probably won't want to parallelize compaction itself:

  • when pageserver sharding is done (Epic: pageserver support for sharding phase 1 #5507 etc), large databases will be broken up into smaller shards that compact separately.
  • I anticipate we can make one core go much faster, especially by using a more streaming approach to I/O, and pipelining reads so that we read the next pages' input data while we're busy walredo'ing the previous page.

If we got to the point where our I/O was all nice an efficient and the actual walredo (pure CPU) part becomes a bottleneck, then I would perhaps look to do a localized worker pool of walredo processes (currently we have one per tenant), and/or make pure Rust implementations of certain common WAL operations (but I'm told this is harder than it sounds).

Copy link

github-actions bot commented Nov 28, 2023

2436 tests run: 2339 passed, 0 failed, 97 skipped (full report)


Flaky tests (1)

Postgres 16

Code coverage (full report)

  • functions: 54.5% (9288 of 17029 functions)
  • lines: 82.0% (53959 of 65790 lines)

The comment gets automatically updated with the latest test results
dd76c94 at 2023-12-04T11:47:32.842Z :recycle:

@ivaxer
Copy link
Contributor Author

ivaxer commented Nov 28, 2023

Btw, even after L0->L1 compaction, creating of new images is long process. It's CPU bound to 1 CPU. Disk usage is low:

My apologies, I mistakenly compared the release build (before) with the debug build (after) in the description 🙂. After L0 compaction, image creation works much better. Although it remains CPU-limited, it now performs more IO:

$ iostat -xmdy 10 5 /dev/vda
Linux 5.15.0-88-generic (pageserver2) 	11/28/2023 	_x86_64_	(12 CPU)


Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dMB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
vda           1314.49    158.81     0.00   0.00    0.17   123.71  270.73    119.48    25.17   8.51    6.97   451.90    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    2.11  99.90


Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dMB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
vda           1156.90    143.07     0.00   0.00    0.16   126.64  253.40    141.08    23.40   8.45   10.30   570.11    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    2.79  99.96
Screenshot 2023-11-28 at 22 34 35

@bayandin bayandin added the approved-for-ci-run Changes are safe to trigger CI for the PR label Nov 29, 2023
@github-actions github-actions bot removed the approved-for-ci-run Changes are safe to trigger CI for the PR label Nov 29, 2023
@bayandin bayandin added the approved-for-ci-run Changes are safe to trigger CI for the PR label Nov 29, 2023
@github-actions github-actions bot removed the approved-for-ci-run Changes are safe to trigger CI for the PR label Nov 29, 2023
Copy link
Member

@koivunej koivunej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change makes sense to me, cannot see any reason why not to merge this.

@koivunej koivunej added the approved-for-ci-run Changes are safe to trigger CI for the PR label Nov 30, 2023
@github-actions github-actions bot removed the approved-for-ci-run Changes are safe to trigger CI for the PR label Nov 30, 2023
@ivaxer
Copy link
Contributor Author

ivaxer commented Nov 30, 2023

The change makes sense to me, cannot see any reason why not to merge this.

Great! Glad to hear that.

I fixed flacky test_issue_5878. I used a set_tenant_config with one option that reset the other options to defaults. Because of this, the compaction loop could be concurrently running with the manual compaction.

@koivunej , could you re-run CI please?

@koivunej koivunej added the approved-for-ci-run Changes are safe to trigger CI for the PR label Nov 30, 2023
@github-actions github-actions bot removed the approved-for-ci-run Changes are safe to trigger CI for the PR label Nov 30, 2023
@koivunej koivunej added the approved-for-ci-run Changes are safe to trigger CI for the PR label Dec 4, 2023
@github-actions github-actions bot removed the approved-for-ci-run Changes are safe to trigger CI for the PR label Dec 4, 2023
@koivunej koivunej enabled auto-merge (squash) December 4, 2023 12:20
@koivunej koivunej enabled auto-merge (squash) December 4, 2023 12:24
@koivunej koivunej merged commit eae49ff into neondatabase:main Dec 4, 2023
68 of 70 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants