pageserver: Reduce tracing overhead in timeline::get #6115

ivaxer · 2023-12-12T14:38:54Z

Problem

Compaction process (specifically the image layer reconstructions part) is lagging behind wal ingest (at speed ~10-15MB/s) for medium-sized tenants (30-50GB). CPU profile shows that significant amount of time (see flamegraph) is being spent in tracing::span::Span::new.

mainline (commit: 0ba4cae):

Summary of changes

By lowering the tracing level in get_value_reconstruct_data and get_or_maybe_download from info to debug, we can reduce the overhead of span creation in prod environments. On my system, this sped up the image reconstruction process by 60% (from 14500 to 23160 page reconstruction per sec)

pr:

create_image_layers() (it's 1 CPU bound here) mainline vs pr:

Checklist before requesting a review

I have performed a self-review of my code.
If it is a core feature, I have added thorough tests.
Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

Do not forget to reformat commit message to not include the above checklist

github-actions · 2023-12-12T17:21:44Z

2190 tests run: 2105 passed, 0 failed, 85 skipped (full report)

Flaky tests (1)

Postgres 15

test_issue_5878: release

Code coverage (full report)

functions: 55.1% (9680 of 17563 functions)
lines: 82.4% (55639 of 67543 lines)

_{The comment gets automatically updated with the latest test results
ce692a6 at 2023-12-18T13:07:46.072Z :recycle:}

koivunej

I think it would be great to comment on why those are debug spans, also I did not work plain "debug" would work, but this is a good find anyways.

I will add the comment when I reshuffle the spans to be on the public api of the layer in a follow-up PR later this week.

problame

The main concern I have is with losing context in case we log an error inside the code whose level you demote to debug in this PR.
Take for example

neon/pageserver/src/tenant/storage_layer/layer.rs

Line 884 in 52718bb

    
           tracing::error!("layer file download failed, and additionally failed to communicate this to caller: {e:?}");

Currently if we log something there it contains the layer= field.

My understanding is that we'll lose that layer field with this PR.

Of course this can be fixed by rigoruously including layer= field in all the tracing::{info,warn,error}! usage in the call graph of get_or_maybe_download.

So I guess it's a trade-off between performance and convenience.

That being said, the performance gains are impressive.

I assume your workload didn't actually do any ondemand downloads.

Wondering if we can delay the creation of the span (remaining at info level) until we know we're off the hot path.

At least for get_or_maybe_download, it seems straight forward.
@koivunej , thoughts?

koivunej · 2023-12-13T09:34:46Z

Good point. The delayed span creation can be implemented by creating the span only at this line:

neon/pageserver/src/tenant/storage_layer/layer.rs

Line 744 in 8460654

};

so the span is created only when the factory is called to actually download and initialize, not before.

Span needed there is:

neon/pageserver/src/tenant/storage_layer/layer.rs

Line 654 in 8460654

#[tracing::instrument(skip_all, fields(layer=%self))]

so:

-}
+}.instrument(tracing::info_span!("get_or_maybe_download", layer=%self))

ivaxer · 2023-12-13T10:32:50Z

I assume your workload didn't actually do any ondemand downloads.

Yes, I hope this is typical when generating images in the timeline::compact().

I'll try @koivunej suggestion, thanks.

problame · 2023-12-13T13:52:34Z

Can we do something similar for get_value_reconstruct_data?

pageserver/src/tenant/storage_layer/layer.rs

Lowering the tracing level in get_value_reconstruct_data and get_or_maybe_download from info to debug reduces the overhead of span creation in non-debug environments.

problame

IIUC the contribution of get_value_reconstruct_data and get_or_maybe_download is roughly 50:50 in your benchmarks?

If so, let's get the get_or_maybe_download part merged as part of this PR and create a follow-up to improve the get_value_reconstruct_data.

pageserver/src/tenant/storage_layer/layer.rs

ivaxer · 2023-12-15T09:54:58Z

IIUC the contribution of get_value_reconstruct_data and get_or_maybe_download is roughly 50:50 in your benchmarks?

Yes.

If so, let's get the get_or_maybe_download part merged as part of this PR

Ok.

and create a follow-up to improve the get_value_reconstruct_data.

Ok. Span::new will take >20% CPU time of Timeline::get_reconstruct_data. Any ideas on how to do this other than not creating new span?

flamegraph:

This reverts commit 239d6b7.

This reverts commit 65745ce.

Pre-merge `git merge --squash` of #6115 Lowering the tracing level in get_value_reconstruct_data and get_or_maybe_download from info to debug reduces the overhead of span creation in non-debug environments.

shanyp · 2023-12-18T13:33:48Z

@ivaxer thanks for your contribution

Since #6115 with more often used get_value_reconstruct_data and friends, we should not have needless INFO level span creation near hot paths. In our prod configuration, INFO spans are always created, but in practice, very rarely anything at INFO level is logged underneath. `ResidentLayer::load_keys` is only used during compaction so it is not that hot, but this aligns the access paths and their span usage. PR changes the span level to debug to align with others, and adds the layer name to the error which was missing. Split off from #7030.

ivaxer marked this pull request as ready for review December 12, 2023 15:48

ivaxer requested a review from a team as a code owner December 12, 2023 15:48

ivaxer requested review from problame and removed request for a team December 12, 2023 15:48

bayandin added the approved-for-ci-run Changes are safe to trigger CI for the PR label Dec 12, 2023

github-actions bot removed the approved-for-ci-run Changes are safe to trigger CI for the PR label Dec 12, 2023

vipvap mentioned this pull request Dec 12, 2023

CI run for PR #6115 #6117

Closed

koivunej approved these changes Dec 13, 2023

View reviewed changes

problame reviewed Dec 13, 2023

View reviewed changes

koivunej reviewed Dec 13, 2023

View reviewed changes

pageserver/src/tenant/storage_layer/layer.rs Show resolved Hide resolved

ivaxer force-pushed the feature-reconstruction-tracing-perf branch from 312210b to 22cb367 Compare December 13, 2023 16:35

ivaxer added 3 commits December 15, 2023 10:24

pageserver: Reduce tracing overhead in timeline::get.

f773ff3

Lowering the tracing level in get_value_reconstruct_data and get_or_maybe_download from info to debug reduces the overhead of span creation in non-debug environments.

keep span at info level for download future

53b9da3

explicitly log layer in errors and remove span

239d6b7

ivaxer force-pushed the feature-reconstruction-tracing-perf branch from 22cb367 to 239d6b7 Compare December 15, 2023 07:36

problame reviewed Dec 15, 2023

View reviewed changes

pageserver/src/tenant/storage_layer/layer.rs Outdated Show resolved Hide resolved

pageserver/src/tenant/storage_layer/layer.rs Outdated Show resolved Hide resolved

ivaxer and others added 4 commits December 15, 2023 12:56

Revert "explicitly log layer in errors and remove span"

2b9e225

This reverts commit 239d6b7.

partly revert 239d6b7

65745ce

Revert "partly revert 239d6b7"

2588f31

This reverts commit 65745ce.

preserve the information about which layer it was in case of errors

9b468b2

problame added the approved-for-ci-run Changes are safe to trigger CI for the PR label Dec 18, 2023

github-actions bot removed the approved-for-ci-run Changes are safe to trigger CI for the PR label Dec 18, 2023

fixup tests due to new error message

ce692a6

problame added the approved-for-ci-run Changes are safe to trigger CI for the PR label Dec 18, 2023

github-actions bot removed the approved-for-ci-run Changes are safe to trigger CI for the PR label Dec 18, 2023

problame approved these changes Dec 18, 2023

View reviewed changes

problame enabled auto-merge (squash) December 18, 2023 12:41

problame merged commit 33cb9a6 into neondatabase:main Dec 18, 2023
68 of 70 checks passed

koivunej mentioned this pull request Mar 19, 2024

fix: ResidentLayer::load_keys should not create INFO level span #7174

Merged

conradludgate mentioned this pull request Apr 11, 2024

Release proxy (2024-04-11 hotfix) #7366

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pageserver: Reduce tracing overhead in timeline::get #6115

pageserver: Reduce tracing overhead in timeline::get #6115

ivaxer commented Dec 12, 2023 •

edited

Loading

github-actions bot commented Dec 12, 2023 •

edited

Loading

Postgres 15

koivunej left a comment •

edited

Loading

problame left a comment

koivunej commented Dec 13, 2023 •

edited

Loading

ivaxer commented Dec 13, 2023

problame commented Dec 13, 2023

problame left a comment

ivaxer commented Dec 15, 2023

shanyp commented Dec 18, 2023

pageserver: Reduce tracing overhead in timeline::get #6115

pageserver: Reduce tracing overhead in timeline::get #6115

Conversation

ivaxer commented Dec 12, 2023 • edited Loading

Problem

Summary of changes

Checklist before requesting a review

Checklist before merging

github-actions bot commented Dec 12, 2023 • edited Loading

2190 tests run: 2105 passed, 0 failed, 85 skipped (full report)

Postgres 15

Code coverage (full report)

koivunej left a comment • edited Loading

Choose a reason for hiding this comment

problame left a comment

Choose a reason for hiding this comment

koivunej commented Dec 13, 2023 • edited Loading

ivaxer commented Dec 13, 2023

problame commented Dec 13, 2023

problame left a comment

Choose a reason for hiding this comment

ivaxer commented Dec 15, 2023

shanyp commented Dec 18, 2023

ivaxer commented Dec 12, 2023 •

edited

Loading

github-actions bot commented Dec 12, 2023 •

edited

Loading

koivunej left a comment •

edited

Loading

koivunej commented Dec 13, 2023 •

edited

Loading