-
Notifications
You must be signed in to change notification settings - Fork 362
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pageserver: Reduce tracing overhead in timeline::get #6115
pageserver: Reduce tracing overhead in timeline::get #6115
Conversation
2190 tests run: 2105 passed, 0 failed, 85 skipped (full report)Code coverage (full report)
The comment gets automatically updated with the latest test results
ce692a6 at 2023-12-18T13:07:46.072Z :recycle: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be great to comment on why those are debug spans, also I did not work plain "debug" would work, but this is a good find anyways.
I will add the comment when I reshuffle the spans to be on the public api of the layer in a follow-up PR later this week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main concern I have is with losing context in case we log an error inside the code whose level you demote to debug in this PR.
Take for example
tracing::error!("layer file download failed, and additionally failed to communicate this to caller: {e:?}"); |
Currently if we log something there it contains the layer=
field.
My understanding is that we'll lose that layer
field with this PR.
Of course this can be fixed by rigoruously including layer=
field in all the tracing::{info,warn,error}!
usage in the call graph of get_or_maybe_download
.
So I guess it's a trade-off between performance and convenience.
That being said, the performance gains are impressive.
I assume your workload didn't actually do any ondemand downloads.
Wondering if we can delay the creation of the span (remaining at info level) until we know we're off the hot path.
At least for get_or_maybe_download, it seems straight forward.
@koivunej , thoughts?
Good point. The delayed span creation can be implemented by creating the span only at this line:
Span needed there is:
so: -}
+}.instrument(tracing::info_span!("get_or_maybe_download", layer=%self)) |
Yes, I hope this is typical when generating images in the I'll try @koivunej suggestion, thanks. |
Can we do something similar for get_value_reconstruct_data? |
312210b
to
22cb367
Compare
Lowering the tracing level in get_value_reconstruct_data and get_or_maybe_download from info to debug reduces the overhead of span creation in non-debug environments.
22cb367
to
239d6b7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC the contribution of get_value_reconstruct_data and get_or_maybe_download is roughly 50:50 in your benchmarks?
If so, let's get the get_or_maybe_download part merged as part of this PR and create a follow-up to improve the get_value_reconstruct_data.
Yes.
Ok.
Ok. |
Pre-merge `git merge --squash` of #6115 Lowering the tracing level in get_value_reconstruct_data and get_or_maybe_download from info to debug reduces the overhead of span creation in non-debug environments.
@ivaxer thanks for your contribution |
Since #6115 with more often used get_value_reconstruct_data and friends, we should not have needless INFO level span creation near hot paths. In our prod configuration, INFO spans are always created, but in practice, very rarely anything at INFO level is logged underneath. `ResidentLayer::load_keys` is only used during compaction so it is not that hot, but this aligns the access paths and their span usage. PR changes the span level to debug to align with others, and adds the layer name to the error which was missing. Split off from #7030.
Problem
Compaction process (specifically the image layer reconstructions part) is lagging behind wal ingest (at speed ~10-15MB/s) for medium-sized tenants (30-50GB). CPU profile shows that significant amount of time (see flamegraph) is being spent in
tracing::span::Span::new
.mainline (commit: 0ba4cae):
![reconstruct-mainline-0ba4cae491c2](https://private-user-images.githubusercontent.com/289788/289892120-ebfd262e-5c97-4858-80c7-664a1dbcc59d.svg?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTg5MDkyMzAsIm5iZiI6MTcxODkwODkzMCwicGF0aCI6Ii8yODk3ODgvMjg5ODkyMTIwLWViZmQyNjJlLTVjOTctNDg1OC04MGM3LTY2NGExZGJjYzU5ZC5zdmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNjIwJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDYyMFQxODQyMTBaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1lNzQ1MjU5ZmUwMzI5Yjc2MzJkZjcxYjYxYmY2ZGVhZDhmNWNhMTgxZTE3ZjcxYjRmZTk1NjY1MTNkMjUxYThhJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.16fB7lFOfCbYyP1ZKQCEOIQjJFmVpLDUUpsbJ-oMZD4)
Summary of changes
By lowering the tracing level in get_value_reconstruct_data and get_or_maybe_download from info to debug, we can reduce the overhead of span creation in prod environments. On my system, this sped up the image reconstruction process by 60% (from 14500 to 23160 page reconstruction per sec)
pr:
![reconstruct-opt-2](https://private-user-images.githubusercontent.com/289788/289892172-563a159b-8f2f-4300-b0a1-6cd66e7df769.svg?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTg5MDkyMzAsIm5iZiI6MTcxODkwODkzMCwicGF0aCI6Ii8yODk3ODgvMjg5ODkyMTcyLTU2M2ExNTliLThmMmYtNDMwMC1iMGExLTZjZDY2ZTdkZjc2OS5zdmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNjIwJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDYyMFQxODQyMTBaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT01YmI2ZGIzNjdjOGRlNjdmOTkxMGUxM2Y5YjVlMWUyMWEzNTQwMGYxN2MzYzA4NTQ3Y2QyOTQ5Y2Q3OWFkZjZiJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.djS2tTmHrlrbykL3mevexUMhe0GyLnKyp818U8GysNk)
create_image_layers()
(it's 1 CPU bound here) mainline vs pr:Checklist before requesting a review
Checklist before merging