Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Epic: fully eliminate latency impact of walredo spawning #6581

Closed
6 of 7 tasks
Tracked by #6565
problame opened this issue Feb 2, 2024 · 1 comment
Closed
6 of 7 tasks
Tracked by #6565

Epic: fully eliminate latency impact of walredo spawning #6581

problame opened this issue Feb 2, 2024 · 1 comment
Assignees
Labels
c/storage/pageserver Component: storage: pageserver

Comments

@problame
Copy link
Contributor

problame commented Feb 2, 2024

This is a spin-off from Epic pageserver: spawning walredo process is slow #6565

Problem

We currently spawn walredo processes lazily, causing the first getpage request to experience a 5ms slowdown, which is the avg spawn latency.

If the compute issues getpage requests concurrently (parallel query execution, prefetch), then all these concurrent requests coalesce on the 5ms long spawn, i.e., all experience a 5ms slowdown.

We also shut down walredo after idle time. If the compute is still up, and then issues a request, it experiences the 5ms outlier(s) again.

On moderately busy PSes (mid hundreds to low thousands of GetPage/second), the outliers introduced by this skew the p99.9X latencies.

Also, there's the impact of CLONE_VFORK (posix_spawn), which freezes the entire parent until the child does exec. See #7320

ALl of the above is definitely contributing to tail latencies; the question is: how badly, and is it easier to just fix than to quantify it.

Metrics-only solution

Only fix the metrics by counting getpage requests that needed walredo spawn seprately.
This goes in a simmilar direction as #3797

Partial Solution

Spawn walredo during basebackup. This solves the outlier on cold starts, and probably makes our p99.9X getpage stats look better.

But, cold starts + first query is dominated by cold starts, so, +5ms on the first query part won't move the needle for actual UX.

Full Solution: pre-spawned pool

Hence, to completely take walredo process spawning off the critical path, we should have a pool of pre-spawned walredo processes for use by tenants.

Note that as a preliminary, #7320 should be done fist to avoid API-level conflicts.
However, the core ideas are orthogonal:

Work

Preliminaries

  1. 1 of 3
    c/storage/pageserver

Partial solution: spawn walredo as part of basebackup

Full solution: pre-spawned pool

Related

@problame problame self-assigned this Feb 2, 2024
@problame problame added the c/storage/pageserver Component: storage: pageserver label Feb 2, 2024
problame added a commit that referenced this issue Feb 2, 2024
When we'll later introduce a global pool of pre-spawned walredo
processes (#6581), this
refactoring avoids plumbing through the reference to the pool to all the
places where we create a broken tenant.

Builds atop the refactoring in #6583
problame added a commit that referenced this issue Feb 6, 2024
)

When we'll later introduce a global pool of pre-spawned walredo
processes (#6581), this
refactoring avoids plumbing through the reference to the pool to all the
places where we create a broken tenant.

Builds atop the refactoring in #6583
@problame problame changed the title pre-spawn walredo processes Epic: pre-spawn walredo processes Apr 4, 2024
@problame problame changed the title Epic: pre-spawn walredo processes Epic: lazy walredo process spawning impacts p99.9X getpage latencies Apr 4, 2024
@problame problame changed the title Epic: lazy walredo process spawning impacts p99.9X getpage latencies Epic: avoid latency of lazy walredo process spawning / skews p99.9X on moderately busy PSes Apr 4, 2024
@problame problame changed the title Epic: avoid latency of lazy walredo process spawning / skews p99.9X on moderately busy PSes Epic: fully eliminate latency impact of walredo spawning Apr 4, 2024
@problame
Copy link
Contributor Author

The latency impact is easily demonstratable, but not even a mid-term priority.
=> close this epic to avoid it becoming stale #7320 (comment)

@problame problame closed this as not planned Won't fix, can't repro, duplicate, stale Apr 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver
Projects
None yet
Development

No branches or pull requests

1 participant