Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pageserver: spawning walredo process is slow #6565

Closed
6 tasks done
jcsp opened this issue Feb 1, 2024 · 1 comment
Closed
6 tasks done

pageserver: spawning walredo process is slow #6565

jcsp opened this issue Feb 1, 2024 · 1 comment
Assignees
Labels
c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug triaged bugs that were already triaged

Comments

@jcsp
Copy link
Contributor

jcsp commented Feb 1, 2024

Problem

On some pageservers we see >1s times to spawn the process.

Investigation Results

DoD

  • walredo process spawning latency is predictable
  • acquisition of a walredo process for page reconstruction is < XXX milliseconds

Plan

Explore whether we can us posix_spawn; if so, ship to staging and observe whether it is a sufficient improvement. We can move the close_fds work into walredo startup, where we still trust the process.

If posix_spawn can't be used, implement a sidecar "spawner" process that pageserver asks to spawn walredo processes.

  • Option 1: extend the existing walredo C code to enter "template" mode.
  • Option 2: fork off a pagserver child process that will act as the spawner process

NB: we decide against a pool of pre-spawned walredo processes as the amoutn of CPU wasted on the inefficient fork() call is significant.

Background Reading

Work

Solve The Issue

Follow-Ups

  1. hlinnaka

Spin-Offs (no need to complete before closing)

@jcsp jcsp added t/bug Issue Type: Bug c/storage/pageserver Component: storage: pageserver labels Feb 1, 2024
problame added a commit that referenced this issue Feb 1, 2024
… code

The rust stdlib uses the efficient `posix_spawn` by default.
However, before this PR, pageserver used `pre_exec()` in our
`close_fds()` ext trait.

This PR moves the work that `close_fds()` did to the walredo C code.
I verified manually that we're now forking out the walredo process using
`posix_spawn`.

refs #6565
problame added a commit that referenced this issue Feb 1, 2024
… code

The rust stdlib uses the efficient `posix_spawn` by default.
However, before this PR, pageserver used `pre_exec()` in our
`close_fds()` ext trait.

This PR moves the work that `close_fds()` did to the walredo C code.
I verified manually that we're now forking out the walredo process using
`posix_spawn`.

refs #6565
problame added a commit that referenced this issue Feb 1, 2024
…C code (#6574)

The rust stdlib uses the efficient `posix_spawn` by default.
However, before this PR, pageserver used `pre_exec()` in our
`close_fds()` ext trait.

This PR moves the work that `close_fds()` did to the walredo C code.
I verified manually using `gdb` that we're now forking out the walredo
process using `posix_spawn`.

refs #6565
@problame problame changed the title pageserver: spawning walredo process is slow when pageserver has large virtual memory pageserver: spawning walredo process is slow Feb 1, 2024
@jcsp
Copy link
Contributor Author

jcsp commented Feb 5, 2024

  • walredo fork changes will address basebackup pain
  • but spawn still takes tens of millis, so first getpage request in a while from a running database, that is a latency spike -- hence motivation to use a pool.

@problame problame mentioned this issue Feb 5, 2024
1 task
@jcsp jcsp added the triaged bugs that were already triaged label Feb 8, 2024
@problame problame closed this as completed Apr 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug triaged bugs that were already triaged
Projects
None yet
Development

No branches or pull requests

2 participants