New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor(walreceiver): eliminate task_mgr usage #7260
Conversation
2730 tests run: 2592 passed, 0 failed, 138 skipped (full report)Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
c0d5cea at 2024-03-28T12:09:13.984Z :recycle: |
turns out that the old debug assertions that were relying on task_mgr were ineffective because WalReceiverConnectionHandler was never a task_mgr task I switched them to check the `ctx` instead, and voila, we see that walreceiver actually calls wait_lsn during ingest (kinda obvious). And it's fine, unless it would wait.
What I read between the lines of your review here is that you would like to preserve the "wait-for-task-finish" semantics that we had before this PR, correct? If so, I think keeping track of the spawned tasks using JoinHandle / JoinSet is a nice explicit way to do it (somewhere in the history of this PR, I had it half-done that way). The less explicit alternative of achieving the same thing is to not keep track of the spawned tasks explicitly, but use the approach pointed out in But, as also pointed out in the PR description:
Elaborating more on that: early walreceiver shutdown is the only use case where we can't just So, we concluded that we shouldn't spend the complexity budget on walreceiver if possible. EDIT: I realized that |
One sensible argument in favor of continuing to have early walreceiver shutdown with proper "wait-for-task-finish" semantics is that, if the walreceiver tasks continue to write data past the remote_client.shutdown() call inside the However, that's a big if and frankly, it doesn't matter practically, because right after |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had a call on this. I agree that waiting for walreceiver after cancelation is not required; I cannot see how this could fail or what would the worst-case outcome be, probably nothing. I would still have preferred to keep the waiting as more of a refactor
and as a design that would be future-proof.
There are many ways of achieving a scalable design for starting and stopping tasks without task_mgr. My preferred would have been with joinsets, cancellationtokens, and plain async fn
without all of the unnecessary struct
s.
We want to move the code base away from task_mgr.
This PR refactors the walreceiver code such that it doesn't use
task_mgr
anymore.Background
As a reminder, there are three tasks in a Timeline that's ingesting WAL.
WalReceiverManager
,WalReceiverConnectionHandler
, andWalReceiverConnectionPoller
.See the documentation in
task_mgr.rs
for how they interact.Before this PR, cancellation was requested through task_mgr::shutdown_token() and
TaskHandle::shutdown
.Wait-for-task-finish was implemented using a mixture of
task_mgr::shutdown_tasks
andTaskHandle::shutdown
.This drawing might help:
Changes
For cancellation, the entire WalReceiver task tree now has a
child_token()
ofTimeline::cancel
. TheTaskHandle
no longer is a cancellation root.This means that
Timeline::cancel.cancel()
is propagated.For wait-for-task-finish, all three tasks in the task tree hold the
Timeline::gate
open until they exit.The downside of using the
Timeline::gate
is that we can no longer wait for just the walreceiver to shut down, which is particularly relevant forTimeline::flush_and_shutdown
.Effectively, it means that we might ingest more WAL while the
freeze_and_flush()
call is ongoing.Also, drive-by-fix the assertiosn around task kinds in
wait_lsn
. The check forWalReceiverConnectionHandler
was ineffective because that never was a task_mgr task, but a TaskHandle task. Refine the assertion to check whether we would wait, and only fail in that case.Alternatives
I contemplated (ab-)using the
Gate
by having a separateGate
forstruct WalReceiver
.All the child tasks would use that gate instead of
Timeline::gate
.And
struct WalReceiver
itself would hold anOption<GateGuard>
of theTimeline::gate
.Then we could have a
WalReceiver::stop
function that closes the WalReceiver's gate, then drops theWalReceiver::Option<GateGuard>
.However, such design would mean sharing the WalReceiver's
Gate
in anArc
, which seems awkward.A proper abstraction would be to make gates hierarchical, analogous to CancellationToken.
In the end, @jcsp and I talked it over and we determined that it's not worth the effort at this time.
Refs
part of #7062