Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeline::shutdown can leave a dangling handle_walreceiver_connection tokio task #7062

Closed
4 tasks done
problame opened this issue Mar 8, 2024 · 1 comment
Closed
4 tasks done
Assignees
Labels
c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug triaged bugs that were already triaged

Comments

@problame
Copy link
Contributor

problame commented Mar 8, 2024

Problem

diff --git a/pageserver/src/tenant/timeline/walreceiver.rs b/pageserver/src/tenant/timeline/walreceiver.rs
index 2fab6722b..2a2f9ab04 100644
--- a/pageserver/src/tenant/timeline/walreceiver.rs
+++ b/pageserver/src/tenant/timeline/walreceiver.rs
@@ -236,6 +236,10 @@ impl<E: Clone> TaskHandle<E> {
     async fn shutdown(self) {
         if let Some(jh) = self.join_handle {
             self.cancellation.cancel();
+            // we can stop being polled before the jh completes.
+            // it can happen because connection_manager_loop_step is in a tokio::select!() statement together with task_mgr cancellation:
+            // https://github.com/neondatabase/neon/blob/fb518aea0db046817987a463b1556ad950e97f09/pageserver/src/tenant/timeline/walreceiver.rs#L84-L104
+            // the consequence of that will be that we leave a tokio task that executes handle_walreceiver_connection running
             match jh.await {
                 Ok(Ok(())) => debug!("Shutdown success"),
                 Ok(Err(e)) => error!("Shutdown task error: {e:?}"),

Found together with @koivunej while investigating #7051

i.e., rapid /location_config that transition a tenant from Single to Multi and back with changing generation numbers, so, the code takes the slow path.

Consequence of this:

  • the dangling task can continue to ingest WAL
  • that means it can continue to try to
    • put / finish_write into the Timeline
      • which in turn can cause layers to be frozen & flushed
    • trigger on-demand downloads

We audited the codebase to ensure that, as of #7051 , none of the above will succeed after Timeline::shutdown returns.

Solution Proposal

Don't cancel-by-drop the connection_manager_loop_step.
Use shutdown_token() instead and make code responsive to it.

Impl

@problame problame added t/bug Issue Type: Bug c/storage/pageserver Component: storage: pageserver labels Mar 8, 2024
@jcsp jcsp added the triaged bugs that were already triaged label Mar 21, 2024
@problame problame self-assigned this Mar 25, 2024
problame added a commit that referenced this issue Mar 25, 2024
problame added a commit that referenced this issue Mar 25, 2024
problame added a commit that referenced this issue Mar 25, 2024
problame added a commit that referenced this issue Mar 27, 2024
…eceiver_connection tokio task (#7235)

# Problem

As pointed out through doc-comments in this PR, `drop_old_connection` is
not cancellation-safe.

This means we can leave a `handle_walreceiver_connection` tokio task
dangling during Timeline shutdown.

More details described in the corresponding issue #7062.

# Solution

Don't cancel-by-drop the `connection_manager_loop_step` from the
`tokio::select!()` in the task_mgr task.
Instead, transform the code to use a `CancellationToken` ---
specifically, `task_mgr::shutdown_token()` --- and make code responsive
to it.

The `drop_old_connection()` is still not cancellation-safe and also
doesn't get a cancellation token, because there's no point inside the
function where we could return early if cancellation were requested
using a token.

We rely on the `handle_walreceiver_connection` to be sensitive to the
`TaskHandle`s cancellation token (argument name: `cancellation`).
Currently it checks for `cancellation` on each WAL message. It is
probably also sensitive to `Timeline::cancel` because ultimately all
that `handle_walreceiver_connection` does is interact with the
`Timeline`.

In summary, the above means that the following code (which is found in
`Timeline::shutdown`) now might **take longer**, but actually ensures
that all `handle_walreceiver_connection` tasks are finished:

```rust
task_mgr::shutdown_tasks(
    Some(TaskKind::WalReceiverManager),
    Some(self.tenant_shard_id),
    Some(self.timeline_id)
)
```

# Refs

refs #7062
@problame
Copy link
Contributor Author

problame commented Mar 28, 2024

Update:

problame added a commit that referenced this issue Apr 3, 2024
We want to move the code base away from task_mgr.

This PR refactors the walreceiver code such that it doesn't use
`task_mgr` anymore.

# Background

As a reminder, there are three tasks in a Timeline that's ingesting WAL.
`WalReceiverManager`, `WalReceiverConnectionHandler`, and
`WalReceiverConnectionPoller`.
See the documentation in `task_mgr.rs` for how they interact.

Before this PR, cancellation was requested through
task_mgr::shutdown_token() and `TaskHandle::shutdown`.

Wait-for-task-finish was implemented using a mixture of
`task_mgr::shutdown_tasks` and `TaskHandle::shutdown`.

This drawing might help:

<img width="300" alt="image"
src="https://github.com/neondatabase/neon/assets/956573/b6be7ad6-ecb3-41d0-b410-ec85cb8d6d20">


# Changes

For cancellation, the entire WalReceiver task tree now has a
`child_token()` of `Timeline::cancel`. The `TaskHandle` no longer is a
cancellation root.
This means that `Timeline::cancel.cancel()` is propagated.

For wait-for-task-finish, all three tasks in the task tree hold the
`Timeline::gate` open until they exit.

The downside of using the `Timeline::gate` is that we can no longer wait
for just the walreceiver to shut down, which is particularly relevant
for `Timeline::flush_and_shutdown`.
Effectively, it means that we might ingest more WAL while the
`freeze_and_flush()` call is ongoing.

Also, drive-by-fix the assertiosn around task kinds in `wait_lsn`. The
check for `WalReceiverConnectionHandler` was ineffective because that
never was a task_mgr task, but a TaskHandle task. Refine the assertion
to check whether we would wait, and only fail in that case.

# Alternatives

I contemplated (ab-)using the `Gate` by having a separate `Gate` for
`struct WalReceiver`.
All the child tasks would use _that_ gate instead of `Timeline::gate`.
And `struct WalReceiver` itself would hold an `Option<GateGuard>` of the
`Timeline::gate`.
Then we could have a `WalReceiver::stop` function that closes the
WalReceiver's gate, then drops the `WalReceiver::Option<GateGuard>`.

However, such design would mean sharing the WalReceiver's `Gate` in an
`Arc`, which seems awkward.
A proper abstraction would be to make gates hierarchical, analogous to
CancellationToken.

In the end, @jcsp and I talked it over and we determined that it's not
worth the effort at this time.

# Refs

part of #7062
problame added a commit that referenced this issue Apr 3, 2024
…rom deletion code path (#7233)

This PR is a fallout from work on #7062.

# Changes

- Unify the freeze-and-flush and hard shutdown code paths into a single
method `Timeline::shutdown` that takes the shutdown mode as an argument.
- Replace `freeze_and_flush` bool arg in callers with that mode
argument, makes them more expressive.
- Switch timeline deletion to use `Timeline::shutdown` instead of its
own slightly-out-of-sync copy.
- Remove usage of `task_mgr::shutdown_watcher` /
`task_mgr::shutdown_token` where possible

# Future Work

Do we really need the freeze_and_flush?
If we could get rid of it, then there'd be no need for a specific
shutdown order.

Also, if you undo this patch's changes to the `eviction_task.rs` and
enable RUST_LOG=debug, it's easy to see that we do leave some task
hanging that logs under span `Connection{...}` at debug level. I think
it's a pre-existing issue; it's probably a broker client task.
@problame problame closed this as completed Apr 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug triaged bugs that were already triaged
Projects
None yet
Development

No branches or pull requests

2 participants