Timeline::shutdown can leave a dangling `handle_walreceiver_connection` tokio task #7062

problame · 2024-03-08T11:33:13Z

Problem

diff --git a/pageserver/src/tenant/timeline/walreceiver.rs b/pageserver/src/tenant/timeline/walreceiver.rs
index 2fab6722b..2a2f9ab04 100644
--- a/pageserver/src/tenant/timeline/walreceiver.rs
+++ b/pageserver/src/tenant/timeline/walreceiver.rs
@@ -236,6 +236,10 @@ impl<E: Clone> TaskHandle<E> {
     async fn shutdown(self) {
         if let Some(jh) = self.join_handle {
             self.cancellation.cancel();
+            // we can stop being polled before the jh completes.
+            // it can happen because connection_manager_loop_step is in a tokio::select!() statement together with task_mgr cancellation:
+            // https://github.com/neondatabase/neon/blob/fb518aea0db046817987a463b1556ad950e97f09/pageserver/src/tenant/timeline/walreceiver.rs#L84-L104
+            // the consequence of that will be that we leave a tokio task that executes handle_walreceiver_connection running
             match jh.await {
                 Ok(Ok(())) => debug!("Shutdown success"),
                 Ok(Err(e)) => error!("Shutdown task error: {e:?}"),

Found together with @koivunej while investigating #7051

i.e., rapid /location_config that transition a tenant from Single to Multi and back with changing generation numbers, so, the code takes the slow path.

Consequence of this:

the dangling task can continue to ingest WAL
that means it can continue to try to
- put / finish_write into the Timeline
  - which in turn can cause layers to be frozen & flushed
- trigger on-demand downloads

We audited the codebase to ensure that, as of #7051 , none of the above will succeed after Timeline::shutdown returns.

Solution Proposal

Don't cancel-by-drop the connection_manager_loop_step.
Use shutdown_token() instead and make code responsive to it.

Impl

Tasks

Give feedback

The text was updated successfully, but these errors were encountered:

preliminary for fix for #7062

…r_connection tokio task fixes #7062

…7234) preliminary refactoring for #7233 part of #7062

…eceiver_connection tokio task (#7235) # Problem As pointed out through doc-comments in this PR, `drop_old_connection` is not cancellation-safe. This means we can leave a `handle_walreceiver_connection` tokio task dangling during Timeline shutdown. More details described in the corresponding issue #7062. # Solution Don't cancel-by-drop the `connection_manager_loop_step` from the `tokio::select!()` in the task_mgr task. Instead, transform the code to use a `CancellationToken` --- specifically, `task_mgr::shutdown_token()` --- and make code responsive to it. The `drop_old_connection()` is still not cancellation-safe and also doesn't get a cancellation token, because there's no point inside the function where we could return early if cancellation were requested using a token. We rely on the `handle_walreceiver_connection` to be sensitive to the `TaskHandle`s cancellation token (argument name: `cancellation`). Currently it checks for `cancellation` on each WAL message. It is probably also sensitive to `Timeline::cancel` because ultimately all that `handle_walreceiver_connection` does is interact with the `Timeline`. In summary, the above means that the following code (which is found in `Timeline::shutdown`) now might **take longer**, but actually ensures that all `handle_walreceiver_connection` tasks are finished: ```rust task_mgr::shutdown_tasks( Some(TaskKind::WalReceiverManager), Some(self.tenant_shard_id), Some(self.timeline_id) ) ``` # Refs refs #7062

problame · 2024-03-28T12:04:59Z

Update:

Small preliminary refactoring & fix is merged
Given I had the walreceiver code loaded into my head, it was opportune to refactor walreceiver such that it no longer uses task_mgr, because we want to do that for GA.
Opened PR refactor(walreceiver): eliminate task_mgr usage #7260 for that, tracked in tasklist of this issue.
- There's disagreement about the need/design around walreceiver early shutdown.
Also opened further refactoring PR refactor(Timeline::shutdown): rely more on Timeline::cancel; use it from deletion code path #7233 which is stacked atop refactor(walreceiver): eliminate task_mgr usage #7260
- It might fix pageserver: metrics lingering for deleted timelines #7221, which @jcsp has been working on and re-assigned to me since

@jcsp

We want to move the code base away from task_mgr. This PR refactors the walreceiver code such that it doesn't use `task_mgr` anymore. # Background As a reminder, there are three tasks in a Timeline that's ingesting WAL. `WalReceiverManager`, `WalReceiverConnectionHandler`, and `WalReceiverConnectionPoller`. See the documentation in `task_mgr.rs` for how they interact. Before this PR, cancellation was requested through task_mgr::shutdown_token() and `TaskHandle::shutdown`. Wait-for-task-finish was implemented using a mixture of `task_mgr::shutdown_tasks` and `TaskHandle::shutdown`. This drawing might help: <img width="300" alt="image" src="https://github.com/neondatabase/neon/assets/956573/b6be7ad6-ecb3-41d0-b410-ec85cb8d6d20"> # Changes For cancellation, the entire WalReceiver task tree now has a `child_token()` of `Timeline::cancel`. The `TaskHandle` no longer is a cancellation root. This means that `Timeline::cancel.cancel()` is propagated. For wait-for-task-finish, all three tasks in the task tree hold the `Timeline::gate` open until they exit. The downside of using the `Timeline::gate` is that we can no longer wait for just the walreceiver to shut down, which is particularly relevant for `Timeline::flush_and_shutdown`. Effectively, it means that we might ingest more WAL while the `freeze_and_flush()` call is ongoing. Also, drive-by-fix the assertiosn around task kinds in `wait_lsn`. The check for `WalReceiverConnectionHandler` was ineffective because that never was a task_mgr task, but a TaskHandle task. Refine the assertion to check whether we would wait, and only fail in that case. # Alternatives I contemplated (ab-)using the `Gate` by having a separate `Gate` for `struct WalReceiver`. All the child tasks would use _that_ gate instead of `Timeline::gate`. And `struct WalReceiver` itself would hold an `Option<GateGuard>` of the `Timeline::gate`. Then we could have a `WalReceiver::stop` function that closes the WalReceiver's gate, then drops the `WalReceiver::Option<GateGuard>`. However, such design would mean sharing the WalReceiver's `Gate` in an `Arc`, which seems awkward. A proper abstraction would be to make gates hierarchical, analogous to CancellationToken. In the end, @jcsp and I talked it over and we determined that it's not worth the effort at this time. # Refs part of #7062

…rom deletion code path (#7233) This PR is a fallout from work on #7062. # Changes - Unify the freeze-and-flush and hard shutdown code paths into a single method `Timeline::shutdown` that takes the shutdown mode as an argument. - Replace `freeze_and_flush` bool arg in callers with that mode argument, makes them more expressive. - Switch timeline deletion to use `Timeline::shutdown` instead of its own slightly-out-of-sync copy. - Remove usage of `task_mgr::shutdown_watcher` / `task_mgr::shutdown_token` where possible # Future Work Do we really need the freeze_and_flush? If we could get rid of it, then there'd be no need for a specific shutdown order. Also, if you undo this patch's changes to the `eviction_task.rs` and enable RUST_LOG=debug, it's easy to see that we do leave some task hanging that logs under span `Connection{...}` at debug level. I think it's a pre-existing issue; it's probably a broker client task.

problame added t/bug Issue Type: Bug c/storage/pageserver Component: storage: pageserver labels Mar 8, 2024

jcsp added the triaged bugs that were already triaged label Mar 21, 2024

problame self-assigned this Mar 25, 2024

problame added a commit that referenced this issue Mar 25, 2024

refactor(remote_timeline_client): infallible stop() and shutdown()

bfd3a0b

preliminary for fix for #7062

problame added a commit that referenced this issue Mar 25, 2024

fix(#7062): Timeline::shutdown can leave a dangling handle_walreceive…

66682d7

…r_connection tokio task fixes #7062

problame added a commit that referenced this issue Mar 25, 2024

refactor(remote_timeline_client): infallible stop() and shutdown() (#…

f72415e

…7234) preliminary refactoring for #7233 part of #7062

problame mentioned this issue Mar 27, 2024

refactor(walreceiver): eliminate task_mgr usage #7260

Merged

problame closed this as completed Apr 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timeline::shutdown can leave a dangling `handle_walreceiver_connection` tokio task #7062

Timeline::shutdown can leave a dangling `handle_walreceiver_connection` tokio task #7062

problame commented Mar 8, 2024 •

edited

Tasks

problame commented Mar 28, 2024 •

edited

Timeline::shutdown can leave a dangling handle_walreceiver_connection tokio task #7062

Timeline::shutdown can leave a dangling handle_walreceiver_connection tokio task #7062

Comments

problame commented Mar 8, 2024 • edited

Problem

Solution Proposal

Impl

Tasks

problame commented Mar 28, 2024 • edited

Timeline::shutdown can leave a dangling `handle_walreceiver_connection` tokio task #7062

Timeline::shutdown can leave a dangling `handle_walreceiver_connection` tokio task #7062

problame commented Mar 8, 2024 •

edited

problame commented Mar 28, 2024 •

edited