Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: coalesce deleting a timeline #4194

Closed
wants to merge 29 commits into from
Closed

Conversation

koivunej
Copy link
Contributor

@koivunej koivunej commented May 10, 2023

Supercedes an earlier #4159. In an effort to let us write easier code, this particular function was giving us a lot of trouble and so I created utils::shared_retried::SharedRetried. It's implementation grew much nastier than I initially wanted, mainly for ability to use task_mgr::spawn.

SharedRetried has some tests. It has a bit contrived features because of how we can produce non-terminal or retried results from the future factory that is passed to try_restart:

  • Ok(()) is always terminal
  • Err(InnerDeleteTimelineError::DeletedGrewChildren) is terminal because normally panics from this one future would be non-terminal, but this intentional panicking-to-poison-tenant is caught as a separate case

"Terminal" results mean, that there will not be any more attempts. With the Ok(()) case, the request handler would most likely be the only one keeping a timeline alive, so no need to retry. After DeletedGrewChildren we have no way to retry.

The hope is that in future we can write easier management api request handlers, which are not cancelled.

Reviewing:

  • start by looking at the tests of SharedRetried to understand how it's meant to be used
  • then look at the doc comment
  • then remember to turn on "hide whitespace" before looking at the delete_timeline changes

Describe your changes

Issue ticket number and link

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist

@koivunej

This comment was marked as outdated.

@github-actions
Copy link

github-actions bot commented May 10, 2023

992 tests run: 949 passed, 2 failed, 41 skipped (full report)


Failures on Posgres 15

# Run failed on Postgres 15 tests locally:
DEFAULT_PG_VERSION=15 scripts/pytest -k "test_broken_timeline[debug-pg15]"
Flaky tests (3)

Postgres 15

  • test_close_on_connections_exit[endpoint]: ✅ debug

Postgres 14

  • test_close_on_connections_exit[project]: ✅ debug
  • test_close_on_connections_exit[endpoint]: ✅ debug
The comment gets automatically updated with the latest test results
69e9237 at 2023-05-22T08:36:01.349Z :recycle:

@koivunej

This comment was marked as outdated.

@koivunej koivunej marked this pull request as ready for review May 10, 2023 14:32
@koivunej koivunej requested review from a team as code owners May 10, 2023 14:32
@koivunej koivunej requested review from knizhnik and removed request for a team May 10, 2023 14:32
@koivunej

This comment was marked as outdated.

@hlinnaka
Copy link
Contributor

This SharedRetryable abstraction feels overengineered to me.

If I understand correctly, the naive open-coded version of this would be:

    // `None` means the deletion hasn't been performed yet. `Some` means that it has, and
    // it contains the result.
    //
    // This would be a field in the Timeline struct.
    let deletion_completed: tokio::sync::Mutex<Option<Result<(), ()>>>
        = tokio::sync::Mutex::new(None);

    // Start the deletion, or if another task is already performing it, wait for it to finish.
    let res = {
        let mut deletion_guard = deletion_completed.lock().await;
        if let Some(res) = *deletion_guard {
            // another task performed the deletion already, just return its result
            res
        } else {
            // perform the deletion ...    
            let res = Ok(());

            // remember the result, if another task tries to perform deletion
            // concurrently
            *deletion_guard = Some(res);

            res
        }
    };
    
    println!("got result: {:?}", res);

What does the new abstraction bring us over this?

@koivunej
Copy link
Contributor Author

koivunej commented May 10, 2023

Yeah, add on top of that example:

  • spawning a single task without cancellation to do the work and allowing 0..N multiple tasks to await for the one task -support (this most likely is the overengineered looking part)
    • without cancellation meaning not cancelled when the original attempt is cancelled
  • transient/persistent errors, retry on transient

If you look at the superceded PR you can see a simpler less split version of this. The tests also highlight different cases.

Copy link
Contributor

@problame problame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I treated shared_retryable as a black-box in this review. I can take another look, but, I trust you did a good job there (plus, I reviewed the not-librarified version earlier).

If you want me to look at specific pieces of it, e.g., task_mgr interaction, please start a conversation and mention me.

All comments added in this review are quite nitpicky, so, if you feel differently strongly, write a justification and mark them as resolved.

pageserver/src/tenant.rs Outdated Show resolved Hide resolved
pageserver/src/tenant.rs Outdated Show resolved Hide resolved
pageserver/src/tenant.rs Outdated Show resolved Hide resolved
pageserver/src/tenant.rs Outdated Show resolved Hide resolved
pageserver/src/tenant.rs Show resolved Hide resolved
test_runner/regress/test_timeline_delete.py Show resolved Hide resolved
test_runner/regress/test_timeline_delete.py Outdated Show resolved Hide resolved
test_runner/regress/test_timeline_delete.py Show resolved Hide resolved
test_runner/regress/test_timeline_delete.py Show resolved Hide resolved
test_runner/regress/test_timeline_delete.py Outdated Show resolved Hide resolved
@koivunej
Copy link
Contributor Author

Earlier @problame:

I treated shared_retryable as a black-box in this review. I can take another look, but, I trust you did a good job there (plus, I reviewed the not-librarified version earlier).

Yeah the tests in the bottom of the file should be the only ones. The readability could use some inlining probably, but I wanted to start adding that non-Result API as well, because I think it might become useful, and there were some clearly shared paths.

Important to note if you'd look at the impl:

  • try_restart is the only public method
  • rest are #[cfg(test)] not to keep rewriting the boilerplate to spawn if Some and recv.await
  • _any are the infallible non-pub methods

If you want me to look at specific pieces of it, e.g., task_mgr interaction, please start a conversation and mention me.

I guess the most important part is to check that the process_shutdown_on_error is false :) It is now tenant scoped as discussed. I can't see how I could go wrong there.

@koivunej
Copy link
Contributor Author

koivunej commented May 15, 2023

Was thinking of number of simplifications over the weekend for this, but will leave them for later:

  • stop using async mutex -- I am trying to show an example of defaulting to safe mutex and may have been considering a "ownedmutexguard held over await points" at some stage; in last iteration it's not needed
  • the non-retryable version should be the core abstraction, retrying something made on top of that
    • was initially going for this from the direction of results, so this stuck
    • failed to see the better direction

@koivunej
Copy link
Contributor Author

koivunej commented May 16, 2023

Had to rebase on top of main to get a new ci attempt. 4591698 was the previous HEAD.

Oops, restoring the review suggestion commits on top. Restored:

Even if they don't always appear in the commit listing on main PR page (chronologically sorted).

koivunej and others added 3 commits May 16, 2023 15:49
Co-authored-by: Christian Schwarz <christian@neon.tech>
even though it's a std::io::Error

Co-authored-by: Christian Schwarz <christian@neon.tech>
koivunej and others added 2 commits May 17, 2023 11:13
Copy link
Contributor

@LizardWizzard LizardWizzard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea looks good to me. Some minor comments.

Also as I said in one of our previous discussions it may be worth considering baking factory inside the SharedRetryable instance, so it is not possible to run one instance with different futures. Maybe thats too restrictive and not needed. For tests it is definitely handier to be able to pass different futures every time.

Using some abstraction to spawn things and passing it in may simplify return types, but thats another abstraction.

libs/utils/src/shared_retryable.rs Outdated Show resolved Hide resolved
/// future will never make progress.
///
/// This complication exists because on pageserver we cannot use `tokio::spawn` directly
/// at this time.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we provide a closure that accepts the future and spawns it when called?

Maybe something like https://docs.rs/futures/0.3.28/futures/task/trait.Spawn.html can make it a bit clearer. Though it needs another abstraction to be introduced (Probably our own, not this particular one from futures)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spawner requires boxing at the caller, so I'd prefer not doing that, it is what SpawnExt does: https://docs.rs/futures-util/0.3.28/src/futures_util/task/spawn.rs.html#49-51

Discussed this also with @funbringer, we could have a trait+ZST for spawning on tokio or on a closure.

Nice thing that followed after not spawning deep in the callstack is that the easiest way now to accidentially "leak" the strong: Arc<Receiver<_>> requires std::mem::forget which can hardly be accidential. It is probably not obvious looking at the latest iteration but that's why I don't really like to revert back.

There are tests to show that dropping the future needing to be spawned will produce a failure. I guess it could be a fast looping situation, but that should really come up during local testing I think.

I could add some #[must_use] to try_restart. Would that help? I didn't check if they do get implicit #[must_use] from std::future::Future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spawner requires boxing at the caller, so I'd prefer not doing that, it is what SpawnExt does:

Yep, thats why I wrote this: Probably our own, not this particular one from futures.

we could have a trait+ZST for spawning on tokio or on a closure.

Was thinking about something of that nature.

My point is that internal spawning should simplify signatures a lot, i e no need to return multiple Future<Output=...>.

In general would be interesting to see how this decision affects complexity of the whole primitive, but personally I dont have capacity to run such an experiment, so I'm on board with current approach.

I could add some #[must_use]

Having #[must_use] would be a plus.

libs/utils/src/shared_retryable.rs Outdated Show resolved Hide resolved
libs/utils/src/shared_retryable.rs Outdated Show resolved Hide resolved
@problame
Copy link
Contributor

problame commented May 17, 2023

Also as I said in one of our previous discussions it may be worth considering baking factory inside the SharedRetryable instance, so it is not possible to run one instance with different futures. Maybe thats too restrictive and not needed. For tests it is definitely handier to be able to pass different futures every time.

I like that, can be a follow-up though.

Copy link
Contributor

@LizardWizzard LizardWizzard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the PR is good in its current shape. It solves the problem. Polishing can be done in follow ups if needed

@hlinnaka
Copy link
Contributor

With the decision that we'll spawn tasks earlier in HTTP handler, this doesn't need to shield the underlying code from async cancellations anymore. That allows this PR to be greatly simplified, as it doesn't need to spawn tasks any more.

Copy link
Contributor

@problame problame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-read the non shared_retryable.rs code plus the shared_retryable.rs module comment.

I don't see an immediate functional gain for the system as a whole with the coalescing instead of just bailing out early if an operation is ongoing (scopeguard + bool flag). But, I'm Ok if others feel that this is something we need.

Comment on lines +1571 to +1584
// Grab the layer_removal_cs lock, and actually perform the deletion.
//
// It can also happen if we race with tenant detach, because,
// it doesn't grab the layer_removal_cs lock.
// This lock prevents multiple concurrent delete_timeline calls from
// stepping on each other's toes, while deleting the files. It also
// prevents GC or compaction from running at the same time.
//
// For now, log and continue.
// warn! level is technically not appropriate for the
// first case because we should expect retries to happen.
// But the error is so rare, it seems better to get attention if it happens.
let tenant_state = self.current_state();
warn!(
timeline_dir=?local_timeline_directory,
?tenant_state,
"timeline directory not found, proceeding anyway"
);
// continue with the rest of the deletion
// Note that there are still other race conditions between
// GC, compaction and timeline deletion. GC task doesn't
// register itself properly with the timeline it's
// operating on. See
// https://github.com/neondatabase/neon/issues/2671
//
// No timeout here, GC & Compaction should be responsive to the
// `TimelineState::Stopping` change.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Grab the layer_removal_cs lock, and actually perform the deletion.
//
// It can also happen if we race with tenant detach, because,
// it doesn't grab the layer_removal_cs lock.
// This lock prevents multiple concurrent delete_timeline calls from
// stepping on each other's toes, while deleting the files. It also
// prevents GC or compaction from running at the same time.
//
// For now, log and continue.
// warn! level is technically not appropriate for the
// first case because we should expect retries to happen.
// But the error is so rare, it seems better to get attention if it happens.
let tenant_state = self.current_state();
warn!(
timeline_dir=?local_timeline_directory,
?tenant_state,
"timeline directory not found, proceeding anyway"
);
// continue with the rest of the deletion
// Note that there are still other race conditions between
// GC, compaction and timeline deletion. GC task doesn't
// register itself properly with the timeline it's
// operating on. See
// https://github.com/neondatabase/neon/issues/2671
//
// No timeout here, GC & Compaction should be responsive to the
// `TimelineState::Stopping` change.
// Grab the layer_removal_cs lock, and actually perform the deletion.
// This serializes timeline deletion with compaction, gc, and eviction.
// All these other tasks check timeline state after acquiring the lock
// and bail out if it is `TimelineState::Stopping`.
//
// Without the serialization + check for timeline state, these other
// tasks would make decision based on layers in the layer map that
// we (delete_timeline) are about to delete from disk below.
//
// Note that there are still other race conditions between
// GC, compaction and timeline deletion. GC task doesn't
// register itself properly with the timeline it's
// operating on. See
// https://github.com/neondatabase/neon/issues/2671
//
// No timeout here, GC & Compaction should be responsive to the
// `TimelineState::Stopping` change.

Comment on lines +1610 to +1612
// This can happen if we're called a second time, e.g.,
// because of a previous failure/cancellation at/after
// failpoint timeline-delete-after-rm.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// This can happen if we're called a second time, e.g.,
// because of a previous failure/cancellation at/after
// failpoint timeline-delete-after-rm.
// This can happen if we're called a second time, e.g.,
// because of a previous failure at/after
// failpoint timeline-delete-after-rm.

Comment on lines +1667 to +1668
// with SharedRetryable this should no longer happen
warn!("no other task should had dropped the Timeline");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// with SharedRetryable this should no longer happen
warn!("no other task should had dropped the Timeline");
// SharedRetryable runs the delete timeline future to completion
// and coalesces concurrent delete requests.
// We removing the timeline from the hashmap as the last step.
// So, this should not happen.
warn!("timeline already gone from timelines hash map");

Comment on lines +1675 to +1676
// the panic has already been formatted by hook, don't worry about it
Err(InnerDeleteTimelineError::OtherPanic)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nobody will know what hook you're talking about here.

Suggested change
// the panic has already been formatted by hook, don't worry about it
Err(InnerDeleteTimelineError::OtherPanic)
// don't repeat the panic text here.
// the panic has already been logged by logging::replace_panic_hook_with_tracing_panic_hook
Err(InnerDeleteTimelineError::OtherPanic)

Comment on lines 1684 to +1686

// Remove the timeline from the map.
let mut timelines = self.timelines.lock().unwrap();
let children_exist = timelines
.iter()
.any(|(_, entry)| entry.get_ancestor_timeline_id() == Some(timeline_id));
// XXX this can happen because `branch_timeline` doesn't check `TimelineState::Stopping`.
// We already deleted the layer files, so it's probably best to panic.
// (Ideally, above remove_dir_all is atomic so we don't see this timeline after a restart)
if children_exist {
panic!("Timeline grew children while we removed layer files");
}
let removed_timeline = timelines.remove(&timeline_id);
if removed_timeline.is_none() {
// This can legitimately happen if there's a concurrent call to this function.
// T1 T2
// lock
// unlock
// lock
// unlock
// remove files
// lock
// remove from map
// unlock
// return
// remove files
// lock
// remove from map observes empty map
// unlock
// return
debug!("concurrent call to this function won the race");
let (recv, maybe_fut) = timeline.delete_self.try_restart(factory).await;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While re-reading this PR, I found the try_restart name confusing.

How about coalesce , to emphasize what the SharedRetryable actually does?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is try_restart confusing: because it doesn't actually do something.

coalesce isn't ideal either, because it suggests that it does something as well.

Maybe wrap_for_coalesced_retry ?

Comment on lines +4 to +46
/// Container using which many request handlers can come together and join a single task to
/// completion instead of racing each other and their own cancellation.
///
/// In a picture:
///
/// ```text
/// SharedRetryable::try_restart Spawned task completes with only one concurrent attempt
/// \ /
/// request handler 1 ---->|--X
/// request handler 2 ---->|-------|
/// request handler 3 ---->|-------|
/// | |
/// v |
/// one spawned task \------>/
///
/// (X = cancelled during await)
/// ```
///
/// Implementation is cancel safe. Implementation and internal structure are hurt by the inability
/// to just spawn the task, but this is needed for `pageserver` usage. Within `pageserver`, the
/// `task_mgr` must be used to spawn the future because it will cause awaiting during shutdown.
///
/// Implementation exposes a fully decomposed [`SharedRetryable::try_restart`] which requires the
/// caller to do the spawning before awaiting for the result. If the caller is dropped while this
/// happens, a new attempt will be required, and all concurrent awaiters will see a
/// [`RetriedTaskPanicked`] error.
///
/// There is another "family of APIs" [`SharedRetryable::attempt_spawn`] for infallible futures. It is
/// just provided for completeness, and it does not have a fully decomposed version like
/// `try_restart`.
///
/// For `try_restart_*` family of APIs, there is a concept of two leveled results. The inner level
/// is returned by the executed future. It needs to be `Clone`. Most errors are not `Clone`, so
/// implementation advice is to log the happened error, and not propagate more than a label as the
/// "inner error" which will be used to build an outer error. The outer error will also have to be
/// convertable from [`RetriedTaskPanicked`] to absorb that case as well.
///
/// ## Example
///
/// A shared service value completes the infallible work once, even if called concurrently by
/// multiple cancellable tasks.
///
/// Example moved as a test `service_example`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improvements I'd like to see here, so that people don't need to read the code in this module to understand what's going on.

  • Move cancel safety paragraph to separage heading & section at bottom.
  • Before explaining the two variants of the API, explain the concept of trait Retryable for classifying the errors returned by the future that is being coalesced on.
  • When explaining the two API variants, start with the safe & good one first (attempt_spawn). Then explain in a separate paragraph why try_restart exists, how it is to be used, and what the pitfalls are (e.g. forgetting to spawn)

@problame problame dismissed their stale review May 22, 2023 09:15

See last review

@hlinnaka
Copy link
Contributor

With the decision that we'll spawn tasks earlier in HTTP handler, this doesn't need to shield the underlying code from async cancellations anymore. That allows this PR to be greatly simplified, as it doesn't need to spawn tasks any more.

To elaborate, we do not need this PR. A simple Mutex like this should be enough to do request coalescing:

diff --git a/pageserver/src/tenant.rs b/pageserver/src/tenant.rs
index 8349e1993..d7bf96bfb 100644
--- a/pageserver/src/tenant.rs
+++ b/pageserver/src/tenant.rs
@@ -1421,6 +1421,13 @@ impl Tenant {
             timeline
         };
 
+        // Prevent two tasks from trying to delete the timeline at the same time
+        let mut delete_lock_guard = timeline.delete_lock.lock().await;
+        if *delete_lock_guard {
+            // already deleted by a concurrent task
+            return Ok(());
+        }
+
         // Now that the Timeline is in Stopping state, request all the related tasks to
         // shut down.
         //
@@ -1576,6 +1583,8 @@ impl Tenant {
         }
         drop(timelines);
 
+        *delete_lock_guard = true;
+
         Ok(())
     }
 
diff --git a/pageserver/src/tenant/timeline.rs b/pageserver/src/tenant/timeline.rs
index c47f4444f..490cb6068 100644
--- a/pageserver/src/tenant/timeline.rs
+++ b/pageserver/src/tenant/timeline.rs
@@ -226,6 +226,10 @@ pub struct Timeline {
 
     state: watch::Sender<TimelineState>,
 
+    /// Prevent two tasks from deleting the timeline at the same time. If 'true',
+    /// the timeline has already been deleted.
+    pub delete_lock: tokio::sync::Mutex<bool>,
+
     eviction_task_timeline_state: tokio::sync::Mutex<EvictionTaskTimelineState>,
 }
 
@@ -1421,6 +1425,7 @@ impl Timeline {
                 eviction_task_timeline_state: tokio::sync::Mutex::new(
                     EvictionTaskTimelineState::default(),
                 ),
+                delete_lock: tokio::sync::Mutex::new(false),
             };
             result.repartition_threshold = result.get_checkpoint_distance() / 10;
             result

Haven't tested this, just to give you the idea of what I mean.

Copy link
Contributor

@hlinnaka hlinnaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@koivunej koivunej closed this May 25, 2023
hlinnaka added a commit that referenced this pull request May 27, 2023
If the timeline is already being deleted, return an error. We used to
notice the duplicate request and error out in
persist_index_part_with_deleted_flag(), but it's better to detect it
earlier. Add an explicit lock for the deletion.

Note: This doesn't do anything about the async cancellation problem
(github issue #3478): if the original HTTP request dropped, because the
client disconnected, the timeline deletion stops half-way through the
operation. That needs to be fixed, too, but that's a separate story.

(This is a simpler replacement for PR #4194. I'm also working on the
cancellation shielding, see PR #4314.)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants