-
Notifications
You must be signed in to change notification settings - Fork 388
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: coalesce deleting a timeline #4194
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
992 tests run: 949 passed, 2 failed, 41 skipped (full report)Failures on Posgres 15
The comment gets automatically updated with the latest test results
69e9237 at 2023-05-22T08:36:01.349Z :recycle: |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This SharedRetryable abstraction feels overengineered to me. If I understand correctly, the naive open-coded version of this would be:
What does the new abstraction bring us over this? |
Yeah, add on top of that example:
If you look at the superceded PR you can see a simpler less split version of this. The tests also highlight different cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I treated shared_retryable as a black-box in this review. I can take another look, but, I trust you did a good job there (plus, I reviewed the not-librarified version earlier).
If you want me to look at specific pieces of it, e.g., task_mgr interaction, please start a conversation and mention me.
All comments added in this review are quite nitpicky, so, if you feel differently strongly, write a justification and mark them as resolved.
Earlier @problame:
Yeah the tests in the bottom of the file should be the only ones. The readability could use some inlining probably, but I wanted to start adding that non-Result API as well, because I think it might become useful, and there were some clearly shared paths. Important to note if you'd look at the impl:
I guess the most important part is to check that the process_shutdown_on_error is false :) It is now tenant scoped as discussed. I can't see how I could go wrong there. |
Was thinking of number of simplifications over the weekend for this, but will leave them for later:
|
documented in the module.
as in, you could run many of these futures with tokio::select within one task or `runtime.block_on`; it does not matter.
0b42ba9
to
83457a8
Compare
Co-authored-by: Christian Schwarz <christian@neon.tech>
even though it's a std::io::Error Co-authored-by: Christian Schwarz <christian@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>
e5fdfae
to
70c2717
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea looks good to me. Some minor comments.
Also as I said in one of our previous discussions it may be worth considering baking factory inside the SharedRetryable instance, so it is not possible to run one instance with different futures. Maybe thats too restrictive and not needed. For tests it is definitely handier to be able to pass different futures every time.
Using some abstraction to spawn things and passing it in may simplify return types, but thats another abstraction.
/// future will never make progress. | ||
/// | ||
/// This complication exists because on pageserver we cannot use `tokio::spawn` directly | ||
/// at this time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we provide a closure that accepts the future and spawns it when called?
Maybe something like https://docs.rs/futures/0.3.28/futures/task/trait.Spawn.html can make it a bit clearer. Though it needs another abstraction to be introduced (Probably our own, not this particular one from futures)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spawner requires boxing at the caller, so I'd prefer not doing that, it is what SpawnExt does: https://docs.rs/futures-util/0.3.28/src/futures_util/task/spawn.rs.html#49-51
Discussed this also with @funbringer, we could have a trait+ZST for spawning on tokio or on a closure.
Nice thing that followed after not spawning deep in the callstack is that the easiest way now to accidentially "leak" the strong: Arc<Receiver<_>>
requires std::mem::forget
which can hardly be accidential. It is probably not obvious looking at the latest iteration but that's why I don't really like to revert back.
There are tests to show that dropping the future needing to be spawned will produce a failure. I guess it could be a fast looping situation, but that should really come up during local testing I think.
I could add some #[must_use]
to try_restart
. Would that help? I didn't check if they do get implicit #[must_use]
from std::future::Future
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spawner requires boxing at the caller, so I'd prefer not doing that, it is what SpawnExt does:
Yep, thats why I wrote this: Probably our own, not this particular one from futures
.
we could have a trait+ZST for spawning on tokio or on a closure.
Was thinking about something of that nature.
My point is that internal spawning should simplify signatures a lot, i e no need to return multiple Future<Output=...>
.
In general would be interesting to see how this decision affects complexity of the whole primitive, but personally I dont have capacity to run such an experiment, so I'm on board with current approach.
I could add some #[must_use]
Having #[must_use]
would be a plus.
I like that, can be a follow-up though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the PR is good in its current shape. It solves the problem. Polishing can be done in follow ups if needed
With the decision that we'll spawn tasks earlier in HTTP handler, this doesn't need to shield the underlying code from async cancellations anymore. That allows this PR to be greatly simplified, as it doesn't need to spawn tasks any more. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-read the non shared_retryable.rs code plus the shared_retryable.rs module comment.
I don't see an immediate functional gain for the system as a whole with the coalescing instead of just bailing out early if an operation is ongoing (scopeguard + bool flag). But, I'm Ok if others feel that this is something we need.
// Grab the layer_removal_cs lock, and actually perform the deletion. | ||
// | ||
// It can also happen if we race with tenant detach, because, | ||
// it doesn't grab the layer_removal_cs lock. | ||
// This lock prevents multiple concurrent delete_timeline calls from | ||
// stepping on each other's toes, while deleting the files. It also | ||
// prevents GC or compaction from running at the same time. | ||
// | ||
// For now, log and continue. | ||
// warn! level is technically not appropriate for the | ||
// first case because we should expect retries to happen. | ||
// But the error is so rare, it seems better to get attention if it happens. | ||
let tenant_state = self.current_state(); | ||
warn!( | ||
timeline_dir=?local_timeline_directory, | ||
?tenant_state, | ||
"timeline directory not found, proceeding anyway" | ||
); | ||
// continue with the rest of the deletion | ||
// Note that there are still other race conditions between | ||
// GC, compaction and timeline deletion. GC task doesn't | ||
// register itself properly with the timeline it's | ||
// operating on. See | ||
// https://github.com/neondatabase/neon/issues/2671 | ||
// | ||
// No timeout here, GC & Compaction should be responsive to the | ||
// `TimelineState::Stopping` change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Grab the layer_removal_cs lock, and actually perform the deletion. | |
// | |
// It can also happen if we race with tenant detach, because, | |
// it doesn't grab the layer_removal_cs lock. | |
// This lock prevents multiple concurrent delete_timeline calls from | |
// stepping on each other's toes, while deleting the files. It also | |
// prevents GC or compaction from running at the same time. | |
// | |
// For now, log and continue. | |
// warn! level is technically not appropriate for the | |
// first case because we should expect retries to happen. | |
// But the error is so rare, it seems better to get attention if it happens. | |
let tenant_state = self.current_state(); | |
warn!( | |
timeline_dir=?local_timeline_directory, | |
?tenant_state, | |
"timeline directory not found, proceeding anyway" | |
); | |
// continue with the rest of the deletion | |
// Note that there are still other race conditions between | |
// GC, compaction and timeline deletion. GC task doesn't | |
// register itself properly with the timeline it's | |
// operating on. See | |
// https://github.com/neondatabase/neon/issues/2671 | |
// | |
// No timeout here, GC & Compaction should be responsive to the | |
// `TimelineState::Stopping` change. | |
// Grab the layer_removal_cs lock, and actually perform the deletion. | |
// This serializes timeline deletion with compaction, gc, and eviction. | |
// All these other tasks check timeline state after acquiring the lock | |
// and bail out if it is `TimelineState::Stopping`. | |
// | |
// Without the serialization + check for timeline state, these other | |
// tasks would make decision based on layers in the layer map that | |
// we (delete_timeline) are about to delete from disk below. | |
// | |
// Note that there are still other race conditions between | |
// GC, compaction and timeline deletion. GC task doesn't | |
// register itself properly with the timeline it's | |
// operating on. See | |
// https://github.com/neondatabase/neon/issues/2671 | |
// | |
// No timeout here, GC & Compaction should be responsive to the | |
// `TimelineState::Stopping` change. |
// This can happen if we're called a second time, e.g., | ||
// because of a previous failure/cancellation at/after | ||
// failpoint timeline-delete-after-rm. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// This can happen if we're called a second time, e.g., | |
// because of a previous failure/cancellation at/after | |
// failpoint timeline-delete-after-rm. | |
// This can happen if we're called a second time, e.g., | |
// because of a previous failure at/after | |
// failpoint timeline-delete-after-rm. |
// with SharedRetryable this should no longer happen | ||
warn!("no other task should had dropped the Timeline"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// with SharedRetryable this should no longer happen | |
warn!("no other task should had dropped the Timeline"); | |
// SharedRetryable runs the delete timeline future to completion | |
// and coalesces concurrent delete requests. | |
// We removing the timeline from the hashmap as the last step. | |
// So, this should not happen. | |
warn!("timeline already gone from timelines hash map"); |
// the panic has already been formatted by hook, don't worry about it | ||
Err(InnerDeleteTimelineError::OtherPanic) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nobody will know what hook you're talking about here.
// the panic has already been formatted by hook, don't worry about it | |
Err(InnerDeleteTimelineError::OtherPanic) | |
// don't repeat the panic text here. | |
// the panic has already been logged by logging::replace_panic_hook_with_tracing_panic_hook | |
Err(InnerDeleteTimelineError::OtherPanic) |
|
||
// Remove the timeline from the map. | ||
let mut timelines = self.timelines.lock().unwrap(); | ||
let children_exist = timelines | ||
.iter() | ||
.any(|(_, entry)| entry.get_ancestor_timeline_id() == Some(timeline_id)); | ||
// XXX this can happen because `branch_timeline` doesn't check `TimelineState::Stopping`. | ||
// We already deleted the layer files, so it's probably best to panic. | ||
// (Ideally, above remove_dir_all is atomic so we don't see this timeline after a restart) | ||
if children_exist { | ||
panic!("Timeline grew children while we removed layer files"); | ||
} | ||
let removed_timeline = timelines.remove(&timeline_id); | ||
if removed_timeline.is_none() { | ||
// This can legitimately happen if there's a concurrent call to this function. | ||
// T1 T2 | ||
// lock | ||
// unlock | ||
// lock | ||
// unlock | ||
// remove files | ||
// lock | ||
// remove from map | ||
// unlock | ||
// return | ||
// remove files | ||
// lock | ||
// remove from map observes empty map | ||
// unlock | ||
// return | ||
debug!("concurrent call to this function won the race"); | ||
let (recv, maybe_fut) = timeline.delete_self.try_restart(factory).await; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While re-reading this PR, I found the try_restart
name confusing.
How about coalesce
, to emphasize what the SharedRetryable
actually does?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is try_restart
confusing: because it doesn't actually do something.
coalesce
isn't ideal either, because it suggests that it does something as well.
Maybe wrap_for_coalesced_retry
?
/// Container using which many request handlers can come together and join a single task to | ||
/// completion instead of racing each other and their own cancellation. | ||
/// | ||
/// In a picture: | ||
/// | ||
/// ```text | ||
/// SharedRetryable::try_restart Spawned task completes with only one concurrent attempt | ||
/// \ / | ||
/// request handler 1 ---->|--X | ||
/// request handler 2 ---->|-------| | ||
/// request handler 3 ---->|-------| | ||
/// | | | ||
/// v | | ||
/// one spawned task \------>/ | ||
/// | ||
/// (X = cancelled during await) | ||
/// ``` | ||
/// | ||
/// Implementation is cancel safe. Implementation and internal structure are hurt by the inability | ||
/// to just spawn the task, but this is needed for `pageserver` usage. Within `pageserver`, the | ||
/// `task_mgr` must be used to spawn the future because it will cause awaiting during shutdown. | ||
/// | ||
/// Implementation exposes a fully decomposed [`SharedRetryable::try_restart`] which requires the | ||
/// caller to do the spawning before awaiting for the result. If the caller is dropped while this | ||
/// happens, a new attempt will be required, and all concurrent awaiters will see a | ||
/// [`RetriedTaskPanicked`] error. | ||
/// | ||
/// There is another "family of APIs" [`SharedRetryable::attempt_spawn`] for infallible futures. It is | ||
/// just provided for completeness, and it does not have a fully decomposed version like | ||
/// `try_restart`. | ||
/// | ||
/// For `try_restart_*` family of APIs, there is a concept of two leveled results. The inner level | ||
/// is returned by the executed future. It needs to be `Clone`. Most errors are not `Clone`, so | ||
/// implementation advice is to log the happened error, and not propagate more than a label as the | ||
/// "inner error" which will be used to build an outer error. The outer error will also have to be | ||
/// convertable from [`RetriedTaskPanicked`] to absorb that case as well. | ||
/// | ||
/// ## Example | ||
/// | ||
/// A shared service value completes the infallible work once, even if called concurrently by | ||
/// multiple cancellable tasks. | ||
/// | ||
/// Example moved as a test `service_example`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improvements I'd like to see here, so that people don't need to read the code in this module to understand what's going on.
- Move cancel safety paragraph to separage heading & section at bottom.
- Before explaining the two variants of the API, explain the concept of
trait Retryable
for classifying the errors returned by the future that is being coalesced on. - When explaining the two API variants, start with the safe & good one first (
attempt_spawn
). Then explain in a separate paragraph whytry_restart
exists, how it is to be used, and what the pitfalls are (e.g. forgetting to spawn)
To elaborate, we do not need this PR. A simple Mutex like this should be enough to do request coalescing:
Haven't tested this, just to give you the idea of what I mean. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the timeline is already being deleted, return an error. We used to notice the duplicate request and error out in persist_index_part_with_deleted_flag(), but it's better to detect it earlier. Add an explicit lock for the deletion. Note: This doesn't do anything about the async cancellation problem (github issue #3478): if the original HTTP request dropped, because the client disconnected, the timeline deletion stops half-way through the operation. That needs to be fixed, too, but that's a separate story. (This is a simpler replacement for PR #4194. I'm also working on the cancellation shielding, see PR #4314.)
Supercedes an earlier #4159. In an effort to let us write easier code, this particular function was giving us a lot of trouble and so I created
utils::shared_retried::SharedRetried
. It's implementation grew much nastier than I initially wanted, mainly for ability to usetask_mgr::spawn
.SharedRetried has some tests. It has a bit contrived features because of how we can produce non-terminal or retried results from the future factory that is passed to
try_restart
:Ok(())
is always terminalErr(InnerDeleteTimelineError::DeletedGrewChildren)
is terminal because normally panics from this one future would be non-terminal, but this intentional panicking-to-poison-tenant is caught as a separate case"Terminal" results mean, that there will not be any more attempts. With the
Ok(())
case, the request handler would most likely be the only one keeping a timeline alive, so no need to retry. AfterDeletedGrewChildren
we have no way to retry.The hope is that in future we can write easier management api request handlers, which are not cancelled.
Reviewing:
SharedRetried
to understand how it's meant to be usedDescribe your changes
Issue ticket number and link
Checklist before requesting a review
Checklist before merging