fix: coalesce deleting a timeline #4194

koivunej · 2023-05-10T12:52:49Z

Supercedes an earlier #4159. In an effort to let us write easier code, this particular function was giving us a lot of trouble and so I created utils::shared_retried::SharedRetried. It's implementation grew much nastier than I initially wanted, mainly for ability to use task_mgr::spawn.

SharedRetried has some tests. It has a bit contrived features because of how we can produce non-terminal or retried results from the future factory that is passed to try_restart:

Ok(()) is always terminal
Err(InnerDeleteTimelineError::DeletedGrewChildren) is terminal because normally panics from this one future would be non-terminal, but this intentional panicking-to-poison-tenant is caught as a separate case

"Terminal" results mean, that there will not be any more attempts. With the Ok(()) case, the request handler would most likely be the only one keeping a timeline alive, so no need to retry. After DeletedGrewChildren we have no way to retry.

The hope is that in future we can write easier management api request handlers, which are not cancelled.

Reviewing:

start by looking at the tests of SharedRetried to understand how it's meant to be used
then look at the doc comment
then remember to turn on "hide whitespace" before looking at the delete_timeline changes

Describe your changes

Issue ticket number and link

Checklist before requesting a review

I have performed a self-review of my code.
If it is a core feature, I have added thorough tests.
Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

Do not forget to reformat commit message to not include the above checklist

github-actions · 2023-05-10T13:18:09Z

992 tests run: 949 passed, 2 failed, 41 skipped (full report)

Failures on Posgres 15

test_broken_timeline: debug, release

# Run failed on Postgres 15 tests locally:
DEFAULT_PG_VERSION=15 scripts/pytest -k "test_broken_timeline[debug-pg15]"

Flaky tests (3)

Postgres 15

test_close_on_connections_exit[endpoint]: ✅ debug

Postgres 14

test_close_on_connections_exit[project]: ✅ debug
test_close_on_connections_exit[endpoint]: ✅ debug

_{The comment gets automatically updated with the latest test results
69e9237 at 2023-05-22T08:36:01.349Z :recycle:}

libs/utils/src/shared_retryable.rs

hlinnaka · 2023-05-10T18:40:44Z

This SharedRetryable abstraction feels overengineered to me.

If I understand correctly, the naive open-coded version of this would be:

    // `None` means the deletion hasn't been performed yet. `Some` means that it has, and
    // it contains the result.
    //
    // This would be a field in the Timeline struct.
    let deletion_completed: tokio::sync::Mutex<Option<Result<(), ()>>>
        = tokio::sync::Mutex::new(None);

    // Start the deletion, or if another task is already performing it, wait for it to finish.
    let res = {
        let mut deletion_guard = deletion_completed.lock().await;
        if let Some(res) = *deletion_guard {
            // another task performed the deletion already, just return its result
            res
        } else {
            // perform the deletion ...    
            let res = Ok(());

            // remember the result, if another task tries to perform deletion
            // concurrently
            *deletion_guard = Some(res);

            res
        }
    };
    
    println!("got result: {:?}", res);

What does the new abstraction bring us over this?

koivunej · 2023-05-10T18:52:32Z

Yeah, add on top of that example:

spawning a single task without cancellation to do the work and allowing 0..N multiple tasks to await for the one task -support (this most likely is the overengineered looking part)
- without cancellation meaning not cancelled when the original attempt is cancelled
transient/persistent errors, retry on transient

If you look at the superceded PR you can see a simpler less split version of this. The tests also highlight different cases.

problame

I treated shared_retryable as a black-box in this review. I can take another look, but, I trust you did a good job there (plus, I reviewed the not-librarified version earlier).

If you want me to look at specific pieces of it, e.g., task_mgr interaction, please start a conversation and mention me.

All comments added in this review are quite nitpicky, so, if you feel differently strongly, write a justification and mark them as resolved.

pageserver/src/tenant.rs

test_runner/regress/test_timeline_delete.py

koivunej · 2023-05-11T17:10:46Z

Earlier @problame:

I treated shared_retryable as a black-box in this review. I can take another look, but, I trust you did a good job there (plus, I reviewed the not-librarified version earlier).

Yeah the tests in the bottom of the file should be the only ones. The readability could use some inlining probably, but I wanted to start adding that non-Result API as well, because I think it might become useful, and there were some clearly shared paths.

Important to note if you'd look at the impl:

try_restart is the only public method
rest are #[cfg(test)] not to keep rewriting the boilerplate to spawn if Some and recv.await
_any are the infallible non-pub methods

If you want me to look at specific pieces of it, e.g., task_mgr interaction, please start a conversation and mention me.

I guess the most important part is to check that the process_shutdown_on_error is false :) It is now tenant scoped as discussed. I can't see how I could go wrong there.

koivunej · 2023-05-15T06:50:51Z

Was thinking of number of simplifications over the weekend for this, but will leave them for later:

stop using async mutex -- I am trying to show an example of defaulting to safe mutex and may have been considering a "ownedmutexguard held over await points" at some stage; in last iteration it's not needed
the non-retryable version should be the core abstraction, retrying something made on top of that
- was initially going for this from the direction of results, so this stuck
- failed to see the better direction

documented in the module.

as in, you could run many of these futures with tokio::select within one task or `runtime.block_on`; it does not matter.

…docs

per review comments.

koivunej · 2023-05-16T12:47:55Z

Had to rebase on top of main to get a new ci attempt. 4591698 was the previous HEAD.

Oops, restoring the review suggestion commits on top. Restored:

Even if they don't always appear in the commit listing on main PR page (chronologically sorted).

Co-authored-by: Christian Schwarz <christian@neon.tech>

even though it's a std::io::Error Co-authored-by: Christian Schwarz <christian@neon.tech>

Co-authored-by: Christian Schwarz <christian@neon.tech>

LizardWizzard

The idea looks good to me. Some minor comments.

Also as I said in one of our previous discussions it may be worth considering baking factory inside the SharedRetryable instance, so it is not possible to run one instance with different futures. Maybe thats too restrictive and not needed. For tests it is definitely handier to be able to pass different futures every time.

Using some abstraction to spawn things and passing it in may simplify return types, but thats another abstraction.

libs/utils/src/shared_retryable.rs

LizardWizzard · 2023-05-17T09:47:22Z

libs/utils/src/shared_retryable.rs

+    /// future will never make progress.
+    ///
+    /// This complication exists because on pageserver we cannot use `tokio::spawn` directly
+    /// at this time.


Can we provide a closure that accepts the future and spawns it when called?

Maybe something like https://docs.rs/futures/0.3.28/futures/task/trait.Spawn.html can make it a bit clearer. Though it needs another abstraction to be introduced (Probably our own, not this particular one from futures)

Spawner requires boxing at the caller, so I'd prefer not doing that, it is what SpawnExt does: https://docs.rs/futures-util/0.3.28/src/futures_util/task/spawn.rs.html#49-51

Discussed this also with @funbringer, we could have a trait+ZST for spawning on tokio or on a closure.

Nice thing that followed after not spawning deep in the callstack is that the easiest way now to accidentially "leak" the strong: Arc<Receiver<_>> requires std::mem::forget which can hardly be accidential. It is probably not obvious looking at the latest iteration but that's why I don't really like to revert back.

There are tests to show that dropping the future needing to be spawned will produce a failure. I guess it could be a fast looping situation, but that should really come up during local testing I think.

I could add some #[must_use] to try_restart. Would that help? I didn't check if they do get implicit #[must_use] from std::future::Future.

Spawner requires boxing at the caller, so I'd prefer not doing that, it is what SpawnExt does:

Yep, thats why I wrote this: Probably our own, not this particular one from futures.

we could have a trait+ZST for spawning on tokio or on a closure.

Was thinking about something of that nature.

My point is that internal spawning should simplify signatures a lot, i e no need to return multiple Future<Output=...>.

In general would be interesting to see how this decision affects complexity of the whole primitive, but personally I dont have capacity to run such an experiment, so I'm on board with current approach.

I could add some #[must_use]

Having #[must_use] would be a plus.

libs/utils/src/shared_retryable.rs

problame · 2023-05-17T11:46:05Z

Also as I said in one of our previous discussions it may be worth considering baking factory inside the SharedRetryable instance, so it is not possible to run one instance with different futures. Maybe thats too restrictive and not needed. For tests it is definitely handier to be able to pass different futures every time.

I like that, can be a follow-up though.

LizardWizzard

I think the PR is good in its current shape. It solves the problem. Polishing can be done in follow ups if needed

hlinnaka · 2023-05-22T07:59:41Z

With the decision that we'll spawn tasks earlier in HTTP handler, this doesn't need to shield the underlying code from async cancellations anymore. That allows this PR to be greatly simplified, as it doesn't need to spawn tasks any more.

problame

Re-read the non shared_retryable.rs code plus the shared_retryable.rs module comment.

I don't see an immediate functional gain for the system as a whole with the coalescing instead of just bailing out early if an operation is ongoing (scopeguard + bool flag). But, I'm Ok if others feel that this is something we need.

problame · 2023-05-22T08:37:30Z

pageserver/src/tenant.rs

+                    // Grab the layer_removal_cs lock, and actually perform the deletion.
                    //
-                    // It can also happen if we race with tenant detach, because,
-                    // it doesn't grab the layer_removal_cs lock.
+                    // This lock prevents multiple concurrent delete_timeline calls from
+                    // stepping on each other's toes, while deleting the files. It also
+                    // prevents GC or compaction from running at the same time.
                    //
-                    // For now, log and continue.
-                    // warn! level is technically not appropriate for the
-                    // first case because we should expect retries to happen.
-                    // But the error is so rare, it seems better to get attention if it happens.
-                    let tenant_state = self.current_state();
-                    warn!(
-                        timeline_dir=?local_timeline_directory,
-                        ?tenant_state,
-                        "timeline directory not found, proceeding anyway"
-                    );
-                    // continue with the rest of the deletion
+                    // Note that there are still other race conditions between
+                    // GC, compaction and timeline deletion. GC task doesn't
+                    // register itself properly with the timeline it's
+                    // operating on. See
+                    // https://github.com/neondatabase/neon/issues/2671
+                    //
+                    // No timeout here, GC & Compaction should be responsive to the
+                    // `TimelineState::Stopping` change.


Suggested change

// Grab the layer_removal_cs lock, and actually perform the deletion.

//

// It can also happen if we race with tenant detach, because,

// it doesn't grab the layer_removal_cs lock.

// This lock prevents multiple concurrent delete_timeline calls from

// stepping on each other's toes, while deleting the files. It also

// prevents GC or compaction from running at the same time.

//

// For now, log and continue.

// warn! level is technically not appropriate for the

// first case because we should expect retries to happen.

// But the error is so rare, it seems better to get attention if it happens.

let tenant_state = self.current_state();

warn!(

timeline_dir=?local_timeline_directory,

?tenant_state,

"timeline directory not found, proceeding anyway"

);

// continue with the rest of the deletion

// Note that there are still other race conditions between

// GC, compaction and timeline deletion. GC task doesn't

// register itself properly with the timeline it's

// operating on. See

// https://github.com/neondatabase/neon/issues/2671

//

// No timeout here, GC & Compaction should be responsive to the

// `TimelineState::Stopping` change.

// Grab the layer_removal_cs lock, and actually perform the deletion.

// This serializes timeline deletion with compaction, gc, and eviction.

// All these other tasks check timeline state after acquiring the lock

// and bail out if it is `TimelineState::Stopping`.

//

// Without the serialization + check for timeline state, these other

// tasks would make decision based on layers in the layer map that

// we (delete_timeline) are about to delete from disk below.

//

// Note that there are still other race conditions between

// GC, compaction and timeline deletion. GC task doesn't

// register itself properly with the timeline it's

// operating on. See

// https://github.com/neondatabase/neon/issues/2671

//

// No timeout here, GC & Compaction should be responsive to the

// `TimelineState::Stopping` change.

problame · 2023-05-22T08:38:37Z

pageserver/src/tenant.rs

+                            // This can happen if we're called a second time, e.g.,
+                            // because of a previous failure/cancellation at/after
+                            // failpoint timeline-delete-after-rm.


Suggested change

// This can happen if we're called a second time, e.g.,

// because of a previous failure/cancellation at/after

// failpoint timeline-delete-after-rm.

// This can happen if we're called a second time, e.g.,

// because of a previous failure at/after

// failpoint timeline-delete-after-rm.

problame · 2023-05-22T08:43:22Z

pageserver/src/tenant.rs

+                        // with SharedRetryable this should no longer happen
+                        warn!("no other task should had dropped the Timeline");


Suggested change

// with SharedRetryable this should no longer happen

warn!("no other task should had dropped the Timeline");

// SharedRetryable runs the delete timeline future to completion

// and coalesces concurrent delete requests.

// We removing the timeline from the hashmap as the last step.

// So, this should not happen.

warn!("timeline already gone from timelines hash map");

problame · 2023-05-22T08:44:52Z

pageserver/src/tenant.rs

+                            // the panic has already been formatted by hook, don't worry about it
+                            Err(InnerDeleteTimelineError::OtherPanic)


Nobody will know what hook you're talking about here.

Suggested change

// the panic has already been formatted by hook, don't worry about it

Err(InnerDeleteTimelineError::OtherPanic)

// don't repeat the panic text here.

// the panic has already been logged by logging::replace_panic_hook_with_tracing_panic_hook

Err(InnerDeleteTimelineError::OtherPanic)

problame · 2023-05-22T08:47:41Z

pageserver/src/tenant.rs


-        // Remove the timeline from the map.
-        let mut timelines = self.timelines.lock().unwrap();
-        let children_exist = timelines
-            .iter()
-            .any(|(_, entry)| entry.get_ancestor_timeline_id() == Some(timeline_id));
-        // XXX this can happen because `branch_timeline` doesn't check `TimelineState::Stopping`.
-        // We already deleted the layer files, so it's probably best to panic.
-        // (Ideally, above remove_dir_all is atomic so we don't see this timeline after a restart)
-        if children_exist {
-            panic!("Timeline grew children while we removed layer files");
-        }
-        let removed_timeline = timelines.remove(&timeline_id);
-        if removed_timeline.is_none() {
-            // This can legitimately happen if there's a concurrent call to this function.
-            //   T1                                             T2
-            //   lock
-            //   unlock
-            //                                                  lock
-            //                                                  unlock
-            //                                                  remove files
-            //                                                  lock
-            //                                                  remove from map
-            //                                                  unlock
-            //                                                  return
-            //   remove files
-            //   lock
-            //   remove from map observes empty map
-            //   unlock
-            //   return
-            debug!("concurrent call to this function won the race");
+        let (recv, maybe_fut) = timeline.delete_self.try_restart(factory).await;
+


While re-reading this PR, I found the try_restart name confusing.

How about coalesce , to emphasize what the SharedRetryable actually does?

Why is try_restart confusing: because it doesn't actually do something.

coalesce isn't ideal either, because it suggests that it does something as well.

Maybe wrap_for_coalesced_retry ?

problame · 2023-05-22T08:57:22Z

libs/utils/src/shared_retryable.rs

+/// Container using which many request handlers can come together and join a single task to
+/// completion instead of racing each other and their own cancellation.
+///
+/// In a picture:
+///
+/// ```text
+/// SharedRetryable::try_restart         Spawned task completes with only one concurrent attempt
+///                             \       /
+///      request handler 1 ---->|--X
+///      request handler 2 ---->|-------|
+///      request handler 3 ---->|-------|
+///                             |       |
+///                             v       |
+///       one spawned task      \------>/
+///
+/// (X = cancelled during await)
+/// ```
+///
+/// Implementation is cancel safe. Implementation and internal structure are hurt by the inability
+/// to just spawn the task, but this is needed for `pageserver` usage. Within `pageserver`, the
+/// `task_mgr` must be used to spawn the future because it will cause awaiting during shutdown.
+///
+/// Implementation exposes a fully decomposed [`SharedRetryable::try_restart`] which requires the
+/// caller to do the spawning before awaiting for the result. If the caller is dropped while this
+/// happens, a new attempt will be required, and all concurrent awaiters will see a
+/// [`RetriedTaskPanicked`] error.
+///
+/// There is another "family of APIs" [`SharedRetryable::attempt_spawn`] for infallible futures. It is
+/// just provided for completeness, and it does not have a fully decomposed version like
+/// `try_restart`.
+///
+/// For `try_restart_*` family of APIs, there is a concept of two leveled results. The inner level
+/// is returned by the executed future. It needs to be `Clone`. Most errors are not `Clone`, so
+/// implementation advice is to log the happened error, and not propagate more than a label as the
+/// "inner error" which will be used to build an outer error. The outer error will also have to be
+/// convertable from [`RetriedTaskPanicked`] to absorb that case as well.
+///
+/// ## Example
+///
+/// A shared service value completes the infallible work once, even if called concurrently by
+/// multiple cancellable tasks.
+///
+/// Example moved as a test `service_example`.


Improvements I'd like to see here, so that people don't need to read the code in this module to understand what's going on.

Move cancel safety paragraph to separage heading & section at bottom.

Before explaining the two variants of the API, explain the concept of trait Retryable for classifying the errors returned by the future that is being coalesced on.

When explaining the two API variants, start with the safe & good one first (attempt_spawn). Then explain in a separate paragraph why try_restart exists, how it is to be used, and what the pitfalls are (e.g. forgetting to spawn)

See last review

hlinnaka · 2023-05-22T12:15:37Z

With the decision that we'll spawn tasks earlier in HTTP handler, this doesn't need to shield the underlying code from async cancellations anymore. That allows this PR to be greatly simplified, as it doesn't need to spawn tasks any more.

To elaborate, we do not need this PR. A simple Mutex like this should be enough to do request coalescing:

diff --git a/pageserver/src/tenant.rs b/pageserver/src/tenant.rs
index 8349e1993..d7bf96bfb 100644
--- a/pageserver/src/tenant.rs
+++ b/pageserver/src/tenant.rs
@@ -1421,6 +1421,13 @@ impl Tenant {
             timeline
         };
 
+        // Prevent two tasks from trying to delete the timeline at the same time
+        let mut delete_lock_guard = timeline.delete_lock.lock().await;
+        if *delete_lock_guard {
+            // already deleted by a concurrent task
+            return Ok(());
+        }
+
         // Now that the Timeline is in Stopping state, request all the related tasks to
         // shut down.
         //
@@ -1576,6 +1583,8 @@ impl Tenant {
         }
         drop(timelines);
 
+        *delete_lock_guard = true;
+
         Ok(())
     }
 
diff --git a/pageserver/src/tenant/timeline.rs b/pageserver/src/tenant/timeline.rs
index c47f4444f..490cb6068 100644
--- a/pageserver/src/tenant/timeline.rs
+++ b/pageserver/src/tenant/timeline.rs
@@ -226,6 +226,10 @@ pub struct Timeline {
 
     state: watch::Sender<TimelineState>,
 
+    /// Prevent two tasks from deleting the timeline at the same time. If 'true',
+    /// the timeline has already been deleted.
+    pub delete_lock: tokio::sync::Mutex<bool>,
+
     eviction_task_timeline_state: tokio::sync::Mutex<EvictionTaskTimelineState>,
 }
 
@@ -1421,6 +1425,7 @@ impl Timeline {
                 eviction_task_timeline_state: tokio::sync::Mutex::new(
                     EvictionTaskTimelineState::default(),
                 ),
+                delete_lock: tokio::sync::Mutex::new(false),
             };
             result.repartition_threshold = result.get_checkpoint_distance() / 10;
             result

Haven't tested this, just to give you the idea of what I mean.

hlinnaka

If the timeline is already being deleted, return an error. We used to notice the duplicate request and error out in persist_index_part_with_deleted_flag(), but it's better to detect it earlier. Add an explicit lock for the deletion. Note: This doesn't do anything about the async cancellation problem (github issue #3478): if the original HTTP request dropped, because the client disconnected, the timeline deletion stops half-way through the operation. That needs to be fixed, too, but that's a separate story. (This is a simpler replacement for PR #4194. I'm also working on the cancellation shielding, see PR #4314.)

This comment was marked as outdated.

Sign in to view

koivunej mentioned this pull request May 10, 2023

delete_timeline: request coalescing #4159

Closed

2 tasks

This comment was marked as outdated.

Sign in to view

koivunej requested review from problame and LizardWizzard May 10, 2023 13:47

koivunej marked this pull request as ready for review May 10, 2023 14:32

koivunej requested review from a team as code owners May 10, 2023 14:32

koivunej requested review from knizhnik and removed request for a team May 10, 2023 14:32

This comment was marked as outdated.

Sign in to view

koivunej commented May 10, 2023

View reviewed changes

libs/utils/src/shared_retryable.rs Outdated Show resolved Hide resolved

koivunej commented May 10, 2023

View reviewed changes

libs/utils/src/shared_retryable.rs Show resolved Hide resolved

koivunej mentioned this pull request May 10, 2023

Include some recent pitfalls in contributing.md #4198

Closed

problame previously requested changes May 11, 2023

View reviewed changes

koivunej mentioned this pull request May 11, 2023

Document that our code is mostly not async cancellation-safe. #4216

Merged

koivunej added 9 commits May 16, 2023 15:45

feat: utils::shared_retryable::SharedRetryable

ba7c97b

documented in the module.

fix: coalesce requests to delete_timeline

00673cf

fixup: clippy

43b397d

fixup: doc: many futures instead of tasks

17c92c8

as in, you could run many of these futures with tokio::select within one task or `runtime.block_on`; it does not matter.

fixup: doc: explain convoluted return type

3b1141b

fixup: doc: align strong usage, fix broken references

d3ce8ea

refactor: simplify, racy removes no longer intended

8c1962d

doc: more cleanup

3784166

fixup: one missed logging opportunity

c948ee3

koivunej added 3 commits May 16, 2023 15:45

test: test_concurrent_timeline_delete_if_first_stuck_at_index_upload …

6b1adf6

…docs

test: assert that allowed error appears in log

b4cf3b1

test: simplify delete_timeline client hangup

83457a8

per review comments.

koivunej force-pushed the delete_timeline_coalesce branch from 0b42ba9 to 83457a8 Compare May 16, 2023 12:45

koivunej and others added 3 commits May 16, 2023 15:49

doc: review suggestions

a972034

Co-authored-by: Christian Schwarz <christian@neon.tech>

refactor: log errors with {:#}

7b6ed51

even though it's a std::io::Error Co-authored-by: Christian Schwarz <christian@neon.tech>

refactor: prefer in_current_span

cbe3e11

LizardWizzard mentioned this pull request May 16, 2023

delete_timeline post remove_dir_all is quite brittle wrt cancel-safety & errors #4146

Closed

koivunej and others added 2 commits May 17, 2023 11:13

refactor: panic with actual error, add fallback

f0bcfd6

Co-authored-by: Christian Schwarz <christian@neon.tech>

test: add missing allowed_error

70c2717

koivunej force-pushed the delete_timeline_coalesce branch from e5fdfae to 70c2717 Compare May 17, 2023 08:17

test: refactor: unified check gone style

9ab4e5a

LizardWizzard reviewed May 17, 2023

View reviewed changes

funbringer added 2 commits May 17, 2023 14:42

Changes

0251c7c

Simplify low-hanging fruit

d0b2bef

LizardWizzard approved these changes May 17, 2023

View reviewed changes

koivunej added 2 commits May 22, 2023 10:57

doc: explain what task_mgr gives us

bc4f6db

doc: clarify cancellation on spawned task

c8b3532

refactor: inline None into MaybeDone::NotStarted

69e9237

problame reviewed May 22, 2023

View reviewed changes

shanyp assigned koivunej May 22, 2023

hlinnaka requested changes May 22, 2023

View reviewed changes

hlinnaka mentioned this pull request May 23, 2023

Don't allow two timeline_delete operations to run concurrently. #4313

Merged

koivunej closed this May 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: coalesce deleting a timeline #4194

fix: coalesce deleting a timeline #4194

koivunej commented May 10, 2023 •

edited

Loading

This comment was marked as outdated.

github-actions bot commented May 10, 2023 •

edited

Loading

Postgres 15

Postgres 14

This comment was marked as outdated.

This comment was marked as outdated.

hlinnaka commented May 10, 2023

koivunej commented May 10, 2023 •

edited

Loading

problame left a comment

koivunej commented May 11, 2023

koivunej commented May 15, 2023 •

edited

Loading

koivunej commented May 16, 2023 •

edited

Loading

LizardWizzard left a comment

LizardWizzard May 17, 2023

koivunej May 17, 2023

LizardWizzard May 22, 2023

problame commented May 17, 2023 •

edited

Loading

LizardWizzard left a comment

hlinnaka commented May 22, 2023

problame left a comment

problame May 22, 2023

problame May 22, 2023

problame May 22, 2023

problame May 22, 2023

problame May 22, 2023

problame May 22, 2023

problame May 22, 2023

hlinnaka commented May 22, 2023

hlinnaka left a comment

		// with SharedRetryable this should no longer happen
		warn!("no other task should had dropped the Timeline");

-                        // with SharedRetryable this should no longer happen
-                        warn!("no other task should had dropped the Timeline");
+                        // SharedRetryable runs the delete timeline future to completion
+                        // and coalesces concurrent delete requests.
+                        // We removing the timeline from the hashmap as the last step.
+                        // So, this should not happen.
+                        warn!("timeline already gone from timelines hash map");

		// the panic has already been formatted by hook, don't worry about it
		Err(InnerDeleteTimelineError::OtherPanic)

fix: coalesce deleting a timeline #4194

fix: coalesce deleting a timeline #4194

Conversation

koivunej commented May 10, 2023 • edited Loading

Describe your changes

Issue ticket number and link

Checklist before requesting a review

Checklist before merging

This comment was marked as outdated.

github-actions bot commented May 10, 2023 • edited Loading

992 tests run: 949 passed, 2 failed, 41 skipped (full report)

Failures on Posgres 15

Postgres 15

Postgres 14

This comment was marked as outdated.

This comment was marked as outdated.

hlinnaka commented May 10, 2023

koivunej commented May 10, 2023 • edited Loading

problame left a comment

Choose a reason for hiding this comment

koivunej commented May 11, 2023

koivunej commented May 15, 2023 • edited Loading

koivunej commented May 16, 2023 • edited Loading

LizardWizzard left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

problame commented May 17, 2023 • edited Loading

LizardWizzard left a comment

Choose a reason for hiding this comment

hlinnaka commented May 22, 2023

problame left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hlinnaka commented May 22, 2023

hlinnaka left a comment

Choose a reason for hiding this comment

koivunej commented May 10, 2023 •

edited

Loading

github-actions bot commented May 10, 2023 •

edited

Loading

koivunej commented May 10, 2023 •

edited

Loading

koivunej commented May 15, 2023 •

edited

Loading

koivunej commented May 16, 2023 •

edited

Loading

problame commented May 17, 2023 •

edited

Loading