Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests: improve stability of test_deletion_queue_recovery #7325

Merged
merged 3 commits into from
Apr 5, 2024

Conversation

jcsp
Copy link
Contributor

@jcsp jcsp commented Apr 5, 2024

Problem

As #6092 points out, this test was (ab)using a failpoint!() with 'pause', which was occasionally causing index uploads to get hung on a stuck executor thread, resulting in timeouts waiting for remote_consistent_lsn.

That is one of several failure modes, but by far the most frequent.

Summary of changes

  • Replace the failpoint! with a sleep_millis_async, which is not only async but also supports clean shutdown.
  • Improve debugging: log the consistent LSN when scheduling an index upload
  • Tidy: remove an unnecessary checkpoint in the test code, where last_flush_lsn_upload had just been called (this does a checkpoint internally)

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist

@jcsp jcsp added c/storage/pageserver Component: storage: pageserver a/test Area: related to testing a/tech_debt Area: related to tech debt labels Apr 5, 2024
@jcsp jcsp requested a review from a team as a code owner April 5, 2024 11:34
@jcsp jcsp requested a review from problame April 5, 2024 11:34
Copy link

github-actions bot commented Apr 5, 2024

2754 tests run: 2630 passed, 0 failed, 124 skipped (full report)


Flaky tests (3)

Postgres 15

  • test_ts_of_lsn_api: debug

Postgres 14

  • test_deletion_queue_recovery[no-validate-keep]: release, debug

Code coverage* (full report)

  • functions: 28.0% (6396 of 22872 functions)
  • lines: 46.8% (45010 of 96138 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
cdaf38e at 2024-04-05T12:24:44.075Z :recycle:

Copy link
Contributor

@problame problame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note there's also pausable_failpoint!, it allows to use the indefinite sleep action instead of a very high number like you did in this PR.

It is quite heavy though because it does a spawn_blocking.

};
/// Declare a failpoint that can use the `pause` failpoint action.
/// We don't want to block the executor thread, hence, spawn_blocking + await.
macro_rules! pausable_failpoint {
($name:literal) => {
if cfg!(feature = "testing") {
tokio::task::spawn_blocking({
let current = tracing::Span::current();
move || {
let _entered = current.entered();
tracing::info!("at failpoint {}", $name);
fail::fail_point!($name);
}
})
.await
.expect("spawn_blocking");
}
};
($name:literal, $cond:expr) => {
if cfg!(feature = "testing") {
if $cond {
pausable_failpoint!($name)
}
}
};
}

Copy link
Contributor

@problame problame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, for this specific use case, I think you want to exercise the cancellation path?

Would maybe make sense to have a failpoint_support::sleep_indefinitely_until_cancel which does a 100 year sleep + wait for cancel internally.

@jcsp
Copy link
Contributor Author

jcsp commented Apr 5, 2024

Note there's also pausable_failpoint!, it allows to use the indefinite sleep action instead of a very high number like you did in this PR.

Right, at some point we should pull that out into the failpoint_support location so that it's usable from more spaces (and make it support cancellation tokens like sleep_millis_async does)

@jcsp jcsp merged commit 534c099 into main Apr 5, 2024
56 checks passed
@jcsp jcsp deleted the jcsp/issue-6092-tests-deletion-queue-recovery branch April 5, 2024 17:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/tech_debt Area: related to tech debt a/test Area: related to testing c/storage/pageserver Component: storage: pageserver
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants