Skip to content

Commit

Permalink
tests: improve stability of test_deletion_queue_recovery (#7325)
Browse files Browse the repository at this point in the history
## Problem

As #6092 points out, this
test was (ab)using a failpoint!() with 'pause', which was occasionally
causing index uploads to get hung on a stuck executor thread, resulting
in timeouts waiting for remote_consistent_lsn.

That is one of several failure modes, but by far the most frequent.

## Summary of changes

- Replace the failpoint! with a `sleep_millis_async`, which is not only
async but also supports clean shutdown.
- Improve debugging: log the consistent LSN when scheduling an index
upload
- Tidy: remove an unnecessary checkpoint in the test code, where
last_flush_lsn_upload had just been called (this does a checkpoint
internally)
  • Loading branch information
jcsp committed Apr 5, 2024
1 parent ec01292 commit 534c099
Show file tree
Hide file tree
Showing 3 changed files with 10 additions and 9 deletions.
7 changes: 5 additions & 2 deletions pageserver/src/control_plane_client.rs
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ use pageserver_api::{
use serde::{de::DeserializeOwned, Serialize};
use tokio_util::sync::CancellationToken;
use url::Url;
use utils::{backoff, generation::Generation, id::NodeId};
use utils::{backoff, failpoint_support, generation::Generation, id::NodeId};

use crate::{
config::{NodeMetadata, PageServerConf},
Expand Down Expand Up @@ -210,7 +210,10 @@ impl ControlPlaneGenerationsApi for ControlPlaneClient {
.collect(),
};

fail::fail_point!("control-plane-client-validate");
failpoint_support::sleep_millis_async!("control-plane-client-validate-sleep", &self.cancel);
if self.cancel.is_cancelled() {
return Err(RetryForeverError::ShuttingDown);
}

let response: ValidateResponse = self.retry_http_forever(&re_attach_path, request).await?;

Expand Down
6 changes: 3 additions & 3 deletions pageserver/src/tenant/remote_timeline_client.rs
Original file line number Diff line number Diff line change
Expand Up @@ -593,14 +593,14 @@ impl RemoteTimelineClient {
upload_queue: &mut UploadQueueInitialized,
metadata: TimelineMetadata,
) {
let disk_consistent_lsn = upload_queue.latest_metadata.disk_consistent_lsn();

info!(
"scheduling metadata upload with {} files ({} changed)",
"scheduling metadata upload up to consistent LSN {disk_consistent_lsn} with {} files ({} changed)",
upload_queue.latest_files.len(),
upload_queue.latest_files_changes_since_metadata_upload_scheduled,
);

let disk_consistent_lsn = upload_queue.latest_metadata.disk_consistent_lsn();

let index_part = IndexPart::new(
upload_queue.latest_files.clone(),
disk_consistent_lsn,
Expand Down
6 changes: 2 additions & 4 deletions test_runner/regress/test_pageserver_generations.py
Original file line number Diff line number Diff line change
Expand Up @@ -111,7 +111,6 @@ def churn(data):
last_flush_lsn_upload(
env, endpoint, tenant_id, timeline_id, pageserver_id=pageserver.id
)
ps_http.timeline_checkpoint(tenant_id, timeline_id)

# Compaction should generate some GC-elegible layers
for i in range(0, 2):
Expand Down Expand Up @@ -385,9 +384,8 @@ def test_deletion_queue_recovery(
if validate_before == ValidateBefore.NO_VALIDATE:
failpoints.append(
# Prevent deletion lists from being validated, we will test that they are
# dropped properly during recovery. 'pause' is okay here because we kill
# the pageserver with immediate=true
("control-plane-client-validate", "pause")
# dropped properly during recovery. This is such a long sleep as to be equivalent to "never"
("control-plane-client-validate", "return(3600000)")
)

ps_http.configure_failpoints(failpoints)
Expand Down

1 comment on commit 534c099

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2834 tests run: 2697 passed, 0 failed, 137 skipped (full report)


Code coverage* (full report)

  • functions: 28.0% (6404 of 22855 functions)
  • lines: 46.8% (45074 of 96211 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
534c099 at 2024-04-05T18:10:58.259Z :recycle:

Please sign in to comment.