`test_basebackup_with_high_slru_count` fails due to attachment service reconcile race #7006

VladLazar · 2024-03-04T15:29:48Z

https://neon-github-public-dev.s3.amazonaws.com/reports/main/8116822732/index.html

There's a weird interaction between the pageserver and attachment service.

Attachment service restarts
Startup reconciliation fails on the attachment service because the pageserver is not yet able to reply to requests.
Attachment service starts background reconciliations
Pageserver restarts
Pageserver calls re-attach which triggers all generations to be incremented
Reconcilers started at step 3 fail because the pageserver refuses the old generation they provide
Another round of reconciliations starts for the tenant shards which failed at step 5
Reconciliation works this time, shutting down the timelines used by the benchmark

Feels like we shouldn't reconcile a node until we successfully got the location state from that pageserver.
@jcsp what do you think?

VladLazar · 2024-03-05T11:38:40Z

Had a chat about this with John who happens to have wip branch for this. Assigned to him.

The test is flaky due to #7006.

VladLazar · 2024-03-05T14:25:52Z

Once fixed, let's re-enable test_basebackup_with_high_slru_count disabled in #7025

The test is flaky due to #7006.

## Problem Closes: #6847 Closes: #7006 ## Summary of changes - Pageserver API calls are wrapped in timeout/retry logic: this prevents a reconciler getting hung on a pageserver API hang, and prevents reconcilers having to totally retry if one API call returns a retryable error (e.g. 503). - Add a cancellation token to `Node`, so that when we mark a node offline we will cancel any API calls in progress to that node, and avoid issuing any more API calls to that offline node. - If the dirty locations of a shard are all on offline nodes, then don't spawn a reconciler - In re-attach, if we have no observed state object for a tenant then construct one with conf: None (which means "unknown"). Then in Reconciler, implement a TODO for scanning such locations before running, so that we will avoid spuriously incrementing a generation in the case of a node that was offline while we started (this is the case that tripped up #7006) - Refactoring: make Node contents private (and thereby guarantee that updates to availability mode reliably update the cancellation token.) - Refactoring: don't pass the whole map of nodes into Reconciler (and thereby remove a bunch of .expect() calls) Some of this was discovered/tested with a new failure injection test that will come in a separate PR, once it is stable enough for CI.

Previously disabled due to #7006.

VladLazar added a/benchmark Area: related to benchmarking c/storage/pageserver Component: storage: pageserver labels Mar 4, 2024

VladLazar changed the title ~~test_basebackup_with_high_slru_count shuts down timeline with inflight base-backup requests~~ test_basebackup_with_high_slru_count fails due to attachment service reconcile race Mar 5, 2024

VladLazar assigned jcsp Mar 5, 2024

VladLazar added a commit that referenced this issue Mar 5, 2024

test: disable large slru basebackup bench in ci

be5b9a1

The test is flaky due to #7006.

VladLazar mentioned this issue Mar 5, 2024

test: disable large slru basebackup bench in ci #7025

Merged

5 tasks

VladLazar added a commit that referenced this issue Mar 5, 2024

test: disable large slru basebackup bench in ci (#7025)

2daa2f1

The test is flaky due to #7006.

jcsp mentioned this issue Mar 5, 2024

storage controller: robustness improvements #7027

Merged

5 tasks

jcsp closed this as completed in #7027 Mar 7, 2024

VladLazar mentioned this issue Mar 14, 2024

test_runner: re-enable large slru benchmark #7125

Merged

5 tasks

VladLazar added a commit that referenced this issue Mar 14, 2024

test_runner: re-enable large slru benchmark (#7125)

3d8830a

Previously disabled due to #7006.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`test_basebackup_with_high_slru_count` fails due to attachment service reconcile race #7006

`test_basebackup_with_high_slru_count` fails due to attachment service reconcile race #7006

VladLazar commented Mar 4, 2024

VladLazar commented Mar 5, 2024

VladLazar commented Mar 5, 2024

test_basebackup_with_high_slru_count fails due to attachment service reconcile race #7006

test_basebackup_with_high_slru_count fails due to attachment service reconcile race #7006

Comments

VladLazar commented Mar 4, 2024

VladLazar commented Mar 5, 2024

VladLazar commented Mar 5, 2024

`test_basebackup_with_high_slru_count` fails due to attachment service reconcile race #7006

`test_basebackup_with_high_slru_count` fails due to attachment service reconcile race #7006