Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_basebackup_with_high_slru_count fails due to attachment service reconcile race #7006

Closed
VladLazar opened this issue Mar 4, 2024 · 2 comments · Fixed by #7027
Closed
Assignees
Labels
a/benchmark Area: related to benchmarking c/storage/pageserver Component: storage: pageserver

Comments

@VladLazar
Copy link
Contributor

https://neon-github-public-dev.s3.amazonaws.com/reports/main/8116822732/index.html

There's a weird interaction between the pageserver and attachment service.

  1. Attachment service restarts
  2. Startup reconciliation fails on the attachment service because the pageserver is not yet able to reply to requests.
  3. Attachment service starts background reconciliations
  4. Pageserver restarts
  5. Pageserver calls re-attach which triggers all generations to be incremented
  6. Reconcilers started at step 3 fail because the pageserver refuses the old generation they provide
  7. Another round of reconciliations starts for the tenant shards which failed at step 5
  8. Reconciliation works this time, shutting down the timelines used by the benchmark

Feels like we shouldn't reconcile a node until we successfully got the location state from that pageserver.
@jcsp what do you think?

@VladLazar VladLazar added a/benchmark Area: related to benchmarking c/storage/pageserver Component: storage: pageserver labels Mar 4, 2024
@VladLazar VladLazar changed the title test_basebackup_with_high_slru_count shuts down timeline with inflight base-backup requests test_basebackup_with_high_slru_count fails due to attachment service reconcile race Mar 5, 2024
@VladLazar
Copy link
Contributor Author

Had a chat about this with John who happens to have wip branch for this. Assigned to him.

VladLazar added a commit that referenced this issue Mar 5, 2024
@VladLazar
Copy link
Contributor Author

Once fixed, let's re-enable test_basebackup_with_high_slru_count disabled in #7025

VladLazar added a commit that referenced this issue Mar 5, 2024
@jcsp jcsp closed this as completed in #7027 Mar 7, 2024
jcsp added a commit that referenced this issue Mar 7, 2024
## Problem


Closes: #6847
Closes: #7006

## Summary of changes

- Pageserver API calls are wrapped in timeout/retry logic: this prevents
a reconciler getting hung on a pageserver API hang, and prevents
reconcilers having to totally retry if one API call returns a retryable
error (e.g. 503).
- Add a cancellation token to `Node`, so that when we mark a node
offline we will cancel any API calls in progress to that node, and avoid
issuing any more API calls to that offline node.
- If the dirty locations of a shard are all on offline nodes, then don't
spawn a reconciler
- In re-attach, if we have no observed state object for a tenant then
construct one with conf: None (which means "unknown"). Then in
Reconciler, implement a TODO for scanning such locations before running,
so that we will avoid spuriously incrementing a generation in the case
of a node that was offline while we started (this is the case that
tripped up #7006)
- Refactoring: make Node contents private (and thereby guarantee that
updates to availability mode reliably update the cancellation token.)
- Refactoring: don't pass the whole map of nodes into Reconciler (and
thereby remove a bunch of .expect() calls)

Some of this was discovered/tested with a new failure injection test
that will come in a separate PR, once it is stable enough for CI.
VladLazar added a commit that referenced this issue Mar 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/benchmark Area: related to benchmarking c/storage/pageserver Component: storage: pageserver
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants