-
Notifications
You must be signed in to change notification settings - Fork 375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test_basebackup_with_high_slru_count
fails due to attachment service reconcile race
#7006
Labels
Comments
test_basebackup_with_high_slru_count
shuts down timeline with inflight base-backup requeststest_basebackup_with_high_slru_count
fails due to attachment service reconcile race
Had a chat about this with John who happens to have wip branch for this. Assigned to him. |
5 tasks
Once fixed, let's re-enable |
5 tasks
jcsp
added a commit
that referenced
this issue
Mar 7, 2024
## Problem Closes: #6847 Closes: #7006 ## Summary of changes - Pageserver API calls are wrapped in timeout/retry logic: this prevents a reconciler getting hung on a pageserver API hang, and prevents reconcilers having to totally retry if one API call returns a retryable error (e.g. 503). - Add a cancellation token to `Node`, so that when we mark a node offline we will cancel any API calls in progress to that node, and avoid issuing any more API calls to that offline node. - If the dirty locations of a shard are all on offline nodes, then don't spawn a reconciler - In re-attach, if we have no observed state object for a tenant then construct one with conf: None (which means "unknown"). Then in Reconciler, implement a TODO for scanning such locations before running, so that we will avoid spuriously incrementing a generation in the case of a node that was offline while we started (this is the case that tripped up #7006) - Refactoring: make Node contents private (and thereby guarantee that updates to availability mode reliably update the cancellation token.) - Refactoring: don't pass the whole map of nodes into Reconciler (and thereby remove a bunch of .expect() calls) Some of this was discovered/tested with a new failure injection test that will come in a separate PR, once it is stable enough for CI.
5 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
https://neon-github-public-dev.s3.amazonaws.com/reports/main/8116822732/index.html
There's a weird interaction between the pageserver and attachment service.
Feels like we shouldn't reconcile a node until we successfully got the location state from that pageserver.
@jcsp what do you think?
The text was updated successfully, but these errors were encountered: