Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage controller: graceful handling of attempts to reconcile with offline nodes #6847

Closed
Tracked by #6342
jcsp opened this issue Feb 20, 2024 · 0 comments · Fixed by #7027
Closed
Tracked by #6342

storage controller: graceful handling of attempts to reconcile with offline nodes #6847

jcsp opened this issue Feb 20, 2024 · 0 comments · Fixed by #7027
Assignees
Labels
c/storage/controller Component: Storage Controller t/feature Issue type: feature, for new features or requests

Comments

@jcsp
Copy link
Contributor

jcsp commented Feb 20, 2024

Currently, for tenant shards attached to a node whose availability state is set to offline, we demote it to a secondary in the IntentState, and schedule another node to be attached.

That works, in that reconciliation kicks in and attaches to the new node, but the Reconciler will also try to configure the offline node into its new role as a secondary, and fail.

A more elegant handling might be:

  • In Reconciler, skip trying to talk to nodes marked offline, and generate a specific error variant that indicates that reconciliation succeeded apart from offline nodes.
  • in maybe_reconcile, recognize the case where all nodes locations are up to date apart from offline nodes, and avoid spawning a reconcile task in these cases.
@jcsp jcsp added t/feature Issue type: feature, for new features or requests c/storage/controller Component: Storage Controller labels Feb 20, 2024
@jcsp jcsp self-assigned this Mar 5, 2024
@jcsp jcsp closed this as completed in #7027 Mar 7, 2024
jcsp added a commit that referenced this issue Mar 7, 2024
## Problem


Closes: #6847
Closes: #7006

## Summary of changes

- Pageserver API calls are wrapped in timeout/retry logic: this prevents
a reconciler getting hung on a pageserver API hang, and prevents
reconcilers having to totally retry if one API call returns a retryable
error (e.g. 503).
- Add a cancellation token to `Node`, so that when we mark a node
offline we will cancel any API calls in progress to that node, and avoid
issuing any more API calls to that offline node.
- If the dirty locations of a shard are all on offline nodes, then don't
spawn a reconciler
- In re-attach, if we have no observed state object for a tenant then
construct one with conf: None (which means "unknown"). Then in
Reconciler, implement a TODO for scanning such locations before running,
so that we will avoid spuriously incrementing a generation in the case
of a node that was offline while we started (this is the case that
tripped up #7006)
- Refactoring: make Node contents private (and thereby guarantee that
updates to availability mode reliably update the cancellation token.)
- Refactoring: don't pass the whole map of nodes into Reconciler (and
thereby remove a bunch of .expect() calls)

Some of this was discovered/tested with a new failure injection test
that will come in a separate PR, once it is stable enough for CI.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/controller Component: Storage Controller t/feature Issue type: feature, for new features or requests
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant