storage controller: graceful handling of attempts to reconcile with offline nodes #6847

jcsp · 2024-02-20T18:57:47Z

Currently, for tenant shards attached to a node whose availability state is set to offline, we demote it to a secondary in the IntentState, and schedule another node to be attached.

That works, in that reconciliation kicks in and attaches to the new node, but the Reconciler will also try to configure the offline node into its new role as a secondary, and fail.

A more elegant handling might be:

In Reconciler, skip trying to talk to nodes marked offline, and generate a specific error variant that indicates that reconciliation succeeded apart from offline nodes.
in maybe_reconcile, recognize the case where all nodes locations are up to date apart from offline nodes, and avoid spawning a reconcile task in these cases.

## Problem Closes: #6847 Closes: #7006 ## Summary of changes - Pageserver API calls are wrapped in timeout/retry logic: this prevents a reconciler getting hung on a pageserver API hang, and prevents reconcilers having to totally retry if one API call returns a retryable error (e.g. 503). - Add a cancellation token to `Node`, so that when we mark a node offline we will cancel any API calls in progress to that node, and avoid issuing any more API calls to that offline node. - If the dirty locations of a shard are all on offline nodes, then don't spawn a reconciler - In re-attach, if we have no observed state object for a tenant then construct one with conf: None (which means "unknown"). Then in Reconciler, implement a TODO for scanning such locations before running, so that we will avoid spuriously incrementing a generation in the case of a node that was offline while we started (this is the case that tripped up #7006) - Refactoring: make Node contents private (and thereby guarantee that updates to availability mode reliably update the cancellation token.) - Refactoring: don't pass the whole map of nodes into Reconciler (and thereby remove a bunch of .expect() calls) Some of this was discovered/tested with a new failure injection test that will come in a separate PR, once it is stable enough for CI.

jcsp added t/feature Issue type: feature, for new features or requests c/storage/controller Component: Storage Controller labels Feb 20, 2024

jcsp mentioned this issue Feb 20, 2024

Epic: storage controller (née sharding service) #6342

Open

jcsp self-assigned this Mar 5, 2024

jcsp mentioned this issue Mar 5, 2024

storage controller: robustness improvements #7027

Merged

5 tasks

jcsp closed this as completed in #7027 Mar 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage controller: graceful handling of attempts to reconcile with offline nodes #6847

storage controller: graceful handling of attempts to reconcile with offline nodes #6847

jcsp commented Feb 20, 2024

storage controller: graceful handling of attempts to reconcile with offline nodes #6847

storage controller: graceful handling of attempts to reconcile with offline nodes #6847

Comments

jcsp commented Feb 20, 2024