Nexus needs reconciliation process for Sled failure / removal

Nexus currently stores information about Sled IDs within CRDB, including:

- Instances belonging to a particular Sled
- Zpools, datasets, and  ([pending](https://github.com/oxidecomputer/omicron/pull/511)) regions belonging to a particular Sled
- (soon) Versioning and Inventory information for a sled.

If a sled irrecoverably fails - not just reboots, but is either destroyed or unplugged - Nexus needs to have a process for recovering all data which has been detached from the Sled.

This will likely involve:
- Identifying all instances previously running on the Sled have faulted, if they do not have backups
- Migration of regions to new locations, if available
- Cleaning sled-specific information from the database

Additionally, it is important to note that "unplugging a sled" is unpredictable, and may occur at any time. There is separate, critical question of distinguishing between a **temporary** vs **permanent** failure (we would want to treat a crash + reboot much differently from a long-term sled removal). However, in the period of time while we're still making that decision, we need to have a policy for dealing with ongoing operations to that sled.

Arguably, the "deletion" operations are particularly nasty in this time period - we don't know if we can discard those requests ("the sled is gone, so we just need to clean the DB"), store them for later ("the sled if offline now, but if it comes back, we'll tell it to perform the deletion") or fail those requests ("deletion is not possible while this sled is offline!")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Nexus needs reconciliation process for Sled failure / removal #612

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Nexus needs reconciliation process for Sled failure / removal #612

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions