Skip to content

Nexus needs reconciliation process for Sled failure / removal #612

@smklein

Description

@smklein

Nexus currently stores information about Sled IDs within CRDB, including:

  • Instances belonging to a particular Sled
  • Zpools, datasets, and (pending) regions belonging to a particular Sled
  • (soon) Versioning and Inventory information for a sled.

If a sled irrecoverably fails - not just reboots, but is either destroyed or unplugged - Nexus needs to have a process for recovering all data which has been detached from the Sled.

This will likely involve:

  • Identifying all instances previously running on the Sled have faulted, if they do not have backups
  • Migration of regions to new locations, if available
  • Cleaning sled-specific information from the database

Additionally, it is important to note that "unplugging a sled" is unpredictable, and may occur at any time. There is separate, critical question of distinguishing between a temporary vs permanent failure (we would want to treat a crash + reboot much differently from a long-term sled removal). However, in the period of time while we're still making that decision, we need to have a policy for dealing with ongoing operations to that sled.

Arguably, the "deletion" operations are particularly nasty in this time period - we don't know if we can discard those requests ("the sled is gone, so we just need to clean the DB"), store them for later ("the sled if offline now, but if it comes back, we'll tell it to perform the deletion") or fail those requests ("deletion is not possible while this sled is offline!")

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions