-
Notifications
You must be signed in to change notification settings - Fork 62
Description
Nexus currently stores information about Sled IDs within CRDB, including:
- Instances belonging to a particular Sled
- Zpools, datasets, and (pending) regions belonging to a particular Sled
- (soon) Versioning and Inventory information for a sled.
If a sled irrecoverably fails - not just reboots, but is either destroyed or unplugged - Nexus needs to have a process for recovering all data which has been detached from the Sled.
This will likely involve:
- Identifying all instances previously running on the Sled have faulted, if they do not have backups
- Migration of regions to new locations, if available
- Cleaning sled-specific information from the database
Additionally, it is important to note that "unplugging a sled" is unpredictable, and may occur at any time. There is separate, critical question of distinguishing between a temporary vs permanent failure (we would want to treat a crash + reboot much differently from a long-term sled removal). However, in the period of time while we're still making that decision, we need to have a policy for dealing with ongoing operations to that sled.
Arguably, the "deletion" operations are particularly nasty in this time period - we don't know if we can discard those requests ("the sled is gone, so we just need to clean the DB"), store them for later ("the sled if offline now, but if it comes back, we'll tell it to perform the deletion") or fail those requests ("deletion is not possible while this sled is offline!")