-
Notifications
You must be signed in to change notification settings - Fork 62
Description
Depends on #6455 (and probably also #6490).
Per RFD 486:
An instance’s
boot_on_faultdiscipline tells Nexus whether to try to recover after retiring a failed VMM. The options are to do nothing (the default) or to try to restart the instance automatically.
We should implement that.
Potentially, we could attempt to schedule a new start saga for an instance as part of the update saga that transitions it to Failed. However, regardless of whether or not we do that, there should definitely be a RPW that's responsible for periodically listing instances which are in the Failed state and have boot_on_fault disciplines indicating that they should be restarted, and ensure that a start saga is started for those instances. Update sagas which have transitioned an instance to Failed could just activate that background task.