Skip to content

Nexus should restart Failed instances when boot_on_fault says to #6491

@hawkw

Description

@hawkw

Depends on #6455 (and probably also #6490).

Per RFD 486:

An instance’s boot_on_fault discipline tells Nexus whether to try to recover after retiring a failed VMM. The options are to do nothing (the default) or to try to restart the instance automatically.

We should implement that.

Potentially, we could attempt to schedule a new start saga for an instance as part of the update saga that transitions it to Failed. However, regardless of whether or not we do that, there should definitely be a RPW that's responsible for periodically listing instances which are in the Failed state and have boot_on_fault disciplines indicating that they should be restarted, and ensure that a start saga is started for those instances. Update sagas which have transitioned an instance to Failed could just activate that background task.

Metadata

Metadata

Assignees

Labels

nexusRelated to nexus

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions