tracking issue: instance restart behavior during upgrade

I'll start with the action plan and add a comment later with some discussion.

Short-term (R17):

- [x] reduce cooldown period from 1 hour to 5 minutes  (https://github.com/oxidecomputer/omicron/pull/9097)

These are longer-term goals that aren't specific enough to make tasks yet:

- Instances should not be restarted at all for system upgrades.  We have long planned to use live migration to avoid this.
- Whether we do live migration or use instance restarts, we could make allocation choices more intelligently to minimize the number of instance movements required.  (e.g., prefer to move instances to sleds that have already been updated).  This is much harder than it sounds.  See [RFD 564](https://rfd.shared.oxide.computer/rfd/0564).
- Even when we have to restart instances to move them, we could do so in the same coordinated way that we plan to use live migration for.  (Roughly: we've discussed having the update system mark a sled as needing evacuation, avoid putting new instances there, and then waiting for evacuation to happen.  The plan is to do that evacuation with live migration, but all of this could also be done with ordinary VM restarts, too.)  This would leverage the same work and also make sure that we don't cooldown instances when they fail _because_ of the upgrade.
- Instances should not be cooled down for "start" failures that can't be its fault (e.g., failure to start on a sled due to the sled not having sync'd time, or not having U2 devices, etc.).  @jgallagher is filing a separate issue on this shortly.  This isn't really upgrade-related but we hit it during upgrade testing and it contributed to instance unavailability.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

tracking issue: instance restart behavior during upgrade #9094

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

tracking issue: instance restart behavior during upgrade #9094

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions