-
Notifications
You must be signed in to change notification settings - Fork 58
Open
Description
I'll start with the action plan and add a comment later with some discussion.
Short-term (R17):
- reduce cooldown period from 1 hour to 5 minutes ([nexus] decrease auto-restart cooldown to 5 minutes #9097)
These are longer-term goals that aren't specific enough to make tasks yet:
- Instances should not be restarted at all for system upgrades. We have long planned to use live migration to avoid this.
- Whether we do live migration or use instance restarts, we could make allocation choices more intelligently to minimize the number of instance movements required. (e.g., prefer to move instances to sleds that have already been updated). This is much harder than it sounds. See RFD 564.
- Even when we have to restart instances to move them, we could do so in the same coordinated way that we plan to use live migration for. (Roughly: we've discussed having the update system mark a sled as needing evacuation, avoid putting new instances there, and then waiting for evacuation to happen. The plan is to do that evacuation with live migration, but all of this could also be done with ordinary VM restarts, too.) This would leverage the same work and also make sure that we don't cooldown instances when they fail because of the upgrade.
- Instances should not be cooled down for "start" failures that can't be its fault (e.g., failure to start on a sled due to the sled not having sync'd time, or not having U2 devices, etc.). @jgallagher is filing a separate issue on this shortly. This isn't really upgrade-related but we hit it during upgrade testing and it contributed to instance unavailability.
Metadata
Metadata
Assignees
Labels
No labels