-
Notifications
You must be signed in to change notification settings - Fork 62
nexus: revamp live migration saga #2892
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
175b467 to
fcc3e12
Compare
smklein
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the new format - this saga has a lot fewer constraints, and your table-based commentary makes it very clear what's being coordinated between the sleds and CRDB.
Thank you for the changes!
61613c3 to
39a8dfd
Compare
luqmana
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Greg! I left a couple nits and some possible follow up but looks good.
jordanhendricks
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for giving me time to get through this. A couple questions in review.
| // other information, like per-Propolis states, that's not relevant here and | ||
| // is ignored.) | ||
| // | ||
| // | Item | Source | Dest | CRDB | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed this offline a bit: It would be nice to spell out a bit more here what the source/dest columns in this table represent (namely, that they are that sled's sled-agent's current state).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 321331e.
| + sim_migrate_prep | ||
|
|
||
| // This step sets the migration ID and destination Propolis ID fields in | ||
| // CRDB by asking the instance's current sled to paste them into its runtime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"paste them" is throwing me off a bit -- similar to the earlier comment, I am not sure where exactly this state is being set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 321331e.
| // In case 1, the instance should already be updated properly (or will be | ||
| // updated properly soon), and in cases 2 and 3 there's nothing that can | ||
| // reliably be done (the error may not be transient), so just swallow all | ||
| // errors here (but warn that they occurred). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the outcome here if we hit errors here that are not-transient? Is the migration stuck?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I corrected the comment in 1af0867 and 5e45c22.
In general, the error handling story in the instance update paths seems very fragile to me. ISTM to work like this:
- When Nexus wants to change an instance's state, it issues a call to sled agent that returns a new instance runtime state on success or a sled agent error on failure. The returned result type is a Progenitor client result that wraps either the updated runtime state or a sled agent-produced HTTP error code.
- Instance state change results are handled in
Nexus::handle_instance_put_result. If a state change failed, this converts the Progenitor client result into an Omicron error type. That conversion preserves 503 Unavailable errors, turns any 400-level client error into 400 Bad Request, and turns everything else into 500 Internal Server Error. Any further information from the inner error type is lost. - Sled agent has a wide array of descriptive internal error types. However, its Dropshot error handlers need to return an HTTP error, and sled agent's error conversion routine turns almost everything into a 500 error, except for Propolis errors that bear their own statuses, which statuses get preserved.
This makes it really hard to get sled agent to express a specific instance state change error to Nexus: sled agent turns everything into a 500 regardless of the semantics of the internal error it encountered, but even if it didn't, handle_instance_put_result will just push everything through the From<progenitor_client::Error> impl, which throws away most individual error codes.
There's at least one bug in the new migration code that stems from all of this: setting or clearing migration IDs is supposed to return a 400-level error on a generation number conflict, but sled agent ignores the semantics of the instance::Error::Transition return type and just turns it into a 500 instead. This can cause instances to get marked as failed when they shouldn't be. I've tried to fix this in 28d526d.
Flesh out the live migration saga. See RFD 361 and the theory statements in instance_migrate.rs for more details about the design. The new saga will start migration, but (yet) more changes are required for it to succeed end-to-end, since sled agent still needs to be taught to send the correct updates when a migration ends. Remove the portion of the migration saga that prospectively reprograms OPTE V2P mappings using the instance's destination sled. A future commit will update V2P mappings in response to the end-of-migration instance update that records an instance's new home sled.
Add the ability to constrain sled selection to a specific set of desired target sleds. This uses a builder that can be extended in the future for other purposes (e.g. switching to "don't schedule to this sled," "schedule only to sleds having such-and-such property," and so forth). Use this to add a resource-reservation step to the migration saga. Also fix a simulated sled agent bug that made simulated unregistration not idempotent.
28d526d to
2b5106a
Compare
jordanhendricks
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, greg! the updated comments look great.
Yet another PR--the seventh--in the live migration series. It may be helpful to refer to #2862 while reading parts of the saga, since that PR introduced some of the sled agent primitives the saga now uses.
The next few changes in this sequence will do the following:
Revamp the live migration saga to implement (more or less, anyway) the design described in RFD 361. See that RFD and the theory statements in
instance_migrate.rsfor more details.The new saga starts migrations, but more sled agent and Nexus changes are needed for migration to succeed end-to-end. Specifically, sled agent needs to be taught to send the correct updates when a migration ends, and Nexus needs to be taught to handle those updates; see below for more commentary.
Remove OPTE V2P mapping management from the migration saga proper. This will come back in a separate change that detects when an instance has moved between sleds and updates the mappings. This will fix the "can't roll back V2P mappings if the migration ultimately fails" issue identified in the old code's comments.
To maintain the shape of the existing external live migration API (which accepts an explicit destination sled), extend the sled resource reservation logic to accept a "must select from the following sleds" constraint. (This is not meant to be a final decision about the shape of the migration API; this is just meant to keep from piling still more changes into an already large PR.)
Also fix an idempotency bug in simulated sled agent's instance unregistration routine.