Skip to content

Conversation

@gjcolombo
Copy link
Contributor

@gjcolombo gjcolombo commented Apr 20, 2023

Yet another PR--the seventh--in the live migration series. It may be helpful to refer to #2862 while reading parts of the saga, since that PR introduced some of the sled agent primitives the saga now uses.

The next few changes in this sequence will do the following:

  • Implement sled agent's logic to observe that migrations have finished & push the appropriate updates to Nexus
  • Add Nexus logic to update V2P mappings when an instance's sled is about to change
  • Add sled agent/Nexus logic to drop Propolis resource reservations that are no longer needed

Revamp the live migration saga to implement (more or less, anyway) the design described in RFD 361. See that RFD and the theory statements in instance_migrate.rs for more details.

The new saga starts migrations, but more sled agent and Nexus changes are needed for migration to succeed end-to-end. Specifically, sled agent needs to be taught to send the correct updates when a migration ends, and Nexus needs to be taught to handle those updates; see below for more commentary.

Remove OPTE V2P mapping management from the migration saga proper. This will come back in a separate change that detects when an instance has moved between sleds and updates the mappings. This will fix the "can't roll back V2P mappings if the migration ultimately fails" issue identified in the old code's comments.

To maintain the shape of the existing external live migration API (which accepts an explicit destination sled), extend the sled resource reservation logic to accept a "must select from the following sleds" constraint. (This is not meant to be a final decision about the shape of the migration API; this is just meant to keep from piling still more changes into an already large PR.)

Also fix an idempotency bug in simulated sled agent's instance unregistration routine.

@gjcolombo gjcolombo marked this pull request as draft April 20, 2023 19:10
@gjcolombo gjcolombo marked this pull request as ready for review April 21, 2023 15:45
@gjcolombo gjcolombo force-pushed the gjcolombo/lets-migrate/7-its-saga-time branch from 175b467 to fcc3e12 Compare April 21, 2023 23:20
@smklein smklein self-assigned this Apr 26, 2023
@smklein smklein assigned gjcolombo and unassigned smklein Apr 26, 2023
@gjcolombo gjcolombo requested a review from smklein April 26, 2023 21:27
@gjcolombo gjcolombo removed their assignment Apr 26, 2023
Copy link
Collaborator

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the new format - this saga has a lot fewer constraints, and your table-based commentary makes it very clear what's being coordinated between the sleds and CRDB.

Thank you for the changes!

@gjcolombo gjcolombo force-pushed the gjcolombo/lets-migrate/7-its-saga-time branch from 61613c3 to 39a8dfd Compare April 27, 2023 15:56
Copy link
Contributor

@luqmana luqmana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Greg! I left a couple nits and some possible follow up but looks good.

Copy link
Contributor

@jordanhendricks jordanhendricks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for giving me time to get through this. A couple questions in review.

// other information, like per-Propolis states, that's not relevant here and
// is ignored.)
//
// | Item | Source | Dest | CRDB |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed this offline a bit: It would be nice to spell out a bit more here what the source/dest columns in this table represent (namely, that they are that sled's sled-agent's current state).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 321331e.

+ sim_migrate_prep

// This step sets the migration ID and destination Propolis ID fields in
// CRDB by asking the instance's current sled to paste them into its runtime
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"paste them" is throwing me off a bit -- similar to the earlier comment, I am not sure where exactly this state is being set

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 321331e.

Comment on lines 308 to 311
// In case 1, the instance should already be updated properly (or will be
// updated properly soon), and in cases 2 and 3 there's nothing that can
// reliably be done (the error may not be transient), so just swallow all
// errors here (but warn that they occurred).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the outcome here if we hit errors here that are not-transient? Is the migration stuck?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I corrected the comment in 1af0867 and 5e45c22.

In general, the error handling story in the instance update paths seems very fragile to me. ISTM to work like this:

  • When Nexus wants to change an instance's state, it issues a call to sled agent that returns a new instance runtime state on success or a sled agent error on failure. The returned result type is a Progenitor client result that wraps either the updated runtime state or a sled agent-produced HTTP error code.
  • Instance state change results are handled in Nexus::handle_instance_put_result. If a state change failed, this converts the Progenitor client result into an Omicron error type. That conversion preserves 503 Unavailable errors, turns any 400-level client error into 400 Bad Request, and turns everything else into 500 Internal Server Error. Any further information from the inner error type is lost.
  • Sled agent has a wide array of descriptive internal error types. However, its Dropshot error handlers need to return an HTTP error, and sled agent's error conversion routine turns almost everything into a 500 error, except for Propolis errors that bear their own statuses, which statuses get preserved.

This makes it really hard to get sled agent to express a specific instance state change error to Nexus: sled agent turns everything into a 500 regardless of the semantics of the internal error it encountered, but even if it didn't, handle_instance_put_result will just push everything through the From<progenitor_client::Error> impl, which throws away most individual error codes.

There's at least one bug in the new migration code that stems from all of this: setting or clearing migration IDs is supposed to return a 400-level error on a generation number conflict, but sled agent ignores the semantics of the instance::Error::Transition return type and just turns it into a 500 instead. This can cause instances to get marked as failed when they shouldn't be. I've tried to fix this in 28d526d.

gjcolombo added 13 commits May 1, 2023 15:32
Flesh out the live migration saga. See RFD 361 and the theory statements in
instance_migrate.rs for more details about the design.

The new saga will start migration, but (yet) more changes are required for
it to succeed end-to-end, since sled agent still needs to be taught to send the
correct updates when a migration ends.

Remove the portion of the migration saga that prospectively reprograms OPTE
V2P mappings using the instance's destination sled. A future commit will update
V2P mappings in response to the end-of-migration instance update that records
an instance's new home sled.
Add the ability to constrain sled selection to a specific set of desired target
sleds. This uses a builder that can be extended in the future for other
purposes (e.g. switching to "don't schedule to this sled," "schedule only to
sleds having such-and-such property," and so forth). Use this to add a
resource-reservation step to the migration saga.

Also fix a simulated sled agent bug that made simulated unregistration not
idempotent.
@gjcolombo gjcolombo force-pushed the gjcolombo/lets-migrate/7-its-saga-time branch from 28d526d to 2b5106a Compare May 1, 2023 15:33
Copy link
Contributor

@jordanhendricks jordanhendricks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, greg! the updated comments look great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants