nexus: revamp live migration saga #2892

gjcolombo · 2023-04-20T18:52:39Z

Yet another PR--the seventh--in the live migration series. It may be helpful to refer to #2862 while reading parts of the saga, since that PR introduced some of the sled agent primitives the saga now uses.

The next few changes in this sequence will do the following:

Implement sled agent's logic to observe that migrations have finished & push the appropriate updates to Nexus
Add Nexus logic to update V2P mappings when an instance's sled is about to change
Add sled agent/Nexus logic to drop Propolis resource reservations that are no longer needed

Revamp the live migration saga to implement (more or less, anyway) the design described in RFD 361. See that RFD and the theory statements in instance_migrate.rs for more details.

The new saga starts migrations, but more sled agent and Nexus changes are needed for migration to succeed end-to-end. Specifically, sled agent needs to be taught to send the correct updates when a migration ends, and Nexus needs to be taught to handle those updates; see below for more commentary.

Remove OPTE V2P mapping management from the migration saga proper. This will come back in a separate change that detects when an instance has moved between sleds and updates the mappings. This will fix the "can't roll back V2P mappings if the migration ultimately fails" issue identified in the old code's comments.

To maintain the shape of the existing external live migration API (which accepts an explicit destination sled), extend the sled resource reservation logic to accept a "must select from the following sleds" constraint. (This is not meant to be a final decision about the shape of the migration API; this is just meant to keep from piling still more changes into an already large PR.)

Also fix an idempotency bug in simulated sled agent's instance unregistration routine.

nexus/db-model/src/sled.rs

nexus/src/app/instance.rs

nexus/src/app/sagas/instance_migrate.rs

smklein

Thanks for the new format - this saga has a lot fewer constraints, and your table-based commentary makes it very clear what's being coordinated between the sleds and CRDB.

Thank you for the changes!

luqmana

Thanks Greg! I left a couple nits and some possible follow up but looks good.

nexus/src/app/instance.rs

nexus/src/app/sagas/instance_migrate.rs

jordanhendricks

Thanks for giving me time to get through this. A couple questions in review.

jordanhendricks · 2023-04-27T18:55:51Z

nexus/src/app/sagas/instance_migrate.rs

+// other information, like per-Propolis states, that's not relevant here and
+// is ignored.)
+//
+// | Item         | Source | Dest | CRDB |


Discussed this offline a bit: It would be nice to spell out a bit more here what the source/dest columns in this table represent (namely, that they are that sled's sled-agent's current state).

Fixed in 321331e.

jordanhendricks · 2023-04-28T16:35:25Z

nexus/src/app/sagas/instance_migrate.rs

-        + sim_migrate_prep
+
+    // This step sets the migration ID and destination Propolis ID fields in
+    // CRDB by asking the instance's current sled to paste them into its runtime


"paste them" is throwing me off a bit -- similar to the earlier comment, I am not sure where exactly this state is being set

Fixed in 321331e.

jordanhendricks · 2023-04-28T16:36:08Z

nexus/src/app/sagas/instance_migrate.rs

+    // In case 1, the instance should already be updated properly (or will be
+    // updated properly soon), and in cases 2 and 3 there's nothing that can
+    // reliably be done (the error may not be transient), so just swallow all
+    // errors here (but warn that they occurred).


What is the outcome here if we hit errors here that are not-transient? Is the migration stuck?

I corrected the comment in 1af0867 and 5e45c22.

In general, the error handling story in the instance update paths seems very fragile to me. ISTM to work like this:

When Nexus wants to change an instance's state, it issues a call to sled agent that returns a new instance runtime state on success or a sled agent error on failure. The returned result type is a Progenitor client result that wraps either the updated runtime state or a sled agent-produced HTTP error code.

Instance state change results are handled in Nexus::handle_instance_put_result. If a state change failed, this converts the Progenitor client result into an Omicron error type. That conversion preserves 503 Unavailable errors, turns any 400-level client error into 400 Bad Request, and turns everything else into 500 Internal Server Error. Any further information from the inner error type is lost.

Sled agent has a wide array of descriptive internal error types. However, its Dropshot error handlers need to return an HTTP error, and sled agent's error conversion routine turns almost everything into a 500 error, except for Propolis errors that bear their own statuses, which statuses get preserved.

This makes it really hard to get sled agent to express a specific instance state change error to Nexus: sled agent turns everything into a 500 regardless of the semantics of the internal error it encountered, but even if it didn't, handle_instance_put_result will just push everything through the From<progenitor_client::Error> impl, which throws away most individual error codes.

There's at least one bug in the new migration code that stems from all of this: setting or clearing migration IDs is supposed to return a 400-level error on a generation number conflict, but sled agent ignores the semantics of the instance::Error::Transition return type and just turns it into a 500 instead. This can cause instances to get marked as failed when they shouldn't be. I've tried to fix this in 28d526d.

Flesh out the live migration saga. See RFD 361 and the theory statements in instance_migrate.rs for more details about the design. The new saga will start migration, but (yet) more changes are required for it to succeed end-to-end, since sled agent still needs to be taught to send the correct updates when a migration ends. Remove the portion of the migration saga that prospectively reprograms OPTE V2P mappings using the instance's destination sled. A future commit will update V2P mappings in response to the end-of-migration instance update that records an instance's new home sled.

Add the ability to constrain sled selection to a specific set of desired target sleds. This uses a builder that can be extended in the future for other purposes (e.g. switching to "don't schedule to this sled," "schedule only to sleds having such-and-such property," and so forth). Use this to add a resource-reservation step to the migration saga. Also fix a simulated sled agent bug that made simulated unregistration not idempotent.

jordanhendricks

thanks, greg! the updated comments look great.

gjcolombo marked this pull request as draft April 20, 2023 19:10

gjcolombo requested review from jordanhendricks, luqmana and smklein April 21, 2023 15:45

gjcolombo marked this pull request as ready for review April 21, 2023 15:45

gjcolombo force-pushed the gjcolombo/lets-migrate/7-its-saga-time branch from 175b467 to fcc3e12 Compare April 21, 2023 23:20

smklein self-assigned this Apr 26, 2023

smklein reviewed Apr 26, 2023

View reviewed changes

smklein assigned gjcolombo and unassigned smklein Apr 26, 2023

gjcolombo requested a review from smklein April 26, 2023 21:27

gjcolombo removed their assignment Apr 26, 2023

smklein approved these changes Apr 27, 2023

View reviewed changes

gjcolombo force-pushed the gjcolombo/lets-migrate/7-its-saga-time branch from 61613c3 to 39a8dfd Compare April 27, 2023 15:56

luqmana reviewed Apr 27, 2023

View reviewed changes

nexus/src/app/instance.rs Show resolved Hide resolved

nexus/src/app/instance.rs Outdated Show resolved Hide resolved

nexus/src/app/instance.rs Outdated Show resolved Hide resolved

nexus/src/app/sagas/instance_migrate.rs Show resolved Hide resolved

jordanhendricks reviewed Apr 28, 2023

View reviewed changes

gjcolombo mentioned this pull request Apr 28, 2023

want mechanism to track & report on live migration statuses #2948

Closed

gjcolombo added 13 commits May 1, 2023 15:32

clippy i love you but you're bringing me down

4d2f664

split setting/clearing migration IDs into two operations

b0ef001

eliminate undo shenanigans & improve comments

e988665

switch error type & improve message

444b05c

clarify semantics of must_select_from

ce760e9

use try semantics instead of unwrapping

77863ca

add & make use of 409 Conflict error type

10cadec

improve saga comments

4c55cfc

fix comment in sim_clear_migration_ids

32ac89d

further correct comment

15b8377

don't convert migration ID conflicts into 500 errors

2b5106a

gjcolombo force-pushed the gjcolombo/lets-migrate/7-its-saga-time branch from 28d526d to 2b5106a Compare May 1, 2023 15:33

jordanhendricks approved these changes May 1, 2023

View reviewed changes

gjcolombo merged commit 6ff5453 into main May 1, 2023

gjcolombo deleted the gjcolombo/lets-migrate/7-its-saga-time branch May 1, 2023 16:57

gjcolombo mentioned this pull request May 26, 2023

sled agent and Nexus frequently flatten errors into 500 Internal Server Error #3238

Closed

This was referenced Jan 11, 2024

Gather and send instance information on migration #1783

Closed

Network interfaces need to be sent to the destination sled during instance migration #1459

Closed

nexus: revamp live migration saga #2892

nexus: revamp live migration saga #2892

Uh oh!

Conversation

gjcolombo commented Apr 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

smklein left a comment

Choose a reason for hiding this comment

Uh oh!

luqmana left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jordanhendricks left a comment

Choose a reason for hiding this comment

Uh oh!

jordanhendricks Apr 27, 2023

Choose a reason for hiding this comment

Uh oh!

gjcolombo Apr 28, 2023

Choose a reason for hiding this comment

Uh oh!

jordanhendricks Apr 28, 2023

Choose a reason for hiding this comment

Uh oh!

gjcolombo Apr 28, 2023

Choose a reason for hiding this comment

Uh oh!

jordanhendricks Apr 28, 2023

Choose a reason for hiding this comment

Uh oh!

gjcolombo Apr 28, 2023

Choose a reason for hiding this comment

Uh oh!

jordanhendricks left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

gjcolombo commented Apr 20, 2023 •

edited

Loading