Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add migration table and explicit migration tracking in sled-agent #5859

Merged
merged 34 commits into from
Jun 12, 2024

Conversation

hawkw
Copy link
Member

@hawkw hawkw commented Jun 5, 2024

As part of ongoing work on improving instance lifecycle management (see #5749) , we intend to remove the InstanceRuntimeState tracking from sled-agent, and make Nexus the sole owner of instance records in CRDB, with a new instance-update saga taking over the responsibility of managing the instance's state transitions.

In order to properly manage the instance state machine, Nexus will need information about the status of active migrations that are currently only available to sled-agents. For example, if an instance is migrating, and a sled agent reports that the source VMM is Destroyed, Nexus doesn't presently have the capability to determine whether the source VMM was destroyed because the migration completed successfully, or that the source shut down prior to starting the migration, resulting in a failure.

In order for Nexus to correctly manage state updates during live migration, we introduce a new migration table to the schema for tracking the state of ongoing migrations. The instance-migrate saga creates a migration record when beginning a migration. The Nexus and sled-agent APIs are extended to include migration state updates from sled-agents to Nexus. In this branch, the migration table is (basically) write-only; Nexus doesn't really read from it, and just stuffs updates into it. In the future, however, this will be used by the instance-update saga.

It occurred to me that, in addition to using the migration table for instance updates, it might also be useful to add an OMDB command to look up the status of a migration using this table. However, I decided that made more sense as a follow-up change, as I'd like to get back to integrating this into #5749.

Fixes #2948

schema/crdb/dbinit.sql Outdated Show resolved Hide resolved
@hawkw hawkw changed the title WIP: migration table add miigration table and explicit migration tracking in sled-agent Jun 7, 2024
@hawkw hawkw changed the title add miigration table and explicit migration tracking in sled-agent add migration table and explicit migration tracking in sled-agent Jun 7, 2024
@hawkw hawkw marked this pull request as ready for review June 7, 2024 18:10
@hawkw hawkw requested a review from gjcolombo June 7, 2024 18:11
common/src/api/internal/nexus.rs Outdated Show resolved Hide resolved
common/src/api/internal/nexus.rs Show resolved Hide resolved
@hawkw hawkw requested a review from smklein June 10, 2024 17:11
Copy link
Collaborator

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks solid, great idea to split out the source/destination columns so explicitly. Already seems easier to follow, and I think the omdb commands are a great idea.

schema/crdb/dbinit.sql Show resolved Hide resolved
Comment on lines +47 to +48
/// The state of the migration target VMM.
pub target_state: MigrationState,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not opposed to starting with the "state" being the same at source and destination, but I wonder if we'll eventually want different types here. Seems possible that there's a source-only or target-only state in the future.

(No change needed, just musing)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, the intention for this table is to store pretty low-resolution data about the migration state, basically just whether it's completed or failed, so we're not including more detailed migration status reported by Propolis. So, we don't really expect these to include states that are specific to one side of the migration currently, although I suppose we could...

nexus/db-queries/src/db/datastore/instance.rs Outdated Show resolved Hide resolved
nexus/db-queries/src/db/queries/instance.rs Outdated Show resolved Hide resolved
nexus/src/app/instance.rs Outdated Show resolved Hide resolved
sled-agent/src/common/instance.rs Outdated Show resolved Hide resolved
sled-agent/src/common/instance.rs Outdated Show resolved Hide resolved
nexus/db-model/src/migration.rs Outdated Show resolved Hide resolved
ObservedMigrationStatus::NoMigration
| ObservedMigrationStatus::InProgress
| ObservedMigrationStatus::Pending => {}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to leave this comment on right-side line 382 but GitHub.

What happens in the following case?

  • migration saga calls set_migration_ids on a migration source
  • before the target VMM launches, the guest in the source shuts down
  • the source's sled reaches apply_propolis_observation and the call to clear_migration_ids

Does the source migration status just get stuck in Pending? (The migration writ large should eventually fail, provided Propolis pushes the updates we expect: the target VMM will start, try to connect to the source, and find that it's missing, which should cause that side of the migration to fail.)

If the answer's "yes," does that matter given that the target should eventually report migration failure, or should we do something to sanitize this in clear_migration_ids? (For example, what happens if the target sled crashes just after it launches the target Propolis, such that it never reports anything else, either?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, my thinking was that the target observing a failure here would be sufficient to handle this case, but your point that the target sled could also have crashed and might not see the source's failure... I'll see about having this fail the migration.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, as of f6c9875 we should report failure immediately if a VMM fails/is destroyed while it's part of an in-progress migration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I suspect that in the long run we're probably going to need Nexus to deal with this in a health check task: if Nexus discovers that a VMM is unexpectedly gone, then the VMM's state changes and any migrations that it was a part of fail. But we'll address that when we add those checks.

nexus/db-queries/src/db/queries/instance.rs Show resolved Hide resolved
nexus/db-queries/src/db/queries/instance.rs Outdated Show resolved Hide resolved
nexus/src/app/instance.rs Outdated Show resolved Hide resolved
nexus/src/app/sagas/instance_migrate.rs Outdated Show resolved Hide resolved
sled-agent/src/sim/instance.rs Outdated Show resolved Hide resolved
@hawkw
Copy link
Member Author

hawkw commented Jun 10, 2024

Thanks @gjcolombo and @smklein for the very thorough reviews, I really appreciate it! 🖤

@hawkw hawkw requested review from gjcolombo and smklein June 11, 2024 20:18
Copy link
Contributor

@gjcolombo gjcolombo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for going another lap on this!

sled-agent/src/common/instance.rs Outdated Show resolved Hide resolved
ObservedMigrationStatus::NoMigration
| ObservedMigrationStatus::InProgress
| ObservedMigrationStatus::Pending => {}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I suspect that in the long run we're probably going to need Nexus to deal with this in a health check task: if Nexus discovers that a VMM is unexpectedly gone, then the VMM's state changes and any migrations that it was a part of fail. But we'll address that when we add those checks.

@hawkw hawkw enabled auto-merge (squash) June 12, 2024 17:42
@hawkw hawkw merged commit 0a0db97 into main Jun 12, 2024
20 checks passed
@hawkw hawkw deleted the eliza/migration-table branch June 12, 2024 18:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

want mechanism to track & report on live migration statuses
3 participants