-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Perform instance state transitions in instance-update
saga
#5749
Conversation
943dfb2
to
afd2dad
Compare
**Note**: This change is part of the ongoing work on instance lifecycle management that I'm working on in PR #5749. It's not actually necessary on its own, it's just a component of the upcoming instance updater saga. However, I thought it would be easier to review if I factored out this change into a separate PR that can be reviewed and merged on its own. The instance update saga (see PR #5749) will only clean up after VMMs whose IDs appear in an `instance` record. When a live migration finishes (successfully or not), we want to allow a new migration to begin as soon as possible, which means we have to unlink the “unused” side of the migration --- the source if migration succeeded, or the target if it failed --- from the instance, even though that VMM may not be fully destroyed yet. Once this happens, the instance update saga will no longer be able to clean up these VMMs, so we’ll need a separate task that cleans up these "abandoned" VMMs in the background. This branch introduces an `abandoned_vmm_reaper` background task that's responsible for doing this. It queries the database to list VMMs which are: - in the `Destroyed` state - not deleted yet (i.e. `time_deleted` IS NOT NULL) - not pointed to by their corresponding instances (neither the `active_propolis_id` nor the `target_propolis_id` equals the VMM's ID) For any VMMs returned by this query, the `abandoned_vmm_reaper` task will: - remove the `sled_resource` reservation for that VMM - sets the `time_deleted` on the VMM record if it was not already set. This cleanup process will be executed periodically in the background. Eventually, the background task will also be explicitly triggered by the instance update saga when it knows it has abandoned a VMM. As an aside, I noticed that the current implementation of `DataStore::vmm_mark_deleted` will always unconditionally set the `time_deleted` field on a VMM record, even if it's already set. This is "probably fine" for overall correctness: the VMM remains deleted, so the operation is still idempotent-ish. But, it's not *great*, as it means that any queries for VMMs deleted before a certain timestamp may not be strictly correct, and we're updating the database more frequently than we really need to. So, I've gone ahead and changed it to only set `time_deleted` if the record's `time_deleted` is null, using `check_if_exists` so that the method still returns `Ok` if the record was already deleted --- the caller can inspect the returned `bool` to determine whether or not they were the actual deleter, but the query still doesn't fail.
afd2dad
to
30a341d
Compare
In order to simplify the instance lifecycle state machine and ensure that instance state updates are processed reliably, we intend to perform instance state updates in a saga (which will be added in PR #5749). This saga will require a notion of mutual exclusion between update sagas for the same instance, in order to avoid race conditions like the following: 1. Sagas `S1` and `S2` start at the same time and observe the same instance/VMM states, which indicate that the instance’s active VMM has shut down 2. `S1` clears all the resources/provisioning counters and marks the instance as `Stopped`` 3. User restarts the instance 4. `S2` clears the same instance provisioning counters again Presently, these races are avoided by the fact that instance state updates are performed partially in `sled-agent`, which serves as an "actor" with exclusive ownership over the state transition. Moving these state transitions to Nexus requires introducing mutual exclusion. This commit adds a distributed lock on instance state transitions to the datastore. We add the following fields to the `instance` table: - `updater_id`, which is the UUID of the saga currently holding the update lock on the instance (or `NULL` if no saga has locked the instance) - `updater_gen`, a generation counter that is incremented each time the lock is acquired by a new saga Using these fields, we can add new datastore methods to try and acquire an instance update lock by doing the following: 1. Generate a UUID for the saga, `saga_lock_id`. This will be performed in the saga itself and isn't part of this PR. 2. Read the instance record and interpret the value of the `updater_id` field as follows: - `NULL`: lock not held, we can acquire it by incrementing the `updater_gen` field and setting the `updater_id` field to the saga's UUID. - `updater_id == saga_id`: the saga already holds the lock, we can proceed with the update. - `updater_id != saga_id`: another saga holds the lock, we can't proceed with the update. Fail the operation. 3. Attempt to write back the updated instance record with generation incremented and the `updater_id` set to the saga's ID, conditional on the `updater_gen` field being equal to the ID that was read when read the instance record. This is equivalent to the atomic compare-and-swap operation that one might use to implement a non-distributed lock in a single address space. - If this fails because the generation number is outdated, try again (i.e. goto (2)). - If this succeeds, the lock was acquired successfully. Additionally, we can add a method for unlocking an instance record by clearing the `updater_id` field and incrementing the `updater_gen`. This query is conditional on the `updater_id` field being equal to the saga's UUID, to prevent cases where we accidentally unlock an instance that was locked by a different saga. Introducing this distributed lock is considered fairly safe, as it will only ever be acquired in a saga, and the reverse action for the saga action that acquires the lock will ensure that the lock is released if the saga unwinds. Essentially, this is equivalent to a RAII guard releasing a lock when a thread panics in a single-threaded Rust program. Presently, none of these methods are actually used. The saga that uses them will be added in PR #5749. I've factored out this change into its own PR so that we can merge the foundation needed for that branch. Hopefully this makes the diff a bit smaller and easier to review, as well as decreasing merge conflict churn with the schema changes.
In order to simplify the instance lifecycle state machine and ensure that instance state updates are processed reliably, we intend to perform instance state updates in a saga (which will be added in PR #5749). This saga will require a notion of mutual exclusion between update sagas for the same instance, in order to avoid race conditions like the following: 1. Sagas `S1` and `S2` start at the same time and observe the same instance/VMM states, which indicate that the instance’s active VMM has shut down 2. `S1` clears all the resources/provisioning counters and marks the instance as `Stopped`` 3. User restarts the instance 4. `S2` clears the same instance provisioning counters again Presently, these races are avoided by the fact that instance state updates are performed partially in `sled-agent`, which serves as an "actor" with exclusive ownership over the state transition. Moving these state transitions to Nexus requires introducing mutual exclusion. This commit adds a distributed lock on instance state transitions to the datastore. We add the following fields to the `instance` table: - `updater_id`, which is the UUID of the saga currently holding the update lock on the instance (or `NULL` if no saga has locked the instance) - `updater_gen`, a generation counter that is incremented each time the lock is acquired by a new saga Using these fields, we can add new datastore methods to try and acquire an instance update lock by doing the following: 1. Generate a UUID for the saga, `saga_lock_id`. This will be performed in the saga itself and isn't part of this PR. 2. Read the instance record and interpret the value of the `updater_id` field as follows: - `NULL`: lock not held, we can acquire it by incrementing the `updater_gen` field and setting the `updater_id` field to the saga's UUID. - `updater_id == saga_id`: the saga already holds the lock, we can proceed with the update. - `updater_id != saga_id`: another saga holds the lock, we can't proceed with the update. Fail the operation. 3. Attempt to write back the updated instance record with generation incremented and the `updater_id` set to the saga's ID, conditional on the `updater_gen` field being equal to the ID that was read when read the instance record. This is equivalent to the atomic compare-and-swap operation that one might use to implement a non-distributed lock in a single address space. - If this fails because the generation number is outdated, try again (i.e. goto (2)). - If this succeeds, the lock was acquired successfully. Additionally, we can add a method for unlocking an instance record by clearing the `updater_id` field and incrementing the `updater_gen`. This query is conditional on the `updater_id` field being equal to the saga's UUID, to prevent cases where we accidentally unlock an instance that was locked by a different saga. Introducing this distributed lock is considered fairly safe, as it will only ever be acquired in a saga, and the reverse action for the saga action that acquires the lock will ensure that the lock is released if the saga unwinds. Essentially, this is equivalent to a RAII guard releasing a lock when a thread panics in a single-threaded Rust program. Presently, none of these methods are actually used. The saga that uses them will be added in PR #5749. I've factored out this change into its own PR so that we can merge the foundation needed for that branch. Hopefully this makes the diff a bit smaller and easier to review, as well as decreasing merge conflict churn with the schema changes.
ad224c9
to
2dc69d1
Compare
In order to simplify the instance lifecycle state machine and ensure that instance state updates are processed reliably, we intend to perform instance state updates in a saga (which will be added in PR #5749). This saga will require a notion of mutual exclusion between update sagas for the same instance, in order to avoid race conditions like the following: 1. Sagas `S1` and `S2` start at the same time and observe the same instance/VMM states, which indicate that the instance’s active VMM has shut down 2. `S1` clears all the resources/provisioning counters and marks the instance as `Stopped`` 3. User restarts the instance 4. `S2` clears the same instance provisioning counters again Presently, these races are avoided by the fact that instance state updates are performed partially in `sled-agent`, which serves as an "actor" with exclusive ownership over the state transition. Moving these state transitions to Nexus requires introducing mutual exclusion. This commit adds a distributed lock on instance state transitions to the datastore. We add the following fields to the `instance` table: - `updater_id`, which is the UUID of the saga currently holding the update lock on the instance (or `NULL` if no saga has locked the instance) - `updater_gen`, a generation counter that is incremented each time the lock is acquired by a new saga Using these fields, we can add new datastore methods to try and acquire an instance update lock by doing the following: 1. Generate a UUID for the saga, `saga_lock_id`. This will be performed in the saga itself and isn't part of this PR. 2. Read the instance record and interpret the value of the `updater_id` field as follows: - `NULL`: lock not held, we can acquire it by incrementing the `updater_gen` field and setting the `updater_id` field to the saga's UUID. - `updater_id == saga_id`: the saga already holds the lock, we can proceed with the update. - `updater_id != saga_id`: another saga holds the lock, we can't proceed with the update. Fail the operation. 3. Attempt to write back the updated instance record with generation incremented and the `updater_id` set to the saga's ID, conditional on the `updater_gen` field being equal to the ID that was read when read the instance record. This is equivalent to the atomic compare-and-swap operation that one might use to implement a non-distributed lock in a single address space. - If this fails because the generation number is outdated, try again (i.e. goto (2)). - If this succeeds, the lock was acquired successfully. Additionally, we can add a method for unlocking an instance record by clearing the `updater_id` field and incrementing the `updater_gen`. This query is conditional on the `updater_id` field being equal to the saga's UUID, to prevent cases where we accidentally unlock an instance that was locked by a different saga. Introducing this distributed lock is considered fairly safe, as it will only ever be acquired in a saga, and the reverse action for the saga action that acquires the lock will ensure that the lock is released if the saga unwinds. Essentially, this is equivalent to a RAII guard releasing a lock when a thread panics in a single-threaded Rust program. Presently, none of these methods are actually used. The saga that uses them will be added in PR #5749. I've factored out this change into its own PR so that we can merge the foundation needed for that branch. Hopefully this makes the diff a bit smaller and easier to review, as well as decreasing merge conflict churn with the schema changes.
ba55078
to
a58c14e
Compare
This branch is part of ongoing work on the `instance-update` saga (see PR #5749); I've factored it out into a separate PR. This is largely because this branch makes mechanical changes to a bunch of different files that aren't directly related to the core change of #5749, and I'd like to avoid the additional merge conflicts that are likely when these changes remain un-merged for a long time.
ba551b0
to
8e7a1ef
Compare
This branch is part of ongoing work on the `instance-update` saga (see PR #5749); I've factored it out into a separate PR. This is largely because this branch makes mechanical changes to a bunch of different files that aren't directly related to the core change of #5749, and I'd like to avoid the additional merge conflicts that are likely when these changes remain un-merged for a long time. --- Depends on #5854. As part of #5749, it is desirable to distinguish between *why* a VMM record was marked as `Destroyed`, in order to determine which saga is responsible for cleaning up that VMM's resources. The `instance-update` saga must be the only entity that can set an instance's VMM IDs (active and target) to NULL. Presently, however, the `instance-start` and `instance-migrate` sagas may also do this when they unwind. This is a requirement to avoid situations where a failed `instance-update` saga cannot unwind, because the instance's generation number has changed while the saga was executing. We want to ensure that if a start or migrate request fails, the instance will appear to be in the correct post-state as soon as the relevant API call returns. In order to do this, without having to also guarantee that an instance update has occurred, we introduce a new VMM state, `SagaUnwound`, with the following rules: - It is legal to start or migrate an instance if the `active_propolis_id` or `destination_propolis_id` (respectively) is either `NULL` or refers to a VMM that’s in the `SagaUnwound` state (the new VMM ID directly replaces the old ID). - If an instance has an `active_propolis_id` in the `SagaUnwound` state, then the instance appears to be `Stopped`. - If an instance has a `destination_propolis_id` in the `SagaUnwound` state, nothing special happens–the instance’s state is still derived from its active VMM’s state. - The update saga treats `SagaUnwound` VMMs as identical to `Destroyed` VMMs for purposes of deciding whether to remove a VMM ID from an instance. This branch adds a new `VmmState::SagaUnwound` variant. The `SagaUnwound` state is an internal implementation detail that shouldn't be exposed to an operator or in the external API. Sled-agents will never report that a VMM is in this state. Instead, this state is set my the `instance-start` and `instance-migrate` sagas when they unwind. When determining the API instance state from an instance and active VMM query so that the `SagaUnwound` state is mapped to `Destroyed`. Closes #5848, which this replaces.
This branch is part of ongoing work on the `instance-update` saga (see PR #5749); I've factored it out into a separate PR. This is largely because this branch makes mechanical changes to a bunch of different files that aren't directly related to the core change of #5749, and I'd like to avoid the additional merge conflicts that are likely when these changes remain un-merged for a long time. --- Depends on #5854. As part of #5749, it is desirable to distinguish between *why* a VMM record was marked as `Destroyed`, in order to determine which saga is responsible for cleaning up that VMM's resources. The `instance-update` saga must be the only entity that can set an instance's VMM IDs (active and target) to NULL. Presently, however, the `instance-start` and `instance-migrate` sagas may also do this when they unwind. This is a requirement to avoid situations where a failed `instance-update` saga cannot unwind, because the instance's generation number has changed while the saga was executing. We want to ensure that if a start or migrate request fails, the instance will appear to be in the correct post-state as soon as the relevant API call returns. In order to do this, without having to also guarantee that an instance update has occurred, we introduce a new VMM state, `SagaUnwound`, with the following rules: - It is legal to start or migrate an instance if the `active_propolis_id` or `destination_propolis_id` (respectively) is either `NULL` or refers to a VMM that’s in the `SagaUnwound` state (the new VMM ID directly replaces the old ID). - If an instance has an `active_propolis_id` in the `SagaUnwound` state, then the instance appears to be `Stopped`. - If an instance has a `destination_propolis_id` in the `SagaUnwound` state, nothing special happens–the instance’s state is still derived from its active VMM’s state. - The update saga treats `SagaUnwound` VMMs as identical to `Destroyed` VMMs for purposes of deciding whether to remove a VMM ID from an instance. This branch adds a new `VmmState::SagaUnwound` variant. The `SagaUnwound` state is an internal implementation detail that shouldn't be exposed to an operator or in the external API. Sled-agents will never report that a VMM is in this state. Instead, this state is set my the `instance-start` and `instance-migrate` sagas when they unwind. When determining the API instance state from an instance and active VMM query so that the `SagaUnwound` state is mapped to `Destroyed`. Closes #5848, which this replaces.
This branch is part of ongoing work on the `instance-update` saga (see PR #5749); I've factored it out into a separate PR. This is largely because this branch makes mechanical changes to a bunch of different files that aren't directly related to the core change of #5749, and I'd like to avoid the additional merge conflicts that are likely when these changes remain un-merged for a long time. --- Depends on #5854 and should merge only after that PR. As part of #5749, it is desirable to distinguish between *why* a VMM record was marked as `Destroyed`, in order to determine which saga is responsible for cleaning up that VMM's resources. The `instance-update` saga must be the only entity that can set an instance's VMM IDs (active and target) to NULL. Presently, however, the `instance-start` and `instance-migrate` sagas may also do this when they unwind. This is a requirement to avoid situations where a failed `instance-update` saga cannot unwind, because the instance's generation number has changed while the saga was executing. We want to ensure that if a start or migrate request fails, the instance will appear to be in the correct post-state as soon as the relevant API call returns. In order to do this, without having to also guarantee that an instance update has occurred, we introduce a new VMM state, `SagaUnwound`, with the following rules: - It is legal to start or migrate an instance if the `active_propolis_id` or `destination_propolis_id` (respectively) is either `NULL` or refers to a VMM that’s in the `SagaUnwound` state (the new VMM ID directly replaces the old ID). - If an instance has an `active_propolis_id` in the `SagaUnwound` state, then the instance appears to be `Stopped`. - If an instance has a `destination_propolis_id` in the `SagaUnwound` state, nothing special happens–the instance’s state is still derived from its active VMM’s state. - The update saga treats `SagaUnwound` VMMs as identical to `Destroyed` VMMs for purposes of deciding whether to remove a VMM ID from an instance. This branch adds a new `VmmState::SagaUnwound` variant. The `SagaUnwound` state is an internal implementation detail that shouldn't be exposed to an operator or in the external API. Sled-agents will never report that a VMM is in this state. Instead, this state is set my the `instance-start` and `instance-migrate` sagas when they unwind. When determining the API instance state from an instance and active VMM query so that the `SagaUnwound` state is mapped to `Destroyed`. Closes #5848, which this replaces.
25be3c1
to
335ede9
Compare
…5859) As part of ongoing work on improving instance lifecycle management (see #5749) , we intend to remove the `InstanceRuntimeState` tracking from `sled-agent`, and make Nexus the sole owner of `instance` records in CRDB, with a new `instance-update` saga taking over the responsibility of managing the instance's state transitions. In order to properly manage the instance state machine, Nexus will need information about the status of active migrations that are currently only available to sled-agents. For example, if an instance is migrating, and a sled agent reports that the source VMM is `Destroyed`, Nexus doesn't presently have the capability to determine whether the source VMM was destroyed because the migration completed successfully, or that the source shut down prior to starting the migration, resulting in a failure. In order for Nexus to correctly manage state updates during live migration, we introduce a new `migration` table to the schema for tracking the state of ongoing migrations. The `instance-migrate` saga creates a `migration` record when beginning a migration. The Nexus and sled-agent APIs are extended to include migration state updates from sled-agents to Nexus. In *this* branch, the `migration` table is (basically) write-only; Nexus doesn't really read from it, and just stuffs updates into it. In the future, however, this will be used by the `instance-update` saga. It occurred to me that, in addition to using the migration table for instance updates, it might also be useful to add an OMDB command to look up the status of a migration using this table. However, I decided that made more sense as a follow-up change, as I'd like to get back to integrating this into #5749. Fixes #2948
In order to implement the `instance-update` saga ongoing in #5749, it's necessary to have a way to query the database for the state of an instance, its active VMM, its target VMM, and active migration in a single, atomic query. This will be used as a snapshot of the instance state that the update saga will operate on. This commit adds an implementation of such a query in the form of a `DataStore::instance_fetch_all` method, returning an `InstanceSnapshot` struct that bundles together all these records. We do this by `LEFT JOIN`ing the `instance` table with the `vmm` and `migration` table on the `instance`'s various ID columns, with a query that should look something like: ```sql SELECT * FROM instance LEFT JOIN vmm ON ((instance.active_propolis_id = vmm.id) OR (instance.target_propolis_id = vmm.id)) LEFT JOIN migration ON (instance.migration_id = migration.id) WHERE instance.id = instance_id ``` Ideally, we would instead do two subqueries that join `instance` with `vmm` on the active and destination IDs separately and return a single row which contains both in a way that can be distinguished in the result set, but I wasn't able to figure out how to do that in Diesel. Instead, the query may return two rows, if the instance has both an active and target VMM, and iterate over the result set to find both VMMs in Rust code. This is a bit uglier, but it works fine.
In #5831, I added the `updater_gen` and `updater_id` fields for the instance-updater lock to the `InstanceRuntimeState` struct, based purely on vibes: "They change at runtime, right? Therefore, they ought to be 'runtime state'...". It turns out that this is actually Super Ultra Wrong, because any query which writes an `InstanceRuntimeState` without paying attention to the lock fields can inadvertantly clobber them, either releasing the lock or setting it back to a previous lock holder. These fields should be on the `Instance` struct instead, to avoid this kind of thing. This commit moves them to the correct place. I figured it would be nice to land this separately from #5749, purely for the sake of keeping that diff small.
In order to implement the `instance-update` saga ongoing in #5749, it's necessary to have a way to query the database for the state of an instance, its active VMM, its target VMM, and active migration in a single, atomic query. This will be used as a snapshot of the instance state that the update saga will operate on. This commit adds an implementation of such a query in the form of a `DataStore::instance_fetch_all` method, returning an `InstanceSnapshot` struct that bundles together all these records. We do this by `LEFT JOIN`ing the `instance` table with the `vmm` and `migration` table on the `instance`'s various ID columns, with a query that should look something like: ```sql SELECT * FROM instance LEFT JOIN vmm ON ((instance.active_propolis_id = vmm.id) OR (instance.target_propolis_id = vmm.id)) LEFT JOIN migration ON (instance.migration_id = migration.id) WHERE instance.id = instance_id ``` Ideally, we would instead do two subqueries that join `instance` with `vmm` on the active and destination IDs separately and return a single row which contains both in a way that can be distinguished in the result set, but I wasn't able to figure out how to do that in Diesel. Instead, the query may return two rows, if the instance has both an active and target VMM, and iterate over the result set to find both VMMs in Rust code. This is a bit uglier, but it works fine.
This code is *really* gruesome, but it should, hopefully, make the unwinding behavior correct in the case where the child saga unwinds *from* the `become_updater` node without having successfully locked the instance record yet. Previously, this could be leaked since the child saga unwinding does not make the start saga unwind. Now, however, the start saga will unwind for any child saga error *except* for an "already locked" error, in which case we can just complete cleanly. I've also updated the unwinding tests to assert that the lock is held either by the (fake) start saga, OR unlocked, since unwinding from the first action can leave it locked by the parent.
Since we're not calling 'em "snapshots" anymore, this should be clearer.
This makes @gjcolombo's change from #6278 a bit more consistent with the rest of the code, and in particular, ensures that we always present the user with the same instance state names. It also addresses the `SagaUnwound` behavior Greg pointed out in [this comment][1]. [1]: #5749 (comment)
1fe5ef5
to
67c424c
Compare
Okay, folks, I've been running Therefore, I'm going to enable auto-merge on this branch so that it can finally land! 🎉 Thanks to @gjcolombo, @jmpesp, and @smklein for all your suggestions and guidance, code reviews, and help with testing! I'm really excited to finally get this merged (and move on with my life...)! |
The `omicron_common::api::internal::InstanceRuntimeState` type was once a key component of how Nexus and sled-agents communicated about instances. Now, however, it's outlived its usefulness: #5749 changed Nexus' `cpapi_instances_put` and sled-agent's `instance_get` APIs to operate exclusively on `VmmRuntimeState`s rather than `InstanceRuntimeState`, with the sled-agent almost completely unaware of `InstanceRuntimeState`. At present, the `InstanceRuntimeState` type remains in the internal API just because it's used in the `InstanceEnsureBody` type, which is sent to sled-agent's `instance_ensure_registered` API when a new VMM is registered. However, looking at the code that `instance_ensure_registered` calls into in sled-agent, we see that the only thing from the `InstanceRuntimeState` that the sled-agent actually *uses* is the migration ID of the current migration (if the instance is migrating in). Soooo...we can just replace the whole `InstanceRuntimeState` there with a `migration_id: Option<Uuid>`. Once that's done, there are now no longer any internal APIs in Nexus or in sled-agent that actually use `InstanceRuntimeState`, so the type can be removed from the internal API entirely. All of the code in Nexus that deals with instance runtime states already has changed to operate exclusively on the `nexus_db_model` version of the struct, so removing the API type is actually quite straightforward. In addition, I've refactored the sled-agent `instance_ensure_registered` code paths a little bit to pass around the `InstanceEnsureBody`, rather than unpacking its fields at the HTTP entrypoint and then passing them all through the chain of method calls that actually gets it to the instance-runner task. Just passing the struct around means we can remove a bunch of `#[allow(clippy::too_many_arguments)]` attributes. It also makes changing the contents of the instance-ensure payload a bit easier, as we'll no longer need to change the arguments list in the whole maze of tiny little functions, all alike, that pass around all the pieces of the `InstanceEnsureBody`.
A number of bugs relating to guest instance lifecycle management have been observed. These include:
Starting
orStopping
, with no way to forcibly terminate them (Want mechanism to forcibly remove an instance's active VMMs irrespective of instance state #4004)Failed
state when nothing is actually wrong with them, potentially leaking virtual resources (Revisit moving instances to Failed inhandle_instance_put_result
#4226)These typically require support intervention to resolve.
Broadly , these issues exist because the control plane's current mechanisms for understanding and managing an instance's lifecycle state machine are "kind of a mess". In particular:
instance
record is currently split between Nexus and sled-agent(s). Although Nexus is the only entity that actually reads or writes to the database, the instance's runtime state is also modified by the sled-agents that manage its active Propolis (and, if it's migrating, it's target Propolis), and written to CRDB on their behalf by Nexus. This means that there are multiple copies of the instance's state in different places at the same time, which can potentially get out of sync. When an instance is migrating, its state is updated by two different sled-agents, and they may potentially generate state updates that conflict with each other. And, splitting the responsibility between Nexus and sled-agent makes the code more complex and harder to understand: there is no one place where all instance state machine transitions are performed.instance-start
andinstance-delete
, are performed by distributed sagas, ensuring that they run to completion even if the Nexus instance executing them comes to an untimely end. This is not the case for operations that result from instance state transitions reported by sled-agents, which just happen in the HTTP APIs for reporting instance states. If the Nexus processing such a transition crashes, gets network partition'd, or encountering a transient error, the instance is left in an incomplete state and the remainder of the operation will not be performed.This branch rewrites much of the control plane's instance state management subsystem to resolve these issues. At a high level, it makes the following high-level changes:
Nexus is now the sole owner of the
instance
record. Sled-agents no longer have their own copies of an instance'sInstanceRuntimeState
, and do not generate changes to that state when reporting instance observations to Nexus. Instead, the sled-agent only publishes updates to thevmm
andmigration
records (which are never modified by Nexus directly) and Nexus is the only entity responsible for determining how an instance's state should change in response to a VMM or migration state update.When an instance has an active VMM, its effective external state is determined primarily by the active
vmm
record, so that fewer state transitions require changes to theinstance
record. PR Split instance state enum into instance/VMM state enums #5854 laid the ground work for this change, but it's relevant here as well.All updates to an
instance
record (and resources conceptually owned by that instance) are performed by a distributed saga. I've introduced a newinstance-update
saga, which is responsible for performing all changes to theinstance
record, virtual provisioning resources, and instance network config that are performed as part of a state transition. Moving this to a saga helps us to ensure that these operations are always run to completion, even in the event of a sudden Nexus death.Consistency of instance state changes is ensured by distributed locking. State changes may be published by multiple sled-agents to different Nexus replicas. If one Nexus replica is processing a state change received from a sled-agent, and then the instance's state changes again, and the sled-agent publishes that state change to a different Nexus...lots of bad things can happen, since the second state change may be performed from the previous initial state, when it should have a "happens-after" relationship with the other state transition. And, some operations may contradict each other when performed concurrently.
To prevent these race conditions, this PR has the dubious honor of using the first distributed lock in the Oxide control plane, the "instance updater lock". I introduced the locking primitives in PR [nexus] add instance-updater lock #5831 --- see that branch for more discussion of locking.
Background tasks are added to prevent missed updates. To ensure we cannot accidentally miss an instance update even if a Nexus dies, hits a network partition, or just chooses to eat the state update accidentally, we add a new
instance-updater
background task, which queries the database for instances that are in states that require an update saga without such a saga running, and starts the requisite sagas.Currently, the instance update saga runs in the following cases:
Destroyed
, in which case the instance's virtual resources are cleaned up and the active VMM is unlinked.The inner workings of the instance-update saga itself is fairly complex, and has some kind of interesting idiosyncrasies relative to the existing sagas. I've written up a lengthy comment that provides an overview of the theory behind the design of the saga and its principles of operation, so I won't reproduce that in this commit message.