Perform instance state transitions in `instance-update` saga #5749

hawkw · 2024-05-13T22:07:48Z

A number of bugs relating to guest instance lifecycle management have been observed. These include:

Instances getting "stuck" in a transient state, such as Starting or Stopping, with no way to forcibly terminate them (Want mechanism to forcibly remove an instance's active VMMs irrespective of instance state #4004)
Race conditions between instances starting and receiving state updates, which cause provisioning counters to underflow (Race between start saga, start saga unwind, and sled agent instance teardown can cause provisioning counter underflow #5042)
Instances entering and exiting the Failed state when nothing is actually wrong with them, potentially leaking virtual resources (Revisit moving instances to Failed in handle_instance_put_result #4226)

These typically require support intervention to resolve.

Broadly , these issues exist because the control plane's current mechanisms for understanding and managing an instance's lifecycle state machine are "kind of a mess". In particular:

(Conceptual) ownership of the CRDB instance record is currently split between Nexus and sled-agent(s). Although Nexus is the only entity that actually reads or writes to the database, the instance's runtime state is also modified by the sled-agents that manage its active Propolis (and, if it's migrating, it's target Propolis), and written to CRDB on their behalf by Nexus. This means that there are multiple copies of the instance's state in different places at the same time, which can potentially get out of sync. When an instance is migrating, its state is updated by two different sled-agents, and they may potentially generate state updates that conflict with each other. And, splitting the responsibility between Nexus and sled-agent makes the code more complex and harder to understand: there is no one place where all instance state machine transitions are performed.
Nexus doesn't ensure that instance state updates are processed reliably. Instance state transitions triggered by user actions, such as instance-start and instance-delete, are performed by distributed sagas, ensuring that they run to completion even if the Nexus instance executing them comes to an untimely end. This is not the case for operations that result from instance state transitions reported by sled-agents, which just happen in the HTTP APIs for reporting instance states. If the Nexus processing such a transition crashes, gets network partition'd, or encountering a transient error, the instance is left in an incomplete state and the remainder of the operation will not be performed.

This branch rewrites much of the control plane's instance state management subsystem to resolve these issues. At a high level, it makes the following high-level changes:

Nexus is now the sole owner of the instance record. Sled-agents no longer have their own copies of an instance's InstanceRuntimeState, and do not generate changes to that state when reporting instance observations to Nexus. Instead, the sled-agent only publishes updates to the vmm and migration records (which are never modified by Nexus directly) and Nexus is the only entity responsible for determining how an instance's state should change in response to a VMM or migration state update.
When an instance has an active VMM, its effective external state is determined primarily by the active vmm record, so that fewer state transitions require changes to the instance record. PR Split instance state enum into instance/VMM state enums #5854 laid the ground work for this change, but it's relevant here as well.
All updates to an instance record (and resources conceptually owned by that instance) are performed by a distributed saga. I've introduced a new instance-update saga, which is responsible for performing all changes to the instance record, virtual provisioning resources, and instance network config that are performed as part of a state transition. Moving this to a saga helps us to ensure that these operations are always run to completion, even in the event of a sudden Nexus death.
Consistency of instance state changes is ensured by distributed locking. State changes may be published by multiple sled-agents to different Nexus replicas. If one Nexus replica is processing a state change received from a sled-agent, and then the instance's state changes again, and the sled-agent publishes that state change to a different Nexus...lots of bad things can happen, since the second state change may be performed from the previous initial state, when it should have a "happens-after" relationship with the other state transition. And, some operations may contradict each other when performed concurrently.

To prevent these race conditions, this PR has the dubious honor of using the first distributed lock in the Oxide control plane, the "instance updater lock". I introduced the locking primitives in PR [nexus] add instance-updater lock #5831 --- see that branch for more discussion of locking.
Background tasks are added to prevent missed updates. To ensure we cannot accidentally miss an instance update even if a Nexus dies, hits a network partition, or just chooses to eat the state update accidentally, we add a new instance-updater background task, which queries the database for instances that are in states that require an update saga without such a saga running, and starts the requisite sagas.

Currently, the instance update saga runs in the following cases:

An instance's active VMM transitions to Destroyed, in which case the instance's virtual resources are cleaned up and the active VMM is unlinked.
Either side of an instance's live migration reports that the migration has completed successfully.
Either side of an instance's live migration reports that the migration has failed.

The inner workings of the instance-update saga itself is fairly complex, and has some kind of interesting idiosyncrasies relative to the existing sagas. I've written up a lengthy comment that provides an overview of the theory behind the design of the saga and its principles of operation, so I won't reproduce that in this commit message.

nexus/db-model/src/schema.rs

nexus/db-queries/src/db/datastore/instance.rs

nexus/db-model/src/instance.rs

**Note**: This change is part of the ongoing work on instance lifecycle management that I'm working on in PR #5749. It's not actually necessary on its own, it's just a component of the upcoming instance updater saga. However, I thought it would be easier to review if I factored out this change into a separate PR that can be reviewed and merged on its own. The instance update saga (see PR #5749) will only clean up after VMMs whose IDs appear in an `instance` record. When a live migration finishes (successfully or not), we want to allow a new migration to begin as soon as possible, which means we have to unlink the “unused” side of the migration --- the source if migration succeeded, or the target if it failed --- from the instance, even though that VMM may not be fully destroyed yet. Once this happens, the instance update saga will no longer be able to clean up these VMMs, so we’ll need a separate task that cleans up these "abandoned" VMMs in the background. This branch introduces an `abandoned_vmm_reaper` background task that's responsible for doing this. It queries the database to list VMMs which are: - in the `Destroyed` state - not deleted yet (i.e. `time_deleted` IS NOT NULL) - not pointed to by their corresponding instances (neither the `active_propolis_id` nor the `target_propolis_id` equals the VMM's ID) For any VMMs returned by this query, the `abandoned_vmm_reaper` task will: - remove the `sled_resource` reservation for that VMM - sets the `time_deleted` on the VMM record if it was not already set. This cleanup process will be executed periodically in the background. Eventually, the background task will also be explicitly triggered by the instance update saga when it knows it has abandoned a VMM. As an aside, I noticed that the current implementation of `DataStore::vmm_mark_deleted` will always unconditionally set the `time_deleted` field on a VMM record, even if it's already set. This is "probably fine" for overall correctness: the VMM remains deleted, so the operation is still idempotent-ish. But, it's not *great*, as it means that any queries for VMMs deleted before a certain timestamp may not be strictly correct, and we're updating the database more frequently than we really need to. So, I've gone ahead and changed it to only set `time_deleted` if the record's `time_deleted` is null, using `check_if_exists` so that the method still returns `Ok` if the record was already deleted --- the caller can inspect the returned `bool` to determine whether or not they were the actual deleter, but the query still doesn't fail.

In order to simplify the instance lifecycle state machine and ensure that instance state updates are processed reliably, we intend to perform instance state updates in a saga (which will be added in PR #5749). This saga will require a notion of mutual exclusion between update sagas for the same instance, in order to avoid race conditions like the following: 1. Sagas `S1` and `S2` start at the same time and observe the same instance/VMM states, which indicate that the instance’s active VMM has shut down 2. `S1` clears all the resources/provisioning counters and marks the instance as `Stopped`` 3. User restarts the instance 4. `S2` clears the same instance provisioning counters again Presently, these races are avoided by the fact that instance state updates are performed partially in `sled-agent`, which serves as an "actor" with exclusive ownership over the state transition. Moving these state transitions to Nexus requires introducing mutual exclusion. This commit adds a distributed lock on instance state transitions to the datastore. We add the following fields to the `instance` table: - `updater_id`, which is the UUID of the saga currently holding the update lock on the instance (or `NULL` if no saga has locked the instance) - `updater_gen`, a generation counter that is incremented each time the lock is acquired by a new saga Using these fields, we can add new datastore methods to try and acquire an instance update lock by doing the following: 1. Generate a UUID for the saga, `saga_lock_id`. This will be performed in the saga itself and isn't part of this PR. 2. Read the instance record and interpret the value of the `updater_id` field as follows: - `NULL`: lock not held, we can acquire it by incrementing the `updater_gen` field and setting the `updater_id` field to the saga's UUID. - `updater_id == saga_id`: the saga already holds the lock, we can proceed with the update. - `updater_id != saga_id`: another saga holds the lock, we can't proceed with the update. Fail the operation. 3. Attempt to write back the updated instance record with generation incremented and the `updater_id` set to the saga's ID, conditional on the `updater_gen` field being equal to the ID that was read when read the instance record. This is equivalent to the atomic compare-and-swap operation that one might use to implement a non-distributed lock in a single address space. - If this fails because the generation number is outdated, try again (i.e. goto (2)). - If this succeeds, the lock was acquired successfully. Additionally, we can add a method for unlocking an instance record by clearing the `updater_id` field and incrementing the `updater_gen`. This query is conditional on the `updater_id` field being equal to the saga's UUID, to prevent cases where we accidentally unlock an instance that was locked by a different saga. Introducing this distributed lock is considered fairly safe, as it will only ever be acquired in a saga, and the reverse action for the saga action that acquires the lock will ensure that the lock is released if the saga unwinds. Essentially, this is equivalent to a RAII guard releasing a lock when a thread panics in a single-threaded Rust program. Presently, none of these methods are actually used. The saga that uses them will be added in PR #5749. I've factored out this change into its own PR so that we can merge the foundation needed for that branch. Hopefully this makes the diff a bit smaller and easier to review, as well as decreasing merge conflict churn with the schema changes.

nexus/src/app/mod.rs

In order to simplify the instance lifecycle state machine and ensure that instance state updates are processed reliably, we intend to perform instance state updates in a saga (which will be added in PR #5749). This saga will require a notion of mutual exclusion between update sagas for the same instance, in order to avoid race conditions like the following: 1. Sagas `S1` and `S2` start at the same time and observe the same instance/VMM states, which indicate that the instance’s active VMM has shut down 2. `S1` clears all the resources/provisioning counters and marks the instance as `Stopped`` 3. User restarts the instance 4. `S2` clears the same instance provisioning counters again Presently, these races are avoided by the fact that instance state updates are performed partially in `sled-agent`, which serves as an "actor" with exclusive ownership over the state transition. Moving these state transitions to Nexus requires introducing mutual exclusion. This commit adds a distributed lock on instance state transitions to the datastore. We add the following fields to the `instance` table: - `updater_id`, which is the UUID of the saga currently holding the update lock on the instance (or `NULL` if no saga has locked the instance) - `updater_gen`, a generation counter that is incremented each time the lock is acquired by a new saga Using these fields, we can add new datastore methods to try and acquire an instance update lock by doing the following: 1. Generate a UUID for the saga, `saga_lock_id`. This will be performed in the saga itself and isn't part of this PR. 2. Read the instance record and interpret the value of the `updater_id` field as follows: - `NULL`: lock not held, we can acquire it by incrementing the `updater_gen` field and setting the `updater_id` field to the saga's UUID. - `updater_id == saga_id`: the saga already holds the lock, we can proceed with the update. - `updater_id != saga_id`: another saga holds the lock, we can't proceed with the update. Fail the operation. 3. Attempt to write back the updated instance record with generation incremented and the `updater_id` set to the saga's ID, conditional on the `updater_gen` field being equal to the ID that was read when read the instance record. This is equivalent to the atomic compare-and-swap operation that one might use to implement a non-distributed lock in a single address space. - If this fails because the generation number is outdated, try again (i.e. goto (2)). - If this succeeds, the lock was acquired successfully. Additionally, we can add a method for unlocking an instance record by clearing the `updater_id` field and incrementing the `updater_gen`. This query is conditional on the `updater_id` field being equal to the saga's UUID, to prevent cases where we accidentally unlock an instance that was locked by a different saga. Introducing this distributed lock is considered fairly safe, as it will only ever be acquired in a saga, and the reverse action for the saga action that acquires the lock will ensure that the lock is released if the saga unwinds. Essentially, this is equivalent to a RAII guard releasing a lock when a thread panics in a single-threaded Rust program. Presently, none of these methods are actually used. The saga that uses them will be added in PR #5749. I've factored out this change into its own PR so that we can merge the foundation needed for that branch. Hopefully this makes the diff a bit smaller and easier to review, as well as decreasing merge conflict churn with the schema changes.

This branch is part of ongoing work on the `instance-update` saga (see PR #5749); I've factored it out into a separate PR. This is largely because this branch makes mechanical changes to a bunch of different files that aren't directly related to the core change of #5749, and I'd like to avoid the additional merge conflicts that are likely when these changes remain un-merged for a long time.

This branch is part of ongoing work on the `instance-update` saga (see PR #5749); I've factored it out into a separate PR. This is largely because this branch makes mechanical changes to a bunch of different files that aren't directly related to the core change of #5749, and I'd like to avoid the additional merge conflicts that are likely when these changes remain un-merged for a long time. --- Depends on #5854. As part of #5749, it is desirable to distinguish between *why* a VMM record was marked as `Destroyed`, in order to determine which saga is responsible for cleaning up that VMM's resources. The `instance-update` saga must be the only entity that can set an instance's VMM IDs (active and target) to NULL. Presently, however, the `instance-start` and `instance-migrate` sagas may also do this when they unwind. This is a requirement to avoid situations where a failed `instance-update` saga cannot unwind, because the instance's generation number has changed while the saga was executing. We want to ensure that if a start or migrate request fails, the instance will appear to be in the correct post-state as soon as the relevant API call returns. In order to do this, without having to also guarantee that an instance update has occurred, we introduce a new VMM state, `SagaUnwound`, with the following rules: - It is legal to start or migrate an instance if the `active_propolis_id` or `destination_propolis_id` (respectively) is either `NULL` or refers to a VMM that’s in the `SagaUnwound` state (the new VMM ID directly replaces the old ID). - If an instance has an `active_propolis_id` in the `SagaUnwound` state, then the instance appears to be `Stopped`. - If an instance has a `destination_propolis_id` in the `SagaUnwound` state, nothing special happens–the instance’s state is still derived from its active VMM’s state. - The update saga treats `SagaUnwound` VMMs as identical to `Destroyed` VMMs for purposes of deciding whether to remove a VMM ID from an instance. This branch adds a new `VmmState::SagaUnwound` variant. The `SagaUnwound` state is an internal implementation detail that shouldn't be exposed to an operator or in the external API. Sled-agents will never report that a VMM is in this state. Instead, this state is set my the `instance-start` and `instance-migrate` sagas when they unwind. When determining the API instance state from an instance and active VMM query so that the `SagaUnwound` state is mapped to `Destroyed`. Closes #5848, which this replaces.

This branch is part of ongoing work on the `instance-update` saga (see PR #5749); I've factored it out into a separate PR. This is largely because this branch makes mechanical changes to a bunch of different files that aren't directly related to the core change of #5749, and I'd like to avoid the additional merge conflicts that are likely when these changes remain un-merged for a long time. --- Depends on #5854 and should merge only after that PR. As part of #5749, it is desirable to distinguish between *why* a VMM record was marked as `Destroyed`, in order to determine which saga is responsible for cleaning up that VMM's resources. The `instance-update` saga must be the only entity that can set an instance's VMM IDs (active and target) to NULL. Presently, however, the `instance-start` and `instance-migrate` sagas may also do this when they unwind. This is a requirement to avoid situations where a failed `instance-update` saga cannot unwind, because the instance's generation number has changed while the saga was executing. We want to ensure that if a start or migrate request fails, the instance will appear to be in the correct post-state as soon as the relevant API call returns. In order to do this, without having to also guarantee that an instance update has occurred, we introduce a new VMM state, `SagaUnwound`, with the following rules: - It is legal to start or migrate an instance if the `active_propolis_id` or `destination_propolis_id` (respectively) is either `NULL` or refers to a VMM that’s in the `SagaUnwound` state (the new VMM ID directly replaces the old ID). - If an instance has an `active_propolis_id` in the `SagaUnwound` state, then the instance appears to be `Stopped`. - If an instance has a `destination_propolis_id` in the `SagaUnwound` state, nothing special happens–the instance’s state is still derived from its active VMM’s state. - The update saga treats `SagaUnwound` VMMs as identical to `Destroyed` VMMs for purposes of deciding whether to remove a VMM ID from an instance. This branch adds a new `VmmState::SagaUnwound` variant. The `SagaUnwound` state is an internal implementation detail that shouldn't be exposed to an operator or in the external API. Sled-agents will never report that a VMM is in this state. Instead, this state is set my the `instance-start` and `instance-migrate` sagas when they unwind. When determining the API instance state from an instance and active VMM query so that the `SagaUnwound` state is mapped to `Destroyed`. Closes #5848, which this replaces.

…5859) As part of ongoing work on improving instance lifecycle management (see #5749) , we intend to remove the `InstanceRuntimeState` tracking from `sled-agent`, and make Nexus the sole owner of `instance` records in CRDB, with a new `instance-update` saga taking over the responsibility of managing the instance's state transitions. In order to properly manage the instance state machine, Nexus will need information about the status of active migrations that are currently only available to sled-agents. For example, if an instance is migrating, and a sled agent reports that the source VMM is `Destroyed`, Nexus doesn't presently have the capability to determine whether the source VMM was destroyed because the migration completed successfully, or that the source shut down prior to starting the migration, resulting in a failure. In order for Nexus to correctly manage state updates during live migration, we introduce a new `migration` table to the schema for tracking the state of ongoing migrations. The `instance-migrate` saga creates a `migration` record when beginning a migration. The Nexus and sled-agent APIs are extended to include migration state updates from sled-agents to Nexus. In *this* branch, the `migration` table is (basically) write-only; Nexus doesn't really read from it, and just stuffs updates into it. In the future, however, this will be used by the `instance-update` saga. It occurred to me that, in addition to using the migration table for instance updates, it might also be useful to add an OMDB command to look up the status of a migration using this table. However, I decided that made more sense as a follow-up change, as I'd like to get back to integrating this into #5749. Fixes #2948

In order to implement the `instance-update` saga ongoing in #5749, it's necessary to have a way to query the database for the state of an instance, its active VMM, its target VMM, and active migration in a single, atomic query. This will be used as a snapshot of the instance state that the update saga will operate on. This commit adds an implementation of such a query in the form of a `DataStore::instance_fetch_all` method, returning an `InstanceSnapshot` struct that bundles together all these records. We do this by `LEFT JOIN`ing the `instance` table with the `vmm` and `migration` table on the `instance`'s various ID columns, with a query that should look something like: ```sql SELECT * FROM instance LEFT JOIN vmm ON ((instance.active_propolis_id = vmm.id) OR (instance.target_propolis_id = vmm.id)) LEFT JOIN migration ON (instance.migration_id = migration.id) WHERE instance.id = instance_id ``` Ideally, we would instead do two subqueries that join `instance` with `vmm` on the active and destination IDs separately and return a single row which contains both in a way that can be distinguished in the result set, but I wasn't able to figure out how to do that in Diesel. Instead, the query may return two rows, if the instance has both an active and target VMM, and iterate over the result set to find both VMMs in Rust code. This is a bit uglier, but it works fine.

In #5831, I added the `updater_gen` and `updater_id` fields for the instance-updater lock to the `InstanceRuntimeState` struct, based purely on vibes: "They change at runtime, right? Therefore, they ought to be 'runtime state'...". It turns out that this is actually Super Ultra Wrong, because any query which writes an `InstanceRuntimeState` without paying attention to the lock fields can inadvertantly clobber them, either releasing the lock or setting it back to a previous lock holder. These fields should be on the `Instance` struct instead, to avoid this kind of thing. This commit moves them to the correct place. I figured it would be nice to land this separately from #5749, purely for the sake of keeping that diff small.

In order to implement the `instance-update` saga ongoing in #5749, it's necessary to have a way to query the database for the state of an instance, its active VMM, its target VMM, and active migration in a single, atomic query. This will be used as a snapshot of the instance state that the update saga will operate on. This commit adds an implementation of such a query in the form of a `DataStore::instance_fetch_all` method, returning an `InstanceSnapshot` struct that bundles together all these records. We do this by `LEFT JOIN`ing the `instance` table with the `vmm` and `migration` table on the `instance`'s various ID columns, with a query that should look something like: ```sql SELECT * FROM instance LEFT JOIN vmm ON ((instance.active_propolis_id = vmm.id) OR (instance.target_propolis_id = vmm.id)) LEFT JOIN migration ON (instance.migration_id = migration.id) WHERE instance.id = instance_id ``` Ideally, we would instead do two subqueries that join `instance` with `vmm` on the active and destination IDs separately and return a single row which contains both in a way that can be distinguished in the result set, but I wasn't able to figure out how to do that in Diesel. Instead, the query may return two rows, if the instance has both an active and target VMM, and iterate over the result set to find both VMMs in Rust code. This is a bit uglier, but it works fine.

This code is *really* gruesome, but it should, hopefully, make the unwinding behavior correct in the case where the child saga unwinds *from* the `become_updater` node without having successfully locked the instance record yet. Previously, this could be leaked since the child saga unwinding does not make the start saga unwind. Now, however, the start saga will unwind for any child saga error *except* for an "already locked" error, in which case we can just complete cleanly. I've also updated the unwinding tests to assert that the lock is held either by the (fake) start saga, OR unlocked, since unwinding from the first action can leave it locked by the parent.

Since we're not calling 'em "snapshots" anymore, this should be clearer.

see #5749 (comment)

@gjcolombo

This makes @gjcolombo's change from #6278 a bit more consistent with the rest of the code, and in particular, ensures that we always present the user with the same instance state names. It also addresses the `SagaUnwound` behavior Greg pointed out in [this comment][1]. [1]: #5749 (comment)

hawkw · 2024-08-09T21:08:36Z

Okay, folks, I've been running omicron-stress against this branch on madrid for a couple hours now, and haven't yet seen any errors! Meanwhile, I've also been running migrations of an instance every 30 seconds and have managed 55 successful migrations so far. I haven't seen any signs of the issues that have previously popped up during testing.

Therefore, I'm going to enable auto-merge on this branch so that it can finally land! 🎉

Thanks to @gjcolombo, @jmpesp, and @smklein for all your suggestions and guidance, code reviews, and help with testing! I'm really excited to finally get this merged (and move on with my life...)!

The `omicron_common::api::internal::InstanceRuntimeState` type was once a key component of how Nexus and sled-agents communicated about instances. Now, however, it's outlived its usefulness: #5749 changed Nexus' `cpapi_instances_put` and sled-agent's `instance_get` APIs to operate exclusively on `VmmRuntimeState`s rather than `InstanceRuntimeState`, with the sled-agent almost completely unaware of `InstanceRuntimeState`. At present, the `InstanceRuntimeState` type remains in the internal API just because it's used in the `InstanceEnsureBody` type, which is sent to sled-agent's `instance_ensure_registered` API when a new VMM is registered. However, looking at the code that `instance_ensure_registered` calls into in sled-agent, we see that the only thing from the `InstanceRuntimeState` that the sled-agent actually *uses* is the migration ID of the current migration (if the instance is migrating in). Soooo...we can just replace the whole `InstanceRuntimeState` there with a `migration_id: Option<Uuid>`. Once that's done, there are now no longer any internal APIs in Nexus or in sled-agent that actually use `InstanceRuntimeState`, so the type can be removed from the internal API entirely. All of the code in Nexus that deals with instance runtime states already has changed to operate exclusively on the `nexus_db_model` version of the struct, so removing the API type is actually quite straightforward. In addition, I've refactored the sled-agent `instance_ensure_registered` code paths a little bit to pass around the `InstanceEnsureBody`, rather than unpacking its fields at the HTTP entrypoint and then passing them all through the chain of method calls that actually gets it to the instance-runner task. Just passing the struct around means we can remove a bunch of `#[allow(clippy::too_many_arguments)]` attributes. It also makes changing the contents of the instance-ensure payload a bit easier, as we'll no longer need to change the arguments list in the whole maze of tiny little functions, all alike, that pass around all the pieces of the `InstanceEnsureBody`.

smklein reviewed May 13, 2024

View reviewed changes

nexus/db-model/src/schema.rs Outdated Show resolved Hide resolved

nexus/db-queries/src/db/datastore/instance.rs Outdated Show resolved Hide resolved

nexus/db-model/src/instance.rs Outdated Show resolved Hide resolved

hawkw force-pushed the eliza/instance-updater branch 4 times, most recently from 943dfb2 to afd2dad Compare May 20, 2024 23:34

hawkw mentioned this pull request May 23, 2024

[nexus] add background task for cleaning up abandoned VMMs #5812

Merged

hawkw force-pushed the eliza/instance-updater branch from afd2dad to 30a341d Compare May 24, 2024 17:50

hawkw mentioned this pull request May 29, 2024

[nexus] add instance-updater lock #5831

Merged

hawkw force-pushed the eliza/instance-updater branch from ad224c9 to 2dc69d1 Compare May 29, 2024 21:15

andrewjstone reviewed May 30, 2024

View reviewed changes

nexus/src/app/mod.rs Outdated Show resolved Hide resolved

hawkw force-pushed the eliza/instance-updater branch from ba55078 to a58c14e Compare May 30, 2024 22:52

hawkw mentioned this pull request May 31, 2024

[nexus] add InstanceState::SagaUnwound #5848

Closed

hawkw force-pushed the eliza/instance-updater branch from ba551b0 to 8e7a1ef Compare June 3, 2024 22:23

hawkw mentioned this pull request Jun 4, 2024

Add VmmState::SagaUnwound #5855

Merged

hawkw mentioned this pull request Jun 7, 2024

add migration table and explicit migration tracking in sled-agent #5859

Merged

hawkw force-pushed the eliza/instance-updater branch from 25be3c1 to 335ede9 Compare June 10, 2024 19:18

hawkw mentioned this pull request Jun 12, 2024

[nexus] Add DataStore::instance_fetch_all #5888

Merged

hawkw mentioned this pull request Jun 12, 2024

Remove updater lock from instance runtime state #5890

Merged

hawkw added 19 commits August 9, 2024 13:26

fix "succeed idempotently" tests not testing that

90b2d04

tests for migration completed but target VMM destroyed

73cdb72

correctly handle target destroyed migration success

b031bd0

fix start saga unwinding if duplicate child unwinds

18b2f32

fix completed updates spawning spurious update sagas

c8b7421

update tests to new migrate saga unwinding behavior

20a222b

saga idempotency tests should test the real saga

77594e3

also run unwinding tests with "real" saga

8cd3378

add back start saga unwinding/idempotency tests

e04b278

turns out there's a normal reason that could happen

59adbf4

rename UpdatesRequired::for_snapshot

9461168

Since we're not calling 'em "snapshots" anymore, this should be clearer.

'rename symbol' doesnt work on docs

2253390

improve errors/log messages

47d8c38

update instance lock tests to match new behavior

58bb353

fix instance_commit_update idempotency

c707814

fail to commit update if gen has advanced while locked

8ed6dc6

see #5749 (comment)

document more saga interactions

67c424c

hawkw force-pushed the eliza/instance-updater branch from 1fe5ef5 to 67c424c Compare August 9, 2024 20:58

hawkw enabled auto-merge (squash) August 9, 2024 21:04

david-crespo mentioned this pull request Aug 9, 2024

Update instanceCan logic for API changes oxidecomputer/console#2334

Closed

gjcolombo mentioned this pull request Aug 9, 2024

Race between start saga, start saga unwind, and sled agent instance teardown can cause provisioning counter underflow #5042

Closed

hawkw merged commit 9b595e9 into main Aug 9, 2024
23 checks passed

hawkw deleted the eliza/instance-updater branch August 9, 2024 22:19

hawkw mentioned this pull request Aug 12, 2024

Don't allow continually failing instance start/migrate sagas to block instance update sagas #6293

Open

hawkw mentioned this pull request Aug 24, 2024

sled agent: index running VMMs by VMM ID, not instance ID #6429

Merged

hawkw mentioned this pull request Sep 11, 2024

Rip InstanceRuntimeState out of the internal API #6556

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perform instance state transitions in `instance-update` saga #5749

Perform instance state transitions in `instance-update` saga #5749

hawkw commented May 13, 2024 •

edited

Loading

hawkw commented Aug 9, 2024

Perform instance state transitions in instance-update saga #5749

Perform instance state transitions in instance-update saga #5749

Conversation

hawkw commented May 13, 2024 • edited Loading

hawkw commented Aug 9, 2024

Perform instance state transitions in `instance-update` saga #5749

Perform instance state transitions in `instance-update` saga #5749

hawkw commented May 13, 2024 •

edited

Loading