Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perform instance state transitions in instance-update saga #5749

Merged
merged 234 commits into from
Aug 9, 2024

Conversation

hawkw
Copy link
Member

@hawkw hawkw commented May 13, 2024

A number of bugs relating to guest instance lifecycle management have been observed. These include:

These typically require support intervention to resolve.

Broadly , these issues exist because the control plane's current mechanisms for understanding and managing an instance's lifecycle state machine are "kind of a mess". In particular:

  • (Conceptual) ownership of the CRDB instance record is currently split between Nexus and sled-agent(s). Although Nexus is the only entity that actually reads or writes to the database, the instance's runtime state is also modified by the sled-agents that manage its active Propolis (and, if it's migrating, it's target Propolis), and written to CRDB on their behalf by Nexus. This means that there are multiple copies of the instance's state in different places at the same time, which can potentially get out of sync. When an instance is migrating, its state is updated by two different sled-agents, and they may potentially generate state updates that conflict with each other. And, splitting the responsibility between Nexus and sled-agent makes the code more complex and harder to understand: there is no one place where all instance state machine transitions are performed.
  • Nexus doesn't ensure that instance state updates are processed reliably. Instance state transitions triggered by user actions, such as instance-start and instance-delete, are performed by distributed sagas, ensuring that they run to completion even if the Nexus instance executing them comes to an untimely end. This is not the case for operations that result from instance state transitions reported by sled-agents, which just happen in the HTTP APIs for reporting instance states. If the Nexus processing such a transition crashes, gets network partition'd, or encountering a transient error, the instance is left in an incomplete state and the remainder of the operation will not be performed.

This branch rewrites much of the control plane's instance state management subsystem to resolve these issues. At a high level, it makes the following high-level changes:

  • Nexus is now the sole owner of the instance record. Sled-agents no longer have their own copies of an instance's InstanceRuntimeState, and do not generate changes to that state when reporting instance observations to Nexus. Instead, the sled-agent only publishes updates to the vmm and migration records (which are never modified by Nexus directly) and Nexus is the only entity responsible for determining how an instance's state should change in response to a VMM or migration state update.

  • When an instance has an active VMM, its effective external state is determined primarily by the active vmm record, so that fewer state transitions require changes to the instance record. PR Split instance state enum into instance/VMM state enums #5854 laid the ground work for this change, but it's relevant here as well.

  • All updates to an instance record (and resources conceptually owned by that instance) are performed by a distributed saga. I've introduced a new instance-update saga, which is responsible for performing all changes to the instance record, virtual provisioning resources, and instance network config that are performed as part of a state transition. Moving this to a saga helps us to ensure that these operations are always run to completion, even in the event of a sudden Nexus death.

  • Consistency of instance state changes is ensured by distributed locking. State changes may be published by multiple sled-agents to different Nexus replicas. If one Nexus replica is processing a state change received from a sled-agent, and then the instance's state changes again, and the sled-agent publishes that state change to a different Nexus...lots of bad things can happen, since the second state change may be performed from the previous initial state, when it should have a "happens-after" relationship with the other state transition. And, some operations may contradict each other when performed concurrently.

    To prevent these race conditions, this PR has the dubious honor of using the first distributed lock in the Oxide control plane, the "instance updater lock". I introduced the locking primitives in PR [nexus] add instance-updater lock #5831 --- see that branch for more discussion of locking.

  • Background tasks are added to prevent missed updates. To ensure we cannot accidentally miss an instance update even if a Nexus dies, hits a network partition, or just chooses to eat the state update accidentally, we add a new instance-updater background task, which queries the database for instances that are in states that require an update saga without such a saga running, and starts the requisite sagas.

Currently, the instance update saga runs in the following cases:

  • An instance's active VMM transitions to Destroyed, in which case the instance's virtual resources are cleaned up and the active VMM is unlinked.
  • Either side of an instance's live migration reports that the migration has completed successfully.
  • Either side of an instance's live migration reports that the migration has failed.

The inner workings of the instance-update saga itself is fairly complex, and has some kind of interesting idiosyncrasies relative to the existing sagas. I've written up a lengthy comment that provides an overview of the theory behind the design of the saga and its principles of operation, so I won't reproduce that in this commit message.

nexus/db-model/src/schema.rs Outdated Show resolved Hide resolved
nexus/db-queries/src/db/datastore/instance.rs Outdated Show resolved Hide resolved
nexus/db-model/src/instance.rs Outdated Show resolved Hide resolved
@hawkw hawkw force-pushed the eliza/instance-updater branch 4 times, most recently from 943dfb2 to afd2dad Compare May 20, 2024 23:34
hawkw added a commit that referenced this pull request May 24, 2024
**Note**: This change is part of the ongoing work on instance lifecycle
management that I'm working on in PR #5749. It's not actually necessary
on its own, it's just a component of the upcoming instance updater saga.
However, I thought it would be easier to review if I factored out this
change into a separate PR that can be reviewed and merged on its own.

The instance update saga (see PR #5749) will only clean up after VMMs
whose IDs appear in an `instance` record. When a live migration finishes
(successfully or not), we want to allow a new migration to begin as soon
as possible, which means we have to unlink the “unused” side of the
migration --- the source if migration succeeded, or the target if it
failed --- from the instance, even though that VMM may not be fully
destroyed yet. Once this happens, the instance update saga will no
longer be able to clean up these VMMs, so we’ll need a separate task
that cleans up these "abandoned" VMMs in the background.

This branch introduces an `abandoned_vmm_reaper` background task that's
responsible for doing this. It queries the database to list VMMs which
are:

- in the `Destroyed` state
- not deleted yet (i.e. `time_deleted` IS NOT NULL)
- not pointed to by their corresponding instances (neither the
  `active_propolis_id` nor the `target_propolis_id` equals the VMM's ID)
  
For any VMMs returned by this query, the `abandoned_vmm_reaper` task
will:
- remove the `sled_resource` reservation for that VMM
- sets the `time_deleted` on the VMM record if it was not already set.

This cleanup process will be executed periodically in the background.
Eventually, the background task will also be explicitly triggered by the
instance update saga when it knows it has abandoned a VMM.

As an aside, I noticed that the current implementation of
`DataStore::vmm_mark_deleted` will always unconditionally set the
`time_deleted` field on a VMM record, even if it's already set. This is
"probably fine" for overall correctness: the VMM remains deleted, so the
operation is still idempotent-ish. But, it's not *great*, as it means
that any queries for VMMs deleted before a certain timestamp may not be
strictly correct, and we're updating the database more frequently than
we really need to. So, I've gone ahead and changed it to only set
`time_deleted` if the record's `time_deleted` is null, using
`check_if_exists` so that the method still returns `Ok` if the record
was already deleted --- the caller can inspect the returned `bool` to
determine whether or not they were the actual deleter, but the query
still doesn't fail.
hawkw added a commit that referenced this pull request May 29, 2024
In order to simplify the instance lifecycle state machine and ensure
that instance state updates are processed reliably, we intend to perform
instance state updates in a saga (which will be added in PR #5749). This
saga will require a notion of mutual exclusion between update sagas for
the same instance, in order to avoid race conditions like the following:

1. Sagas `S1` and `S2` start at the same time and observe the same
   instance/VMM states, which indicate that the instance’s active VMM
   has shut down
2. `S1` clears all the resources/provisioning counters and marks the
   instance as `Stopped``
3. User restarts the instance
4. `S2` clears the same instance provisioning counters again

Presently, these races are avoided by the fact that instance state
updates are performed partially in `sled-agent`, which serves as an
"actor" with exclusive ownership over the state transition. Moving these
state transitions to Nexus requires introducing mutual exclusion.

This commit adds a distributed lock on instance state transitions to the
datastore. We add the following fields to the `instance` table:

- `updater_id`, which is the UUID of the saga currently holding the
  update lock on the instance (or `NULL` if no saga has locked the
  instance)
- `updater_gen`, a generation counter that is incremented each time the
  lock is acquired by a new saga

Using these fields, we can add new datastore methods to try and acquire
an instance update lock by doing the following:

1. Generate a UUID for the saga, `saga_lock_id`. This will be performed
   in the saga itself and isn't part of this PR.
2. Read the instance record and interpret the value of the `updater_id`
  field as follows:
  - `NULL`: lock not held, we can acquire it by incrementing the
    `updater_gen` field and setting the `updater_id` field to the saga's
    UUID.
  - `updater_id == saga_id`: the saga already holds the lock, we can
    proceed with the update.
  - `updater_id != saga_id`: another saga holds the lock, we can't
    proceed with the update. Fail the operation.
3. Attempt to write back the updated instance record with generation
   incremented and the `updater_id` set to the saga's ID, conditional on
   the `updater_gen` field being equal to the ID that was read when read
   the instance record. This is equivalent to the atomic compare-and-swap
   operation that one might use to implement a non-distributed lock in a
   single address space.
  - If this fails because the generation number is outdated, try again
    (i.e. goto (2)).
  - If this succeeds, the lock was acquired successfully.

Additionally, we can add a method for unlocking an instance record by
clearing the `updater_id` field and incrementing the `updater_gen`. This
query is conditional on the `updater_id` field being equal to the saga's
UUID, to prevent cases where we accidentally unlock an instance that was
locked by a different saga.

Introducing this distributed lock is considered fairly safe, as it will
only ever be acquired in a saga, and the reverse action for the saga
action that acquires the lock will ensure that the lock is released if
the saga unwinds. Essentially, this is equivalent to a RAII guard
releasing a lock when a thread panics in a single-threaded Rust program.

Presently, none of these methods are actually used. The saga that uses
them will be added in PR #5749. I've factored out this change into its
own PR so that we can merge the foundation needed for that branch.
Hopefully this makes the diff a bit smaller and easier to review, as
well as decreasing merge conflict churn with the schema changes.
hawkw added a commit that referenced this pull request May 29, 2024
In order to simplify the instance lifecycle state machine and ensure
that instance state updates are processed reliably, we intend to perform
instance state updates in a saga (which will be added in PR #5749). This
saga will require a notion of mutual exclusion between update sagas for
the same instance, in order to avoid race conditions like the following:

1. Sagas `S1` and `S2` start at the same time and observe the same
   instance/VMM states, which indicate that the instance’s active VMM
   has shut down
2. `S1` clears all the resources/provisioning counters and marks the
   instance as `Stopped``
3. User restarts the instance
4. `S2` clears the same instance provisioning counters again

Presently, these races are avoided by the fact that instance state
updates are performed partially in `sled-agent`, which serves as an
"actor" with exclusive ownership over the state transition. Moving these
state transitions to Nexus requires introducing mutual exclusion.

This commit adds a distributed lock on instance state transitions to the
datastore. We add the following fields to the `instance` table:

- `updater_id`, which is the UUID of the saga currently holding the
  update lock on the instance (or `NULL` if no saga has locked the
  instance)
- `updater_gen`, a generation counter that is incremented each time the
  lock is acquired by a new saga

Using these fields, we can add new datastore methods to try and acquire
an instance update lock by doing the following:

1. Generate a UUID for the saga, `saga_lock_id`. This will be performed
   in the saga itself and isn't part of this PR.
2. Read the instance record and interpret the value of the `updater_id`
  field as follows:
  - `NULL`: lock not held, we can acquire it by incrementing the
    `updater_gen` field and setting the `updater_id` field to the saga's
    UUID.
  - `updater_id == saga_id`: the saga already holds the lock, we can
    proceed with the update.
  - `updater_id != saga_id`: another saga holds the lock, we can't
    proceed with the update. Fail the operation.
3. Attempt to write back the updated instance record with generation
   incremented and the `updater_id` set to the saga's ID, conditional on
   the `updater_gen` field being equal to the ID that was read when read
   the instance record. This is equivalent to the atomic compare-and-swap
   operation that one might use to implement a non-distributed lock in a
   single address space.
  - If this fails because the generation number is outdated, try again
    (i.e. goto (2)).
  - If this succeeds, the lock was acquired successfully.

Additionally, we can add a method for unlocking an instance record by
clearing the `updater_id` field and incrementing the `updater_gen`. This
query is conditional on the `updater_id` field being equal to the saga's
UUID, to prevent cases where we accidentally unlock an instance that was
locked by a different saga.

Introducing this distributed lock is considered fairly safe, as it will
only ever be acquired in a saga, and the reverse action for the saga
action that acquires the lock will ensure that the lock is released if
the saga unwinds. Essentially, this is equivalent to a RAII guard
releasing a lock when a thread panics in a single-threaded Rust program.

Presently, none of these methods are actually used. The saga that uses
them will be added in PR #5749. I've factored out this change into its
own PR so that we can merge the foundation needed for that branch.
Hopefully this makes the diff a bit smaller and easier to review, as
well as decreasing merge conflict churn with the schema changes.
nexus/src/app/mod.rs Outdated Show resolved Hide resolved
hawkw added a commit that referenced this pull request May 30, 2024
In order to simplify the instance lifecycle state machine and ensure
that instance state updates are processed reliably, we intend to perform
instance state updates in a saga (which will be added in PR #5749). This
saga will require a notion of mutual exclusion between update sagas for
the same instance, in order to avoid race conditions like the following:

1. Sagas `S1` and `S2` start at the same time and observe the same
instance/VMM states, which indicate that the instance’s active VMM has
shut down
2. `S1` clears all the resources/provisioning counters and marks the
instance as `Stopped``
3. User restarts the instance
4. `S2` clears the same instance provisioning counters again

Presently, these races are avoided by the fact that instance state
updates are performed partially in `sled-agent`, which serves as an
"actor" with exclusive ownership over the state transition. Moving these
state transitions to Nexus requires introducing mutual exclusion.

This commit adds a distributed lock on instance state transitions to the
datastore. We add the following fields to the `instance` table:

- `updater_id`, which is the UUID of the saga currently holding the
update lock on the instance (or `NULL` if no saga has locked the
instance)
- `updater_gen`, a generation counter that is incremented each time the
lock is acquired by a new saga

Using these fields, we can add new datastore methods to try and acquire
an instance update lock by doing the following:

1. Generate a UUID for the saga, `saga_lock_id`. This will be performed
in the saga itself and isn't part of this PR.
2. Read the instance record and interpret the value of the `updater_id`
field as follows:
- `NULL`: lock not held, we can acquire it by incrementing the
`updater_gen` field and setting the `updater_id` field to the saga's
UUID.
- `updater_id == saga_id`: the saga already holds the lock, we can
proceed with the update.
- `updater_id != saga_id`: another saga holds the lock, we can't proceed
with the update. Fail the operation.
3. Attempt to write back the updated instance record with generation
incremented and the `updater_id` set to the saga's ID, conditional on
the `updater_gen` field being equal to the ID that was read when read
the instance record. This is equivalent to the atomic compare-and-swap
operation that one might use to implement a non-distributed lock in a
single address space.
- If this fails because the generation number is outdated, try again
(i.e. goto (2)).
  - If this succeeds, the lock was acquired successfully.

Additionally, we can add a method for unlocking an instance record by
clearing the `updater_id` field and incrementing the `updater_gen`. This
query is conditional on the `updater_id` field being equal to the saga's
UUID, to prevent cases where we accidentally unlock an instance that was
locked by a different saga.

Introducing this distributed lock is considered fairly safe, as it will
only ever be acquired in a saga, and the reverse action for the saga
action that acquires the lock will ensure that the lock is released if
the saga unwinds. Essentially, this is equivalent to a RAII guard
releasing a lock when a thread panics in a single-threaded Rust program.

Presently, none of these methods are actually used. The saga that uses
them will be added in PR #5749. I've factored out this change into its
own PR so that we can merge the foundation needed for that branch.
Hopefully this makes the diff a bit smaller and easier to review, as
well as decreasing merge conflict churn with the schema changes.
hawkw added a commit that referenced this pull request May 31, 2024
This branch is part of ongoing work on the `instance-update` saga (see
PR #5749); I've factored it out into a separate PR. This is largely
because this branch makes mechanical changes to a bunch of different
files that aren't directly related to the core change of #5749, and I'd
like to avoid the additional merge conflicts that are likely when these
changes remain un-merged for a long time.
hawkw added a commit that referenced this pull request Jun 4, 2024
This branch is part of ongoing work on the `instance-update` saga (see
PR #5749); I've factored it out into a separate PR. This is largely
because this branch makes mechanical changes to a bunch of different
files that aren't directly related to the core change of #5749, and I'd
like to avoid the additional merge conflicts that are likely when these
changes remain un-merged for a long time.

---

Depends on #5854.

As part of #5749, it is desirable to distinguish between *why* a VMM
record was marked as `Destroyed`, in order to determine which saga is
responsible for cleaning up that VMM's resources. The `instance-update`
saga must be the only entity that can set an instance's VMM IDs (active
and target) to NULL. Presently, however, the `instance-start` and
`instance-migrate` sagas may also do this when they unwind. This is a
requirement to avoid situations where a failed `instance-update` saga
cannot unwind, because the instance's generation number has changed
while the saga was executing.

We want to ensure that if a start or migrate request fails, the instance
will appear to be in the correct post-state as soon as the relevant API
call returns. In order to do this, without having to also guarantee that
an instance update has occurred, we introduce a new VMM state,
`SagaUnwound`, with the following rules:

- It is legal to start or migrate an instance if the
  `active_propolis_id` or `destination_propolis_id` (respectively) is
  either `NULL` or refers to a VMM that’s in the `SagaUnwound` state
  (the new VMM ID directly replaces the old ID).
- If an instance has an `active_propolis_id` in the `SagaUnwound` state,
  then the instance appears to be `Stopped`.
- If an instance has a `destination_propolis_id` in the `SagaUnwound`
  state, nothing special happens–the instance’s state is still derived
  from its active VMM’s state.
- The update saga treats `SagaUnwound` VMMs as identical to `Destroyed`
  VMMs for purposes of deciding whether to remove a VMM ID from an
  instance.

This branch adds a new `VmmState::SagaUnwound` variant. The
`SagaUnwound` state is an internal implementation detail that shouldn't
be exposed to an operator or in the external API. Sled-agents will never
report that a VMM is in this state. Instead, this state is set my the
`instance-start` and `instance-migrate` sagas when they unwind. When
determining the API instance state from an instance and active VMM query
so that the `SagaUnwound` state is mapped to `Destroyed`.

Closes #5848, which this replaces.
@hawkw hawkw mentioned this pull request Jun 4, 2024
hawkw added a commit that referenced this pull request Jun 5, 2024
This branch is part of ongoing work on the `instance-update` saga (see
PR #5749); I've factored it out into a separate PR. This is largely
because this branch makes mechanical changes to a bunch of different
files that aren't directly related to the core change of #5749, and I'd
like to avoid the additional merge conflicts that are likely when these
changes remain un-merged for a long time.

---

Depends on #5854.

As part of #5749, it is desirable to distinguish between *why* a VMM
record was marked as `Destroyed`, in order to determine which saga is
responsible for cleaning up that VMM's resources. The `instance-update`
saga must be the only entity that can set an instance's VMM IDs (active
and target) to NULL. Presently, however, the `instance-start` and
`instance-migrate` sagas may also do this when they unwind. This is a
requirement to avoid situations where a failed `instance-update` saga
cannot unwind, because the instance's generation number has changed
while the saga was executing.

We want to ensure that if a start or migrate request fails, the instance
will appear to be in the correct post-state as soon as the relevant API
call returns. In order to do this, without having to also guarantee that
an instance update has occurred, we introduce a new VMM state,
`SagaUnwound`, with the following rules:

- It is legal to start or migrate an instance if the
  `active_propolis_id` or `destination_propolis_id` (respectively) is
  either `NULL` or refers to a VMM that’s in the `SagaUnwound` state
  (the new VMM ID directly replaces the old ID).
- If an instance has an `active_propolis_id` in the `SagaUnwound` state,
  then the instance appears to be `Stopped`.
- If an instance has a `destination_propolis_id` in the `SagaUnwound`
  state, nothing special happens–the instance’s state is still derived
  from its active VMM’s state.
- The update saga treats `SagaUnwound` VMMs as identical to `Destroyed`
  VMMs for purposes of deciding whether to remove a VMM ID from an
  instance.

This branch adds a new `VmmState::SagaUnwound` variant. The
`SagaUnwound` state is an internal implementation detail that shouldn't
be exposed to an operator or in the external API. Sled-agents will never
report that a VMM is in this state. Instead, this state is set my the
`instance-start` and `instance-migrate` sagas when they unwind. When
determining the API instance state from an instance and active VMM query
so that the `SagaUnwound` state is mapped to `Destroyed`.

Closes #5848, which this replaces.
hawkw added a commit that referenced this pull request Jun 5, 2024
This branch is part of ongoing work on the `instance-update` saga (see
PR #5749); I've factored it out into a separate PR. This is largely
because this branch makes mechanical changes to a bunch of different
files that aren't directly related to the core change of #5749, and I'd
like to avoid the additional merge conflicts that are likely when these
changes remain un-merged for a long time.

---

Depends on #5854 and should merge only after that PR.

As part of #5749, it is desirable to distinguish between *why* a VMM
record was marked as `Destroyed`, in order to determine which saga is
responsible for cleaning up that VMM's resources. The `instance-update`
saga must be the only entity that can set an instance's VMM IDs (active
and target) to NULL. Presently, however, the `instance-start` and
`instance-migrate` sagas may also do this when they unwind. This is a
requirement to avoid situations where a failed `instance-update` saga
cannot unwind, because the instance's generation number has changed
while the saga was executing.

We want to ensure that if a start or migrate request fails, the instance
will appear to be in the correct post-state as soon as the relevant API
call returns. In order to do this, without having to also guarantee that
an instance update has occurred, we introduce a new VMM state,
`SagaUnwound`, with the following rules:

- It is legal to start or migrate an instance if the
`active_propolis_id` or `destination_propolis_id` (respectively) is
either `NULL` or refers to a VMM that’s in the `SagaUnwound` state (the
new VMM ID directly replaces the old ID).
- If an instance has an `active_propolis_id` in the `SagaUnwound` state,
then the instance appears to be `Stopped`.
- If an instance has a `destination_propolis_id` in the `SagaUnwound`
state, nothing special happens–the instance’s state is still derived
from its active VMM’s state.
- The update saga treats `SagaUnwound` VMMs as identical to `Destroyed`
VMMs for purposes of deciding whether to remove a VMM ID from an
instance.

This branch adds a new `VmmState::SagaUnwound` variant. The
`SagaUnwound` state is an internal implementation detail that shouldn't
be exposed to an operator or in the external API. Sled-agents will never
report that a VMM is in this state. Instead, this state is set my the
`instance-start` and `instance-migrate` sagas when they unwind. When
determining the API instance state from an instance and active VMM query
so that the `SagaUnwound` state is mapped to `Destroyed`.

Closes #5848, which this replaces.
hawkw added a commit that referenced this pull request Jun 12, 2024
…5859)

As part of ongoing work on improving instance lifecycle management (see
#5749) , we intend to remove the `InstanceRuntimeState` tracking from
`sled-agent`, and make Nexus the sole owner of `instance` records in
CRDB, with a new `instance-update` saga taking over the responsibility
of managing the instance's state transitions.

In order to properly manage the instance state machine, Nexus will need
information about the status of active migrations that are currently
only available to sled-agents. For example, if an instance is migrating,
and a sled agent reports that the source VMM is `Destroyed`, Nexus
doesn't presently have the capability to determine whether the source
VMM was destroyed because the migration completed successfully, or that
the source shut down prior to starting the migration, resulting in a
failure.

In order for Nexus to correctly manage state updates during live
migration, we introduce a new `migration` table to the schema for
tracking the state of ongoing migrations. The `instance-migrate` saga
creates a `migration` record when beginning a migration. The Nexus and
sled-agent APIs are extended to include migration state updates from
sled-agents to Nexus. In *this* branch, the `migration` table is
(basically) write-only; Nexus doesn't really read from it, and just
stuffs updates into it. In the future, however, this will be used by the
`instance-update` saga.

It occurred to me that, in addition to using the migration table for
instance updates, it might also be useful to add an OMDB command to look
up the status of a migration using this table. However, I decided that
made more sense as a follow-up change, as I'd like to get back to
integrating this into #5749.

Fixes #2948
hawkw added a commit that referenced this pull request Jun 12, 2024
In order to implement the `instance-update` saga ongoing in #5749, it's
necessary to have a way to query the database for the state of an
instance, its active VMM, its target VMM, and active migration in a
single, atomic query. This will be used as a snapshot of the instance
state that the update saga will operate on.

This commit adds an implementation of such a query in the form of a
`DataStore::instance_fetch_all` method, returning an `InstanceSnapshot`
struct that bundles together all these records. We do this by `LEFT
JOIN`ing the `instance` table with the `vmm` and `migration` table on
the `instance`'s various ID columns, with a query that should look
something like:

```sql
SELECT * FROM instance
LEFT JOIN vmm
    ON ((instance.active_propolis_id = vmm.id)
    OR (instance.target_propolis_id = vmm.id))
LEFT JOIN migration ON (instance.migration_id = migration.id)
WHERE instance.id = instance_id
```

Ideally, we would instead do two subqueries that join `instance` with
`vmm` on the active and destination IDs separately and return a single
row which contains both in a way that can be distinguished in the result
set, but I wasn't able to figure out how to do that in Diesel. Instead,
the query may return two rows, if the instance has both an active and
target VMM, and iterate over the result set to find both VMMs in Rust
code. This is a bit uglier, but it works fine.
hawkw added a commit that referenced this pull request Jun 12, 2024
In #5831, I added the `updater_gen` and `updater_id` fields for the
instance-updater lock to the `InstanceRuntimeState` struct, based purely
on vibes: "They change at runtime, right? Therefore, they ought to be
'runtime state'...". It turns out that this is actually Super Ultra
Wrong, because any query which writes an `InstanceRuntimeState` without
paying attention to the lock fields can inadvertantly clobber them,
either releasing the lock or setting it back to a previous lock holder.

These fields should be on the `Instance` struct instead, to avoid this
kind of thing. This commit moves them to the correct place. I figured it
would be nice to land this separately from #5749, purely for the sake of
keeping that diff small.
hawkw added a commit that referenced this pull request Jun 12, 2024
In order to implement the `instance-update` saga ongoing in #5749, it's
necessary to have a way to query the database for the state of an
instance, its active VMM, its target VMM, and active migration in a
single, atomic query. This will be used as a snapshot of the instance
state that the update saga will operate on.

This commit adds an implementation of such a query in the form of a
`DataStore::instance_fetch_all` method, returning an `InstanceSnapshot`
struct that bundles together all these records. We do this by `LEFT
JOIN`ing the `instance` table with the `vmm` and `migration` table on
the `instance`'s various ID columns, with a query that should look
something like:

```sql
SELECT * FROM instance
LEFT JOIN vmm
    ON ((instance.active_propolis_id = vmm.id)
    OR (instance.target_propolis_id = vmm.id))
LEFT JOIN migration ON (instance.migration_id = migration.id)
WHERE instance.id = instance_id
```

Ideally, we would instead do two subqueries that join `instance` with
`vmm` on the active and destination IDs separately and return a single
row which contains both in a way that can be distinguished in the result
set, but I wasn't able to figure out how to do that in Diesel. Instead,
the query may return two rows, if the instance has both an active and
target VMM, and iterate over the result set to find both VMMs in Rust
code. This is a bit uglier, but it works fine.
hawkw added 19 commits August 9, 2024 13:26
This code is *really* gruesome, but it should, hopefully, make the
unwinding behavior correct in the case where the child saga unwinds
*from* the `become_updater` node without having successfully locked the
instance record yet. Previously, this could be leaked since the child
saga unwinding does not make the start saga unwind. Now, however, the
start saga will unwind for any child saga error *except* for an "already
locked" error, in which case we can just complete cleanly.

I've also updated the unwinding tests to assert that the lock is held
either by the (fake) start saga, OR unlocked, since unwinding from the
first action can leave it locked by the parent.
Since we're not calling 'em "snapshots" anymore, this should be clearer.
This makes @gjcolombo's change from #6278 a bit more consistent with the
rest of the code, and in particular, ensures that we always present the
user with the same instance state names. It also addresses the
`SagaUnwound` behavior Greg pointed out in [this comment][1].

[1]: #5749 (comment)
@hawkw hawkw enabled auto-merge (squash) August 9, 2024 21:04
@hawkw
Copy link
Member Author

hawkw commented Aug 9, 2024

Okay, folks, I've been running omicron-stress against this branch on madrid for a couple hours now, and haven't yet seen any errors! Meanwhile, I've also been running migrations of an instance every 30 seconds and have managed 55 successful migrations so far. I haven't seen any signs of the issues that have previously popped up during testing.

Therefore, I'm going to enable auto-merge on this branch so that it can finally land! 🎉

Thanks to @gjcolombo, @jmpesp, and @smklein for all your suggestions and guidance, code reviews, and help with testing! I'm really excited to finally get this merged (and move on with my life...)!

@hawkw hawkw merged commit 9b595e9 into main Aug 9, 2024
23 checks passed
@hawkw hawkw deleted the eliza/instance-updater branch August 9, 2024 22:19
hawkw added a commit that referenced this pull request Sep 11, 2024
The `omicron_common::api::internal::InstanceRuntimeState` type was once
a key component of how Nexus and sled-agents communicated about
instances. Now, however, it's outlived its usefulness: #5749 changed
Nexus' `cpapi_instances_put` and sled-agent's `instance_get` APIs to
operate exclusively on `VmmRuntimeState`s rather than
`InstanceRuntimeState`, with the sled-agent almost completely unaware of
`InstanceRuntimeState`.

At present, the `InstanceRuntimeState` type remains in the internal API
just because it's used in the `InstanceEnsureBody` type, which is sent
to sled-agent's `instance_ensure_registered` API when a new VMM is
registered. However, looking at the code that
`instance_ensure_registered` calls into in sled-agent, we see that the
only thing from the `InstanceRuntimeState` that the sled-agent actually
*uses* is the migration ID of the current migration (if the instance is
migrating in). Soooo...we can just replace the whole
`InstanceRuntimeState` there with a `migration_id: Option<Uuid>`. Once
that's done, there are now no longer any internal APIs in Nexus or in
sled-agent that actually use `InstanceRuntimeState`, so the type can be
removed from the internal API entirely. All of the code in Nexus that
deals with instance runtime states already has changed to operate
exclusively on the `nexus_db_model` version of the struct, so removing
the API type is actually quite straightforward.

In addition, I've refactored the sled-agent `instance_ensure_registered`
code paths a little bit to pass around the `InstanceEnsureBody`, rather
than unpacking its fields at the HTTP entrypoint and then passing them
all through the chain of method calls that actually gets it to the
instance-runner task. Just passing the struct around means we can remove
a bunch of `#[allow(clippy::too_many_arguments)]` attributes. It also
makes changing the contents of the instance-ensure payload a bit easier,
as we'll no longer need to change the arguments list in the whole maze
of tiny little functions, all alike, that pass around all the pieces of
the `InstanceEnsureBody`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants