Use Propolis ID as VM sled resource key #2840

gjcolombo · 2023-04-14T17:09:39Z

Using instance IDs to reserve VMM resources on sleds is not quite flexible enough, because a single instance can have multiple Propolis VMMs. Use the Propolis ID as the key for these instead.

Add a function to allow the deletion saga to fetch a previously-deleted instance record so that it can obtain this ID at the correct time. Simply returning the deleted record from the delete-record step is insufficient because the record- deleting step needs to be idempotent and, if it runs more than once, may not find any record to delete and return.

Tested: cargo test. Will also do some ad hoc VM creations/deletions on a test cluster before merging.

Fixes #2839.

Using instance IDs to reserve VMM resources on sleds is not quite flexible enough, because a single instance can have multiple Propolis VMMs. Use the Propolis ID as the key for these instead. Add a function to allow the deletion saga to fetch a previously-deleted instance record so that it can obtain this ID at the correct time. Simply returning the deleted record from the delete-record step is insufficient because the record- deleting step needs to be idempotent and, if it runs more than once, may not find any record to delete and return. Tested: cargo test. Will do some ad hoc testing before merging.

nexus/db-queries/src/db/datastore/instance.rs

smklein · 2023-04-14T17:49:01Z

nexus/src/app/sagas/instance_delete.rs

+    // deleted and so cannot change anymore.
+    let instance = osagactx
+        .datastore()
+        .instance_fetch_deleted(&opctx, &params.authz_instance)


I get what we're doing here, but this is mildly spooky to me. I don't think we really reference deleted objects anywhere else, which means, for <UNNAMED RECORD GARBAGE COLLECTOR THAT DOESN'T EXIST YET>, we can safely "hard delete" any objects which have been "soft deleted" like this.

With this API, we now actually can't delete instances until their corresponding sagas have finished, which isn't clear from the record.

We can probably mitigate this by simply "ensuring that time deleted is really old when we do hard delete", but it seems arguably "more correct" to read this propolis UUID before we delete the record, and then confirm it hasn't changed when we actually mark the record as deleted.

Anyway. This doesn't need to block you, but we might want to consider filing a follow-up issue that doesn't depend on the deleted instance.

We discussed this offline. It would be nice to be able to query the Propolis ID and then condition deletion on the ID not having changed: DELETE from the instance table WHERE the Propolis IDs match AND the instance isn't already deleted. The problem with this is that, for idempotency, the 'delete instance record' step needs to swallow errors in the case where the target record was already deleted, and (without a heavy hammer like a transaction) that step can't reason about whether the operation failed because of the "instance isn't deleted" filter or the "instance has the right Propolis ID" filter.

However, there's a much better way to handle all of this. Instead of cleaning up Propolis resources during instance delete, we should clean them up when a Propolis stops, either due to a stop API request (#2315 again!) or due to a live migration. Since instance stop and live migration are likely to arrive much sooner than the logic to garbage-collect soft-deleted resources, we'll leave this in place for now to unblock more live migration work and clean it up when the appropriate "Propolis is gone" primitive is available.

I will add a TODO comment to this effect in this saga step.

Co-authored-by: Sean Klein <sean@oxide.computer>

gjcolombo requested a review from smklein April 14, 2023 17:09

smklein self-assigned this Apr 14, 2023

smklein approved these changes Apr 14, 2023

View reviewed changes

Fix typo

734ffea

Co-authored-by: Sean Klein <sean@oxide.computer>

smklein removed their assignment Apr 14, 2023

Add TODO comment

5234026

gjcolombo enabled auto-merge (squash) April 14, 2023 21:08

gjcolombo disabled auto-merge April 14, 2023 21:10

gjcolombo enabled auto-merge (squash) April 14, 2023 21:14

gjcolombo merged commit 8d11bf1 into main Apr 14, 2023

gjcolombo deleted the gjcolombo/propolis-id-as-resource-key branch April 14, 2023 21:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Propolis ID as VM sled resource key #2840

Use Propolis ID as VM sled resource key #2840

gjcolombo commented Apr 14, 2023

smklein Apr 14, 2023

gjcolombo Apr 14, 2023

Use Propolis ID as VM sled resource key #2840

Use Propolis ID as VM sled resource key #2840

Conversation

gjcolombo commented Apr 14, 2023

smklein Apr 14, 2023

Choose a reason for hiding this comment

gjcolombo Apr 14, 2023

Choose a reason for hiding this comment