Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Propolis ID as VM sled resource key #2840

Merged
merged 3 commits into from
Apr 14, 2023

Conversation

gjcolombo
Copy link
Contributor

Using instance IDs to reserve VMM resources on sleds is not quite flexible enough, because a single instance can have multiple Propolis VMMs. Use the Propolis ID as the key for these instead.

Add a function to allow the deletion saga to fetch a previously-deleted instance record so that it can obtain this ID at the correct time. Simply returning the deleted record from the delete-record step is insufficient because the record- deleting step needs to be idempotent and, if it runs more than once, may not find any record to delete and return.

Tested: cargo test. Will also do some ad hoc VM creations/deletions on a test cluster before merging.

Fixes #2839.

Using instance IDs to reserve VMM resources on sleds is not quite flexible
enough, because a single instance can have multiple Propolis VMMs. Use the
Propolis ID as the key for these instead.

Add a function to allow the deletion saga to fetch a previously-deleted instance
record so that it can obtain this ID at the correct time. Simply returning the
deleted record from the delete-record step is insufficient because the record-
deleting step needs to be idempotent and, if it runs more than once, may not
find any record to delete and return.

Tested: cargo test. Will do some ad hoc testing before merging.
@gjcolombo gjcolombo requested a review from smklein April 14, 2023 17:09
@smklein smklein self-assigned this Apr 14, 2023
nexus/db-queries/src/db/datastore/instance.rs Outdated Show resolved Hide resolved
// deleted and so cannot change anymore.
let instance = osagactx
.datastore()
.instance_fetch_deleted(&opctx, &params.authz_instance)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get what we're doing here, but this is mildly spooky to me. I don't think we really reference deleted objects anywhere else, which means, for <UNNAMED RECORD GARBAGE COLLECTOR THAT DOESN'T EXIST YET>, we can safely "hard delete" any objects which have been "soft deleted" like this.

With this API, we now actually can't delete instances until their corresponding sagas have finished, which isn't clear from the record.

We can probably mitigate this by simply "ensuring that time deleted is really old when we do hard delete", but it seems arguably "more correct" to read this propolis UUID before we delete the record, and then confirm it hasn't changed when we actually mark the record as deleted.

Anyway. This doesn't need to block you, but we might want to consider filing a follow-up issue that doesn't depend on the deleted instance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this offline. It would be nice to be able to query the Propolis ID and then condition deletion on the ID not having changed: DELETE from the instance table WHERE the Propolis IDs match AND the instance isn't already deleted. The problem with this is that, for idempotency, the 'delete instance record' step needs to swallow errors in the case where the target record was already deleted, and (without a heavy hammer like a transaction) that step can't reason about whether the operation failed because of the "instance isn't deleted" filter or the "instance has the right Propolis ID" filter.

However, there's a much better way to handle all of this. Instead of cleaning up Propolis resources during instance delete, we should clean them up when a Propolis stops, either due to a stop API request (#2315 again!) or due to a live migration. Since instance stop and live migration are likely to arrive much sooner than the logic to garbage-collect soft-deleted resources, we'll leave this in place for now to unblock more live migration work and clean it up when the appropriate "Propolis is gone" primitive is available.

I will add a TODO comment to this effect in this saga step.

Co-authored-by: Sean Klein <sean@oxide.computer>
@smklein smklein removed their assignment Apr 14, 2023
@gjcolombo gjcolombo enabled auto-merge (squash) April 14, 2023 21:08
@gjcolombo gjcolombo disabled auto-merge April 14, 2023 21:10
@gjcolombo gjcolombo enabled auto-merge (squash) April 14, 2023 21:14
@gjcolombo gjcolombo merged commit 8d11bf1 into main Apr 14, 2023
19 checks passed
@gjcolombo gjcolombo deleted the gjcolombo/propolis-id-as-resource-key branch April 14, 2023 21:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Nexus should treat Propolises (not just instances) as sled services
2 participants