-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Propolis ID as VM sled resource key #2840
Conversation
Using instance IDs to reserve VMM resources on sleds is not quite flexible enough, because a single instance can have multiple Propolis VMMs. Use the Propolis ID as the key for these instead. Add a function to allow the deletion saga to fetch a previously-deleted instance record so that it can obtain this ID at the correct time. Simply returning the deleted record from the delete-record step is insufficient because the record- deleting step needs to be idempotent and, if it runs more than once, may not find any record to delete and return. Tested: cargo test. Will do some ad hoc testing before merging.
// deleted and so cannot change anymore. | ||
let instance = osagactx | ||
.datastore() | ||
.instance_fetch_deleted(&opctx, ¶ms.authz_instance) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get what we're doing here, but this is mildly spooky to me. I don't think we really reference deleted objects anywhere else, which means, for <UNNAMED RECORD GARBAGE COLLECTOR THAT DOESN'T EXIST YET>, we can safely "hard delete" any objects which have been "soft deleted" like this.
With this API, we now actually can't delete instances until their corresponding sagas have finished, which isn't clear from the record.
We can probably mitigate this by simply "ensuring that time deleted is really old when we do hard delete", but it seems arguably "more correct" to read this propolis UUID before we delete the record, and then confirm it hasn't changed when we actually mark the record as deleted.
Anyway. This doesn't need to block you, but we might want to consider filing a follow-up issue that doesn't depend on the deleted instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed this offline. It would be nice to be able to query the Propolis ID and then condition deletion on the ID not having changed: DELETE from the instance table WHERE the Propolis IDs match AND the instance isn't already deleted. The problem with this is that, for idempotency, the 'delete instance record' step needs to swallow errors in the case where the target record was already deleted, and (without a heavy hammer like a transaction) that step can't reason about whether the operation failed because of the "instance isn't deleted" filter or the "instance has the right Propolis ID" filter.
However, there's a much better way to handle all of this. Instead of cleaning up Propolis resources during instance delete, we should clean them up when a Propolis stops, either due to a stop API request (#2315 again!) or due to a live migration. Since instance stop and live migration are likely to arrive much sooner than the logic to garbage-collect soft-deleted resources, we'll leave this in place for now to unblock more live migration work and clean it up when the appropriate "Propolis is gone" primitive is available.
I will add a TODO comment to this effect in this saga step.
Using instance IDs to reserve VMM resources on sleds is not quite flexible enough, because a single instance can have multiple Propolis VMMs. Use the Propolis ID as the key for these instead.
Add a function to allow the deletion saga to fetch a previously-deleted instance record so that it can obtain this ID at the correct time. Simply returning the deleted record from the delete-record step is insufficient because the record- deleting step needs to be idempotent and, if it runs more than once, may not find any record to delete and return.
Tested: cargo test. Will also do some ad hoc VM creations/deletions on a test cluster before merging.
Fixes #2839.