Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Propolis ID as VM sled resource key #2840

Merged
merged 3 commits into from
Apr 14, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 30 additions & 0 deletions nexus/db-queries/src/db/datastore/instance.rs
Original file line number Diff line number Diff line change
Expand Up @@ -163,6 +163,36 @@ impl DataStore {
Ok(db_instance)
}

/// Fetches information about a deleted instance. This can be used to
/// query the properties an instance had at the time it was deleted, which
/// can be useful when cleaning up a deleted instance.
pub async fn instance_fetch_deleted(
&self,
opctx: &OpContext,
authz_instance: &authz::Instance,
) -> LookupResult<Instance> {
opctx.authorize(authz::Action::Read, authz_instance).await?;

use db::schema::instance::dsl;
let instance = dsl::instance
.filter(dsl::id.eq(authz_instance.id()))
.filter(dsl::time_deleted.is_not_null())
.select(Instance::as_select())
.get_result_async(self.pool_authorized(opctx).await?)
.await
.map_err(|e| {
public_error_from_diesel_pool(
e,
ErrorHandler::NotFoundByLookup(
ResourceType::Instance,
LookupType::ById(authz_instance.id()),
),
)
})?;

Ok(instance)
}

// TODO-design It's tempting to return the updated state of the Instance
// here because it's convenient for consumers and by using a RETURNING
// clause, we could ensure that the "update" and "fetch" are atomic.
Expand Down
11 changes: 7 additions & 4 deletions nexus/src/app/sagas/instance_create.rs
Original file line number Diff line number Diff line change
Expand Up @@ -599,7 +599,10 @@ async fn sic_alloc_server(
let rss_ram = params.create_params.memory;
let reservoir_ram = omicron_common::api::external::ByteCount::from(0);

let instance_id = sagactx.lookup::<Uuid>("instance_id")?;
// Use the instance's Propolis ID as its resource key, since each unique
// Propolis consumes its own resources, and an instance can have multiple
// Propolises during a live migration.
let propolis_id = sagactx.lookup::<Uuid>("propolis_id")?;
let resources = db::model::Resources::new(
hardware_threads.into(),
rss_ram.into(),
Expand All @@ -609,7 +612,7 @@ async fn sic_alloc_server(
let resource = osagactx
.nexus()
.reserve_on_random_sled(
instance_id,
propolis_id,
db::model::SledResourceKind::Instance,
resources,
)
Expand All @@ -622,9 +625,9 @@ async fn sic_alloc_server_undo(
sagactx: NexusActionContext,
) -> Result<(), anyhow::Error> {
let osagactx = sagactx.user_data();
let instance_id = sagactx.lookup::<Uuid>("instance_id")?;
let propolis_id = sagactx.lookup::<Uuid>("propolis_id")?;

osagactx.nexus().delete_sled_reservation(instance_id).await?;
osagactx.nexus().delete_sled_reservation(propolis_id).await?;
Ok(())
}

Expand Down
20 changes: 19 additions & 1 deletion nexus/src/app/sagas/instance_delete.rs
Original file line number Diff line number Diff line change
Expand Up @@ -292,9 +292,27 @@ async fn sid_account_sled_resources(
&params.serialized_authn,
);

// Fetch the previously-deleted instance record to get its Propolis ID. It
// is safe to fetch the ID at this point because the instance is already
// deleted and so cannot change anymore.
//
// TODO(#2315): This prevents the garbage collection of soft-deleted
// instance records. A better method is to remove a Propolis's reservation
// once an instance no longer refers to it (e.g. when it has stopped or
// been removed from the instance's migration information) and then make
// this saga check that the instance has no active Propolises before it is
// deleted. This logic should be part of the logic needed to stop an
// instance and release its Propolis reservation; when that is added this
// step can be removed.
let instance = osagactx
.datastore()
.instance_fetch_deleted(&opctx, &params.authz_instance)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get what we're doing here, but this is mildly spooky to me. I don't think we really reference deleted objects anywhere else, which means, for <UNNAMED RECORD GARBAGE COLLECTOR THAT DOESN'T EXIST YET>, we can safely "hard delete" any objects which have been "soft deleted" like this.

With this API, we now actually can't delete instances until their corresponding sagas have finished, which isn't clear from the record.

We can probably mitigate this by simply "ensuring that time deleted is really old when we do hard delete", but it seems arguably "more correct" to read this propolis UUID before we delete the record, and then confirm it hasn't changed when we actually mark the record as deleted.

Anyway. This doesn't need to block you, but we might want to consider filing a follow-up issue that doesn't depend on the deleted instance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this offline. It would be nice to be able to query the Propolis ID and then condition deletion on the ID not having changed: DELETE from the instance table WHERE the Propolis IDs match AND the instance isn't already deleted. The problem with this is that, for idempotency, the 'delete instance record' step needs to swallow errors in the case where the target record was already deleted, and (without a heavy hammer like a transaction) that step can't reason about whether the operation failed because of the "instance isn't deleted" filter or the "instance has the right Propolis ID" filter.

However, there's a much better way to handle all of this. Instead of cleaning up Propolis resources during instance delete, we should clean them up when a Propolis stops, either due to a stop API request (#2315 again!) or due to a live migration. Since instance stop and live migration are likely to arrive much sooner than the logic to garbage-collect soft-deleted resources, we'll leave this in place for now to unblock more live migration work and clean it up when the appropriate "Propolis is gone" primitive is available.

I will add a TODO comment to this effect in this saga step.

.await
.map_err(ActionError::action_failed)?;

osagactx
.datastore()
.sled_reservation_delete(&opctx, params.instance.id())
.sled_reservation_delete(&opctx, instance.runtime().propolis_id)
.await
.map_err(ActionError::action_failed)?;
Ok(())
Expand Down