Let's suppose we have an instance in a state where it should not be deleted. Specifically, a case where sid_delete_instance_record , calling project_delete_instance, observes that the instance cannot be running.
This should be reproducible in a scenario where the instance is "running" - it's not an ok_to_delete_instance_state:
|
let ok_to_delete_instance_states = vec![stopped, failed]; |
If we execute the instance delete saga on this instance, we'll execute the following actions:
|
fn make_saga_dag( |
|
_params: &Self::Params, |
|
mut builder: steno::DagBuilder, |
|
) -> Result<steno::Dag, super::SagaInitError> { |
|
builder.append(v2p_ensure_undo_action()); |
|
builder.append(v2p_ensure_action()); |
|
builder.append(delete_asic_configuration_action()); |
|
builder.append(instance_delete_record_action()); |
|
builder.append(delete_network_interfaces_action()); |
|
builder.append(deallocate_external_ip_action()); |
|
builder.append(virtual_resources_account_action()); |
|
builder.append(sled_resources_account_action()); |
|
Ok(builder.build()?) |
|
} |
Here's what will happen:
- The V2P mappings will be deleted
- The ASIC configuration will be deleted
- The instance record cannot be deleted, because we aren't in a valid state. The saga will start unwinding from
instance_delete_record_action.
- On the unwind path, the V2P mappings will try to be re-created via
sid_v2p_ensure_undo
This is problematic for a couple reasons:
- The (temporary) destruction of the V2P mappings are an observable side-effect of the failed instance deletion
- As far as I can tell, the
sid_delete_network_config function will delete all NAT mappings, and this destructive action will not be "undone"
This seems like it'll degrade the network functionality of the instance, even though it remains running.
It seems like we should delete the instance record first, before proceeding with the de-allocation of resources. This will validate the state of the instance before we actually perform destructive operations.
FYI @jmpesp , @internet-diglett .
Let's suppose we have an instance in a state where it should not be deleted. Specifically, a case where
sid_delete_instance_record, callingproject_delete_instance, observes that the instance cannot be running.This should be reproducible in a scenario where the instance is "running" - it's not an
ok_to_delete_instance_state:omicron/nexus/db-queries/src/db/datastore/instance.rs
Line 228 in e4a5dd0
If we execute the instance delete saga on this instance, we'll execute the following actions:
omicron/nexus/src/app/sagas/instance_delete.rs
Lines 69 to 82 in e4a5dd0
Here's what will happen:
instance_delete_record_action.sid_v2p_ensure_undoThis is problematic for a couple reasons:
sid_delete_network_configfunction will delete all NAT mappings, and this destructive action will not be "undone"This seems like it'll degrade the network functionality of the instance, even though it remains running.
It seems like we should delete the instance record first, before proceeding with the de-allocation of resources. This will validate the state of the instance before we actually perform destructive operations.
FYI @jmpesp , @internet-diglett .