Skip to content

Instance Deletion saga looks destructive, even if instance should not be deleted #2842

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
smklein opened this issue Apr 14, 2023 · 4 comments · Fixed by #4194
Closed

Instance Deletion saga looks destructive, even if instance should not be deleted #2842

smklein opened this issue Apr 14, 2023 · 4 comments · Fixed by #4194
Assignees
Labels
bug Something that isn't working. networking Related to the networking. nexus Related to nexus
Milestone

Comments

@smklein
Copy link
Collaborator

smklein commented Apr 14, 2023

Let's suppose we have an instance in a state where it should not be deleted. Specifically, a case where sid_delete_instance_record , calling project_delete_instance, observes that the instance cannot be running.

This should be reproducible in a scenario where the instance is "running" - it's not an ok_to_delete_instance_state:

let ok_to_delete_instance_states = vec![stopped, failed];

If we execute the instance delete saga on this instance, we'll execute the following actions:

fn make_saga_dag(
_params: &Self::Params,
mut builder: steno::DagBuilder,
) -> Result<steno::Dag, super::SagaInitError> {
builder.append(v2p_ensure_undo_action());
builder.append(v2p_ensure_action());
builder.append(delete_asic_configuration_action());
builder.append(instance_delete_record_action());
builder.append(delete_network_interfaces_action());
builder.append(deallocate_external_ip_action());
builder.append(virtual_resources_account_action());
builder.append(sled_resources_account_action());
Ok(builder.build()?)
}

Here's what will happen:

  • The V2P mappings will be deleted
  • The ASIC configuration will be deleted
  • The instance record cannot be deleted, because we aren't in a valid state. The saga will start unwinding from instance_delete_record_action.
  • On the unwind path, the V2P mappings will try to be re-created via sid_v2p_ensure_undo

This is problematic for a couple reasons:

  1. The (temporary) destruction of the V2P mappings are an observable side-effect of the failed instance deletion
  2. As far as I can tell, the sid_delete_network_config function will delete all NAT mappings, and this destructive action will not be "undone"

This seems like it'll degrade the network functionality of the instance, even though it remains running.

It seems like we should delete the instance record first, before proceeding with the de-allocation of resources. This will validate the state of the instance before we actually perform destructive operations.

FYI @jmpesp , @internet-diglett .

@smklein smklein added bug Something that isn't working. networking Related to the networking. nexus Related to nexus labels Apr 14, 2023
@smklein smklein added this to the MVP milestone Apr 14, 2023
@internet-diglett
Copy link
Contributor

As far as I can tell, the sid_delete_network_config function will delete all NAT mappings, and this destructive action will not be "undone"

Yes, this is the case. I'll try reordering the saga nodes and see if it's a trivial fix.

@internet-diglett
Copy link
Contributor

internet-diglett commented Apr 14, 2023

{"msg":"saga log event","v":0,"name":"test_actions_succeed_idempotently","level":20,"time":"2023-04-14T18:51:21.39154300
7Z","hostname":"workstation-1","pid":21150,"sec_id":"99aa6991-51a2-4eba-a812-ade594ade2bf","component":"SEC","component"
:"nexus","component":"ServerContext","name":"99aa6991-51a2-4eba-a812-ade594ade2bf","new_state":"N002 failed"}
{"msg":"recording saga event","v":0,"name":"test_actions_succeed_idempotently","level":20,"time":"2023-04-14T18:51:21.39
1578685Z","hostname":"workstation-1","pid":21150,"component":"SecStore","component":"nexus","component":"ServerContext",
"name":"99aa6991-51a2-4eba-a812-ade594ade2bf","event_type":"Failed(ActionFailed { source_error: Object {\"ObjectNotFound
\": Object {\"lookup_type\": Object {\"ById\": String(\"818d8bec-ec26-461d-be6f-fb93dd5658cc\")}, \"type_name\": String(
\"instance\")}} })","node_id":"2","saga_id":"c1bacde0-249d-43c9-a470-341b9a216bbf"}

Looks like there are some record lookups that get broken if we perform instance_delete_record_action() first.

It looks like deleting the v2p mappings depends on the instance record being present?

/// Ensure that the necessary v2p mappings for an instance are deleted
pub async fn delete_instance_v2p_mappings(
&self,
opctx: &OpContext,
instance_id: Uuid,
) -> Result<(), Error> {
// For every sled that isn't the sled this instance was allocated to, delete
// the virtual to physical mapping for each of this instance's NICs. If
// there isn't a V2P mapping, del_v2p should be a no-op.
let (.., authz_instance, db_instance) =
LookupPath::new(&opctx, &self.db_datastore)
.instance_id(instance_id)
.fetch_for(authz::Action::Read)
.await?;
let instance_nics = self
.db_datastore
.derive_guest_network_interface_info(&opctx, &authz_instance)
.await?;

Admittedly I could be misunderstanding this, as I'm still not 100% sure on how some of our queries work.

@jmpesp
Copy link
Contributor

jmpesp commented Apr 14, 2023

It looks like deleting the v2p mappings depends on the instance record being present?

Yeah, this is definitely true - for the reason you posted, and for calling derive_guest_network_interface_info.

For this to work, we'd have to cache the existing V2P entries (probably as the output of a saga node) before deleting the existing record. Problem with that is stale information (and this problem probably exists today) - nothing's locking the instance record so it would be possible to add or remove a network interface, and render that cached information stale.

@smklein
Copy link
Collaborator Author

smklein commented Apr 14, 2023

Ah, I was potentially going to suggest the generation number here, but that sounds like a bad fit. Network interfaces can be added to stopped instances, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that isn't working. networking Related to the networking. nexus Related to nexus
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants