Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidate saga forward and undo actions which create more than one item #1466

Closed
bnaecker opened this issue Jul 19, 2022 · 5 comments
Closed
Assignees
Labels
nexus Related to nexus

Comments

@bnaecker
Copy link
Collaborator

As part of the instance create saga, we create a variable number of network interfaces for the guest. In the saga, that's done by first creating all the UUIDs for the NICs, and then a single action which creates all the NICs. Because that single creation action performs multiple fallible operations, we attached the corresponding undo action to the preceding saga node, the one which creates the UUIDs. This is because the undo action for a node is only performed if the action completes successfully. Should the NIC-creation action fail partway through creating a list of NICs, they would not be unwound if the undo action were attached to that node. By attaching it to the previous UUID-creation node, which has a no-op undo, we can be sure to unwind any NICs we created, even if we failed partway through. This is all done here, with a detailed explanation.

I used the same strategy in #1458, when creating multiple ephemeral IP addresses for guests, since that has the same properties. During review, it became clear that this is pretty confusing, and definitely bad for refactoring since one could change the forward action, but not realize that it's strongly coupled to the preceding action. This issue tracks moving both these to the strategy used in the disk-creation steps. There, the loop is "unrolled" in to a set of action-undo pairs, all of which are in separate nodes in the saga. Each creates one disk, and each undo removes that disk. This requires that we can statically limit the number of possible nodes, which I think applies in all these cases. It also means we need to have logic inside each node for conditionally running it, if the number of items requested is less than the static limit.

@bnaecker bnaecker added the nexus Related to nexus label Jul 19, 2022
@bnaecker bnaecker self-assigned this Jul 19, 2022
@davepacheco
Copy link
Collaborator

oxidecomputer/steno#29 / oxidecomputer/steno#30 (not yet landed) can also improve this. With that change, we'll create a new DAG for each instance-create saga, and that DAG can have exactly the right number of nodes for creating NICs, and you can attach the appropriate undo action to each node. It's basically the same as the disk-creation example you mentioned but you don't have to statically limit the number of them or create extra ones to fill out that limit.

@bnaecker
Copy link
Collaborator Author

Thanks @davepacheco that's very good to know. Is it worth waiting for those to land before fixing any of this? The fix is pretty small, but it'd still be good to avoid churn.

@davepacheco
Copy link
Collaborator

I expect those Steno changes to become ready this week or next, but then it may be a bunch of work to update Omicron to use them. I'm not sure how urgent this is.

If I understand right, the change you're proposing here is a necessary part of using the new functionality anyway. Maybe that's a reason not to wait?

@bnaecker
Copy link
Collaborator Author

Ok, thanks Dave. It's not urgent in that the current implementation is not wrong, just confusing. But it will be harder to maintain. I might unravel some of it now, which maybe will help the integration with these Steno changes too, as you point out. Thanks!

@bnaecker
Copy link
Collaborator Author

Closed by #1474

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
nexus Related to nexus
Projects
None yet
Development

No branches or pull requests

2 participants