Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide a Way to Opt Out of Resource Adoption #7913

Open
spjmurray opened this issue Jan 12, 2023 · 14 comments
Open

Provide a Way to Opt Out of Resource Adoption #7913

spjmurray opened this issue Jan 12, 2023 · 14 comments
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature. priority/backlog Higher priority than priority/awaiting-more-evidence. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@spjmurray
Copy link

User Story

Say I deploy a cluster via Helm and ArgoCD. I then want to delete a MachineDeployment, its KuebadmConfigTemplate and OpenstackMachineTemplate. Problem is that while Argo detects it doesn't need to manage these anymore, because they are no longer generated by Helm, it doesn't actually delete them because CAPI has implicitly added owner references to the Cluster. Argo thinks that these resources have been implicitly created by controller-x, as per standard Deployment/StatefulSet/etc semantics, so doesn't touch them.

The way CAPI works essentially prevents deletion of MachineDeployments in this context and requires manual deletion, which is a pain in the butt.

Detailed Description

What I want is a way to opt out of this adoption, so that the provisioning application is in control of its lifecycle, not CAPI.

Anything else you would like to add:

[Miscellaneous information that will assist in solving the issue.]

/kind feature

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 12, 2023
@fabriziopandini
Copy link
Member

HI @spjmurray, thanks for reporting the issue!

Have you considered fixing this in Argo, because when I read the problem statement it seems to me that the root cause is how Argo detects it has to manage objects or not. e.g can Argo support an annotation or a configuration knob that instructs the system to ignore those specific owner references?

Based on my experience there are always advantages in fixing the root cause of a problem in the right layer instead of circling around it in other layers; because it usually allows implementing a more elegant and generic solution that can e.g. solve similar problems in many other applications, not only CAPI.

I have also some general concerns about making CAPI behave differently depending on the tools that are used around it, because there are many tools out there and if we start doing specific changes for one, then what about the others?

But let's start answering the first question and possibly also get some other opinions on the issue

@fabriziopandini
Copy link
Member

fabriziopandini commented Jan 12, 2023

Oh, I saw you opened the issue in Argo while I was writing, great! 🚀

@spjmurray
Copy link
Author

spjmurray commented Jan 13, 2023

@fabriziopandini well, it's not necessarily all Argo's problem. As a slight--but related--aside, I did note that when manually deleting the MD I expected the accompanying KCT and OSMT to be deleted as they could/should be considered "owned" by the MD. However those are also owned/adopted by the Cluster resource, which I found kinda weird as I now need to look at live MDs, and the resources they point to (they have hash based names generated by helm, and I don't want to replicate the logic) in order to create an expected set so I know which ones I can delete, rather than let cascading deletion do its job. That in itself would save about 200 lines of code 😉

I can see why the ownership has been done the way it has to improve the UX, I certainly think clean up on MD deletion also enhances UX and makes CAPI clusters a bit more dynamic.

@killianmuldoon
Copy link
Contributor

I expected the accompanying KCT and OSMT to be deleted as they could/should be considered "owned" by the MD.

These templates can be used across different MDs (and other objects), so I think "Cluster" is the right level of ownership. The templates can also be created out of order - i.e. they can exist independently of any object using them, so basing deletion on an MD ownership could be frustrating for users.

Can I ask how Argo actually deletes the MD? Is it running the equivalent of a delete call against the API? Is the MD deletion actually working or is it just the template deletion that isn't working?

@spjmurray
Copy link
Author

@killianmuldoon it's not running the delete at all. It would if it could, but the owner reference that's been added by CAPI is keeping it alive. Take a ReplicaSet as an example, it shouldn't go pruning all the Pods because they are owned by the RS. Similar thing is happening here, because the MD is now owned by the cluster, it thinks its been implicitly created by the cluster and cannot touch it.

Yeah, I'm just throwing around ideas as regards the "who owns what". In the context of Helm et al it's not really a problem as resource creation/deletion is ostensibly atomic.

I'm also looking into an interim solution with Argo where you can mark a resource as "totally owned by you, ignore what the underlying app is doing to it" type solution. Once I can find out where in the code the magic happen that is!

@fabriziopandini
Copy link
Member

/triage accepted
This is similar to the ongoing discussion on #4014 (we can eventually consider to dedup/merge)

The TL;DR is that this behavior dates back to the beginning of the project and we need to rebuild all the required knowledge before considering changes to it, because as we learned the hard way in recent cycles, the codebase is full of assumptions about owner refs 😞
Most probably this also requires a proposal/an API bump to signal to all the consumers the change

/help

@k8s-ci-robot
Copy link
Contributor

@fabriziopandini:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/triage accepted
This is similar to the ongoing discussion on #4014 (we can eventually consider to dedup/merge)

The TL;DR is that this behavior dates back to the beginning of the project and we need to rebuild all the required knowledge before considering changes to it, because as we learned the hard way in recent cycles, the codebase is full of assumptions about owner refs 😞
Most probably this also requires a proposal/an API bump to signal to all the consumers the change

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 21, 2023
@javanthropus
Copy link

My company is running into this and has so far performed much the same workarounds as discussed here to perform cleanup of unused resources. I also opened a Slack thread for discussion before learning about this ticket.

To summarize from the thread, 2 annotations could be added to these resources depending on the need: 1 to tell CAPI to completely disable management of ownerReferences for a given resource and another to tell CAPI to set the ownerReferences to the KubeadmControlPlane and MachineSet resources that directly depend on template resources such as KubeadmConfigTemplate, AWSMachineTemplate, OpenstackMachineTemplate, etc. The first annotation would be applied to MachineDeployment resources in our case, but it could be placed onto any resource if desired. The behavior change of the second annotation makes it possible to keep the templates around until they are no longer referenced even if a tool such as Argo stops trying to apply them, which is important for avoiding MachineSet resources from becoming invalid due to a template resource being prematurely pruned. It's important to note that in the case of KubeadmControlPlane, the ownerReferences on a template need to be removed as soon as the template is replaced in the KubeadmControlPlane spec. This would allow Argo to perform cleanup assuming no other resources directly depended on the template.

Because the annotations are opt-in, they could be added without a major API bump, and they would allow for testing of what could eventually become the default behavior in a future API version by those of us with a need for these changes now.

@fabriziopandini
Copy link
Member

I'm a little bit perplexed about both options, because having two or three different owner references chains for the same set of objects will be really confusing for users and it will not only make the code complex, but also documentation and troubleshooting.

Also have we considered impact on clusterctl move, or on our E2E test about owner reference?

@javanthropus
Copy link

I would honestly be surprised if most users are even aware of the owner references on these resources until they run into an edge case such as this. That said, my proposal does not cause additional owner references to be applied for the first annotation. The first annotation would prevent any owner references from being added to resources with the annotation at all.

The second annotation would change the owner reference of a resource such as AWSMachineTemplate to be the MachineSet or KubeadmControlPlane resource that directly references it rather than the Cluster resource as a whole. This could lead to a given AWSMachineTemplate, for instance, having multiple owner references, but this is also a supported feature of the owner references list, so I'm not sure why there would be concern about it. The net change is that there would be a DAG rather than a tree rooted at the Cluster resource.

It's good that you bring up clusterctl move and the E2E tests. These would clearly need updates to account for these changes. There may be other tooling that needs to be updated as well; however, deploying these features as experimental annotations would facilitate wider testing and make it easier to discover these gaps. Being experimental, the feature could also be backed out without affecting a large number of users should an insurmountable issue be discovered.

The problem we're trying to solve here is that there are 2 different cases for a tool such as Argo to address if we do nothing about how CAPI works. In one case Argo needs to always ignore the owner references on a resource it was managing (the MachineDeployment). In the other case it needs to obey the owner references until some other resource which is not in the owner references is removed (the MachineSet) and/or no longer references them (the KubeadmControlPlane).

While the first case is easy enough to code up via general logic in Argo, the second case requires Argo to have intimate knowledge of how CAPI itself and its various resources work and are tied together. Since that logic does not belong in a general tool such as Argo, the result is that users must configure Argo to either risk prematurely deleting resources, resulting in invalid cluster configurations, or to never delete them, leading to a mess of unneeded resources over time.

@vincepri
Copy link
Member

vincepri commented Aug 7, 2023

From a Kubernetes garbage collector perspective, the current way Cluster API works it's perfectly in line with what the GC expects: an object can have many owner references, only one can be a controller. How Cluster API is wired, the deletion logic follows that pattern, Argo in this case should be allowing deletion to happen once the first controller owner reference is deleted.

@vincepri
Copy link
Member

vincepri commented Aug 7, 2023

/triage needs-information

@k8s-ci-robot k8s-ci-robot added the triage/needs-information Indicates an issue needs more information in order to work on it. label Aug 7, 2023
@javanthropus
Copy link

Argo would do so, but the capi controller never removes the owner reference from resources such as MachineDeployment, AWSMachineTemplate, KubeadmConfigTemplate, and similar. My proposal asks that it never add the owner reference for MachineDeployment and that it change the owner reference to the MachineSet resource that actually needs a given template resource. K8s GC and Argo pruning should work well together like that.

@fabriziopandini
Copy link
Member

/priority backlog

@k8s-ci-robot k8s-ci-robot added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Apr 11, 2024
@fabriziopandini fabriziopandini removed the triage/needs-information Indicates an issue needs more information in order to work on it. label Apr 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature. priority/backlog Higher priority than priority/awaiting-more-evidence. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

6 participants