-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed to delete machinePool for unreachable cluster #10544
Comments
This issue is currently awaiting triage. CAPI contributors will take a look as soon as possible, apply one of the Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/priority awaiting-more-evidence cc @mboersma @willie-yao @Jont828 note: cluster is unreachable is handled "gracefully" in machine controller, I do expect the same to happen in MachinePools too. |
/assign |
Thanks for the follow up, @fabriziopandini you mentioned handling unreachable cluster "gracefully" in the machinePool, Would you elaborate more. Cause I was thinking that we should have unreachable state for the cluster CR and giving it timeout to do actions like force delete. Is that align with what you mean ? |
@fabriziopandini refers to the code for machine-controller which, we explicitly skips the node deletion on certain errors: cluster-api/internal/controllers/machine/machine_controller.go Lines 347 to 356 in 47ec791
Probably a similar behaviour can be implemented for the machine pool case. |
agree, make sense so the fix I pushed check here if the cluster is deleted, no need to create a client. |
Shouldn't the cluster deprovision process impose ordering and make sure machinePools are gone before tearing down everything else? |
It should and it does |
Just to be 100% clear. I think it's okay to adjust the MachinePool controller so it is still able to go through the deletion even if the workload cluster is unreachable (like the Machine controller). What we are not going to support are scenario where someone deletes the control plane / kubeconfig before everything else (which is mentioned as one of the scenarios in the issue description: "or the cluster kubeConfig secret is deleted"). The only way Cluster deletion in CAPI works today is by deleting the Cluster object and then the Cluster controller will make sure everything is deleted in the correct order. Deletions in random order triggered by tools like Argo are not supported. |
I agree, the most important is to keep the delete order in place as mentioned and handling the timeout by each custom resource so even with mistakenly random deletion the CR doesn't stuck (depend on each case). |
What steps did you take and what happened?
At the time to uninstall/deprovision a cluster AND the cluster is unreachable or the cluster kubeConfig secret is deleted (as it managed by controlPlaneRef) the machinePool controller raise the below errors
The issue is initially raised for cluster-api-provider-aws issue number 4936. Failed to clean up the MachinePool CRs in gitOps workFlow using ArgoCD.
Workaround to clean the machinePool CR by removing the finalizer manually.
What did you expect to happen?
Expect to be able to delete the machinePool CRs without the needs to clean the finalizer manually
Cluster API version
v1.6.3
registry.k8s.io/cluster-api/cluster-api-controller:v1.6.3
Kubernetes version
Kubernetes Version: v1.25.7+eab9cc9
Anything else you would like to add?
Checking the machinePool_controller_reconcileDelete the fix could be by checking the returning error at line-263 is not as failed to create cluster accessor (same error as above logs). The changes could be as below
OR if the machinePool has InfrastructureRef then the responsibility of deleting the node should be done by the InfrastructureRef CR not by the machinePool (it may raise back compatibility issue). The changes could be as below
Label(s) to be applied
/kind bug
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.
The text was updated successfully, but these errors were encountered: