Failed to delete machinePool for unreachable cluster #10544

serngawy · 2024-05-01T16:35:46Z

What steps did you take and what happened?

At the time to uninstall/deprovision a cluster AND the cluster is unreachable or the cluster kubeConfig secret is deleted (as it managed by controlPlaneRef) the machinePool controller raise the below errors

E0501 14:52:48.526675       1 controller.go:329] "Reconciler error" err="failed to create cluster accessor: error creating http client and mapper for remote cluster \"ns-rosa-hcp/rosa-hcp-2\": error creating client for remote cluster \"ns-rosa-hcp/rosa-hcp-2\": error getting rest mapping: failed to get API group resources: unable to retrieve the complete list of server APIs: v1: Get \"https://api.mydomain.5sub.s3.devshift.org:443/api/v1?timeout=10s\": dial tcp: lookup api.mydomain.5sub.s3.devshift.org on 172.30.0.10:53: no such host" controller="machinepool" controllerGroup="cluster.x-k8s.io" controllerKind="MachinePool" MachinePool="ns-rosa-hcp/workers-ex" namespace="ns-rosa-hcp" name="workers-ex" reconcileID="55df7e9c-460a-4cf5-bc19-409c18a6d61e"


E0501 14:53:48.529773       1 controller.go:329] "Reconciler error" err="failed to create cluster accessor: error fetching REST client config for remote cluster \"ns-rosa-hcp/rosa-hcp-2\": failed to retrieve kubeconfig secret for Cluster ns-rosa-hcp/rosa-hcp-2: Secret \"rosa-hcp-2-kubeconfig\" not found" controller="machinepool" controllerGroup="cluster.x-k8s.io" controllerKind="MachinePool" MachinePool="ns-rosa-hcp/workers-ex" namespace="ns-rosa-hcp" name="workers-ex" reconcileID="bbd400ce-c773-4ffe-9ab1-f4dfccc6660e"

The issue is initially raised for cluster-api-provider-aws issue number 4936. Failed to clean up the MachinePool CRs in gitOps workFlow using ArgoCD.

Workaround to clean the machinePool CR by removing the finalizer manually.

What did you expect to happen?

Expect to be able to delete the machinePool CRs without the needs to clean the finalizer manually

Cluster API version

v1.6.3
registry.k8s.io/cluster-api/cluster-api-controller:v1.6.3

Kubernetes version

Kubernetes Version: v1.25.7+eab9cc9

Anything else you would like to add?

Checking the machinePool_controller_reconcileDelete the fix could be by checking the returning error at line-263 is not as failed to create cluster accessor (same error as above logs). The changes could be as below

func (r *MachinePoolReconciler) reconcileDelete(ctx context.Context, cluster *clusterv1.Cluster, mp *expv1.MachinePool) error {
	if ok, err := r.reconcileDeleteExternal(ctx, mp); !ok || err != nil {
		// Return early and don't remove the finalizer if we got an error or
		// the external reconciliation deletion isn't ready.
		return err
	}

	// Check if the return error is "failed to create cluster accessor" meaninng the cluster in unreachable then deleting the machienPool.
	if err := r.reconcileDeleteNodes(ctx, cluster, mp); err != nil {
		if !strings.Contains(err.Error(), "failed to create cluster accessor") {
                         // Return early and don't remove the finalizer if we got an error.
			return err
		}
	}

	controllerutil.RemoveFinalizer(mp, expv1.MachinePoolFinalizer)
	return nil
}

OR if the machinePool has InfrastructureRef then the responsibility of deleting the node should be done by the InfrastructureRef CR not by the machinePool (it may raise back compatibility issue). The changes could be as below

func (r *MachinePoolReconciler) reconcileDelete(ctx context.Context, cluster *clusterv1.Cluster, mp *expv1.MachinePool) error {
	if ok, err := r.reconcileDeleteExternal(ctx, mp); !ok || err != nil {
		// Return early and don't remove the finalizer if we got an error or
		// the external reconciliation deletion isn't ready.
		return err
	}

	// Check if there is an InfrastructureRef then deleting the node should be done by the InfrastructureRef.
	if mp.Spec.Template.Spec.InfrastructureRef.Name == "" {
		if err := r.reconcileDeleteNodes(ctx, cluster, mp); err != nil {
			// Return early and don't remove the finalizer if we got an error.
			return err
		}
	}

	controllerutil.RemoveFinalizer(mp, expv1.MachinePoolFinalizer)
	return nil
}

Label(s) to be applied

/kind bug
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2024-05-01T16:35:54Z

This issue is currently awaiting triage.

CAPI contributors will take a look as soon as possible, apply one of the triage/* labels and provide further guidance.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

fabriziopandini · 2024-05-02T12:32:00Z

/priority awaiting-more-evidence

cc @mboersma @willie-yao @Jont828
to take a look and assign priority

note: cluster is unreachable is handled "gracefully" in machine controller, I do expect the same to happen in MachinePools too.

mboersma · 2024-05-02T14:45:24Z

/assign

serngawy · 2024-05-03T13:57:29Z

Thanks for the follow up, @fabriziopandini you mentioned handling unreachable cluster "gracefully" in the machinePool, Would you elaborate more. Cause I was thinking that we should have unreachable state for the cluster CR and giving it timeout to do actions like force delete. Is that align with what you mean ?

chrischdi · 2024-05-06T07:30:12Z

Thanks for the follow up, @fabriziopandini you mentioned handling unreachable cluster "gracefully" in the machinePool, Would you elaborate more. Cause I was thinking that we should have unreachable state for the cluster CR and giving it timeout to do actions like force delete. Is that align with what you mean ?

@fabriziopandini refers to the code for machine-controller which, we explicitly skips the node deletion on certain errors:

cluster-api/internal/controllers/machine/machine_controller.go

Lines 347 to 356 in 47ec791

    
           err := r.isDeleteNodeAllowed(ctx, cluster, m) 
        
           isDeleteNodeAllowed := err == nil 
        
           if err != nil { 
        
           	switch err { 
        
           	case errNoControlPlaneNodes, errLastControlPlaneNode, errNilNodeRef, errClusterIsBeingDeleted, errControlPlaneIsBeingDeleted: 
        
           		nodeName := "" 
        
           		if m.Status.NodeRef != nil { 
        
           			nodeName = m.Status.NodeRef.Name 
        
           		} 
        
           		log.Info("Skipping deletion of Kubernetes Node associated with Machine as it is not allowed", "Node", klog.KRef("", nodeName), "cause", err.Error())

isDeleteNodeAllowed would be false, so it even won't try to create an accessor and run into the mentioned error.

Probably a similar behaviour can be implemented for the machine pool case.

serngawy · 2024-05-08T13:50:20Z

agree, make sense so the fix I pushed check here if the cluster is deleted, no need to create a client.

enxebre · 2024-06-12T07:16:35Z

At the time to uninstall/deprovision a cluster AND the cluster is unreachable or the cluster kubeConfig secret is deleted (as it managed by controlPlaneRef) the machinePool controller raise the below errors

Shouldn't the cluster deprovision process impose ordering and make sure machinePools are gone before tearing down everything else?

sbueringer · 2024-06-12T09:09:16Z

It should and it does

sbueringer · 2024-07-02T09:13:55Z

Just to be 100% clear. I think it's okay to adjust the MachinePool controller so it is still able to go through the deletion even if the workload cluster is unreachable (like the Machine controller).

What we are not going to support are scenario where someone deletes the control plane / kubeconfig before everything else (which is mentioned as one of the scenarios in the issue description: "or the cluster kubeConfig secret is deleted").

The only way Cluster deletion in CAPI works today is by deleting the Cluster object and then the Cluster controller will make sure everything is deleted in the correct order. Deletions in random order triggered by tools like Argo are not supported.

serngawy · 2024-07-05T14:20:49Z

I agree, the most important is to keep the delete order in place as mentioned and handling the timeout by each custom resource so even with mistakenly random deletion the CR doesn't stuck (depend on each case).

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label May 1, 2024

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label May 1, 2024

k8s-ci-robot added the priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. label May 2, 2024

k8s-ci-robot assigned mboersma May 2, 2024

serngawy linked a pull request May 3, 2024 that will close this issue

🐛 Let machinePools to honour NodeDeletionTimeout #10553

Open

serngawy mentioned this issue May 14, 2024

🐛 ROSA: Fix issue-4936 delete rosaMachinePool and rosaControlPlane kubernetes-sigs/cluster-api-provider-aws#4953

Merged

5 tasks

k8s-ci-robot removed the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label May 22, 2024

sbueringer added the area/machinepool Issues or PRs related to machinepools label Jul 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to delete machinePool for unreachable cluster #10544

Failed to delete machinePool for unreachable cluster #10544

serngawy commented May 1, 2024

k8s-ci-robot commented May 1, 2024

fabriziopandini commented May 2, 2024

mboersma commented May 2, 2024

serngawy commented May 3, 2024

chrischdi commented May 6, 2024

serngawy commented May 8, 2024

enxebre commented Jun 12, 2024

sbueringer commented Jun 12, 2024

sbueringer commented Jul 2, 2024

serngawy commented Jul 5, 2024

Failed to delete machinePool for unreachable cluster #10544

Failed to delete machinePool for unreachable cluster #10544

Comments

serngawy commented May 1, 2024

What steps did you take and what happened?

What did you expect to happen?

Cluster API version

Kubernetes version

Anything else you would like to add?

Label(s) to be applied

k8s-ci-robot commented May 1, 2024

fabriziopandini commented May 2, 2024

mboersma commented May 2, 2024

serngawy commented May 3, 2024

chrischdi commented May 6, 2024

serngawy commented May 8, 2024

enxebre commented Jun 12, 2024

sbueringer commented Jun 12, 2024

sbueringer commented Jul 2, 2024

serngawy commented Jul 5, 2024