Cluster stays in Failed state after Credentials rotations #2097

qeqar · 2024-05-24T08:24:13Z

/kind bug

The Cluster object stays in failed state after a credentials problem is resolved.

  failureMessage: |
      Failure detected from referenced resource infrastructure.cluster.x-k8s.io/v1beta1, Kind=OpenStackCluster with name "rotate-creds-cnrwq": failed to reconcile external network: failed to get external network: Unable to re-authenticate: Expected HTTP response code [200] when accessing [GET https://url/v2.0/networks/54258498-a513-47da-9369-1a644e4be692], but got 401 instead
      {"error": {"code": 401, "title": "Unauthorized", "message": "The request you have made requires authentication."}}: Resource not found: [POST https://url/v3/auth/tokens], error message: {"error":{"code":404,"message":"Could not find Application Credential: OLD ID.","title":"Not Found"}}

What steps did you take and what happened:

Delete used appcreds and wait for failure to accour
Create new app creds and update all resources
Verify that all actions are possible again, on mgmt cluster or in cluster

What did you expect to happen:
Cluster state is good again after fix

Anything else you would like to add:
Looks like Creds errors are handelt as a terminal error and not as a transient error.
Seems to be like a problem we had with the Machines too

Code for the machine problem as an idea how to fix: https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/7a19fb6f038[…]a31df228877a446b9b99/controllers/openstackmachine_controller.go

Code in Openstack cluster controller with the problem: https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/7a19fb6f038[…]a31df228877a446b9b99/controllers/openstackcluster_controller.go

Environment:

Cluster API Provider OpenStack version (Or git rev-parse HEAD if manually built): v010.3
Cluster-API version: v1.7.3

The text was updated successfully, but these errors were encountered:

qeqar · 2024-05-24T08:36:23Z

Just for referance and helping everyone with broken cluster.
A manual temp fix is kubectl patch cluster CLUSTERNAME --subresource=status --type='merge' -p'{"status":{"failureReason": null, "failureMessage": null}}'

huxcrux · 2024-05-24T08:42:06Z

I noticed yesterday that the same behavior seems to occur if you MGMT cluster having problem with DNS. I had ~10 clusters in Failed state after one of our DNS servers experienced problems. (meaning failureMessage indicated a failed DNS lookup)

I have not had time to dig deeper however I can confirm the patch above works just fine :)

qeqar · 2024-05-24T09:34:07Z

Ok it looks like the same problem, and problay has a similar solution. i will keep it in mind, wen trying to fix this one

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label May 24, 2024

qeqar mentioned this issue May 24, 2024

🐛 if the openstackcluster was ready, we don't want to set a terminalError #2099

Merged

k8s-ci-robot closed this as completed in #2099 Jun 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster stays in Failed state after Credentials rotations #2097

Cluster stays in Failed state after Credentials rotations #2097

qeqar commented May 24, 2024

qeqar commented May 24, 2024

huxcrux commented May 24, 2024

qeqar commented May 24, 2024

Cluster stays in Failed state after Credentials rotations #2097

Cluster stays in Failed state after Credentials rotations #2097

Comments

qeqar commented May 24, 2024

qeqar commented May 24, 2024

huxcrux commented May 24, 2024

qeqar commented May 24, 2024