Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster stays in Failed state after Credentials rotations #2097

Closed
qeqar opened this issue May 24, 2024 · 3 comments · Fixed by #2099
Closed

Cluster stays in Failed state after Credentials rotations #2097

qeqar opened this issue May 24, 2024 · 3 comments · Fixed by #2099
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@qeqar
Copy link
Contributor

qeqar commented May 24, 2024

/kind bug

The Cluster object stays in failed state after a credentials problem is resolved.

  failureMessage: |
      Failure detected from referenced resource infrastructure.cluster.x-k8s.io/v1beta1, Kind=OpenStackCluster with name "rotate-creds-cnrwq": failed to reconcile external network: failed to get external network: Unable to re-authenticate: Expected HTTP response code [200] when accessing [GET https://url/v2.0/networks/54258498-a513-47da-9369-1a644e4be692], but got 401 instead
      {"error": {"code": 401, "title": "Unauthorized", "message": "The request you have made requires authentication."}}: Resource not found: [POST https://url/v3/auth/tokens], error message: {"error":{"code":404,"message":"Could not find Application Credential: OLD ID.","title":"Not Found"}}

What steps did you take and what happened:

Delete used appcreds and wait for failure to accour
Create new app creds and update all resources
Verify that all actions are possible again, on mgmt cluster or in cluster

What did you expect to happen:
Cluster state is good again after fix

Anything else you would like to add:
Looks like Creds errors are handelt as a terminal error and not as a transient error.
Seems to be like a problem we had with the Machines too

Code for the machine problem as an idea how to fix: https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/7a19fb6f038[…]a31df228877a446b9b99/controllers/openstackmachine_controller.go

Code in Openstack cluster controller with the problem: https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/7a19fb6f038[…]a31df228877a446b9b99/controllers/openstackcluster_controller.go

Environment:

  • Cluster API Provider OpenStack version (Or git rev-parse HEAD if manually built): v010.3
  • Cluster-API version: v1.7.3
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label May 24, 2024
@qeqar
Copy link
Contributor Author

qeqar commented May 24, 2024

Just for referance and helping everyone with broken cluster.
A manual temp fix is kubectl patch cluster CLUSTERNAME --subresource=status --type='merge' -p'{"status":{"failureReason": null, "failureMessage": null}}'

@huxcrux
Copy link
Contributor

huxcrux commented May 24, 2024

I noticed yesterday that the same behavior seems to occur if you MGMT cluster having problem with DNS. I had ~10 clusters in Failed state after one of our DNS servers experienced problems. (meaning failureMessage indicated a failed DNS lookup)

I have not had time to dig deeper however I can confirm the patch above works just fine :)

@qeqar
Copy link
Contributor Author

qeqar commented May 24, 2024

Ok it looks like the same problem, and problay has a similar solution. i will keep it in mind, wen trying to fix this one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
Status: Done
3 participants