-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No Control Plane machines came into existence. #10356
Comments
/triage accepted Thx for reporting |
Would be good to add a link either to a specific failed job or k8s-triage filtered down on this failure. Just to make it easier to find a failed job |
/priority important-soon |
I did hit this issue in a local setup. However, its quite hard to triage, because the machine already got replaced by a new one (I guess because of MHC doing its thing) and the cluster successfully started then. I still have stuff around if there are ideas to filter information. |
I was able to hit it again and triage a bit. It turns out that the node itself came up, except the parts which try to hit the load balanced control-plane endpoint. TLDR: The haproxy load balancer did not forward traffic to the control plane node. My current theory is:
I was able to "fix" the issue in this case by again sending I'm currently testing the following fix locally which is: reading and comparing the configfile in CAPD after writing it and before reloading haproxy: Test setup:
So it only runs a single test. I used prowjob on kind to create a kind cluster and pod yaml, which I then modified (adjusted timeouts + GINKGO_FOCUS + requests, propably other things too). I then run the loop using All code changes for my setup are here for reference: dfe9d5e I did some optimisations, like packing all required images to |
Fixes are merged, let's check next week or so if the error occurs again. |
The merged fix did not help. |
For reference, I did hit the same issue (CAPD load balancer config not active) as described in this comment on a
|
I'll investigate more on this issue. |
Which jobs are flaking?
Which tests are flaking?
Since when has it been flaking?
Minor flakes with this error have been happening for a long time.
Testgrid link
https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api#capi-e2e-main
Reason for failure (if possible)
No response
Anything else we need to know?
No response
Label(s) to be applied
/kind flake
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.
The text was updated successfully, but these errors were encountered: