-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate self-hosted control-plane upgrades to exceed the 1 minute retryableOperationTimeout #7360
Comments
Maybe related to #5477 |
/triage accepted |
Helpful new test to iterate on this: #7387 |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
This issue has not been updated in over 1 year, and should be re-triaged. You can:
For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/ /remove-triage accepted |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close not-planned |
@k8s-triage-robot: Closing this issue, marking it as "Not Planned". In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What steps did you take and what happened:
While developing #7239 which introduced self-hosted cluster upgrades I detected that the timeout of 1 minute for
cluster-api/test/framework/cluster_proxy.go
Line 53 in 31ebd83
gets hit.
This happens in
GetControlPlaneMachinesByCluster
cluster-api/test/framework/machine_helpers.go
Line 109 in 63a959a
where a timeout of 1 minute is not enough to succeed.
Observation 1: according logs this happens during the self-hosted cluster's control-plane upgrade which seems to be a bit disruptive regarding API Server reachability. Could be due to etcd member join/leave or HA Proxy config reloads or slow HA Proxy healthchecks.
Observation 2: The CAPI controllers also have leader election failures during self-hosted control-plane upgrades.
What did you expect to happen:
1 minute to be enough for
GetControlPlaneMachinesByCluster
.Anything else you would like to add:
Should be reproducible (flaky) by resetting
cluster-api/test/framework/cluster_proxy.go
Line 53 in 31ebd83
to
1 * time.Minute
and running the self-hosted tests.Follow up from #7239 (comment)
Environment:
kubectl version
):/etc/os-release
):/kind bug
[One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]
The text was updated successfully, but these errors were encountered: