Investigate self-hosted control-plane upgrades to exceed the 1 minute retryableOperationTimeout #7360

chrischdi · 2022-10-07T06:18:42Z

What steps did you take and what happened:

While developing #7239 which introduced self-hosted cluster upgrades I detected that the timeout of 1 minute for

retryableOperationTimeout

cluster-api/test/framework/cluster_proxy.go

Line 53 in 31ebd83

retryableOperationTimeout = 3 * time.Minute

gets hit.

This happens in GetControlPlaneMachinesByCluster

cluster-api/test/framework/machine_helpers.go

Line 109 in 63a959a

Eventually(func() error {

where a timeout of 1 minute is not enough to succeed.

Observation 1: according logs this happens during the self-hosted cluster's control-plane upgrade which seems to be a bit disruptive regarding API Server reachability. Could be due to etcd member join/leave or HA Proxy config reloads or slow HA Proxy healthchecks.
Observation 2: The CAPI controllers also have leader election failures during self-hosted control-plane upgrades.

What did you expect to happen:

1 minute to be enough for GetControlPlaneMachinesByCluster.

Anything else you would like to add:

Should be reproducible (flaky) by resetting

cluster-api/test/framework/cluster_proxy.go

Line 53 in 31ebd83

retryableOperationTimeout = 3 * time.Minute

to 1 * time.Minute and running the self-hosted tests.

Follow up from #7239 (comment)

Environment:

Cluster-api version: main
minikube/kind version:
Kubernetes version: (use kubectl version):
OS (e.g. from /etc/os-release):

/kind bug
[One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]

The text was updated successfully, but these errors were encountered:

chrischdi · 2022-10-07T06:27:01Z

Maybe related to #5477

sbueringer · 2022-10-07T07:52:59Z

/triage accepted

chrischdi · 2022-10-11T16:10:13Z

Helpful new test to iterate on this: #7387

k8s-triage-robot · 2023-01-09T17:00:21Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-01-20T04:11:48Z

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

k8s-triage-robot · 2024-02-19T05:08:34Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2024-03-20T05:59:00Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2024-03-20T05:59:04Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 7, 2022

chrischdi mentioned this issue Oct 7, 2022

✨ adjust self-hosted e2e test to also upgrade the cluster #7239

Merged

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 7, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 9, 2023

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Jan 20, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 19, 2024

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate self-hosted control-plane upgrades to exceed the 1 minute retryableOperationTimeout #7360

Investigate self-hosted control-plane upgrades to exceed the 1 minute retryableOperationTimeout #7360

chrischdi commented Oct 7, 2022

chrischdi commented Oct 7, 2022

sbueringer commented Oct 7, 2022

chrischdi commented Oct 11, 2022

k8s-triage-robot commented Jan 9, 2023

k8s-triage-robot commented Jan 20, 2024

k8s-triage-robot commented Feb 19, 2024

k8s-triage-robot commented Mar 20, 2024

k8s-ci-robot commented Mar 20, 2024

Investigate self-hosted control-plane upgrades to exceed the 1 minute retryableOperationTimeout #7360

Investigate self-hosted control-plane upgrades to exceed the 1 minute retryableOperationTimeout #7360

Comments

chrischdi commented Oct 7, 2022

chrischdi commented Oct 7, 2022

sbueringer commented Oct 7, 2022

chrischdi commented Oct 11, 2022

k8s-triage-robot commented Jan 9, 2023

k8s-triage-robot commented Jan 20, 2024

k8s-triage-robot commented Feb 19, 2024

k8s-triage-robot commented Mar 20, 2024

k8s-ci-robot commented Mar 20, 2024