Tolerate temporary errors from etcdserver #11401

dims · 2022-09-30T22:24:12Z

What this PR does / why we need it:

There are cases when the etcdserver is temporarily unavailable and the
errors that we get back from kube-apiserver reflect that error. It looks
like we bail out immediately when these errors happen currently. We
should retry until timeout is reached when this sort of errors happen.

Fixes #9502
Fixes #7637

Signed-off-by: Davanum Srinivas davanum@gmail.com

Special notes for your reviewer:

With this patch, temporary errors like the etcdserver leader changes are not treated as terminal. We continue to retry until the specified timeout.

Note that there are things that can be done on the k8s side, discussion is going on there as well:
kubernetes/kubernetes#112152

If applicable:

This PR does not change any functionality, so no updates to the documentation
Currently there aren't any unit tests in this package, but i added one for just the new method isServiceUnavailable
No changes to functionality, so no incompatibilities

There are cases when the etcdserver is temporarily unavailable and the errors that we get back from kube-apiserver reflect that error. It looks like we bail out immediately when these errors happen currently. We should retry until timeout is reached when this sort of errors happen. Signed-off-by: Davanum Srinivas <davanum@gmail.com>

dims · 2022-10-03T01:43:02Z

@hickeyma this is ready now!

mattfarina · 2022-10-03T14:11:15Z

@dims thanks for looking into the issues here.

I the Kubernetes API supposed to be a leaky abstraction? Are clients expected to work with etcd? I'm asking about intent and mid-term intent. I'm wondering whether this code is something we will need to maintain long term or if this is a short term situation.

dims · 2022-10-03T14:55:56Z

@mattfarina we'll need a KEP in upstream, i've requested some folks who were pushing for this earlier to do more in 1.27 cycle (not 1.26), So until that KEP is discussed/reviewed/approved we will need this.

we'll also need this until versions of kubernetes supported by helm has the old style leaky abstraction.

technosophos

This seems to be the appropriate stop-gap for this error.

pkg/kube/wait.go

hickeyma

Thanks @dims for tracking these issues down and providing this interim solution. It will be helpful to users.

dims · 2022-10-05T11:54:35Z

thanks @hickeyma !

cenkalti · 2022-10-18T14:19:43Z

Hey @technosophos @hickeyma. Unfortunately, this fix does not solve the issue. Can you take a look at my new fix? #11426

sruthiwander · 2023-02-23T03:00:00Z

@dims Can this be backported to 3.2 please?

dims · 2023-02-23T04:28:26Z

@sruthiwander not this one! it was reverted, you will need #11426

Also https://github.com/helm/helm/releases/tag/v3.2.0 is practically ancient, i don't know/think that helm maintainers will go back that far https://helm.sh/docs/topics/release_policy/

pull-request-size bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Sep 30, 2022

dims force-pushed the retry-for-issue-9502 branch 2 times, most recently from c2bbcb0 to 3ca36c0 Compare October 1, 2022 21:29

dims changed the title ~~[WIP][IGNORE ME] Retry a few times, do not give up so soon!~~ Tolerate temporary errors from etcdserver Oct 1, 2022

dims marked this pull request as ready for review October 1, 2022 21:36

dims force-pushed the retry-for-issue-9502 branch from 3ca36c0 to 0635219 Compare October 1, 2022 22:23

pull-request-size bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Oct 1, 2022

dims force-pushed the retry-for-issue-9502 branch 2 times, most recently from dc81bb9 to 86defa7 Compare October 2, 2022 00:10

dims force-pushed the retry-for-issue-9502 branch from 86defa7 to ebc79fa Compare October 2, 2022 00:12

technosophos approved these changes Oct 4, 2022

View reviewed changes

hakuna-matatah reviewed Oct 4, 2022

View reviewed changes

pkg/kube/wait.go Show resolved Hide resolved

hickeyma approved these changes Oct 5, 2022

View reviewed changes

hickeyma added this to the 3.10.1 milestone Oct 5, 2022

hickeyma merged commit 2baf68f into helm:main Oct 5, 2022

cenkalti mentioned this pull request Oct 11, 2022

Fix handling of "leader changed" errors #11426

Merged

3 tasks

mattfarina mentioned this pull request Jan 12, 2023

Regression in 3.11.0-rc.1: Waits forever when installing chart that has hook #11721

Closed

yuxiang-zhang mentioned this pull request Feb 13, 2024

[Feature] Gracefully handle transient failures "leader changed" from control plane instances eksctl-io/eksctl#7550

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tolerate temporary errors from etcdserver #11401

Tolerate temporary errors from etcdserver #11401

dims commented Sep 30, 2022 •

edited

Loading

dims commented Oct 3, 2022

mattfarina commented Oct 3, 2022

dims commented Oct 3, 2022

technosophos left a comment

hickeyma left a comment

dims commented Oct 5, 2022

cenkalti commented Oct 18, 2022

sruthiwander commented Feb 23, 2023 •

edited

Loading

dims commented Feb 23, 2023 •

edited

Loading

Tolerate temporary errors from etcdserver #11401

Tolerate temporary errors from etcdserver #11401

Conversation

dims commented Sep 30, 2022 • edited Loading

dims commented Oct 3, 2022

mattfarina commented Oct 3, 2022

dims commented Oct 3, 2022

technosophos left a comment

Choose a reason for hiding this comment

hickeyma left a comment

Choose a reason for hiding this comment

dims commented Oct 5, 2022

cenkalti commented Oct 18, 2022

sruthiwander commented Feb 23, 2023 • edited Loading

dims commented Feb 23, 2023 • edited Loading

dims commented Sep 30, 2022 •

edited

Loading

sruthiwander commented Feb 23, 2023 •

edited

Loading

dims commented Feb 23, 2023 •

edited

Loading