Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tolerate temporary errors from etcdserver #11401

Merged
merged 1 commit into from
Oct 5, 2022

Conversation

dims
Copy link
Contributor

@dims dims commented Sep 30, 2022

What this PR does / why we need it:

There are cases when the etcdserver is temporarily unavailable and the
errors that we get back from kube-apiserver reflect that error. It looks
like we bail out immediately when these errors happen currently. We
should retry until timeout is reached when this sort of errors happen.

Fixes #9502
Fixes #7637

Signed-off-by: Davanum Srinivas davanum@gmail.com

Special notes for your reviewer:

With this patch, temporary errors like the etcdserver leader changes are not treated as terminal. We continue to retry until the specified timeout.

Note that there are things that can be done on the k8s side, discussion is going on there as well:
kubernetes/kubernetes#112152

If applicable:

  • This PR does not change any functionality, so no updates to the documentation
  • Currently there aren't any unit tests in this package, but i added one for just the new method isServiceUnavailable
  • No changes to functionality, so no incompatibilities

@pull-request-size pull-request-size bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Sep 30, 2022
@dims dims force-pushed the retry-for-issue-9502 branch 2 times, most recently from c2bbcb0 to 3ca36c0 Compare October 1, 2022 21:29
@dims dims changed the title [WIP][IGNORE ME] Retry a few times, do not give up so soon! Tolerate temporary errors from etcdserver Oct 1, 2022
@dims dims marked this pull request as ready for review October 1, 2022 21:36
@pull-request-size pull-request-size bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Oct 1, 2022
@dims dims force-pushed the retry-for-issue-9502 branch 2 times, most recently from dc81bb9 to 86defa7 Compare October 2, 2022 00:10
There are cases when the etcdserver is temporarily unavailable and the
errors that we get back from kube-apiserver reflect that error. It looks
like we bail out immediately when these errors happen currently. We
should retry until timeout is reached when this sort of errors happen.

Signed-off-by: Davanum Srinivas <davanum@gmail.com>
@dims
Copy link
Contributor Author

dims commented Oct 3, 2022

@hickeyma this is ready now!

@mattfarina
Copy link
Collaborator

@dims thanks for looking into the issues here.

I the Kubernetes API supposed to be a leaky abstraction? Are clients expected to work with etcd? I'm asking about intent and mid-term intent. I'm wondering whether this code is something we will need to maintain long term or if this is a short term situation.

@dims
Copy link
Contributor Author

dims commented Oct 3, 2022

@mattfarina we'll need a KEP in upstream, i've requested some folks who were pushing for this earlier to do more in 1.27 cycle (not 1.26), So until that KEP is discussed/reviewed/approved we will need this.

we'll also need this until versions of kubernetes supported by helm has the old style leaky abstraction.

Copy link
Member

@technosophos technosophos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be the appropriate stop-gap for this error.

Copy link
Contributor

@hickeyma hickeyma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dims for tracking these issues down and providing this interim solution. It will be helpful to users.

@hickeyma hickeyma added this to the 3.10.1 milestone Oct 5, 2022
@hickeyma hickeyma merged commit 2baf68f into helm:main Oct 5, 2022
@dims
Copy link
Contributor Author

dims commented Oct 5, 2022

thanks @hickeyma !

@hickeyma hickeyma added needs-pick Indicates that a PR needs to be cherry-picked into the next release candidate. picked Indicates that a PR has been cherry-picked into the next release candidate. and removed needs-pick Indicates that a PR needs to be cherry-picked into the next release candidate. labels Oct 5, 2022
@cenkalti
Copy link
Contributor

Hey @technosophos @hickeyma. Unfortunately, this fix does not solve the issue. Can you take a look at my new fix? #11426

@sruthiwander
Copy link

sruthiwander commented Feb 23, 2023

@dims Can this be backported to 3.2 please?

@dims
Copy link
Contributor Author

dims commented Feb 23, 2023

@sruthiwander not this one! it was reverted, you will need #11426

Also https://github.com/helm/helm/releases/tag/v3.2.0 is practically ancient, i don't know/think that helm maintainers will go back that far https://helm.sh/docs/topics/release_policy/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
picked Indicates that a PR has been cherry-picked into the next release candidate. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
7 participants