Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make connection lost retryable in webhook #117251

Merged
merged 1 commit into from
May 22, 2023
Merged

Conversation

linxiulei
Copy link
Contributor

When a http2 connection dies due to ping timeout, http2 client gets an error of "http2: client connection lost". This is similar to ConnectionReset case so it should be retryable.

What type of PR is this?

/kind bug

What this PR does / why we need it:

This PR improves resilience of webhook function in kube-apiserver by retrying requests failed on dead tcp connections.

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 12, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @linxiulei. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Apr 12, 2023
@k8s-ci-robot k8s-ci-robot added area/apiserver sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 12, 2023
@wzshiming
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 13, 2023
@wzshiming
Copy link
Member

/triage accepted
/priority backlog

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/backlog Higher priority than priority/awaiting-more-evidence. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Apr 13, 2023
@linxiulei
Copy link
Contributor Author

/retest

@aojea
Copy link
Member

aojea commented Apr 26, 2023

/assign

@aojea
Copy link
Member

aojea commented Apr 29, 2023

I knew there was some history here #109022

@@ -62,7 +62,7 @@ type GenericWebhook struct {
// Otherwise it returns false for an immediate fail.
func DefaultShouldRetry(err error) bool {
// these errors indicate a transient error that should be retried.
if utilnet.IsConnectionReset(err) || apierrors.IsInternalError(err) || apierrors.IsTimeout(err) || apierrors.IsTooManyRequests(err) {
if utilnet.IsConnectionReset(err) || utilnet.IsConnectionLost(err) || apierrors.IsInternalError(err) || apierrors.IsTimeout(err) || apierrors.IsTooManyRequests(err) {
Copy link
Member

@aojea aojea May 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only retryable operations should be those that are idempotent or those that the error indicate that there is no risk to retry, IsConnectionLost means that the ping frame was not answered and is a connection level error that impacts all in-flight requests
https://cs.opensource.google/go/x/net/+/master:http2/transport.go

It doesn't seem safe to retry on this error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not too sure about if it's idempotent but I do see some similarity between IsConnectionReset and IsConnectionLost.

Anyway, is it safe to make it retryable just for webhook authentication since it's idempotent (ie. review token)? cc @liggitt

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with IsConnectionReset, since this can be sent after the server processed some data.
Jordan know better the mechanics of the webhooks, if all operations are idempotent then +1 to retry, otherwise we should alos remove the IsConnectionReset IMHO

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

authn and authz webhooks are safe to retry

admission webhooks are generally safe to retry, are already retried at a higher level when there's an etcd write conflict, and provide a uid in the review body so a webhook that cared about idempotence (because it had side effects it wanted to minimize) could detect a retry

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aojea
Copy link
Member

aojea commented May 15, 2023

/lgtm
#117251 (comment)

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 15, 2023
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 15, 2023
@k8s-ci-robot k8s-ci-robot requested a review from aojea May 15, 2023 15:19
@aojea
Copy link
Member

aojea commented May 15, 2023

/lgtm
/assign @liggitt

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 15, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 413b6ae0b9216528683e9a28f7ad1e046f3ca674

@@ -47,6 +48,11 @@ func IsConnectionReset(err error) bool {
return false
}

// Returns if the given err is "http2: client connection lost" error.
func IsHTTP2ConnectionLost(err error) bool {
return err != nil && strings.Contains(err.Error(), "http2: client connection lost")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we're going to just match error strings, shouldn't we have a test that notices if the go string changes and breaks this method?

https://github.com/golang/net/blob/master/http2/transport.go#L1134

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really wish we could be more robust here :-/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@neild how do you feel about exposing the error in golang?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really wish we could be more robust here :-/

hmm, the test setup would be painful to set up a http2 server and figure out a way to fail the ping health check. OTOH, I feel the possibility of change the error message in golang is extremely low.

Exposing the error in golang is much feasible

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the test setup would be painful to set up a http2 server and figure out a way to fail the ping health check.

we do have that test

func TestReconnectBrokenTCP(t *testing.T) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it makes things easier. PTAL!

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels May 19, 2023
@k8s-ci-robot k8s-ci-robot requested a review from liggitt May 19, 2023 19:16
@aojea
Copy link
Member

aojea commented May 20, 2023

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 20, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 3627aff8f08166b94ba412ce5d07349bbd07e5f0

When a http2 connection dies due to ping timeout, http2 client gets an
error of "http2: client connection lost". This is similar to
ConnectionReset case so it should be retryable.

Signed-off-by: Eric Lin <exlin@google.com>
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 20, 2023
@k8s-ci-robot k8s-ci-robot requested a review from aojea May 20, 2023 19:04
@linxiulei
Copy link
Contributor Author

/retest

@aojea
Copy link
Member

aojea commented May 21, 2023

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 21, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 0c2f8024859925eed910d52f069599325af53abb

@liggitt
Copy link
Member

liggitt commented May 22, 2023

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: liggitt, linxiulei

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 22, 2023
@k8s-ci-robot k8s-ci-robot merged commit d9df6b0 into kubernetes:master May 22, 2023
@k8s-ci-robot k8s-ci-robot added this to the v1.28 milestone May 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/backlog Higher priority than priority/awaiting-more-evidence. release-note-none Denotes a PR that doesn't merit a release note. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants