Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rest: retry on connection refused and apiserver shutdown #75368

Open
wants to merge 3 commits into
base: master
from

Conversation

@mfojtik
Copy link
Contributor

mfojtik commented Mar 14, 2019

What type of PR is this?

/kind bug

What this PR does / why we need it:

The client-go seems to retry on connection refused in case of GET requests. There are more transient errors that we can retry instead of making users implement retry in their code. Two examples I can
this of are: "connection refused" errors and "apiserver is shutting down"... Both should be transient and I think we can safely retry those.

NONE
@caesarxuchao

This comment has been minimized.

Copy link
Member

caesarxuchao commented Mar 14, 2019

/assign @logicalhan

@deads2k

This comment has been minimized.

Copy link
Contributor

deads2k commented Mar 18, 2019

@smarterclayton this is in keeping with a pull you wrote a while back I think.

@kubernetes/sig-api-machinery-misc

// Thus in case of "GET" operations, we simply retry it.
// We are not automatically retrying "write" operations, as
// they are not idempotent.
if !net.IsConnectionReset(err) || r.verb != "GET" {
if !net.IsConnectionReset(err) || !IsConnectionRefused(err) || !isAPIServerShutdown(err) || r.verb != "GET" {

This comment has been minimized.

@liggitt

liggitt Mar 18, 2019

Member

this logic isn't right... this will not retry unless a single error is simultaneously a connection reset AND a connection refused AND an "apiserver is shutting down" error.

this demonstrates the existing retry logic isn't exercised in tests :-/ unit tests did fail

This comment has been minimized.

@mfojtik

mfojtik Mar 18, 2019

Author Contributor

it is, the unit test failed, mea culpa.

@mfojtik mfojtik force-pushed the mfojtik:retry-on-errors branch from 28a598d to c7c19b8 Mar 18, 2019

@k8s-ci-robot k8s-ci-robot added size/M and removed size/S labels Mar 18, 2019

@mfojtik mfojtik force-pushed the mfojtik:retry-on-errors branch from c7c19b8 to d910cea Mar 18, 2019

@mfojtik mfojtik force-pushed the mfojtik:retry-on-errors branch from d910cea to ecc4c87 Mar 18, 2019

@@ -38,6 +38,7 @@ func WithWaitGroup(handler http.Handler, longRunning apirequest.LongRunningReque

if !longRunning(req, requestInfo) {
if err := wg.Add(1); err != nil {
w.Header().Add("Retry-After", "1")
http.Error(w, "apiserver is shutting down.", http.StatusInternalServerError)

This comment has been minimized.

@sttts

sttts Mar 18, 2019

Contributor

can we also switch this to a Status object?

This comment has been minimized.

@mfojtik

mfojtik Mar 18, 2019

Author Contributor

i would do this as a follow up (it might require more plumbing and there are more places in code that return plain text errors via http.Error()).

This comment has been minimized.

@smarterclayton

smarterclayton Mar 18, 2019

Contributor

is 1 appropriate?

This comment has been minimized.

@mfojtik

mfojtik Mar 18, 2019

Author Contributor

@smarterclayton my assumption is that HA clusters have 2 other endpoints the client-go can reconnect when it gets apiserver shutdown error. So faster retry will lead to faster success?

@mfojtik mfojtik force-pushed the mfojtik:retry-on-errors branch from ecc4c87 to 4e49c60 Mar 18, 2019

@mfojtik mfojtik force-pushed the mfojtik:retry-on-errors branch from 4e49c60 to 9fda5a4 Mar 19, 2019

@mfojtik mfojtik force-pushed the mfojtik:retry-on-errors branch from 9fda5a4 to 9f4cf47 Mar 19, 2019

@mfojtik mfojtik force-pushed the mfojtik:retry-on-errors branch 2 times, most recently from 8ae760d to c32f834 Mar 19, 2019

mfojtik added some commits Mar 14, 2019

@mfojtik mfojtik force-pushed the mfojtik:retry-on-errors branch from c32f834 to e291877 Mar 19, 2019

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Mar 19, 2019

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mfojtik
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: lavalamp

If they are not already assigned, you can assign the PR to them by writing /assign @lavalamp in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mfojtik

This comment has been minimized.

Copy link
Contributor Author

mfojtik commented Mar 19, 2019

/retest

@mfojtik mfojtik force-pushed the mfojtik:retry-on-errors branch from e291877 to 35db800 Mar 20, 2019

@tosss

This comment has been minimized.

Copy link

tosss commented Mar 20, 2019

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Mar 20, 2019

@mfojtik: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-integration 35db800 link /test pull-kubernetes-integration

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.