Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client should wait before retrying PATCH/PUT/POSTs in case of http 429 from server #222

Closed
shyamjvs opened this issue Jun 27, 2017 · 27 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@shyamjvs
Copy link
Member

We figured recently while running 4000-node cluster tests that the apiserver was saturated with requests and returning 429s, but clients were still sending requests.
In particular, kubelets and NPDs on the nodes were trying to continually retry PATCH/PUT requests on failures (leading to thousands of qps of 429s), even though they're designed to send updates once every one minute. So this is most likely an issue with client-go.

Following from discussion in kubernetes/node-problem-detector#124

cc @Random-Liu @gmarek @kubernetes/sig-scalability-misc

@shyamjvs
Copy link
Member Author

cc @yujuhong

@shyamjvs shyamjvs added the bug label Jun 27, 2017
@shyamjvs
Copy link
Member Author

We don't want to enter the risky state of being stuck with 429s eventually.

@lavalamp
Copy link
Member

lavalamp commented Jun 27, 2017 via email

@yujuhong
Copy link

Do the clients not have a configured rate limit?

The client had a rate limit, and I believe the QPS was still within the limit. Maybe the problem is that the limit needs to be even lower for the cluster of this scale?

BTW, I was wrong in the initial issue. The rest client does not retry non-GET requests. I am not sure what's causing the high number of retries on 429 (both NPD's and kubelet's retry loop have large interval).

@caesarxuchao
Copy link
Member

As i understand, this is not a release blocker, but we need to fix it.

@caesarxuchao caesarxuchao self-assigned this Jun 27, 2017
@caesarxuchao
Copy link
Member

caesarxuchao commented Jun 27, 2017

@gmarek i vaguely remembered you added the backoffMgr. The history was lost when we moved the code to staging.

do you know if this comment is still true?
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/rest/client.go#L138-L139

// readExpBackoffConfig handles the internal logic of determining what the
// backoff policy is.  By default if no information is available, NoBackoff.
// TODO Generalize this see #17727 .
func readExpBackoffConfig() BackoffManager {

The env vars were empty on my 1.6.4 gke node.

@caesarxuchao
Copy link
Member

caesarxuchao commented Jun 28, 2017

As Yuju mentioned, the client is only going to retry if net.IsConnectionReset(err) && r.verb == "GET".

The code @yujuhong mentioned was wrapped in if err!=nil, and when server returns 429, err is nil (see https://github.com/golang/go/blob/master/src/net/http/client.go#L461-L464). So many operations are retried.

If the serve returns 429, before retrying, the client in total sleeps for

  1. the amount of time demanded by the server, plus
  2. the exponential backoff as recorded by the backoff manager. Though by default the backoff manager is empty, so it's 0.

Anyway, i think client is following apiserver's demand, so there is no bug on the client side.

@Random-Liu
Copy link
Member

Anyway, i think client is following apiserver's demand, so there is no bug on the client side.

Shouldn't we configure the backoff manager?

@caesarxuchao
Copy link
Member

IMO the backoff manager is a safety net. APIserver should specify longer duration of retry-after if its load is heavy.

Maybe we should configure the backoff manager, @gmarek do you know what number we should use to initialize the backoff manager?

@yujuhong
Copy link

IMO the backoff manager is a safety net. APIserver should specify longer duration of retry-after if its load is heavy.

I remember I also checked this yesterday. Of course there is always a TODO in the code: https://github.com/kubernetes/kubernetes/blob/v1.8.0-alpha.1/staging/src/k8s.io/apiserver/pkg/server/filters/maxinflight.go#L33

@gmarek
Copy link
Contributor

gmarek commented Jun 28, 2017

@caesarxuchao - this is very good question, which probably has very complex answer as: it depends. It's fine to backoff quite some time for Status updates (with the exception of NodeStatus which serves as a heartbeat), but we probably don't want to backoff too much for Spec updates. This should be figured out by @kubernetes/sig-api-machinery-bugs (@smarterclayton ?). For now can we put there something that's non-zero? E.g. 5ms?

(Disclaimer: I don't think I did anything around backoff mgr:)

@lavalamp
Copy link
Member

lavalamp commented Jun 28, 2017 via email

@caesarxuchao
Copy link
Member

caesarxuchao commented Jun 28, 2017

We have a really common bug where people fail to read the entire contents of the response body,

It looks like it's taken care of: https://github.com/kubernetes/kubernetes/blob/v1.8.0-alpha.1/staging/src/k8s.io/client-go/rest/request.go#L846-L848

@yujuhong
Copy link

IMO folks making requests should understand what their requests are doing
and whether it makes sense to use the backoff manager. I am not comfortable
enabling it by default without thinking a lot more about it.

The backoff manager settings are not exposed through the REST client config...it gets these from the environment variables. They are also not configurable based on the type of requests. If we can make this more useful, perhaps we can remove kubelet's own retry loop.

Given that the client retries up to 10 times (not configurable), I think we should lower the number of retries for node status update in kubelet. There is no point trying to send the same status for a prolonged period time when kubelet can instead sending a new update (every 10s).

@caesarxuchao
Copy link
Member

Exponential backoff manager is introduced in kubernetes/kubernetes#17529. Sorry @gmarek i confused it with the throttler, which i believed was introduced by you ;)

@jayunit100 do you know why we used env var rather than a config to initialize the backoff manager?

I suggest that we do these:

  1. add backoffManager to RESTClient and rest#Config
  2. deprecate readExpBackoffConfig after 2 releases.
  3. move maxRetries to backoffManager

@shyamjvs
Copy link
Member Author

Linking issue kubernetes/kubernetes#47344 for tracking.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 31, 2017
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 30, 2018
@shyamjvs
Copy link
Member Author

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 30, 2018
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 30, 2018
@shyamjvs
Copy link
Member Author

shyamjvs commented Apr 30, 2018 via email

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 30, 2018
@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. and removed bug labels Jun 5, 2018
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 3, 2018
@nikhita
Copy link
Member

nikhita commented Sep 3, 2018

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 3, 2018
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 2, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 1, 2019
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

9 participants