-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Client should wait before retrying PATCH/PUT/POSTs in case of http 429 from server #222
Comments
cc @yujuhong |
We don't want to enter the risky state of being stuck with 429s eventually. |
Do the clients not have a configured rate limit?
Yes, we should make clients respect the server's request if they aren't
right now.
…On Tue, Jun 27, 2017 at 11:57 AM, Shyam JVS ***@***.***> wrote:
We don't want to enter the risky state of being stuck with 429s eventually.
—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
<#222 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAngliFGbnJOisYkU0JDi-MxZ79w2lg4ks5sIVCDgaJpZM4OHChG>
.
|
The client had a rate limit, and I believe the QPS was still within the limit. Maybe the problem is that the limit needs to be even lower for the cluster of this scale? BTW, I was wrong in the initial issue. The rest client does not retry non-GET requests. I am not sure what's causing the high number of retries on 429 (both NPD's and kubelet's retry loop have large interval). |
As i understand, this is not a release blocker, but we need to fix it. |
@gmarek i vaguely remembered you added the backoffMgr. The history was lost when we moved the code to staging. do you know if this comment is still true?
The env vars were empty on my 1.6.4 gke node. |
The code @yujuhong mentioned was wrapped in If the serve returns 429, before retrying, the client in total sleeps for
Anyway, i think client is following apiserver's demand, so there is no bug on the client side. |
Shouldn't we configure the backoff manager? |
IMO the backoff manager is a safety net. APIserver should specify longer duration of retry-after if its load is heavy. Maybe we should configure the backoff manager, @gmarek do you know what number we should use to initialize the backoff manager? |
I remember I also checked this yesterday. Of course there is always a TODO in the code: https://github.com/kubernetes/kubernetes/blob/v1.8.0-alpha.1/staging/src/k8s.io/apiserver/pkg/server/filters/maxinflight.go#L33 |
@caesarxuchao - this is very good question, which probably has very complex answer as: it depends. It's fine to backoff quite some time for Status updates (with the exception of NodeStatus which serves as a heartbeat), but we probably don't want to backoff too much for Spec updates. This should be figured out by @kubernetes/sig-api-machinery-bugs (@smarterclayton ?). For now can we put there something that's non-zero? E.g. 5ms? (Disclaimer: I don't think I did anything around backoff mgr:) |
IMO folks making requests should understand what their requests are doing
and whether it makes sense to use the backoff manager. I am not comfortable
enabling it by default without thinking a lot more about it.
There is another potential problem that should be considered, which is: is
the client opening a new connection each time or does it reuse an existing
one? The latter is much more efficient. We have a really common bug where
people fail to read the entire contents of the response body, which causes
the go library to not reuse the socket. Is it possible that is happening in
this case?
…On Wed, Jun 28, 2017 at 1:26 AM, Marek Grabowski ***@***.***> wrote:
@caesarxuchao <https://github.com/caesarxuchao> - this is very good
question, which probably has very complex answer as: it depends. It's fine
to backoff quite some time for Status updates (with the exception of
NodeStatus which serves as a heartbeat), but we probably don't want to
backoff too much for Spec updates. This should be figured out by
@kubernetes/sig-api-machinery-bugs
<https://github.com/orgs/kubernetes/teams/sig-api-machinery-bugs> (
@smarterclayton <https://github.com/smarterclayton> ?). For now can we
put there something that's non-zero? E.g. 5ms?
(Disclaimer: I don't think I did anything around backoff mgr:)
—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
<#222 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAnglqoXGSJe5Wyoc5zl-BbKLKevPqjBks5sIg4ZgaJpZM4OHChG>
.
|
It looks like it's taken care of: https://github.com/kubernetes/kubernetes/blob/v1.8.0-alpha.1/staging/src/k8s.io/client-go/rest/request.go#L846-L848 |
The backoff manager settings are not exposed through the REST client config...it gets these from the environment variables. They are also not configurable based on the type of requests. If we can make this more useful, perhaps we can remove kubelet's own retry loop. Given that the client retries up to 10 times (not configurable), I think we should lower the number of retries for node status update in kubelet. There is no point trying to send the same status for a prolonged period time when kubelet can instead sending a new update (every 10s). |
Exponential backoff manager is introduced in kubernetes/kubernetes#17529. Sorry @gmarek i confused it with the throttler, which i believed was introduced by you ;) @jayunit100 do you know why we used env var rather than a config to initialize the backoff manager? I suggest that we do these:
|
Linking issue kubernetes/kubernetes#47344 for tracking. |
Issues go stale after 90d of inactivity. Prevent issues from auto-closing with an If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle rotten |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale
…On Mon, Apr 30, 2018, 2:07 PM fejta-bot ***@***.***> wrote:
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually
close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta
<https://github.com/fejta>.
/lifecycle stale
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#222 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AEIhk5xSIPuhijo3gMq9AQCAP7z7K2F0ks5ttv5tgaJpZM4OHChG>
.
|
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
We figured recently while running 4000-node cluster tests that the apiserver was saturated with requests and returning 429s, but clients were still sending requests.
In particular, kubelets and NPDs on the nodes were trying to continually retry PATCH/PUT requests on failures (leading to thousands of qps of 429s), even though they're designed to send updates once every one minute. So this is most likely an issue with client-go.
Following from discussion in kubernetes/node-problem-detector#124
cc @Random-Liu @gmarek @kubernetes/sig-scalability-misc
The text was updated successfully, but these errors were encountered: