Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Watch shouldn't break every 5 minutes #6513

Closed
wojtek-t opened this issue Apr 7, 2015 · 27 comments
Closed

Watch shouldn't break every 5 minutes #6513

wojtek-t opened this issue Apr 7, 2015 · 27 comments
Labels
priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.
Milestone

Comments

@wojtek-t
Copy link
Member

wojtek-t commented Apr 7, 2015

Currently Watch is broken every 5 minutes. It is then "recreated", which sometimes requires listing all elements under watch (if there were more than 1000 "uninteresting" etcd writes since then).

I digged deeper into it and it seems this is not related to etcd itself. What happens is:

  1. a component (e.g. scheduler) starts watch sending http request to apiserver
  2. apiserver initiates a watch on etcd
  3. after 5 minutes, the request to apiserver finished (this is something I don't fully understand - is this a timeout?)
  4. this triggers "canceling" the request to etcd

cc @fgrzadkowski

@wojtek-t wojtek-t added this to the v1.0 milestone Apr 7, 2015
@wojtek-t wojtek-t added team/master priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. labels Apr 7, 2015
@wojtek-t wojtek-t self-assigned this Apr 7, 2015
@wojtek-t wojtek-t changed the title Watch shouldn't be break every 5 minutes Watch shouldn't break every 5 minutes Apr 7, 2015
@wojtek-t
Copy link
Member Author

wojtek-t commented Apr 7, 2015

I think that 3 is because of:
https://github.com/GoogleCloudPlatform/kubernetes/blob/master/cmd/kube-apiserver/app/server.go#L386

I think that we should change it so that all "Watch" requests has a much bigger timeout (but don't change the timeout of all other gets).
Is it possible to specify it at the client level by customizing its "transport" part?

What do you think about it?
It shouldn't conflict with #6207, because we're ignoring watches there, right?

cc @brendanburns @davidopp @lavalamp

@ghost
Copy link

ghost commented Apr 7, 2015

Yes, I can confirm that we're ignoring watches in #6207:

fs.StringVar(&s.LongRunningRequestRE, "long_running_request_regexp", "[._/watch$][^\/proxy._]", "A regular expression matching long running requests which should be excluded from maximum inflight request handling.")

And yes, I think that watches should similarly be excluded from API server timeouts, but would like @bgrant0607 to comment on what the intended semantics and implementation of long-term watches are. Do they have any timeout? Is there intended to be a limit on the number of concurrent long-term watches open against the API server?

@timothysc
Copy link
Member

@wojtek-t could you keep me in the loop on new items created | possibly label as perf b/c I think we are chasing the same things.

@bgrant0607
Copy link
Member

cc @smarterclayton

@wojtek-t
Copy link
Member Author

wojtek-t commented Apr 7, 2015

@timothysc will do

@smarterclayton
Copy link
Contributor

I think we need to fix the etcd problem. Working around it on our end is just too painful, and doesn't help us (short of aggregating all watches on our end, which is a larger chunk of work).

My recommendation is to work with upstream etcd to propose the minimal change for etcd-io/etcd#2048 required to fix the issue on our end (a client option that requests that etcd indicate the window has closed and provide the latest etcd index at that time). That fixes the short term problem in a way that everyone benefits.

@timothysc
Copy link
Member

@wojtek-t fwiw we had seen that only after a watch was reset from a timeout, would other actions continue if the cluster was in a stalled state, but we have not checked if recent etcd modifications re: #6059 have had an effect.

@wojtek-t
Copy link
Member Author

wojtek-t commented Apr 7, 2015

@smarterclayton I don't understand your comment - I didn't observe any problem at the etcd level - the watch on etcd doesn't break itself
the only problem I observe is the "scheduler -> apiserver" http request that is timing out [the etcd issue that you mentioned is only the consequence of out internal request timing out]

@timothysc: sure - I will look into it as soon as the new etcd release is built

@smarterclayton
Copy link
Contributor

@wojtek-t the problem you alluded to of being outside the window of watchable events. The disconnect would not be a problem if the client knows to ask for resource version X+10000. The disconnect is only a problem because the client only knows to ask for resource version X.

It is then "recreated", which sometimes requires listing all elements under watch (if there were more than 1000 "uninteresting" etcd writes since then).

This is only a problem because clients don't get an updated resource version after 1000 uninteresting writes. #2048 is intended to fix that problem.

@smarterclayton
Copy link
Contributor

The 5 minute timeout is our maximum http.Server timeout. I don't think we should increase this, because it doesn't fix the problem from #2048, which affects every watcher.

@bgrant0607
Copy link
Member

cc @hchoudh

@wojtek-t
Copy link
Member Author

wojtek-t commented Apr 7, 2015

OK - I agree that the issue you mentioned is more important to fix.
However, I think that having longer timeout for watches (or for some special requests) might be beneficial [or in other words - having shorter timeouts for other requests].

@wojtek-t wojtek-t removed their assignment Apr 7, 2015
@smarterclayton
Copy link
Contributor

On the other hand, dead clients are going to be consuming resources, especially now that we have limit pools. I do not want someone to be able to mount a DOS attack against the API server by opening watches and starving traffic. Having to reestablish their connection periodically is a part of ensuring that clients that exceed their rate limit are throttled (vs being able to open a watch forever).

----- Original Message -----

OK - I agree that the issue you mentioned is more important to fix.
However, I think that having longer timeout for watches (or for some special
requests) might be beneficial [or in other words - having shorter timeouts
for other requests].


Reply to this email directly or view it on GitHub:
#6513 (comment)

@davidopp
Copy link
Member

davidopp commented Apr 7, 2015

@smarterclayton can you explain this attack mechanism a bit more? Regardless of watch timeout, can't someone sucessfully attack by running a lot of clients that keep renewing? Statistically some non-attack clients will get through if the attackers keep having to renew, but if the attacker runs enough clients then I think they can keep that number close to zero.

@davidopp
Copy link
Member

davidopp commented Apr 7, 2015

Also, can you send a pointer to the PR/issue that introduced "limit pools"?

@smarterclayton
Copy link
Contributor

On Apr 7, 2015, at 3:01 PM, David Oppenheimer notifications@github.com wrote:

@smarterclayton can you explain this attack mechanism a bit more? Regardless of watch timeout, can't someone sucessfully attack by running a lot of clients that keep renewing? Statistically some non-attack clients will get through if the attackers keep having to renew, but if the attacker runs enough clients then I think they can keep that number close to zero.

If you have a rate limiter based on source ip or client credentials, you could ask a lot of watches at once, and refresh them within your rate window. The longer watches are, the lower the rate given to individual clients are. On the other hand, if reestablishing a watch is cheap (because clients update their window) then a lower timeout turns over the connected clients more often.


Reply to this email directly or view it on GitHub.

@davidopp
Copy link
Member

davidopp commented Apr 7, 2015

I see, so this helps if you have a rate limiter based on something that the attacker cannot easily generate a lot of. Do we have this in the system currently?

@smarterclayton
Copy link
Contributor

We do not, although I would like to extend Brendan's initial work to that eventually.

On Apr 7, 2015, at 3:57 PM, David Oppenheimer notifications@github.com wrote:

I see, so this helps if you have a rate limiter based on something that the attacker cannot easily generate a lot of. Do we have this in the system currently?


Reply to this email directly or view it on GitHub.

@lavalamp
Copy link
Member

lavalamp commented Apr 7, 2015

We have a time limit in apiserver, it could be lengthened. Note there's also a time limit in nginx.

I don't really think that a 5 minute timeout should be a problem. My thinking is that to get a substantial gain here, say 10x, we have to extend this period to 50 minutes. That seems clearly too long to me.

@lavalamp
Copy link
Member

lavalamp commented Apr 7, 2015

Really, to be correct, clients need to re-list occasionally, and they have to be able to handle the watch closing at any time. I think the five minute forced close is fine because it forces clients to do it right.

@lavalamp
Copy link
Member

lavalamp commented Apr 7, 2015

Essentially I think this behavior is by design, not a bug, and shouldn't adversely affect performance.

@timothysc
Copy link
Member

@wojtek-t @lavalamp - the lumpiness in the response curve was our old graph:

image

It shows overall bursts of up to 30 pods/sec and our goal was to eliminate the lumps.
What is interesting is the slope difference between *old here and new #6207

@davidopp
Copy link
Member

davidopp commented Apr 7, 2015

@lavalamp thinks the system should be able to handle a re-list every 5 min so this is WAI.

@davidopp davidopp closed this as completed Apr 7, 2015
@smarterclayton
Copy link
Contributor

I would clarify that based on etcd 2048. The system should be able to handle a re-list every five minutes. It should not require all clients to re-list every five minutes. Re-listing is a client responsibility.

----- Original Message -----

@lavalamp thinks the system should be able to handle a re-list every 5 min so
this is WAI.


Reply to this email directly or view it on GitHub:
#6513 (comment)

@wojtek-t
Copy link
Member Author

wojtek-t commented Apr 8, 2015

@smarterclayton @davidopp
I agree that fixing the original issue on the etcd level is much better (and cleaner) solution. But IIUC, this is not of high priority for etcd and it will not happen any time soon. However, in 100-node cluster and each Kubelet having at least a watch on all pods, this may generate a significant load.
So I still think that we might need to consider a temporary workaround for it in the near future.

cc @fgrzadkowski

@smarterclayton
Copy link
Contributor

If it's a high priority for us we should help them fix it. I suspect that it's a fairly easy fix once the api details are sorted out (the server can return 204 No content with the updated etcd index when the watch window is exceeded, client can recognize that and return an error, we read the error and return a typed error back to clients, who update their resource watch version)

On Apr 8, 2015, at 8:12 AM, Wojciech Tyczynski notifications@github.com wrote:

@smarterclayton @davidopp
I agree that fixing the original issue on the etcd level is much better (and cleaner) solution. But IIUC, this is not of high priority for etcd and it will not happen any time soon. However, in 100-node cluster and each Kubelet having at least a watch on all pods, this may generate a significant load.
So I still think that we might need to consider a temporary workaround for it in the near future.

cc @fgrzadkowski


Reply to this email directly or view it on GitHub.

@gatici
Copy link

gatici commented Jul 19, 2019

Dear all,
I am making a big installation with helm charts and my installation is failing everytime with below error. I am using rke cluster. Is it possible to increase the timeout time of watch ?
How can I do it ?

Thanks,

  • chart: so-4.0.0
    description: 'Release "dev-so" failed post-install: watch closed before UntilWithoutRetry
    timeout'
    revision: 1
    status: FAILED
    updated: Fri Jul 19 06:14:40 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.
Projects
None yet
Development

No branches or pull requests

7 participants