New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Watch shouldn't break every 5 minutes #6513
Comments
I think that 3 is because of: I think that we should change it so that all "Watch" requests has a much bigger timeout (but don't change the timeout of all other gets). What do you think about it? |
Yes, I can confirm that we're ignoring watches in #6207:
And yes, I think that watches should similarly be excluded from API server timeouts, but would like @bgrant0607 to comment on what the intended semantics and implementation of long-term watches are. Do they have any timeout? Is there intended to be a limit on the number of concurrent long-term watches open against the API server? |
@wojtek-t could you keep me in the loop on new items created | possibly label as perf b/c I think we are chasing the same things. |
@timothysc will do |
I think we need to fix the etcd problem. Working around it on our end is just too painful, and doesn't help us (short of aggregating all watches on our end, which is a larger chunk of work). My recommendation is to work with upstream etcd to propose the minimal change for etcd-io/etcd#2048 required to fix the issue on our end (a client option that requests that etcd indicate the window has closed and provide the latest etcd index at that time). That fixes the short term problem in a way that everyone benefits. |
@smarterclayton I don't understand your comment - I didn't observe any problem at the etcd level - the watch on etcd doesn't break itself @timothysc: sure - I will look into it as soon as the new etcd release is built |
@wojtek-t the problem you alluded to of being outside the window of watchable events. The disconnect would not be a problem if the client knows to ask for resource version X+10000. The disconnect is only a problem because the client only knows to ask for resource version X.
This is only a problem because clients don't get an updated resource version after 1000 uninteresting writes. #2048 is intended to fix that problem. |
The 5 minute timeout is our maximum http.Server timeout. I don't think we should increase this, because it doesn't fix the problem from #2048, which affects every watcher. |
cc @hchoudh |
OK - I agree that the issue you mentioned is more important to fix. |
On the other hand, dead clients are going to be consuming resources, especially now that we have limit pools. I do not want someone to be able to mount a DOS attack against the API server by opening watches and starving traffic. Having to reestablish their connection periodically is a part of ensuring that clients that exceed their rate limit are throttled (vs being able to open a watch forever). ----- Original Message -----
|
@smarterclayton can you explain this attack mechanism a bit more? Regardless of watch timeout, can't someone sucessfully attack by running a lot of clients that keep renewing? Statistically some non-attack clients will get through if the attackers keep having to renew, but if the attacker runs enough clients then I think they can keep that number close to zero. |
Also, can you send a pointer to the PR/issue that introduced "limit pools"? |
|
I see, so this helps if you have a rate limiter based on something that the attacker cannot easily generate a lot of. Do we have this in the system currently? |
We do not, although I would like to extend Brendan's initial work to that eventually.
|
We have a time limit in apiserver, it could be lengthened. Note there's also a time limit in nginx. I don't really think that a 5 minute timeout should be a problem. My thinking is that to get a substantial gain here, say 10x, we have to extend this period to 50 minutes. That seems clearly too long to me. |
Really, to be correct, clients need to re-list occasionally, and they have to be able to handle the watch closing at any time. I think the five minute forced close is fine because it forces clients to do it right. |
Essentially I think this behavior is by design, not a bug, and shouldn't adversely affect performance. |
@lavalamp thinks the system should be able to handle a re-list every 5 min so this is WAI. |
I would clarify that based on etcd 2048. The system should be able to handle a re-list every five minutes. It should not require all clients to re-list every five minutes. Re-listing is a client responsibility. ----- Original Message -----
|
@smarterclayton @davidopp |
If it's a high priority for us we should help them fix it. I suspect that it's a fairly easy fix once the api details are sorted out (the server can return 204 No content with the updated etcd index when the watch window is exceeded, client can recognize that and return an error, we read the error and return a typed error back to clients, who update their resource watch version)
|
Dear all, Thanks,
|
Currently Watch is broken every 5 minutes. It is then "recreated", which sometimes requires listing all elements under watch (if there were more than 1000 "uninteresting" etcd writes since then).
I digged deeper into it and it seems this is not related to etcd itself. What happens is:
cc @fgrzadkowski
The text was updated successfully, but these errors were encountered: