-
Notifications
You must be signed in to change notification settings - Fork 38.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Randomize apiserver timeout #8574
Conversation
I think the proposal is great, I'm just wondering what the values of "minTimeout" and "maxTimeout" should be. Why did you choose 5 and 8 minutes - was it some guess or was it based on some data? |
In practice I've seen the connection break at < 5m. This is something I'm currently tracking down. That aside, I left the lower bound at whatever it is currently is. For the upper bound I observed a correlation between slow lists and > 30-50 concurrent lists, so I bucketed 100 nodes giving them each their own minute (didn't experiment more than that). I think anywhere between 6-8 should be Ok. |
I asked because my feeling is that maxTimeout = 10minutes = 2 * minTimeout would be more intuitive (we will be spreading the load "uniformly"). |
Yeah, not sure but I think this is easily changable and requires some experimentation, so I can set it to 10 or 4 depending on what other voters prefer. The upper limit is a balance between how paranoid we want to be about dropping watch events vs how likely it is to happen on some set of nodes. |
@@ -410,11 +420,12 @@ func (s *APIServer) Run(_ []string) error { | |||
} | |||
|
|||
if secureLocation != "" { | |||
timeout := time.Duration(MinTimeoutSecs+rand.Intn(MaxTimeoutSecs-MinTimeoutSecs)) * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't we want the same timeout for all http servers within one Kubelet?
I think it definitely makes sense to differentiate it between Kubelets, but maybe within a Kubelet they should be the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
This LGTM. |
These numbers seem fine at distributing load: http://play.golang.org/p/3FSbYtLmLT |
Sorry - my comment above doesn't make much sense. However - I'm afraid that I stopped understanding this PR. Basically, what you're doing with it is you are setting timeout at the apiserver level to be a random between 5 and 8 minutes. But it's still the same for all Kubelets. In other words, I think that instead of a burst after 5 minutes, we will have a burst after some random time between 5 and 8 minutes. [What we would like to do is to setup the timeout at the request level in each Kubelet separately]. |
Hold on, @wojtek-t I need an associated timeout in each watch server. |
78b6082
to
f75c6e2
Compare
No joken this time >.< |
Thanks - will take a look later today after my meetings. |
const ( | ||
// Maximum duration before timing out read/write requests | ||
// Set to a value larger than the timeouts in each watch server. | ||
ReadWriteTimeoutMins = time.Minute * 60 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove "Mins" from name, it's a time.Duration, not a count of minutes.
Can one of the admins verify that this patch is reasonable to test? (reply "ok to test", or if you trust the user, reply "add to whitelist") If this message is too spammy, please complain @ixdy. |
f75c6e2
to
aa6f55f
Compare
aa6f55f
to
8a5445d
Compare
Nits addressed, stil trying to run density on a 100 node cluster (it seems to be stuck polling for fluentd pods in some pre-test phase even on head today, will debug tomorrow) though e2e passed. |
ok to test |
LGTM |
Connections get a random timeout between min/max instead of a hard 5 minute timout.
@lavalamp @fgrzadkowski @wojtek-t