-
Notifications
You must be signed in to change notification settings - Fork 38.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rare 10-second timeouts from ResourceQuota admission controller #63608
Comments
@kubernetes/sig-api-machinery-bugs |
That's weird, I'm not aware of a 10s timeout anywhere (there is a 60s one). @jpbetz does the etcd client have such a timeout? |
FYI:
|
@MikeSpreitzer - can you try etcd without SSL/TLS? it could be |
Yes, I will try that, the next time it is convenient to tear everything down and start over. |
OK, I tried again with etcd NOT using TLS (neither client/server nor server/server). I still got a "timeout" error. The driver at commit f3a4391de25867b2a3e6de184fe7577f600d00c0.
|
I am running kube release 1.10.3 now. I turned v to 5 in the api-server driven by the client, and ran another trial. Here are the errors logged by the driver (with timestamps in UTC):
I found this in the stderr of the api-server (the timestamp seems to be in UTC):
|
The etcd cluster should have no trouble with that write rate.
|
Around that time, the etcd server that was loaded by the kube api-server logged the following (in US CDT):
|
@MikeSpreitzer try tweaking |
|
I added
|
The doc string for |
However, making the client specify a timeout of 17 sec (see https://github.com/MikeSpreitzer/k8api-scaletest/tree/45958532ca0b661a567ed763362078d3722837f8/cmdriverclosed) warded off the problem --- according to my first and only trial so far.
|
However, my second trial was much more interesting. The api-server logged an error on create --- but the client did not! And the subsequent update and delete operations on the object in question got 404s.
Here is what the driver logged about that object:
Here is what the api-server logged around that time:
The etcd leader logged nothing unusual around that time. |
@hzxuzhonghu has suggested the mysterious timeout might be due to https://github.com/kubernetes/kubernetes/blob/v1.10.3/plugin/pkg/admission/resourcequota/controller.go#L551 . I do indeed have the |
I tried removing Why is the ResourceQuota admission controller causing these 10 second timeouts? Remember, they are rare: I can do hundreds of thousands of create operations in rapid fire, and get a single digit number of timeouts, while the other create operations complete in well under 10 seconds. See https://docs.google.com/document/d/1o1ygFQ2n7uQIMNRT5xmKPDV5gUbWP5pmtZTQA6V3u9I for an example. |
Here is another example (now with https://github.com/MikeSpreitzer/k8api-scaletest/tree/e3ca6afab243c90ce4e1fee7a09ee9cb6e163205/cmdriverclosed and the etcd cluster using TLS):
|
So I slowed down from 3000 to 2000 kube writes per second, but I still got a timeout.
And no, that is not a really tight cluster of data points at (17:35:25.870, 10.036) --- it really is just a single data point. No other operation took anywhere near that long. |
Interestingly, there is a roughly 50 ms bubble of latency around the time of the start of the operation that timed out. As you can see from the graph, there are other bubbles of latency. All much smaller than 10 sec. Here are 36 consecutive lines of driver data from around the timeout, sorted by preop time.
|
Looking in https://github.com/kubernetes/kubernetes/blob/v1.10.3/plugin/pkg/admission/resourcequota/controller.go , I see something odd. In two places --- |
So I made https://github.com/MikeSpreitzer/kubernetes/tree/quota-eval-fix to test the hypothesis that the double-call of
|
@MikeSpreitzer wanna file a WIP PR so we can see the exact changes? |
yeah, when I get a chance. Weekend now. |
In plugin/pkg/admission/resourcequota/controller.go, getWork has an optimization that obviates --- and actually is inconsistent with --- a subsequent call on completeWork in the case where the returned list is empty. Fixes kubernetes#63608
This change simplifies the code in plugin/pkg/admission/resourcequota/controller.go by removing the optimization in getWork that required the caller to NOT call completeWork if getWork returns the empty list of work. BTW, the caller was not obeying that requirement; now the caller's behavior (which is unchanged) is right. Fixes kubernetes#63608
This change simplifies the code in plugin/pkg/admission/resourcequota/controller.go by removing the optimization in getWork that required the caller to NOT call completeWork if getWork returns the empty list of work. BTW, the caller was not obeying that requirement; now the caller's behavior (which is unchanged) is right. Fixes kubernetes#63608
This change simplifies the code in plugin/pkg/admission/resourcequota/controller.go by removing the optimization in getWork that required the caller to NOT call completeWork if getWork returns the empty list of work. BTW, the caller was not obeying that requirement; now the caller's behavior (which is unchanged) is right. Fixes kubernetes#63608
Automatic merge from submit-queue (batch tested with PRs 65152, 65199, 65179, 64598, 65216). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Remove optimization from getWork in resourcequota/controller.go **What this PR does / why we need it**: This change simplifies the code in plugin/pkg/admission/resourcequota/controller.go by removing the optimization in getWork that required the caller to NOT call completeWork if getWork returns the empty list of work. BTW, the caller was not obeying that requirement; now the caller's behavior (which is unchanged) is right. **Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*: Fixes #63608 **Special notes for your reviewer**: This is a simpler alternative to #64377 **Release note**: ```release-note NONE ```
This change simplifies the code in plugin/pkg/admission/resourcequota/controller.go by removing the optimization in getWork that required the caller to NOT call completeWork if getWork returns the empty list of work. BTW, the caller was not obeying that requirement; now the caller's behavior (which is unchanged) is right. Fixes kubernetes#63608
This change simplifies the code in plugin/pkg/admission/resourcequota/controller.go by removing the optimization in getWork that required the caller to NOT call completeWork if getWork returns the empty list of work. BTW, the caller was not obeying that requirement; now the caller's behavior (which is unchanged) is right. Fixes kubernetes#63608
@MikeSpreitzer very nice investigation you've done here! |
Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug
What happened:
I wrote a simple driver program (https://github.com/MikeSpreitzer/k8api-scaletest/blob/2f5a8414b5da4059bb2a2b33fcdb88f40ca2e07b/cmdriverclosed/main.go) that creates ConfigMap objects as fast as it can, and ran it like this:
cmdriverclosed --kubeconfig $myconfig -n 30103 -threads 10 -conns 10
. In the CSV file produced, two of the creates recorded an error that simply says "timeout":Here is an excerpt from the api-server's log:
A few minutes later I checked that my etcd cluster is healthy; it was:
I also looked at the
systemctl status etcd
on each host for etcd, and each reported that etcd had been running for over a day.What you expected to happen:
I expected no such problem, or at least a more informative error message.
How to reproduce it (as minimally and precisely as possible):
That test driver is pretty minimal.
Anything else we need to know?:
Environment:
kubectl version
):uname -a
): Linux mjs-api-dal10-b 4.4.0-122-generic DNS #146-Ubuntu SMP Mon Apr 23 15:34:04 UTC 2018 x86_64 x86_64 x86_64 GNU/LinuxThe text was updated successfully, but these errors were encountered: