-
Notifications
You must be signed in to change notification settings - Fork 38.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pod status not reported promptly in integration test (potentially due to watch delay) #6651
Comments
Shouldn't this fixed by #6580? |
@dchen1107, this has nothing to do with #6580. kubelet is trying to create the mirror pod over and over again because it still hasn't observed the mirror pod from watch. It is probably the same as # #6261. |
I thought they might be different because this one has a bunch of "failed creating ... already exists" messages in it, but #6261 doesn't. |
It is essentially the same problem -- the watch was delayed so kubelet kept seeing the old snapshot of the pods where the mirror didn't exist. That's why it's flaky and only happens on shippable (which I suspect has poorer performance) :( This becomes worse every time a new test is added. I think we need to limit the number of concurrent tests in integration.go |
#6655 limits the number of concurrent tests on travis/shippable to 4. Let's see if this helps reduce the timeout flakes. |
Another example: https://app.shippable.com/builds/55275c45aa50c00b001ed2d9 |
Here is yet another example: https://app.shippable.com/builds/5528571f892aba0c00bd674c Retitled the issue since limiting the concurrent tests doesn't work. It could be that the server is heavily loaded and the test timed out, or there was a delay/drop in kubelet's pod watch stream. |
If, for whatever reason, a watch update for pod creation is dropped, kubelet would need to wait until re-listing to notice that. This would certainly cause the test to time out. @lavalamp, would it be useful to shorten the re-listing interval for tests like this? |
I am not quite sure how this works--let's chat about it tomorrow? |
From this failulre, we can see that kubelet was lightly loaded as the update channel was mostly empty while waiting.
This corroborates the theory that watch might be at fault. |
cc: quinton-hoole |
The WIP reflector benchmark (#6697) may shed some light on this issue once it's integrated with the continuous e2e performance suite. I will keep this issue open for now to gather more data (since this is not reproducible locally at all), and to see if there are other potential causes. |
The test has been flaky for a while due to the potential watch performance problem. Temporarily disable this test until we resolve kubernetes#6651. Note that there is extensive coverage of mirror pod creation/deletion via unit tests in kubelet_test.go.
The PR above disables the static pod test temporarily so that we can track down the root cause without spamming people with the failures. Note that other tests in the same file (integration.go) may still trigger the flaky failure, though less frequently. |
@yujuhong Can this be closed yet? |
No, integration test is still flaky. See https://app.shippable.com/builds/5582458f3d07800f00077736 for example. Many tests in cmd/intergration/intergration.go rely on checking the pod status with a preset timeout. I disabled one test which failed most often, but other tests still fail from time to time. |
This helps debug kubernetes#6651
The issue has been inactive for months, and there have not been any new reports. Closing. |
Probably not the same as #6261?
To @yujuhong for triage
The text was updated successfully, but these errors were encountered: