Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod status not reported promptly in integration test (potentially due to watch delay) #6651

Closed
lavalamp opened this issue Apr 9, 2015 · 17 comments · Fixed by #8080
Closed
Assignees
Labels
area/test priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@lavalamp
Copy link
Member

lavalamp commented Apr 9, 2015

Probably not the same as #6261?

E0409 20:06:54.436466   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:06:57.231617   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:07:00.441072   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:07:03.830751   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:07:06.631939   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:07:08.999560   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:07:12.642264   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:07:15.431503   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:07:18.831765   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:07:21.036332   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:07:24.233363   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:07:27.660547   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:07:30.518706   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:07:33.000798   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:07:36.464078   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:07:39.449110   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:07:42.014692   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:07:45.637593   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:07:48.437592   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:07:51.240095   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:07:54.632542   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:07:57.430374   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:08:00.632996   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:08:03.041408   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:08:06.656540   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:08:09.841615   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:08:12.260723   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:08:15.634119   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:08:18.637260   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:08:21.254251   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:08:24.433427   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:08:27.639387   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:08:30.647808   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
I0409 20:08:32.367981   32568 integration.go:391] pod found: default/container-vm-guestbook-manifest-127
I0409 20:08:32.368364   32568 integration.go:391] pod found: default/container-vm-guestbook-pod-spec-localhost
I0409 20:08:32.368544   32568 integration.go:391] pod found: default/foo
I0409 20:08:32.368665   32568 integration.go:391] pod found: default/static-pod-from-spec-localhost
F0409 20:08:32.368827   32568 integration.go:394] static-pod-from-spec FAILED: mirror pod has not been created or is not running: timed out waiting for the condition

To @yujuhong for triage

@lavalamp lavalamp added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. area/test-infra labels Apr 9, 2015
@dchen1107
Copy link
Member

Shouldn't this fixed by #6580?

@yujuhong
Copy link
Contributor

yujuhong commented Apr 9, 2015

@dchen1107, this has nothing to do with #6580. kubelet is trying to create the mirror pod over and over again because it still hasn't observed the mirror pod from watch. It is probably the same as # #6261.

@lavalamp
Copy link
Member Author

lavalamp commented Apr 9, 2015

I thought they might be different because this one has a bunch of "failed creating ... already exists" messages in it, but #6261 doesn't.

@yujuhong
Copy link
Contributor

yujuhong commented Apr 9, 2015

It is essentially the same problem -- the watch was delayed so kubelet kept seeing the old snapshot of the pods where the mirror didn't exist. That's why it's flaky and only happens on shippable (which I suspect has poorer performance) :(

This becomes worse every time a new test is added. I think we need to limit the number of concurrent tests in integration.go
https://github.com/GoogleCloudPlatform/kubernetes/blob/master/cmd/integration/integration.go#L981

@yujuhong yujuhong changed the title Another mirror pod flake Limit the number of concurrent tests in integration.go Apr 9, 2015
@yujuhong
Copy link
Contributor

#6655 limits the number of concurrent tests on travis/shippable to 4. Let's see if this helps reduce the timeout flakes.

@lavalamp
Copy link
Member Author

@yujuhong yujuhong changed the title Limit the number of concurrent tests in integration.go Pod status not reported promptly in integration test (potentially due to watch delay) Apr 10, 2015
@yujuhong
Copy link
Contributor

Here is yet another example: https://app.shippable.com/builds/5528571f892aba0c00bd674c

Retitled the issue since limiting the concurrent tests doesn't work. It could be that the server is heavily loaded and the test timed out, or there was a delay/drop in kubelet's pod watch stream.

@dchen1107
Copy link
Member

@yujuhong
Copy link
Contributor

If, for whatever reason, a watch update for pod creation is dropped, kubelet would need to wait until re-listing to notice that. This would certainly cause the test to time out. @lavalamp, would it be useful to shorten the re-listing interval for tests like this?

@lavalamp
Copy link
Member Author

I am not quite sure how this works--let's chat about it tomorrow?

@yujuhong
Copy link
Contributor

From this failulre, we can see that kubelet was lightly loaded as the update channel was mostly empty while waiting.

I0416 20:47:36.975546    2903 config.go:186] 0/50 updates in the channel; merge finished in 0.00 seconds.
I0416 20:47:37.976479    2903 config.go:186] 0/50 updates in the channel; merge finished in 0.00 seconds.
I0416 20:47:38.977073    2903 config.go:186] 0/50 updates in the channel; merge finished in 0.00 seconds.
I0416 20:47:39.978154    2903 config.go:186] 0/50 updates in the channel; merge finished in 0.00 seconds.
I0416 20:47:40.111029    2903 config.go:186] 0/50 updates in the channel; merge finished in 0.00 seconds.
I0416 20:47:40.978537    2903 config.go:186] 0/50 updates in the channel; merge finished in 0.00 seconds.
I0416 20:47:41.978930    2903 config.go:186] 0/50 updates in the channel; merge finished in 0.00 seconds.
I0416 20:47:42.724643    2903 config.go:186] 0/50 updates in the channel; merge finished in 0.00 seconds.
I0416 20:47:42.979593    2903 config.go:186] 0/50 updates in the channel; merge finished in 0.00 seconds.
I0416 20:47:43.980301    2903 config.go:186] 0/50 updates in the channel; merge finished in 0.00 seconds.
I0416 20:47:44.981798    2903 config.go:186] 0/50 updates in the channel; merge finished in 0.00 seconds.
I0416 20:47:45.982255    2903 config.go:186] 0/50 updates in the channel; merge finished in 0.00 seconds.
I0416 20:47:46.982765    2903 config.go:186] 0/50 updates in the channel; merge finished in 0.00 seconds.
F0416 20:47:47.207091    2903 integration.go:914] FAILED: pod never started running timed out waiting for the condition

This corroborates the theory that watch might be at fault.

@ghost
Copy link

ghost commented Apr 16, 2015

cc: quinton-hoole

@yujuhong
Copy link
Contributor

The WIP reflector benchmark (#6697) may shed some light on this issue once it's integrated with the continuous e2e performance suite.

I will keep this issue open for now to gather more data (since this is not reproducible locally at all), and to see if there are other potential causes.

yujuhong added a commit to yujuhong/kubernetes that referenced this issue May 11, 2015
The test has been flaky for a while due to the potential watch performance
problem. Temporarily disable this test until we resolve kubernetes#6651.

Note that there is extensive coverage of mirror pod creation/deletion via unit
tests in kubelet_test.go.
@yujuhong
Copy link
Contributor

The PR above disables the static pod test temporarily so that we can track down the root cause without spamming people with the failures. Note that other tests in the same file (integration.go) may still trigger the flaky failure, though less frequently.

@yujuhong yujuhong reopened this May 11, 2015
@ghost ghost added this to the v1.0-post milestone Jun 15, 2015
@ghost
Copy link

ghost commented Jun 15, 2015

@yujuhong Can this be closed yet?

@yujuhong
Copy link
Contributor

@yujuhong Can this be closed yet?

No, integration test is still flaky. See https://app.shippable.com/builds/5582458f3d07800f00077736 for example.

Many tests in cmd/intergration/intergration.go rely on checking the pod status with a preset timeout. I disabled one test which failed most often, but other tests still fail from time to time.

yujuhong added a commit to yujuhong/kubernetes that referenced this issue Jun 24, 2015
@bgrant0607 bgrant0607 removed this from the v1.0-post milestone Jul 24, 2015
@saad-ali saad-ali added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Dec 15, 2015
@yujuhong
Copy link
Contributor

yujuhong commented Feb 9, 2016

The issue has been inactive for months, and there have not been any new reports. Closing.

@yujuhong yujuhong closed this as completed Feb 9, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/test priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants