Pod status not reported promptly in integration test (potentially due to watch delay) #6651

lavalamp · 2015-04-09T20:12:43Z

Probably not the same as #6261?

E0409 20:06:54.436466   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:06:57.231617   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:07:00.441072   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:07:03.830751   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:07:06.631939   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:07:08.999560   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:07:12.642264   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:07:15.431503   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:07:18.831765   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:07:21.036332   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:07:24.233363   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:07:27.660547   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:07:30.518706   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:07:33.000798   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:07:36.464078   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:07:39.449110   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:07:42.014692   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:07:45.637593   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:07:48.437592   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:07:51.240095   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:07:54.632542   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:07:57.430374   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:08:00.632996   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:08:03.041408   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:08:06.656540   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:08:09.841615   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:08:12.260723   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:08:15.634119   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:08:18.637260   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:08:21.254251   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:08:24.433427   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:08:27.639387   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
E0409 20:08:30.647808   32568 kubelet.go:1341] Failed creating a mirror pod "static-pod-from-spec-localhost_default": pods "static-pod-from-spec-localhost" already exists
I0409 20:08:32.367981   32568 integration.go:391] pod found: default/container-vm-guestbook-manifest-127
I0409 20:08:32.368364   32568 integration.go:391] pod found: default/container-vm-guestbook-pod-spec-localhost
I0409 20:08:32.368544   32568 integration.go:391] pod found: default/foo
I0409 20:08:32.368665   32568 integration.go:391] pod found: default/static-pod-from-spec-localhost
F0409 20:08:32.368827   32568 integration.go:394] static-pod-from-spec FAILED: mirror pod has not been created or is not running: timed out waiting for the condition

To @yujuhong for triage

The text was updated successfully, but these errors were encountered:

dchen1107 · 2015-04-09T20:15:48Z

Shouldn't this fixed by #6580?

yujuhong · 2015-04-09T21:18:01Z

@dchen1107, this has nothing to do with #6580. kubelet is trying to create the mirror pod over and over again because it still hasn't observed the mirror pod from watch. It is probably the same as # #6261.

lavalamp · 2015-04-09T21:22:04Z

I thought they might be different because this one has a bunch of "failed creating ... already exists" messages in it, but #6261 doesn't.

yujuhong · 2015-04-09T21:30:24Z

It is essentially the same problem -- the watch was delayed so kubelet kept seeing the old snapshot of the pods where the mirror didn't exist. That's why it's flaky and only happens on shippable (which I suspect has poorer performance) :(

This becomes worse every time a new test is added. I think we need to limit the number of concurrent tests in integration.go
https://github.com/GoogleCloudPlatform/kubernetes/blob/master/cmd/integration/integration.go#L981

yujuhong · 2015-04-10T00:32:35Z

#6655 limits the number of concurrent tests on travis/shippable to 4. Let's see if this helps reduce the timeout flakes.

lavalamp · 2015-04-10T18:47:49Z

Another example: https://app.shippable.com/builds/55275c45aa50c00b001ed2d9

yujuhong · 2015-04-10T23:33:24Z

Here is yet another example: https://app.shippable.com/builds/5528571f892aba0c00bd674c

Retitled the issue since limiting the concurrent tests doesn't work. It could be that the server is heavily loaded and the test timed out, or there was a delay/drop in kubelet's pod watch stream.

dchen1107 · 2015-04-13T18:18:57Z

Now example: https://app.shippable.com/builds/552c02f56206490c0097314d

yujuhong · 2015-04-14T19:17:21Z

If, for whatever reason, a watch update for pod creation is dropped, kubelet would need to wait until re-listing to notice that. This would certainly cause the test to time out. @lavalamp, would it be useful to shorten the re-listing interval for tests like this?

lavalamp · 2015-04-14T20:44:51Z

I am not quite sure how this works--let's chat about it tomorrow?

yujuhong · 2015-04-16T22:02:52Z

From this failulre, we can see that kubelet was lightly loaded as the update channel was mostly empty while waiting.

I0416 20:47:36.975546    2903 config.go:186] 0/50 updates in the channel; merge finished in 0.00 seconds.
I0416 20:47:37.976479    2903 config.go:186] 0/50 updates in the channel; merge finished in 0.00 seconds.
I0416 20:47:38.977073    2903 config.go:186] 0/50 updates in the channel; merge finished in 0.00 seconds.
I0416 20:47:39.978154    2903 config.go:186] 0/50 updates in the channel; merge finished in 0.00 seconds.
I0416 20:47:40.111029    2903 config.go:186] 0/50 updates in the channel; merge finished in 0.00 seconds.
I0416 20:47:40.978537    2903 config.go:186] 0/50 updates in the channel; merge finished in 0.00 seconds.
I0416 20:47:41.978930    2903 config.go:186] 0/50 updates in the channel; merge finished in 0.00 seconds.
I0416 20:47:42.724643    2903 config.go:186] 0/50 updates in the channel; merge finished in 0.00 seconds.
I0416 20:47:42.979593    2903 config.go:186] 0/50 updates in the channel; merge finished in 0.00 seconds.
I0416 20:47:43.980301    2903 config.go:186] 0/50 updates in the channel; merge finished in 0.00 seconds.
I0416 20:47:44.981798    2903 config.go:186] 0/50 updates in the channel; merge finished in 0.00 seconds.
I0416 20:47:45.982255    2903 config.go:186] 0/50 updates in the channel; merge finished in 0.00 seconds.
I0416 20:47:46.982765    2903 config.go:186] 0/50 updates in the channel; merge finished in 0.00 seconds.
F0416 20:47:47.207091    2903 integration.go:914] FAILED: pod never started running timed out waiting for the condition

This corroborates the theory that watch might be at fault.

ghost · 2015-04-16T22:10:56Z

cc: quinton-hoole

yujuhong · 2015-04-16T23:14:38Z

The WIP reflector benchmark (#6697) may shed some light on this issue once it's integrated with the continuous e2e performance suite.

I will keep this issue open for now to gather more data (since this is not reproducible locally at all), and to see if there are other potential causes.

The test has been flaky for a while due to the potential watch performance problem. Temporarily disable this test until we resolve kubernetes#6651. Note that there is extensive coverage of mirror pod creation/deletion via unit tests in kubelet_test.go.

yujuhong · 2015-05-11T23:19:09Z

The PR above disables the static pod test temporarily so that we can track down the root cause without spamming people with the failures. Note that other tests in the same file (integration.go) may still trigger the flaky failure, though less frequently.

ghost · 2015-06-15T21:25:26Z

@yujuhong Can this be closed yet?

yujuhong · 2015-06-18T21:54:01Z

@yujuhong Can this be closed yet?

No, integration test is still flaky. See https://app.shippable.com/builds/5582458f3d07800f00077736 for example.

Many tests in cmd/intergration/intergration.go rely on checking the pod status with a preset timeout. I disabled one test which failed most often, but other tests still fail from time to time.

This helps debug kubernetes#6651

yujuhong · 2016-02-09T01:33:40Z

The issue has been inactive for months, and there have not been any new reports. Closing.

lavalamp added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. area/test-infra labels Apr 9, 2015

lavalamp assigned yujuhong Apr 9, 2015

yujuhong mentioned this issue Apr 9, 2015

Flake: mirror pod not found #6261

Closed

yujuhong changed the title ~~Another mirror pod flake~~ Limit the number of concurrent tests in integration.go Apr 9, 2015

yujuhong mentioned this issue Apr 9, 2015

Limit the number of concurrent tests in integration.go #6655

Merged

yujuhong mentioned this issue Apr 10, 2015

Integration tests are failing consistently with "FAILED: pods never started running timed out waiting for the condition" #6525

Closed

yujuhong changed the title ~~Limit the number of concurrent tests in integration.go~~ Pod status not reported promptly in integration test (potentially due to watch delay) Apr 10, 2015

yujuhong mentioned this issue Apr 14, 2015

[DO NOT REVIEW] Kubelet: track the update channel usage #6833

Closed

yujuhong mentioned this issue Apr 27, 2015

hack/test-integration.sh is flaky #4500

Closed

yujuhong mentioned this issue May 11, 2015

kubelet: Fix racy kubelet tests. #7980

Merged

yujuhong mentioned this issue May 11, 2015

Temporarily disable the static pod test in integration.go #8080

Merged

vmarmol closed this as completed in #8080 May 11, 2015

yujuhong reopened this May 11, 2015

ghost added this to the v1.0-post milestone Jun 15, 2015

yujuhong added a commit to yujuhong/kubernetes that referenced this issue Jun 24, 2015

Kubelet: add and modify some logging messsages

f7fd7e9

This helps debug kubernetes#6651

yujuhong mentioned this issue Jun 24, 2015

Kubelet: add and modify some logging messsages #10320

Merged

bgrant0607 removed this from the v1.0-post milestone Jul 24, 2015

jdef mentioned this issue Nov 10, 2015

k8sm scheduler should ignore scheduling static/mirror pods mesosphere/kubernetes-mesos#493

Closed

ikehz added area/test and removed area/test-infra labels Dec 3, 2015

saad-ali added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Dec 15, 2015

yujuhong closed this as completed Feb 9, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pod status not reported promptly in integration test (potentially due to watch delay) #6651

Pod status not reported promptly in integration test (potentially due to watch delay) #6651

lavalamp commented Apr 9, 2015

dchen1107 commented Apr 9, 2015

yujuhong commented Apr 9, 2015

lavalamp commented Apr 9, 2015

yujuhong commented Apr 9, 2015

yujuhong commented Apr 10, 2015

lavalamp commented Apr 10, 2015

yujuhong commented Apr 10, 2015

dchen1107 commented Apr 13, 2015

yujuhong commented Apr 14, 2015

lavalamp commented Apr 14, 2015

yujuhong commented Apr 16, 2015

ghost commented Apr 16, 2015

yujuhong commented Apr 16, 2015

yujuhong commented May 11, 2015

ghost commented Jun 15, 2015

yujuhong commented Jun 18, 2015

yujuhong commented Feb 9, 2016

Pod status not reported promptly in integration test (potentially due to watch delay) #6651

Pod status not reported promptly in integration test (potentially due to watch delay) #6651

Comments

lavalamp commented Apr 9, 2015

dchen1107 commented Apr 9, 2015

yujuhong commented Apr 9, 2015

lavalamp commented Apr 9, 2015

yujuhong commented Apr 9, 2015

yujuhong commented Apr 10, 2015

lavalamp commented Apr 10, 2015

yujuhong commented Apr 10, 2015

dchen1107 commented Apr 13, 2015

yujuhong commented Apr 14, 2015

lavalamp commented Apr 14, 2015

yujuhong commented Apr 16, 2015

ghost commented Apr 16, 2015

yujuhong commented Apr 16, 2015

yujuhong commented May 11, 2015

ghost commented Jun 15, 2015

yujuhong commented Jun 18, 2015

yujuhong commented Feb 9, 2016