Reboot e2e test timeout because of slow docker startup #9349

bprashanth · 2015-06-05T23:06:41Z

I've seen this fail on occasion on local e2e runs. It looks like what happens is:

Reboot, kubelet logs show a bunch of binary looking output (I'm assuming from the reboot)

I0605 22:10:50.673994 @^ x 100

Ssh fails for a while , e2e logs:

INFO: Error while issuing ssh command: when running echo b

Ssh works, kubelet isn't up (10:50->11:50)

I0605 22:11:50.687609    2193 server.go:261] Using root directory: /var/lib/kubelet
I0605 22:11:50.688319    2193 manager.go:126] cAdvisor running in container: "/"

Kubelet is up, docker isn't running

I0605 22:12:56.816171    2515 kubelet.go:2082] Query container info for pod "kibana-logging-v1-0cev8_default" failed with error (API error (500): engine is shutdown
...
E0605 22:14:15.630260    2515 kubelet.go:1644] Couldn't sync containers: dial unix /var/run/docker.sock: no such file or directory

Test times out

Here's a gist of a segment of my logs:
https://gist.github.com/bprashanth/c90bd10b359851a57cb1

and heres describe events for the pod that the test complained about (elasticsearch-logging-v1-7w3d6):
https://gist.github.com/bprashanth/f380c2e80e5740c37520

@mforbes (just from git blame on test unsure if you care)

The text was updated successfully, but these errors were encountered:

bgrant0607 · 2015-06-05T23:10:47Z

See also #9062, #8962, #9117

bprashanth · 2015-06-05T23:18:25Z

I swear i searched for the error message and found nothing. I think the reboot test involves a whole lot of stuff and there's a good chance all 3 bugs are actually different, so retitling.

dchen1107 · 2015-06-11T16:38:56Z

One way to speed up docker's startup is removing all containers before reboot for testing purpose. There is not much things we can do at node level.

brendandburns · 2015-06-11T23:53:01Z

The reboot test shows no failures in the last 30 runs due to this bug. I'm de-prioritizing back to P2, and if it continues to not be flaky, I'm going to kick it out of 1.0

wojtek-t · 2015-07-01T09:25:36Z

@jszczepkowski - can you please take a look?

bprashanth · 2015-07-01T17:15:44Z

Fyi I think reboot is part of the tests we're skipping (https://github.com/GoogleCloudPlatform/kubernetes/blob/master/hack/jenkins/e2e.sh#L69), so if we don't think reboot is important/stable enough to run regularly, this probably isn't a 1.0 bug

goltermann · 2015-10-02T18:19:22Z

@quinton-hoole @fabioy Do we think this one is now fixed via #14772 , or is there more to do?

ghost · 2015-10-02T19:16:32Z

We're actually still seeing occasional timeouts since #14772 , so lets leave this one open to track those. e.g.
/job/kubernetes-e2e-gce-reboot/5589/
Reboot each node by triggering kernel panic and ensure they function upon restart
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/reboot.go:77
Oct 2 10:11:24.229: Node e2e-reboot-minion-6ab8 failed reboot test.

davidopp · 2015-10-04T02:09:40Z

Two failures today:
http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gce-reboot/5653/
http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gce-reboot/5654/

ikehz · 2016-01-21T22:36:19Z

These are still flaking occasionally:

Reboot [Disruptive] each node by dropping all inbound packets for a while and ensure they function afterwards (Failed 1 times in the last 30 runs. Stability: 96 %) No flake or failure 3 min 56 sec Passed
Reboot [Disruptive] each node by dropping all outbound packets for a while and ensure they function afterwards (No known failures. Stability 100 %) No flake or failure 3 min 57 sec Passed
Reboot [Disruptive] each node by ordering clean reboot and ensure they function upon restart (No known failures. Stability 100 %) No flake or failure 1 min 58 sec Passed
Reboot [Disruptive] each node by ordering unclean reboot and ensure they function upon restart (Failed 1 times in the last 30 runs. Stability: 96 %) No flake or failure 1 min 54 sec Passed
Reboot [Disruptive] each node by switching off the network interface and ensure they function upon switch on (No known failures. Stability 100 %) No flake or failure 3 min 59 sec Passed
Reboot [Disruptive] each node by triggering kernel panic and ensure they function upon restart (Failed 1 times in the last 30 runs. Stability: 96 %)No flake or failure 1 min 22 sec Passed

bprashanth · 2016-01-21T22:38:49Z

I think a lot of that is addressed in #19189 (comment)

ikehz · 2016-01-21T22:46:04Z

... except that both of those failures were during the TCP problems we saw yesterday, where a bunch of other tests were also failing.

ikehz · 2016-01-21T22:47:34Z

@dchen1107 Why did you assign this to me?

ikehz · 2016-01-21T22:50:05Z

Closing in favor of #19189.

bprashanth · 2016-01-21T22:50:40Z

Agree with the dupe. I don't think these were actually networking problems, or nodes not coming up. The cases I saw last week/start of this week were because of cluster addons/static pods not coming up. I might be wrong/taken a biased sampling of wrong logs.

bprashanth added area/test sig/node Categorizes an issue or PR as relevant to SIG Node. labels Jun 5, 2015

bprashanth mentioned this issue Jun 5, 2015

Apiserver can proxy to nodes #9262

Merged

bprashanth changed the title ~~Reboot e2e test falky~~ Reboot e2e test timeout because of slow docker startup Jun 5, 2015

brendandburns modified the milestone: v1.0-candidate Jun 9, 2015

goltermann modified the milestones: v1.0, v1.0-candidate Jun 9, 2015

goltermann added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Jun 9, 2015

dchen1107 added the area/test-infra label Jun 11, 2015

dchen1107 removed the sig/node Categorizes an issue or PR as relevant to SIG Node. label Jun 11, 2015

goltermann added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/backlog Higher priority than priority/awaiting-more-evidence. labels Jun 11, 2015

brendandburns self-assigned this Jun 11, 2015

brendandburns added priority/backlog Higher priority than priority/awaiting-more-evidence. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Jun 11, 2015

brendandburns removed their assignment Jun 25, 2015

wojtek-t assigned jszczepkowski Jul 1, 2015

bgrant0607 modified the milestones: v1.0, v1.1 Sep 16, 2015

brendandburns added the kind/flake Categorizes issue or PR as related to a flaky test. label Sep 25, 2015

bgrant0607 unassigned jszczepkowski Sep 28, 2015

bgrant0607 mentioned this issue Sep 29, 2015

GCE reboot tests flaky #14772

Closed

goltermann removed this from the v1.1 milestone Oct 19, 2015

dchen1107 assigned ikehz Jan 21, 2016

ikehz closed this as completed Jan 21, 2016

ikehz added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed area/test-infra labels Jan 21, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reboot e2e test timeout because of slow docker startup #9349

Reboot e2e test timeout because of slow docker startup #9349

bprashanth commented Jun 5, 2015

bgrant0607 commented Jun 5, 2015

bprashanth commented Jun 5, 2015

dchen1107 commented Jun 11, 2015

brendandburns commented Jun 11, 2015

wojtek-t commented Jul 1, 2015

bprashanth commented Jul 1, 2015

goltermann commented Oct 2, 2015

ghost commented Oct 2, 2015

davidopp commented Oct 4, 2015

ikehz commented Jan 21, 2016

bprashanth commented Jan 21, 2016

ikehz commented Jan 21, 2016

ikehz commented Jan 21, 2016

ikehz commented Jan 21, 2016

bprashanth commented Jan 21, 2016

Reboot e2e test timeout because of slow docker startup #9349

Reboot e2e test timeout because of slow docker startup #9349

Comments

bprashanth commented Jun 5, 2015

bgrant0607 commented Jun 5, 2015

bprashanth commented Jun 5, 2015

dchen1107 commented Jun 11, 2015

brendandburns commented Jun 11, 2015

wojtek-t commented Jul 1, 2015

bprashanth commented Jul 1, 2015

goltermann commented Oct 2, 2015

ghost commented Oct 2, 2015

davidopp commented Oct 4, 2015

ikehz commented Jan 21, 2016

bprashanth commented Jan 21, 2016

ikehz commented Jan 21, 2016

ikehz commented Jan 21, 2016

ikehz commented Jan 21, 2016

bprashanth commented Jan 21, 2016