Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reboot e2e test timeout because of slow docker startup #9349

Closed
bprashanth opened this issue Jun 5, 2015 · 15 comments
Closed

Reboot e2e test timeout because of slow docker startup #9349

bprashanth opened this issue Jun 5, 2015 · 15 comments
Assignees
Labels
area/test kind/flake Categorizes issue or PR as related to a flaky test. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@bprashanth
Copy link
Contributor

I've seen this fail on occasion on local e2e runs. It looks like what happens is:

  • Reboot, kubelet logs show a bunch of binary looking output (I'm assuming from the reboot)
I0605 22:10:50.673994 @^ x 100
  • Ssh fails for a while , e2e logs:
INFO: Error while issuing ssh command: when running echo b
  • Ssh works, kubelet isn't up (10:50->11:50)
I0605 22:11:50.687609    2193 server.go:261] Using root directory: /var/lib/kubelet
I0605 22:11:50.688319    2193 manager.go:126] cAdvisor running in container: "/"
  • Kubelet is up, docker isn't running
I0605 22:12:56.816171    2515 kubelet.go:2082] Query container info for pod "kibana-logging-v1-0cev8_default" failed with error (API error (500): engine is shutdown
...
E0605 22:14:15.630260    2515 kubelet.go:1644] Couldn't sync containers: dial unix /var/run/docker.sock: no such file or directory
  • Test times out

Here's a gist of a segment of my logs:
https://gist.github.com/bprashanth/c90bd10b359851a57cb1

and heres describe events for the pod that the test complained about (elasticsearch-logging-v1-7w3d6):
https://gist.github.com/bprashanth/f380c2e80e5740c37520

@mforbes (just from git blame on test unsure if you care)

@bprashanth bprashanth added area/test sig/node Categorizes an issue or PR as relevant to SIG Node. labels Jun 5, 2015
@bgrant0607
Copy link
Member

See also #9062, #8962, #9117

@bprashanth
Copy link
Contributor Author

I swear i searched for the error message and found nothing. I think the reboot test involves a whole lot of stuff and there's a good chance all 3 bugs are actually different, so retitling.

@bprashanth bprashanth changed the title Reboot e2e test falky Reboot e2e test timeout because of slow docker startup Jun 5, 2015
@brendandburns brendandburns modified the milestone: v1.0-candidate Jun 9, 2015
@goltermann goltermann modified the milestones: v1.0, v1.0-candidate Jun 9, 2015
@goltermann goltermann added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Jun 9, 2015
@dchen1107
Copy link
Member

One way to speed up docker's startup is removing all containers before reboot for testing purpose. There is not much things we can do at node level.

@dchen1107 dchen1107 removed the sig/node Categorizes an issue or PR as relevant to SIG Node. label Jun 11, 2015
@goltermann goltermann added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/backlog Higher priority than priority/awaiting-more-evidence. labels Jun 11, 2015
@brendandburns
Copy link
Contributor

The reboot test shows no failures in the last 30 runs due to this bug. I'm de-prioritizing back to P2, and if it continues to not be flaky, I'm going to kick it out of 1.0

@brendandburns brendandburns self-assigned this Jun 11, 2015
@brendandburns brendandburns added priority/backlog Higher priority than priority/awaiting-more-evidence. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Jun 11, 2015
@brendandburns brendandburns removed their assignment Jun 25, 2015
@wojtek-t
Copy link
Member

wojtek-t commented Jul 1, 2015

@jszczepkowski - can you please take a look?

@bprashanth
Copy link
Contributor Author

Fyi I think reboot is part of the tests we're skipping (https://github.com/GoogleCloudPlatform/kubernetes/blob/master/hack/jenkins/e2e.sh#L69), so if we don't think reboot is important/stable enough to run regularly, this probably isn't a 1.0 bug

@bgrant0607 bgrant0607 modified the milestones: v1.0, v1.1 Sep 16, 2015
@brendandburns brendandburns added the kind/flake Categorizes issue or PR as related to a flaky test. label Sep 25, 2015
@goltermann
Copy link
Contributor

@quinton-hoole @fabioy Do we think this one is now fixed via #14772 , or is there more to do?

@ghost
Copy link

ghost commented Oct 2, 2015

We're actually still seeing occasional timeouts since #14772 , so lets leave this one open to track those. e.g.
/job/kubernetes-e2e-gce-reboot/5589/
Reboot each node by triggering kernel panic and ensure they function upon restart
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/reboot.go:77
Oct 2 10:11:24.229: Node e2e-reboot-minion-6ab8 failed reboot test.

@goltermann goltermann removed this from the v1.1 milestone Oct 19, 2015
@ikehz
Copy link
Contributor

ikehz commented Jan 21, 2016

These are still flaking occasionally:

  • Reboot [Disruptive] each node by dropping all inbound packets for a while and ensure they function afterwards (Failed 1 times in the last 30 runs. Stability: 96 %) No flake or failure 3 min 56 sec Passed
  • Reboot [Disruptive] each node by dropping all outbound packets for a while and ensure they function afterwards (No known failures. Stability 100 %) No flake or failure 3 min 57 sec Passed
  • Reboot [Disruptive] each node by ordering clean reboot and ensure they function upon restart (No known failures. Stability 100 %) No flake or failure 1 min 58 sec Passed
  • Reboot [Disruptive] each node by ordering unclean reboot and ensure they function upon restart (Failed 1 times in the last 30 runs. Stability: 96 %) No flake or failure 1 min 54 sec Passed
  • Reboot [Disruptive] each node by switching off the network interface and ensure they function upon switch on (No known failures. Stability 100 %) No flake or failure 3 min 59 sec Passed
  • Reboot [Disruptive] each node by triggering kernel panic and ensure they function upon restart (Failed 1 times in the last 30 runs. Stability: 96 %)No flake or failure 1 min 22 sec Passed

@bprashanth
Copy link
Contributor Author

I think a lot of that is addressed in #19189 (comment)

@ikehz
Copy link
Contributor

ikehz commented Jan 21, 2016

... except that both of those failures were during the TCP problems we saw yesterday, where a bunch of other tests were also failing.

@ikehz
Copy link
Contributor

ikehz commented Jan 21, 2016

@dchen1107 Why did you assign this to me?

@ikehz
Copy link
Contributor

ikehz commented Jan 21, 2016

Closing in favor of #19189.

@ikehz ikehz closed this as completed Jan 21, 2016
@ikehz ikehz added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed area/test-infra labels Jan 21, 2016
@bprashanth
Copy link
Contributor Author

Agree with the dupe. I don't think these were actually networking problems, or nodes not coming up. The cases I saw last week/start of this week were because of cluster addons/static pods not coming up. I might be wrong/taken a biased sampling of wrong logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/test kind/flake Categorizes issue or PR as related to a flaky test. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests

9 participants