Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write tests for dead node evacuation and rebooted node population #7051

Closed
ghost opened this issue Apr 20, 2015 · 17 comments
Closed

Write tests for dead node evacuation and rebooted node population #7051

ghost opened this issue Apr 20, 2015 · 17 comments
Assignees
Labels
area/test area/test-infra priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Milestone

Comments

@ghost
Copy link

ghost commented Apr 20, 2015

#7028 refers. I don't think that we have integration or e2e tests for:

  1. When a node loses power, RC-managed pods on that node are rescheduled to a new node within a suitable time period (consistent with the system configuration).
  2. When a node regains power and boots up, it comes up healthy and can successfully host pods.
@ghost ghost added area/test priority/backlog Higher priority than priority/awaiting-more-evidence. area/test-infra labels Apr 20, 2015
@ghost ghost added this to the v1.0 milestone Apr 20, 2015
@ghost
Copy link
Author

ghost commented Apr 20, 2015

Note. It's not yet clear to me whether an integration or e2e test is more appropriate here. An integration test would run faster, if it's feasible.

@zmerlynn
Copy link
Member

An integration test won't catch anything related to #5666 / #6930 / #7028.

@zmerlynn
Copy link
Member

It's a fine test for shooting kubelet in the head and seeing what happens, but we need a physical reboot test as well. Unfortunately, until we get some of the packages baked into the node and fix a few things related to the way the release is staged in the release bucket (no md5 sums, for instance), it can be on the order of 5m for GCE. I'm loathe to add that as an e2e today. As I mentioned in #7028 (comment), we could possibly create a "nightly" suite where we put long tests like that.

This is, incidentally, almost certainly another shell test. (Unless a priv'd docker container can reboot? That seems unlikely.)

@bgrant0607
Copy link
Member

cc @gmarek

@gmarek
Copy link
Contributor

gmarek commented Apr 28, 2015

To make sure we're on the same page: the v1.0 part of this rather big issue is creating a test (e2e or integration) which checks if the behavior of the cluster is correct when one machine goes away and later comes back? Reboot tests, etc. are for later, right?

@ghost
Copy link
Author

ghost commented Apr 28, 2015

If by your distinction between "goes away and comes back" vs "reboots" you mean "unplanned disappearance and reappearance" vs "planned shutdown and startup, with all the necessary hooks invoked on shutdown", then yes. I think that a machine losing power or network connectivity, or crashing unexpectedly, and then booting and/or rejoining the network should be tested and work properly. That is, after all, one of the primary selling points of Kubernetes and Borg. I think that if planned shutdown and restart behave the same as unplanned disappearance and reappearance for v1.0 that's fine. We can build and test the clean shutdown hooks after v1.0.

As for specific tests, in order of priority, I'd suggest:

  1. Run an RC, kill one of the nodes (e.g. ssh into the node, and cause a kernel panic), and check that the RC reports the missing replica and then converges back to the correct number of running replica's within a suitable timeframe (120 sec?).
  2. As above but cause a temporary network disconnection on the node (e.g. by ssh'ing into the node and scheduling a cron job to down the network interface, or add an iptables firewall rule to drop all in and outbound packets, sleep for a few minutes, and then restore network connectivity).

Q

@davidopp
Copy link
Member

/subscribe

@zmerlynn
Copy link
Member

As above but cause a temporary network disconnection on the node (e.g. by ssh'ing into the node and scheduling a cron job to down the network interface, or add an iptables firewall rule to drop all in and outbound packets, sleep for a few minutes, and then restore network connectivity).

Part of this is covered by the much maligned services.sh test. It is, in fact, why that test is such a hold-out, and why it will be a royal PITA to parallelize (if ever).

@davidopp davidopp added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/backlog Higher priority than priority/awaiting-more-evidence. labels Apr 28, 2015
@gmarek gmarek assigned wojtek-t and unassigned gmarek May 4, 2015
@wojtek-t wojtek-t assigned jszczepkowski and unassigned wojtek-t May 4, 2015
@davidopp
Copy link
Member

@jszczepkowski have you had a chance to look into this yet?

@jszczepkowski
Copy link
Contributor

I'm currently working on the test which re-sizes mig for nodes (adds and removes nodes) and checks if pods are rescheduled. I should be finishing this test soon.

Test for shutdown will be similar: instead of decreasing mig size, a node from mig will be removed and shut down.

@ghost
Copy link
Author

ghost commented May 11, 2015

@jszczepkowski note the distinction in my comment above between "unplanned disappearance" and "planned/orderly shutdown". I think what you're suggesting with Managed Instance Groups will test orderly shutdown"? What we're looking for in this particular issue is "unplanned disappearance".

@davidopp
Copy link
Member

@jszczepkowski any update on this?

@jszczepkowski
Copy link
Contributor

The test which re-sizes mig is currently in review: #8243, and two review rounds are already done. As soon as it is merged, I'll start extending it to add cases defined in this PR.

@mbforbes
Copy link
Contributor

BTW just so we split work well, @jszczepkowski I'm just starting on an e2e test that will do a "MIG rolling-updates" to the same template, which just tests deleting all of the instances and recreating them. This is to ensure sane behavior for node upgrades, which will use "rolling-updates". I'm not going to be doing a "hard" machine kill by causing a kernel panic, though, and that looks like more what @quinton-hoole is suggesting in this test.

Also for those reading along, #8243 is really close :-)

jszczepkowski added a commit to jszczepkowski/kubernetes that referenced this issue May 25, 2015
Added e2e test case which triggers kernel panic on a node and verifies it restarts correctly. Valid for gce and gke. Related to kubernetes#7051.
jszczepkowski added a commit to jszczepkowski/kubernetes that referenced this issue May 27, 2015
Added e2e test cases which trigger different types of node failures and verify they are correctly re-assimilated. Valid for gce and gke. Related to kubernetes#7051.
jszczepkowski added a commit to jszczepkowski/kubernetes that referenced this issue May 27, 2015
Added e2e test case which verifies if a node can return to cluster after longer network partition. Valid for gce. Related to kubernetes#7051.
@davidopp
Copy link
Member

davidopp commented Jun 1, 2015

@jszczepkowski can you summarize what additional work there is for this issue? is #8862 (and #8784) sufficient or will there be additional tests?

@jszczepkowski
Copy link
Contributor

@davidopp There will be no additional tests, #8862 and #8784 are sufficient. After #8862 is merged I will close this issue.

jszczepkowski added a commit to jszczepkowski/kubernetes that referenced this issue Jun 2, 2015
Added e2e test case which verifies if a node can return to cluster after longer network partition. Valid for gce. Finally fixes to kubernetes#7051.
jszczepkowski added a commit to jszczepkowski/kubernetes that referenced this issue Jun 2, 2015
Added e2e test case which verifies if a node can return to cluster after longer network partition. Valid for gce. Finally fixes to kubernetes#7051.
@davidopp
Copy link
Member

davidopp commented Jun 3, 2015

#8862 has been merged, closing.

@davidopp davidopp closed this as completed Jun 3, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/test area/test-infra priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests

7 participants