Write tests for dead node evacuation and rebooted node population #7051

ghost · 2015-04-20T16:35:32Z

#7028 refers. I don't think that we have integration or e2e tests for:

When a node loses power, RC-managed pods on that node are rescheduled to a new node within a suitable time period (consistent with the system configuration).
When a node regains power and boots up, it comes up healthy and can successfully host pods.

ghost · 2015-04-20T16:36:41Z

Note. It's not yet clear to me whether an integration or e2e test is more appropriate here. An integration test would run faster, if it's feasible.

zmerlynn · 2015-04-20T16:41:58Z

An integration test won't catch anything related to #5666 / #6930 / #7028.

zmerlynn · 2015-04-20T16:48:44Z

It's a fine test for shooting kubelet in the head and seeing what happens, but we need a physical reboot test as well. Unfortunately, until we get some of the packages baked into the node and fix a few things related to the way the release is staged in the release bucket (no md5 sums, for instance), it can be on the order of 5m for GCE. I'm loathe to add that as an e2e today. As I mentioned in #7028 (comment), we could possibly create a "nightly" suite where we put long tests like that.

This is, incidentally, almost certainly another shell test. (Unless a priv'd docker container can reboot? That seems unlikely.)

bgrant0607 · 2015-04-21T03:35:42Z

cc @gmarek

gmarek · 2015-04-28T14:25:32Z

To make sure we're on the same page: the v1.0 part of this rather big issue is creating a test (e2e or integration) which checks if the behavior of the cluster is correct when one machine goes away and later comes back? Reboot tests, etc. are for later, right?

ghost · 2015-04-28T17:24:46Z

If by your distinction between "goes away and comes back" vs "reboots" you mean "unplanned disappearance and reappearance" vs "planned shutdown and startup, with all the necessary hooks invoked on shutdown", then yes. I think that a machine losing power or network connectivity, or crashing unexpectedly, and then booting and/or rejoining the network should be tested and work properly. That is, after all, one of the primary selling points of Kubernetes and Borg. I think that if planned shutdown and restart behave the same as unplanned disappearance and reappearance for v1.0 that's fine. We can build and test the clean shutdown hooks after v1.0.

As for specific tests, in order of priority, I'd suggest:

Run an RC, kill one of the nodes (e.g. ssh into the node, and cause a kernel panic), and check that the RC reports the missing replica and then converges back to the correct number of running replica's within a suitable timeframe (120 sec?).
As above but cause a temporary network disconnection on the node (e.g. by ssh'ing into the node and scheduling a cron job to down the network interface, or add an iptables firewall rule to drop all in and outbound packets, sleep for a few minutes, and then restore network connectivity).

Q

davidopp · 2015-04-28T21:28:37Z

/subscribe

zmerlynn · 2015-04-28T21:31:59Z

As above but cause a temporary network disconnection on the node (e.g. by ssh'ing into the node and scheduling a cron job to down the network interface, or add an iptables firewall rule to drop all in and outbound packets, sleep for a few minutes, and then restore network connectivity).

Part of this is covered by the much maligned services.sh test. It is, in fact, why that test is such a hold-out, and why it will be a royal PITA to parallelize (if ever).

davidopp · 2015-05-11T07:33:33Z

@jszczepkowski have you had a chance to look into this yet?

jszczepkowski · 2015-05-11T09:56:01Z

I'm currently working on the test which re-sizes mig for nodes (adds and removes nodes) and checks if pods are rescheduled. I should be finishing this test soon.

Test for shutdown will be similar: instead of decreasing mig size, a node from mig will be removed and shut down.

ghost · 2015-05-11T17:04:30Z

@jszczepkowski note the distinction in my comment above between "unplanned disappearance" and "planned/orderly shutdown". I think what you're suggesting with Managed Instance Groups will test orderly shutdown"? What we're looking for in this particular issue is "unplanned disappearance".

davidopp · 2015-05-19T20:54:52Z

@jszczepkowski any update on this?

jszczepkowski · 2015-05-20T15:45:54Z

The test which re-sizes mig is currently in review: #8243, and two review rounds are already done. As soon as it is merged, I'll start extending it to add cases defined in this PR.

mbforbes · 2015-05-21T17:32:18Z

BTW just so we split work well, @jszczepkowski I'm just starting on an e2e test that will do a "MIG rolling-updates" to the same template, which just tests deleting all of the instances and recreating them. This is to ensure sane behavior for node upgrades, which will use "rolling-updates". I'm not going to be doing a "hard" machine kill by causing a kernel panic, though, and that looks like more what @quinton-hoole is suggesting in this test.

Also for those reading along, #8243 is really close :-)

Added e2e test case which triggers kernel panic on a node and verifies it restarts correctly. Valid for gce and gke. Related to kubernetes#7051.

Added e2e test cases which trigger different types of node failures and verify they are correctly re-assimilated. Valid for gce and gke. Related to kubernetes#7051.

Added e2e test case which verifies if a node can return to cluster after longer network partition. Valid for gce. Related to kubernetes#7051.

davidopp · 2015-06-01T17:29:45Z

@jszczepkowski can you summarize what additional work there is for this issue? is #8862 (and #8784) sufficient or will there be additional tests?

jszczepkowski · 2015-06-02T06:04:46Z

@davidopp There will be no additional tests, #8862 and #8784 are sufficient. After #8862 is merged I will close this issue.

Added e2e test case which verifies if a node can return to cluster after longer network partition. Valid for gce. Finally fixes to kubernetes#7051.

davidopp · 2015-06-03T07:23:00Z

#8862 has been merged, closing.

ghost added area/test priority/backlog Higher priority than priority/awaiting-more-evidence. area/test-infra labels Apr 20, 2015

ghost added this to the v1.0 milestone Apr 20, 2015

bgrant0607 added the team/master label Apr 21, 2015

piosz assigned gmarek Apr 21, 2015

zmerlynn mentioned this issue Apr 27, 2015

Implement "machine shutdown" #7351

Closed

davidopp added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/backlog Higher priority than priority/awaiting-more-evidence. labels Apr 28, 2015

davidopp mentioned this issue Apr 30, 2015

Design a mechanism to generate a synthetic workload and play it #3139

Closed

gmarek assigned wojtek-t and unassigned gmarek May 4, 2015

wojtek-t assigned jszczepkowski and unassigned wojtek-t May 4, 2015

jszczepkowski mentioned this issue May 25, 2015

Added e2e test case which triggers kernel panic on a node. #8784

Merged

jszczepkowski added a commit to jszczepkowski/kubernetes that referenced this issue May 27, 2015

Added e2e test case for network partition.

f89fe74

Added e2e test case which verifies if a node can return to cluster after longer network partition. Valid for gce. Related to kubernetes#7051.

jszczepkowski mentioned this issue May 27, 2015

Added e2e test case for network partition. #8862

Merged

davidopp closed this as completed Jun 3, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write tests for dead node evacuation and rebooted node population #7051

Write tests for dead node evacuation and rebooted node population #7051

ghost commented Apr 20, 2015

ghost commented Apr 20, 2015

zmerlynn commented Apr 20, 2015

zmerlynn commented Apr 20, 2015

bgrant0607 commented Apr 21, 2015

gmarek commented Apr 28, 2015

ghost commented Apr 28, 2015

davidopp commented Apr 28, 2015

zmerlynn commented Apr 28, 2015

davidopp commented May 11, 2015

jszczepkowski commented May 11, 2015

ghost commented May 11, 2015

davidopp commented May 19, 2015

jszczepkowski commented May 20, 2015

mbforbes commented May 21, 2015

davidopp commented Jun 1, 2015

jszczepkowski commented Jun 2, 2015

davidopp commented Jun 3, 2015

Write tests for dead node evacuation and rebooted node population #7051

Write tests for dead node evacuation and rebooted node population #7051

Comments

ghost commented Apr 20, 2015

ghost commented Apr 20, 2015

zmerlynn commented Apr 20, 2015

zmerlynn commented Apr 20, 2015

bgrant0607 commented Apr 21, 2015

gmarek commented Apr 28, 2015

ghost commented Apr 28, 2015

davidopp commented Apr 28, 2015

zmerlynn commented Apr 28, 2015

davidopp commented May 11, 2015

jszczepkowski commented May 11, 2015

ghost commented May 11, 2015

davidopp commented May 19, 2015

jszczepkowski commented May 20, 2015

mbforbes commented May 21, 2015

davidopp commented Jun 1, 2015

jszczepkowski commented Jun 2, 2015

davidopp commented Jun 3, 2015