-
Notifications
You must be signed in to change notification settings - Fork 38.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write tests for dead node evacuation and rebooted node population #7051
Comments
Note. It's not yet clear to me whether an integration or e2e test is more appropriate here. An integration test would run faster, if it's feasible. |
It's a fine test for shooting kubelet in the head and seeing what happens, but we need a physical reboot test as well. Unfortunately, until we get some of the packages baked into the node and fix a few things related to the way the release is staged in the release bucket (no md5 sums, for instance), it can be on the order of 5m for GCE. I'm loathe to add that as an e2e today. As I mentioned in #7028 (comment), we could possibly create a "nightly" suite where we put long tests like that. This is, incidentally, almost certainly another shell test. (Unless a priv'd docker container can reboot? That seems unlikely.) |
cc @gmarek |
To make sure we're on the same page: the v1.0 part of this rather big issue is creating a test (e2e or integration) which checks if the behavior of the cluster is correct when one machine goes away and later comes back? Reboot tests, etc. are for later, right? |
If by your distinction between "goes away and comes back" vs "reboots" you mean "unplanned disappearance and reappearance" vs "planned shutdown and startup, with all the necessary hooks invoked on shutdown", then yes. I think that a machine losing power or network connectivity, or crashing unexpectedly, and then booting and/or rejoining the network should be tested and work properly. That is, after all, one of the primary selling points of Kubernetes and Borg. I think that if planned shutdown and restart behave the same as unplanned disappearance and reappearance for v1.0 that's fine. We can build and test the clean shutdown hooks after v1.0. As for specific tests, in order of priority, I'd suggest:
Q |
/subscribe |
Part of this is covered by the much maligned |
@jszczepkowski have you had a chance to look into this yet? |
I'm currently working on the test which re-sizes mig for nodes (adds and removes nodes) and checks if pods are rescheduled. I should be finishing this test soon. Test for shutdown will be similar: instead of decreasing mig size, a node from mig will be removed and shut down. |
@jszczepkowski note the distinction in my comment above between "unplanned disappearance" and "planned/orderly shutdown". I think what you're suggesting with Managed Instance Groups will test orderly shutdown"? What we're looking for in this particular issue is "unplanned disappearance". |
@jszczepkowski any update on this? |
The test which re-sizes mig is currently in review: #8243, and two review rounds are already done. As soon as it is merged, I'll start extending it to add cases defined in this PR. |
BTW just so we split work well, @jszczepkowski I'm just starting on an e2e test that will do a "MIG rolling-updates" to the same template, which just tests deleting all of the instances and recreating them. This is to ensure sane behavior for node upgrades, which will use "rolling-updates". I'm not going to be doing a "hard" machine kill by causing a kernel panic, though, and that looks like more what @quinton-hoole is suggesting in this test. Also for those reading along, #8243 is really close :-) |
Added e2e test case which triggers kernel panic on a node and verifies it restarts correctly. Valid for gce and gke. Related to kubernetes#7051.
Added e2e test cases which trigger different types of node failures and verify they are correctly re-assimilated. Valid for gce and gke. Related to kubernetes#7051.
Added e2e test case which verifies if a node can return to cluster after longer network partition. Valid for gce. Related to kubernetes#7051.
@jszczepkowski can you summarize what additional work there is for this issue? is #8862 (and #8784) sufficient or will there be additional tests? |
Added e2e test case which verifies if a node can return to cluster after longer network partition. Valid for gce. Finally fixes to kubernetes#7051.
Added e2e test case which verifies if a node can return to cluster after longer network partition. Valid for gce. Finally fixes to kubernetes#7051.
#8862 has been merged, closing. |
#7028 refers. I don't think that we have integration or e2e tests for:
The text was updated successfully, but these errors were encountered: