New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes to node shutdown e2e test #99805
Conversation
- Test was failing due to using `sleep infinity` inside the busybox container which was going into a crash loop. `sleep infinity` isn't supported by the sleep version in busybox, so replace it with a `while true; sleep loop`. - Replace usage of dbus message emitting from gdbus to dbus-send. The test was failing on ubuntu which doesn't have gdbus installed. dbus-send is installed on COS and Ubuntu, so use it instead. - Replace check of pod phase with the test util function `PodRunningReady` which checks both phase as well as pod ready condition. - Add some more verbose logging to ease future debugging.
@bobbypage: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/sig node |
@@ -86,18 +88,18 @@ var _ = SIGDescribe("GracefulNodeShutdown [Serial] [NodeAlphaFeature:GracefulNod | |||
framework.ExpectNoError(err) | |||
framework.ExpectEqual(len(list.Items), len(pods), "the number of pods is not as expected") | |||
|
|||
ginkgo.By("Verifying batch pods are running") | |||
for _, pod := range list.Items { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is strange. CreateBatch
is already checking for readiness, is it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was a bit tricky, here's what I found and what I think the problem is:
CreateBatch
callsCreateSync
for each pod- CreateSync polls if the pod is ready and checks that error is nil from
e2epod.WaitTimeoutForPodReadyInNamespace
WaitTimeoutForPodReadyInNamespace
pollspodRunningAndReady
podRunningAndReady
checks if the pod status phase is running AND that the pod is ready
In our case, the pods were in crash loop backoff and failing to start due to the sleep infinity issue.
The problem is that pod is not ready podRunningAndReady
returns (false, nil) (where nil is the error). Higher up the stack WaitTimeoutForPodReadyInNamespace
will poll until the timeout, but only fail the test if the error is not nil.
In the current case, the pods were in crash loop backoff and were not ready. So, basically podRunningAndReady
kept checking that the pods got into ready state (which they were not) so eventually the WaitTimeoutForPodReadyInNamespace
hit the timeout, and ended up continuing (since the error was nil).
With the added check here, we'll ensure that if the pods fail getting into ready=true condition, the test will fail early rather the continuing (and thus failing due to a different issue)
/priority important-soon |
/assign @dchen1107 |
/retest |
@@ -188,10 +194,10 @@ func getGracePeriodOverrideTestPod(name string, node string, gracePeriod int64, | |||
Args: []string{` | |||
_term() { | |||
echo "Caught SIGTERM signal!" | |||
sleep infinity | |||
while true; do sleep 5; done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option would be to set it to sufficiently high value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack, that should work as well. I went with while true; sleep
loop since it more closely matches "infinity" sleep and seems to be the common pattern in other tests (e.g. https://github.com/kubernetes/kubernetes/blob/bd2e557/test/e2e_node/eviction_test.go#L832).
Let me know if high value is better, I think both work, no strong opinion there.
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bobbypage, mrunalp The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/test pull-kubernetes-e2e-kind |
Latest run of the test is now green with these fixes! https://testgrid.k8s.io/sig-node-kubelet#kubelet-serial-gce-e2e-graceful-node-shutdown |
What type of PR is this?
/kind bug
/kind testing
What this PR does / why we need it:
Test was failing due to using
sleep infinity
inside the busyboxcontainer which was going into a crash loop.
sleep infinity
isn'tsupported by the sleep version in busybox, so replace it with a
while true; sleep loop
.Replace usage of dbus message emitting from gdbus to dbus-send. The
test was failing on ubuntu which doesn't have gdbus installed.
dbus-send is installed on COS and Ubuntu, so use it instead.
Replace check of pod phase with the test util function
PodRunningReady
which checks both phase as well as pod ready condition.
Add some more verbose logging to ease future debugging.
Tested by manually running the node e2e test as follows which passed:
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: