Fixes to node shutdown e2e test #99805

bobbypage · 2021-03-04T20:23:10Z

What type of PR is this?

/kind bug
/kind testing

What this PR does / why we need it:

Test was failing due to using sleep infinity inside the busybox
container which was going into a crash loop. sleep infinity isn't
supported by the sleep version in busybox, so replace it with a while true; sleep loop.
Replace usage of dbus message emitting from gdbus to dbus-send. The
test was failing on ubuntu which doesn't have gdbus installed.
dbus-send is installed on COS and Ubuntu, so use it instead.
Replace check of pod phase with the test util function PodRunningReady
which checks both phase as well as pod ready condition.
Add some more verbose logging to ease future debugging.

Tested by manually running the node e2e test as follows which passed:

IMAGE_CONFIG="${GOPATH}/src/k8s.io/test-infra/jobs/e2e_node/image-config-serial.yaml"

GO111MODULE=on go run test/e2e_node/runner/remote/run_remote.go \
  --cleanup=true \
  --logtostderr '--vmodule=*=4' \
  --ssh-env=gce \
  --results-dir="${RESULTS_DIR}" \
  --project="${PROJECT}" \
  --zone="${ZONE}"  \
  '--ginkgo-flags=--nodes=1 --focus="\[NodeAlphaFeature:GracefulNodeShutdown\]" --skip=""' \
  '--test_args=--feature-gates=GracefulNodeShutdown=true --kubelet-flags="--cgroups-per-qos=true --cgroup-root=/"' \
  --test-timeout=30m \
  --instance-name-prefix="${UUID}" \
  --delete-instances=false \
  --image-config-file="${IMAGE_CONFIG}" \
  --gubernator=false \
  --hosts= \
  2>&1 | tee -i "${TMPDIR}/build-log.txt"

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

NONE

- Test was failing due to using `sleep infinity` inside the busybox container which was going into a crash loop. `sleep infinity` isn't supported by the sleep version in busybox, so replace it with a `while true; sleep loop`. - Replace usage of dbus message emitting from gdbus to dbus-send. The test was failing on ubuntu which doesn't have gdbus installed. dbus-send is installed on COS and Ubuntu, so use it instead. - Replace check of pod phase with the test util function `PodRunningReady` which checks both phase as well as pod ready condition. - Add some more verbose logging to ease future debugging.

k8s-ci-robot · 2021-03-04T20:23:13Z

@bobbypage: The label(s) kind/testing cannot be applied, because the repository doesn't have them.

In response to this:

What type of PR is this?

/kind bug
/kind testing

What this PR does / why we need it:

Test was failing due to using sleep infinity inside the busybox
container which was going into a crash loop. sleep infinity isn't
supported by the sleep version in busybox, so replace it with a while true; sleep loop.

Replace usage of dbus message emitting from gdbus to dbus-send. The
test was failing on ubuntu which doesn't have gdbus installed.
dbus-send is installed on COS and Ubuntu, so use it instead.

Replace check of pod phase with the test util function PodRunningReady
which checks both phase as well as pod ready condition.

Add some more verbose logging to ease future debugging.

Tested by manually running the node e2e test as follows which passed:
IMAGE_CONFIG="${GOPATH}/src/k8s.io/test-infra/jobs/e2e_node/image-config-serial.yaml"

GO111MODULE=on go run test/e2e_node/runner/remote/run_remote.go \
 --cleanup=true \
 --logtostderr '--vmodule=*=4' \
 --ssh-env=gce \
 --results-dir="${RESULTS_DIR}" \
 --project="${PROJECT}" \
 --zone="${ZONE}"  \
 '--ginkgo-flags=--nodes=1 --focus="\[NodeAlphaFeature:GracefulNodeShutdown\]" --skip=""' \
 '--test_args=--feature-gates=GracefulNodeShutdown=true --kubelet-flags="--cgroups-per-qos=true --cgroup-root=/"' \
 --test-timeout=30m \
 --instance-name-prefix="${UUID}" \
 --delete-instances=false \
 --image-config-file="${IMAGE_CONFIG}" \
 --gubernator=false \
 --hosts= \
 2>&1 | tee -i "${TMPDIR}/build-log.txt"
Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?
NONE
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
NONE

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

bobbypage · 2021-03-04T20:23:43Z

/sig node

bobbypage · 2021-03-04T20:23:56Z

/cc @wzshiming @SergeyKanzhelev @mrunalp

SergeyKanzhelev · 2021-03-04T20:43:20Z

test/e2e_node/node_shutdown_linux_test.go

@@ -86,18 +88,18 @@ var _ = SIGDescribe("GracefulNodeShutdown [Serial] [NodeAlphaFeature:GracefulNod
 			framework.ExpectNoError(err)
 			framework.ExpectEqual(len(list.Items), len(pods), "the number of pods is not as expected")

+			ginkgo.By("Verifying batch pods are running")
 			for _, pod := range list.Items {


this is strange. CreateBatch is already checking for readiness, is it?

It was a bit tricky, here's what I found and what I think the problem is:

CreateBatch calls CreateSync for each pod

CreateSync polls if the pod is ready and checks that error is nil from e2epod.WaitTimeoutForPodReadyInNamespace

WaitTimeoutForPodReadyInNamespace polls podRunningAndReady

podRunningAndReady checks if the pod status phase is running AND that the pod is ready

In our case, the pods were in crash loop backoff and failing to start due to the sleep infinity issue.

The problem is that pod is not ready podRunningAndReady returns (false, nil) (where nil is the error). Higher up the stack WaitTimeoutForPodReadyInNamespace will poll until the timeout, but only fail the test if the error is not nil.

In the current case, the pods were in crash loop backoff and were not ready. So, basically podRunningAndReady kept checking that the pods got into ready state (which they were not) so eventually the WaitTimeoutForPodReadyInNamespace hit the timeout, and ended up continuing (since the error was nil).

With the added check here, we'll ensure that if the pods fail getting into ready=true condition, the test will fail early rather the continuing (and thus failing due to a different issue)

SergeyKanzhelev · 2021-03-04T20:46:54Z

/priority important-soon
/triage accepted
/lgtm
/kind failing-test

SergeyKanzhelev · 2021-03-04T20:47:17Z

/assign @dchen1107

bobbypage · 2021-03-04T21:11:01Z

/retest

mrunalp · 2021-03-04T21:36:50Z

test/e2e_node/node_shutdown_linux_test.go

@@ -188,10 +194,10 @@ func getGracePeriodOverrideTestPod(name string, node string, gracePeriod int64,
 					Args: []string{`
 _term() {
 	echo "Caught SIGTERM signal!"
-	sleep infinity
+	while true; do sleep 5; done


Another option would be to set it to sufficiently high value.

ack, that should work as well. I went with while true; sleep loop since it more closely matches "infinity" sleep and seems to be the common pattern in other tests (e.g. https://github.com/kubernetes/kubernetes/blob/bd2e557/test/e2e_node/eviction_test.go#L832).

Let me know if high value is better, I think both work, no strong opinion there.

mrunalp · 2021-03-04T22:09:14Z

/approve

k8s-ci-robot · 2021-03-04T22:09:24Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bobbypage, mrunalp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~test/e2e_node/OWNERS~~ [mrunalp]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mrunalp · 2021-03-04T22:09:33Z

/test pull-kubernetes-e2e-kind

bobbypage · 2021-03-05T02:37:14Z

Latest run of the test is now green with these fixes!

https://testgrid.k8s.io/sig-node-kubelet#kubelet-serial-gce-e2e-graceful-node-shutdown

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. labels Mar 4, 2021

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 4, 2021

k8s-ci-robot requested review from SergeyKanzhelev and wzshiming March 4, 2021 20:23

k8s-ci-robot added area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Mar 4, 2021

k8s-ci-robot requested review from mtaufen and sjenning March 4, 2021 20:24

SergeyKanzhelev reviewed Mar 4, 2021

View reviewed changes

k8s-ci-robot assigned SergeyKanzhelev Mar 4, 2021

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 4, 2021

k8s-ci-robot assigned dchen1107 Mar 4, 2021

mrunalp reviewed Mar 4, 2021

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 4, 2021

This was referenced Mar 4, 2021

GracefulNodeShutdown skip ubuntu kubernetes/test-infra#21151

Closed

[WIP] GracefulNodeShutdown e2e skip on systems without gdbus #99764

Closed

SergeyKanzhelev added this to Done in SIG Node CI/Test Board Mar 4, 2021

k8s-ci-robot merged commit 4293a63 into kubernetes:master Mar 5, 2021

k8s-ci-robot added this to the v1.21 milestone Mar 5, 2021

bobbypage mentioned this pull request Mar 6, 2021

Add golang env setup to node e2e #99874

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes to node shutdown e2e test #99805

Fixes to node shutdown e2e test #99805

bobbypage commented Mar 4, 2021

k8s-ci-robot commented Mar 4, 2021

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

bobbypage commented Mar 4, 2021

bobbypage commented Mar 4, 2021 •

edited

SergeyKanzhelev Mar 4, 2021

bobbypage Mar 4, 2021 •

edited

SergeyKanzhelev commented Mar 4, 2021

SergeyKanzhelev commented Mar 4, 2021

bobbypage commented Mar 4, 2021

mrunalp Mar 4, 2021

bobbypage Mar 4, 2021 •

edited

mrunalp commented Mar 4, 2021

k8s-ci-robot commented Mar 4, 2021

mrunalp commented Mar 4, 2021

bobbypage commented Mar 5, 2021

Fixes to node shutdown e2e test #99805

Fixes to node shutdown e2e test #99805

Conversation

bobbypage commented Mar 4, 2021

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot commented Mar 4, 2021

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

bobbypage commented Mar 4, 2021

bobbypage commented Mar 4, 2021 • edited

SergeyKanzhelev Mar 4, 2021

Choose a reason for hiding this comment

bobbypage Mar 4, 2021 • edited

Choose a reason for hiding this comment

SergeyKanzhelev commented Mar 4, 2021

SergeyKanzhelev commented Mar 4, 2021

bobbypage commented Mar 4, 2021

mrunalp Mar 4, 2021

Choose a reason for hiding this comment

bobbypage Mar 4, 2021 • edited

Choose a reason for hiding this comment

mrunalp commented Mar 4, 2021

k8s-ci-robot commented Mar 4, 2021

mrunalp commented Mar 4, 2021

bobbypage commented Mar 5, 2021

bobbypage commented Mar 4, 2021 •

edited

bobbypage Mar 4, 2021 •

edited

bobbypage Mar 4, 2021 •

edited