Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes to node shutdown e2e test #99805

Merged
merged 1 commit into from Mar 5, 2021

Conversation

bobbypage
Copy link
Member

What type of PR is this?

/kind bug
/kind testing

What this PR does / why we need it:

  • Test was failing due to using sleep infinity inside the busybox
    container which was going into a crash loop. sleep infinity isn't
    supported by the sleep version in busybox, so replace it with a while true; sleep loop.

  • Replace usage of dbus message emitting from gdbus to dbus-send. The
    test was failing on ubuntu which doesn't have gdbus installed.
    dbus-send is installed on COS and Ubuntu, so use it instead.

  • Replace check of pod phase with the test util function PodRunningReady
    which checks both phase as well as pod ready condition.

  • Add some more verbose logging to ease future debugging.

Tested by manually running the node e2e test as follows which passed:

IMAGE_CONFIG="${GOPATH}/src/k8s.io/test-infra/jobs/e2e_node/image-config-serial.yaml"

GO111MODULE=on go run test/e2e_node/runner/remote/run_remote.go \
  --cleanup=true \
  --logtostderr '--vmodule=*=4' \
  --ssh-env=gce \
  --results-dir="${RESULTS_DIR}" \
  --project="${PROJECT}" \
  --zone="${ZONE}"  \
  '--ginkgo-flags=--nodes=1 --focus="\[NodeAlphaFeature:GracefulNodeShutdown\]" --skip=""' \
  '--test_args=--feature-gates=GracefulNodeShutdown=true --kubelet-flags="--cgroups-per-qos=true --cgroup-root=/"' \
  --test-timeout=30m \
  --instance-name-prefix="${UUID}" \
  --delete-instances=false \
  --image-config-file="${IMAGE_CONFIG}" \
  --gubernator=false \
  --hosts= \
  2>&1 | tee -i "${TMPDIR}/build-log.txt"

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

NONE

- Test was failing due to using `sleep infinity` inside the busybox
  container which was going into a crash loop. `sleep infinity` isn't
  supported by the sleep version in busybox, so replace it with a `while
  true; sleep loop`.

- Replace usage of dbus message emitting from gdbus to dbus-send. The
  test was failing on ubuntu which doesn't have gdbus installed.
  dbus-send is installed on COS and Ubuntu, so use it instead.

- Replace check of pod phase with the test util function `PodRunningReady`
  which checks both phase as well as pod ready condition.

- Add some more verbose logging to ease future debugging.
@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. labels Mar 4, 2021
@k8s-ci-robot
Copy link
Contributor

@bobbypage: The label(s) kind/testing cannot be applied, because the repository doesn't have them.

In response to this:

What type of PR is this?

/kind bug
/kind testing

What this PR does / why we need it:

  • Test was failing due to using sleep infinity inside the busybox
    container which was going into a crash loop. sleep infinity isn't
    supported by the sleep version in busybox, so replace it with a while true; sleep loop.

  • Replace usage of dbus message emitting from gdbus to dbus-send. The
    test was failing on ubuntu which doesn't have gdbus installed.
    dbus-send is installed on COS and Ubuntu, so use it instead.

  • Replace check of pod phase with the test util function PodRunningReady
    which checks both phase as well as pod ready condition.

  • Add some more verbose logging to ease future debugging.

Tested by manually running the node e2e test as follows which passed:

IMAGE_CONFIG="${GOPATH}/src/k8s.io/test-infra/jobs/e2e_node/image-config-serial.yaml"

GO111MODULE=on go run test/e2e_node/runner/remote/run_remote.go \
 --cleanup=true \
 --logtostderr '--vmodule=*=4' \
 --ssh-env=gce \
 --results-dir="${RESULTS_DIR}" \
 --project="${PROJECT}" \
 --zone="${ZONE}"  \
 '--ginkgo-flags=--nodes=1 --focus="\[NodeAlphaFeature:GracefulNodeShutdown\]" --skip=""' \
 '--test_args=--feature-gates=GracefulNodeShutdown=true --kubelet-flags="--cgroups-per-qos=true --cgroup-root=/"' \
 --test-timeout=30m \
 --instance-name-prefix="${UUID}" \
 --delete-instances=false \
 --image-config-file="${IMAGE_CONFIG}" \
 --gubernator=false \
 --hosts= \
 2>&1 | tee -i "${TMPDIR}/build-log.txt"

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

NONE

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Mar 4, 2021
@bobbypage
Copy link
Member Author

/sig node

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 4, 2021
@bobbypage
Copy link
Member Author

bobbypage commented Mar 4, 2021

@k8s-ci-robot k8s-ci-robot added area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Mar 4, 2021
@@ -86,18 +88,18 @@ var _ = SIGDescribe("GracefulNodeShutdown [Serial] [NodeAlphaFeature:GracefulNod
framework.ExpectNoError(err)
framework.ExpectEqual(len(list.Items), len(pods), "the number of pods is not as expected")

ginkgo.By("Verifying batch pods are running")
for _, pod := range list.Items {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is strange. CreateBatch is already checking for readiness, is it?

Copy link
Member Author

@bobbypage bobbypage Mar 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a bit tricky, here's what I found and what I think the problem is:

In our case, the pods were in crash loop backoff and failing to start due to the sleep infinity issue.

The problem is that pod is not ready podRunningAndReady returns (false, nil) (where nil is the error). Higher up the stack WaitTimeoutForPodReadyInNamespace will poll until the timeout, but only fail the test if the error is not nil.

In the current case, the pods were in crash loop backoff and were not ready. So, basically podRunningAndReady kept checking that the pods got into ready state (which they were not) so eventually the WaitTimeoutForPodReadyInNamespace hit the timeout, and ended up continuing (since the error was nil).

With the added check here, we'll ensure that if the pods fail getting into ready=true condition, the test will fail early rather the continuing (and thus failing due to a different issue)

@SergeyKanzhelev
Copy link
Member

/priority important-soon
/triage accepted
/lgtm
/kind failing-test

@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 4, 2021
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 4, 2021
@SergeyKanzhelev
Copy link
Member

/assign @dchen1107

@bobbypage
Copy link
Member Author

/retest

@@ -188,10 +194,10 @@ func getGracePeriodOverrideTestPod(name string, node string, gracePeriod int64,
Args: []string{`
_term() {
echo "Caught SIGTERM signal!"
sleep infinity
while true; do sleep 5; done
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option would be to set it to sufficiently high value.

Copy link
Member Author

@bobbypage bobbypage Mar 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack, that should work as well. I went with while true; sleep loop since it more closely matches "infinity" sleep and seems to be the common pattern in other tests (e.g. https://github.com/kubernetes/kubernetes/blob/bd2e557/test/e2e_node/eviction_test.go#L832).

Let me know if high value is better, I think both work, no strong opinion there.

@mrunalp
Copy link
Contributor

mrunalp commented Mar 4, 2021

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bobbypage, mrunalp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 4, 2021
@mrunalp
Copy link
Contributor

mrunalp commented Mar 4, 2021

/test pull-kubernetes-e2e-kind

@bobbypage
Copy link
Member Author

Latest run of the test is now green with these fixes!

image

https://testgrid.k8s.io/sig-node-kubelet#kubelet-serial-gce-e2e-graceful-node-shutdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note-none Denotes a PR that doesn't merit a release note. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

5 participants