Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GracefulNodeShutdown e2e test #98658

Merged

Conversation

wzshiming
Copy link
Member

What type of PR is this?
/kind testing

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

NONE

@k8s-ci-robot
Copy link
Contributor

@wzshiming: The label(s) kind/testing cannot be applied, because the repository doesn't have them

In response to this:

What type of PR is this?
/kind testing

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

NONE

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/test sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 1, 2021
@wzshiming wzshiming force-pushed the add-e2e-for-graceful-node-shutdown branch 2 times, most recently from b175de6 to 309281e Compare February 1, 2021 10:25
@wzshiming
Copy link
Member Author

/assign @bobbypage

@wzshiming wzshiming force-pushed the add-e2e-for-graceful-node-shutdown branch from 309281e to a501393 Compare February 1, 2021 10:39
@wzshiming wzshiming changed the title [WIP] Add GracefulNodeShutdown e2e test Add GracefulNodeShutdown e2e test Feb 1, 2021
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 1, 2021
@wzshiming wzshiming force-pushed the add-e2e-for-graceful-node-shutdown branch from a501393 to 1ce42f8 Compare February 1, 2021 11:34
@SergeyKanzhelev SergeyKanzhelev moved this from Triage to PRs - Review in progress in SIG Node CI/Test Board Feb 1, 2021
"k8s.io/kubernetes/test/e2e/framework"
)

var _ = framework.KubeDescribe("[NodeFeature:GracefulNodeShutdown]", func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might need some other tags, e.g. [Serial] since it modifies node status?

xref: https://github.com/kubernetes/community/blob/master/contributors/devel/sig-testing/e2e-tests.md#kinds-of-tests

"k8s.io/kubernetes/test/e2e/framework"
)

var _ = framework.KubeDescribe("[NodeFeature:GracefulNodeShutdown]", func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the test naming convention appears to be

Feature [Label] [Label] [Label]

I believe the labels are used to filter which test suite the test will run in.

e.g. https://github.com/kubernetes/kubernetes/blob/master/test/e2e_node/eviction_test.go#L69

let's follow the convention: so maybe something like GracefulNodeShutdown [Serial] [NodeFeature:GracefulNodeShutdown

initialConfig.ShutdownGracePeriodCriticalPods = metav1.Duration{Duration: 10 * time.Second}
})

ginkgo.It("Normal shutdown", func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ginkgo is suppose to follow BDD style, so maybe something like:

.It("should be able to gracefully shutdown pods with various grace periods")


var _ = framework.KubeDescribe("[NodeFeature:GracefulNodeShutdown]", func() {
f := framework.NewDefaultFramework("graceful-node-shutdown")
ginkgo.Context("Graceful node shutdown", func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following BDD style: when gracefully shutting down

getGracePeriodOverrideTestPod("period-critical-120", nodeName, 120, true),
getGracePeriodOverrideTestPod("period-critical-5", nodeName, 5, true),
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add some verbose logging of the various main steps the test does to make debugging failure easier in future. The pattern seems to be to use ginkgo.By, e.g. https://github.com/kubernetes/kubernetes/blob/master/test/e2e_node/docker_test.go#L60

I added some suggestions of places to log some of the important steps in the test.

}
if critical {
pod.ObjectMeta.Annotations = map[string]string{
kubelettypes.ConfigSourceAnnotationKey: kubelettypes.FileSource,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious what this is / why is it needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if critical {
pod.ObjectMeta.Namespace = kubeapi.NamespaceSystem
pod.ObjectMeta.Annotations = map[string]string{
kubelettypes.ConfigSourceAnnotationKey: kubelettypes.FileSource,
}
pod.Spec.PriorityClassName = scheduling.SystemNodeCritical
framework.ExpectEqual(kubelettypes.IsCriticalPod(pod), true, "pod should be a critical pod")
} else {
framework.ExpectEqual(kubelettypes.IsCriticalPod(pod), false, "pod should not be a critical pod")
}

kubelettypes.IsCriticalPod judged based on this key to determine whether this pod is critical

sleep 300
}
trap _term SIGTERM
sleep 300
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pod will exist if test somehow gets stuck and takes a while (> than 300s). maybe bump up from 300s to say 10mins or something like that? not sure what's reasonable time for test to execute :)

return err
}

func getNodeReadyStatus(f *framework.Framework) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should be able to use getNode which already does this:

https://github.com/kubernetes/kubernetes/blob/master/test/e2e_node/e2e_node_suite_test.go#L314

I think all you need is something like this:
https://github.com/kubernetes/kubernetes/blob/master/test/e2e_node/e2e_node_suite_test.go#L273-L280

(or maybe above when you check the node status you can use waitForNodeReady directly and remove this function?

Copy link
Member Author

@wzshiming wzshiming Feb 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getNode may not be used because it is not compatible with framework.Framework

waitForNodeReady is also not needed. this case not only requires waiting for node ready but also waiting for node not ready.

}

func emitSignalPrepareForShutdown(b bool) error {
cmd := "gdbus emit --system --object-path /org/freedesktop/login1 --signal org.freedesktop.login1.Manager.PrepareForShutdown " + strconv.FormatBool(b)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's leave a comment describing what this does:

// Emits a fake PrepareForShutdown dbus message on system dbus. Will cause kubelet to react to an active shutdown event.

"node did not become shutdown as expected",
)
} else {
time.Sleep(time.Second)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's get rid of the sleep and rely on gomega.Eventually

@bobbypage
Copy link
Member

Thanks so much for starting on the tests! I left some comments above.

Another thing that would be great to add is if we could also test somehow that the pod itself actually received the proper grace period when it was shutdown. I'm not sure the best way to test that, maybe something like the pod prints a timestamp every second upon getting SIGTERM and then getting the delta between first and last timestamp? Maybe there's a better way to check that... We can do that as followup, I think most important is to get basic test in place first and then we can evolve it :)

@wzshiming wzshiming force-pushed the add-e2e-for-graceful-node-shutdown branch 2 times, most recently from e9afe1d to 4cae803 Compare February 3, 2021 03:44
@bobbypage
Copy link
Member

Thanks for quick updates. Were you able to test this locally? I ran the test via e2e remote:

make test-e2e-node FOCUS="GracefulNodeShutdown" REMOTE=true DELETE_INSTANCES=false PARALLELISM=1 SKIP="" RUNTIME=remote CLEANUP=false

and it failed with a timeout:

• Failure [23.170 seconds]
[k8s.io] GracefulNodeShutdown [Serial] [NodeFeature:GracefulNodeShutdown]
/home/porterdavid_google_com/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:635
  when gracefully shutting down
  _output/local/go/src/k8s.io/kubernetes/test/e2e_node/node_shutdown_linux_test.go:42
    should be able to handle a cancelled shutdown [It]
    _output/local/go/src/k8s.io/kubernetes/test/e2e_node/node_shutdown_linux_test.go:124

    Timed out after 3.000s.
    Expected
        <*errors.errorString | 0xc001744bf0>: {
            s: "node did not become shutdown as expected",
        }
    to be nil

    _output/local/go/src/k8s.io/kubernetes/test/e2e_node/node_shutdown_linux_test.go:134

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. label Feb 14, 2021
Copy link
Member

@oomichi oomichi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/cc @oomichi

pod.Status.Phase == v1.PodRunning,
true,
"pod is not ready",
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: framework.ExpectEqual(pod.Status.Phase, v1.PodRunning, "pod is not ready")

pod.Status.Phase == v1.PodRunning,
true,
"critical pod should not be shutdown",
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: framework.ExpectEqual(pod.Status.Phase, v1.PodRunning, "critical pod should not be shutdown")

for _, pod := range list.Items {
if kubelettypes.IsCriticalPod(&pod) {
if pod.Status.Phase != v1.PodRunning {
return fmt.Errorf("critical pod should not be shutdown")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to output actual pod.Status.Phase in error message for debugging.

}
} else {
if pod.Status.Phase != v1.PodFailed || pod.Status.Reason != "Shutdown" {
return fmt.Errorf("pod should be shutdown")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto: pod.Status.Phase


for _, pod := range list.Items {
if pod.Status.Phase != v1.PodFailed || pod.Status.Reason != "Shutdown" {
return fmt.Errorf("pod should be shutdown")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto: pod.Status.Phase

@wzshiming wzshiming force-pushed the add-e2e-for-graceful-node-shutdown branch from 59237c8 to e367d2f Compare February 18, 2021 06:47
@wzshiming
Copy link
Member Author

@oomichi Thanks, updated

@wzshiming
Copy link
Member Author

/retest

@bobbypage
Copy link
Member

/lgtm
/assign @mrunalp

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 19, 2021
@SergeyKanzhelev
Copy link
Member

/triage accepted
/priority important-soon
(since the KEP is marked to go to beta in 1.21)

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Feb 20, 2021
@SergeyKanzhelev
Copy link
Member

KEP: kubernetes/enhancements#2000

@wzshiming
Copy link
Member Author

/retest

@oomichi
Copy link
Member

oomichi commented Feb 21, 2021

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: oomichi, wzshiming

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 21, 2021
@pacoxu
Copy link
Member

pacoxu commented Feb 21, 2021

/retest

1 similar comment
@wzshiming
Copy link
Member Author

/retest

@pacoxu
Copy link
Member

pacoxu commented Feb 21, 2021

/retest
for flake 😓

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@k8s-ci-robot k8s-ci-robot merged commit 89b9cf7 into kubernetes:master Feb 21, 2021
SIG Node CI/Test Board automation moved this from PRs - Review in progress to Done Feb 21, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.21 milestone Feb 21, 2021
@SergeyKanzhelev
Copy link
Member

You'd need to update the tests definition to include this test into regular runs.

ci-kubernetes-node-kubelet-alpha repels Serial, ci-kubernetes-node-kubelet-serial doesn't include Alpha features.

We may start with the new tab specifically for this test. Or remove the Alpha repellant from Serial tab - it doesn't seem there are more Alpha features to exclude.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note-none Denotes a PR that doesn't merit a release note. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

8 participants