New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GracefulNodeShutdown e2e test #98658
Add GracefulNodeShutdown e2e test #98658
Conversation
@wzshiming: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
b175de6
to
309281e
Compare
/assign @bobbypage |
309281e
to
a501393
Compare
a501393
to
1ce42f8
Compare
"k8s.io/kubernetes/test/e2e/framework" | ||
) | ||
|
||
var _ = framework.KubeDescribe("[NodeFeature:GracefulNodeShutdown]", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this might need some other tags, e.g. [Serial]
since it modifies node status?
"k8s.io/kubernetes/test/e2e/framework" | ||
) | ||
|
||
var _ = framework.KubeDescribe("[NodeFeature:GracefulNodeShutdown]", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the test naming convention appears to be
Feature [Label] [Label] [Label]
I believe the labels are used to filter which test suite the test will run in.
e.g. https://github.com/kubernetes/kubernetes/blob/master/test/e2e_node/eviction_test.go#L69
let's follow the convention: so maybe something like GracefulNodeShutdown [Serial] [NodeFeature:GracefulNodeShutdown
initialConfig.ShutdownGracePeriodCriticalPods = metav1.Duration{Duration: 10 * time.Second} | ||
}) | ||
|
||
ginkgo.It("Normal shutdown", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think ginkgo is suppose to follow BDD style, so maybe something like:
.It("should be able to gracefully shutdown pods with various grace periods")
|
||
var _ = framework.KubeDescribe("[NodeFeature:GracefulNodeShutdown]", func() { | ||
f := framework.NewDefaultFramework("graceful-node-shutdown") | ||
ginkgo.Context("Graceful node shutdown", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following BDD style: when gracefully shutting down
getGracePeriodOverrideTestPod("period-critical-120", nodeName, 120, true), | ||
getGracePeriodOverrideTestPod("period-critical-5", nodeName, 5, true), | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's add some verbose logging of the various main steps the test does to make debugging failure easier in future. The pattern seems to be to use ginkgo.By
, e.g. https://github.com/kubernetes/kubernetes/blob/master/test/e2e_node/docker_test.go#L60
I added some suggestions of places to log some of the important steps in the test.
} | ||
if critical { | ||
pod.ObjectMeta.Annotations = map[string]string{ | ||
kubelettypes.ConfigSourceAnnotationKey: kubelettypes.FileSource, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious what this is / why is it needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kubernetes/test/e2e_node/critical_pod_test.go
Lines 152 to 162 in 7655bad
if critical { | |
pod.ObjectMeta.Namespace = kubeapi.NamespaceSystem | |
pod.ObjectMeta.Annotations = map[string]string{ | |
kubelettypes.ConfigSourceAnnotationKey: kubelettypes.FileSource, | |
} | |
pod.Spec.PriorityClassName = scheduling.SystemNodeCritical | |
framework.ExpectEqual(kubelettypes.IsCriticalPod(pod), true, "pod should be a critical pod") | |
} else { | |
framework.ExpectEqual(kubelettypes.IsCriticalPod(pod), false, "pod should not be a critical pod") | |
} |
kubelettypes.IsCriticalPod
judged based on this key to determine whether this pod is critical
sleep 300 | ||
} | ||
trap _term SIGTERM | ||
sleep 300 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pod will exist if test somehow gets stuck and takes a while (> than 300s). maybe bump up from 300s to say 10mins or something like that? not sure what's reasonable time for test to execute :)
return err | ||
} | ||
|
||
func getNodeReadyStatus(f *framework.Framework) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should be able to use getNode which already does this:
https://github.com/kubernetes/kubernetes/blob/master/test/e2e_node/e2e_node_suite_test.go#L314
I think all you need is something like this:
https://github.com/kubernetes/kubernetes/blob/master/test/e2e_node/e2e_node_suite_test.go#L273-L280
(or maybe above when you check the node status you can use waitForNodeReady
directly and remove this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getNode
may not be used because it is not compatible with framework.Framework
waitForNodeReady
is also not needed. this case not only requires waiting for node ready but also waiting for node not ready.
} | ||
|
||
func emitSignalPrepareForShutdown(b bool) error { | ||
cmd := "gdbus emit --system --object-path /org/freedesktop/login1 --signal org.freedesktop.login1.Manager.PrepareForShutdown " + strconv.FormatBool(b) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's leave a comment describing what this does:
// Emits a fake PrepareForShutdown dbus message on system dbus. Will cause kubelet to react to an active shutdown event.
"node did not become shutdown as expected", | ||
) | ||
} else { | ||
time.Sleep(time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's get rid of the sleep and rely on gomega.Eventually
Thanks so much for starting on the tests! I left some comments above. Another thing that would be great to add is if we could also test somehow that the pod itself actually received the proper grace period when it was shutdown. I'm not sure the best way to test that, maybe something like the pod prints a timestamp every second upon getting SIGTERM and then getting the delta between first and last timestamp? Maybe there's a better way to check that... We can do that as followup, I think most important is to get basic test in place first and then we can evolve it :) |
e9afe1d
to
4cae803
Compare
Thanks for quick updates. Were you able to test this locally? I ran the test via e2e remote:
and it failed with a timeout:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/cc @oomichi
pod.Status.Phase == v1.PodRunning, | ||
true, | ||
"pod is not ready", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: framework.ExpectEqual(pod.Status.Phase, v1.PodRunning, "pod is not ready")
pod.Status.Phase == v1.PodRunning, | ||
true, | ||
"critical pod should not be shutdown", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: framework.ExpectEqual(pod.Status.Phase, v1.PodRunning, "critical pod should not be shutdown")
for _, pod := range list.Items { | ||
if kubelettypes.IsCriticalPod(&pod) { | ||
if pod.Status.Phase != v1.PodRunning { | ||
return fmt.Errorf("critical pod should not be shutdown") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is better to output actual pod.Status.Phase
in error message for debugging.
} | ||
} else { | ||
if pod.Status.Phase != v1.PodFailed || pod.Status.Reason != "Shutdown" { | ||
return fmt.Errorf("pod should be shutdown") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto: pod.Status.Phase
|
||
for _, pod := range list.Items { | ||
if pod.Status.Phase != v1.PodFailed || pod.Status.Reason != "Shutdown" { | ||
return fmt.Errorf("pod should be shutdown") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto: pod.Status.Phase
59237c8
to
e367d2f
Compare
@oomichi Thanks, updated |
/retest |
/lgtm |
/triage accepted |
/retest |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: oomichi, wzshiming The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
1 similar comment
/retest |
/retest |
/retest Review the full test history for this PR. Silence the bot with an |
You'd need to update the tests definition to include this test into regular runs. ci-kubernetes-node-kubelet-alpha repels Serial, ci-kubernetes-node-kubelet-serial doesn't include Alpha features. We may start with the new tab specifically for this test. Or remove the |
What type of PR is this?
/kind testing
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?:
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: