Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRA: e2e: test non-graceful node shutdown #120965

Merged

Conversation

bart0sh
Copy link
Contributor

@bart0sh bart0sh commented Oct 2, 2023

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

Added DRA e2e test that covers non graceful node shutdown

Which issue(s) this PR fixes:

Fixes #120421

Special notes for your reviewer:

see https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/2268-non-graceful-shutdown for more details

This test takes 72 seconds to run:

Ran 1 of 7394 Specs in 72.083 seconds
SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 7393 Skipped

Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/test sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 2, 2023
@bart0sh bart0sh added this to Triage in SIG Node PR Triage Oct 2, 2023
@bart0sh
Copy link
Contributor Author

bart0sh commented Oct 2, 2023

/priority important-soon
/triage accepted

@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 2, 2023
@bart0sh bart0sh moved this from Triage to Needs Reviewer in SIG Node PR Triage Oct 2, 2023
@bart0sh
Copy link
Contributor Author

bart0sh commented Oct 2, 2023

/assign @pohly

@bart0sh bart0sh force-pushed the PR122-DRA-unexpected-node-shutdown branch from 52e7b41 to 8188e66 Compare October 2, 2023 12:18
@bart0sh
Copy link
Contributor Author

bart0sh commented Oct 2, 2023

/retest

@bart0sh
Copy link
Contributor Author

bart0sh commented Oct 4, 2023

/retest

@bart0sh
Copy link
Contributor Author

bart0sh commented Oct 5, 2023

/retest

gomega.Expect(stderr).To(gomega.BeEmpty())
framework.ExpectNoError(err)

ginkgo.By("remove out-of-service taint from the node " + nodeName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also redundant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here. will remove. Thanks for the suggestion!

ginkgo.By("start node" + nodeName)
_, stderr, err = framework.RunCmd("docker", "start", nodeName)
gomega.Expect(stderr).To(gomega.BeEmpty())
framework.ExpectNoError(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can move all of this into the DeferCleanup above. A cleanup callback can fail just like the It callback and it will make the test as failed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would increase a duration of the test to 480 seconds. I've mentioned this in the PR description.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

forget about it. It seems to work correctly without it. I'll remove this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this test takes much longer. I've commented in the PR description about it. Now I've added a comment to the test code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't suggesting to remove framework.RunCmd("docker", "start", nodeName), just to put it into a ginkgo.DeferCleanup. Why is that slower? It shouldn't be.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ginkgo.DeferCleanup(framework.RunCmd, "docker", "start", nodeName) is in the code from the very beginning.
However, without explicitly calling docker start at the end of the test, the test takes a lot more time.

Comment on lines 829 to 825
gomega.Eventually(ctx, func(ctx context.Context) (*resourcev1alpha2.ResourceClaim, error) {
return b.f.ClientSet.ResourceV1alpha2().ResourceClaims(b.f.Namespace.Name).Get(ctx, claim.Name, metav1.GetOptions{})
}).WithTimeout(f.Timeouts.PodDelete).Should(gomega.HaveField("Status.Allocation", (*resourcev1alpha2.AllocationResult)(nil)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can wrap b.f.ClientSet.ResourceV1alpha2().ResourceClaims(b.f.Namespace.Name).Get with the helper in e2e/framework/get.go to get support handling transient API server errors, something like:

Suggested change
gomega.Eventually(ctx, func(ctx context.Context) (*resourcev1alpha2.ResourceClaim, error) {
return b.f.ClientSet.ResourceV1alpha2().ResourceClaims(b.f.Namespace.Name).Get(ctx, claim.Name, metav1.GetOptions{})
}).WithTimeout(f.Timeouts.PodDelete).Should(gomega.HaveField("Status.Allocation", (*resourcev1alpha2.AllocationResult)(nil)))
gomega.Eventually(ctx, framework.GetObject(b.f.ClientSet.ResourceV1alpha2().ResourceClaims(b.f.Namespace.Name), claim.Name, metav1.GetOptions{}). WithTimeout(f.Timeouts.PodDelete).Should(gomega.HaveField("Status.Allocation", gomega.BeNil()))

I'm using gomega.BeNil because it expresses the intent a bit more concisely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After applying this, I've got this error:

type "k8s.io/client-go/kubernetes/typed/resource/v1alpha2".ResourceClaimInterface of b.f.ClientSet.ResourceV1alpha2().ResourceClaims(b.f.Namespace.Name) does not match framework.APIGetFunc[T] (cannot infer T)compiler[CannotInferTypeArgs](https://pkg.go.dev/golang.org/x/tools/internal/typesinternal#CannotInferTypeArgs)
func (kubernetes.Interface).ResourceV1alpha2() v1alpha2.ResourceV1alpha2Interface
[(kubernetes.Interface).ResourceV1alpha2 on pkg.go.dev]

Any hints of fixing it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add .Get to the first parameter of GetObject:

gomega.Eventually(ctx, framework.GetObject(b.f.ClientSet.ResourceV1alpha2().ResourceClaims(b.f.Namespace.Name).Get, claim.Name, metav1.GetOptions{}).		 WithTimeout(f.Timeouts.PodDelete).Should(gomega.HaveField("Status.Allocation", gomega.BeNil()))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, thanks for the suggestion!

SIG Node CI/Test Board automation moved this from PRs - Needs Reviewer to PRs Waiting on Author Oct 5, 2023
// This test covers aspects of non graceful node shutdown by DRA controller
// More details about this can be found in the KEP:
// https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/2268-non-graceful-shutdown
var _ = ginkgo.Describe("[sig-node] [Serial] [Disruptive] [Slow] [Feature:DynamicResourceAllocation] DRA: handling non graceful node shutdown", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this have to be a new Describe? Can't it be a new It under some existing context?

Copy link
Contributor Author

@bart0sh bart0sh Oct 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it doesn't. I separated it because I wanted to show that this is a special test case. It's disruptive, it shuts down test node, it works only with kind. I think it deserves to be visually separated from the rest of existing test cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure whether I agree with this rationale. It has a few downsides (some duplicated setup source code lines, full test name.

What is the the full test name?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me be more explicit "DRA: handling non graceful node shutdown cluster must deallocate on non graceful node shutdown" just doesn't look right.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved under existing context. PTAL.

// Prevent builder tearDown to fail waiting for unprepared resources
delete(b.driver.Nodes, nodeName)
ginkgo.By("stop node " + nodeName + " non gracefully")
ginkgo.DeferCleanup(framework.RunCmd, "docker", "start", nodeName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a dependency on running on a kind cluster here which isn't expressed via some test tag. At least call it out in the test comment, as we don't have a feature tag defined for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@bart0sh bart0sh force-pushed the PR122-DRA-unexpected-node-shutdown branch from a5e9e62 to 11317a4 Compare October 5, 2023 20:53
@bart0sh
Copy link
Contributor Author

bart0sh commented Oct 6, 2023

/retest

@bart0sh bart0sh force-pushed the PR122-DRA-unexpected-node-shutdown branch 2 times, most recently from 38c3fa6 to f7d5193 Compare October 10, 2023 13:00
@bart0sh bart0sh moved this from PRs Waiting on Author to PRs - Needs Reviewer in SIG Node CI/Test Board Oct 10, 2023
@bart0sh bart0sh force-pushed the PR122-DRA-unexpected-node-shutdown branch from f7d5193 to debc814 Compare October 11, 2023 09:07
@bart0sh
Copy link
Contributor Author

bart0sh commented Oct 11, 2023

/retest

test/e2e/dra/dra.go Outdated Show resolved Hide resolved
@bart0sh bart0sh force-pushed the PR122-DRA-unexpected-node-shutdown branch from debc814 to 4b65e42 Compare October 19, 2023 07:28
@k8s-ci-robot k8s-ci-robot added the area/e2e-test-framework Issues or PRs related to refactoring the kubernetes e2e test framework label Oct 19, 2023
@bart0sh
Copy link
Contributor Author

bart0sh commented Oct 19, 2023

After #121139 is merged this PR is unblocked.
@pohly please review, thanks.

// This test covers aspects of non graceful node shutdown by DRA controller
// More details about this can be found in the KEP:
// https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/2268-non-graceful-shutdown
ginkgo.Context("cluster", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "cluster" context itself isn't about non-graceful shutdown. This comment belongs in front of the ginkgo.It with the test.

Regarding context: can you put the It under

ginkgo.Context("multiple nodes", func() {
		nodes := NewNodes(f, 2, 8)
		ginkgo.Context("with network-attached resources", func() {
			driver := NewDriver(f, nodes, networkResources)
			b := newBuilder(f, driver)

                        ginkgo.It(...

There's no need for a new context. The advantage is that tests that use a similar setup are also grouped together and there's less code duplication.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, please review again

@bart0sh bart0sh force-pushed the PR122-DRA-unexpected-node-shutdown branch from 4b65e42 to fb9f2f5 Compare October 19, 2023 19:09
Copy link
Contributor

@pohly pohly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 20, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 0ddea5aa7113cf66dbed026c0508b9726047ee88

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bart0sh, pohly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 20, 2023
@k8s-ci-robot k8s-ci-robot merged commit 7b9d244 into kubernetes:master Oct 20, 2023
18 checks passed
SIG Node CI/Test Board automation moved this from PRs - Needs Reviewer to Done Oct 20, 2023
SIG Node PR Triage automation moved this from Needs Reviewer to Done Oct 20, 2023
@k8s-ci-robot k8s-ci-robot added this to the v1.29 milestone Oct 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/e2e-test-framework Issues or PRs related to refactoring the kubernetes e2e test framework area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note-none Denotes a PR that doesn't merit a release note. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

DRA: interaction with unexpected node shutdown KEP
4 participants