Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

e2e/storage: speed up kubelet commands #124028

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

huww98
Copy link
Contributor

@huww98 huww98 commented Mar 22, 2024

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

Speed up stopping by not waiting for Node not ready, systemctl will ensure kubelet process stopped before return. This should save 40s per case.

Since stop command does not wait for not ready, start command needs to wait for the next heartbeat to ensure we are checking status from new process.

implement restart by stop then start, to get heartbeat time when kubelet is down. And we do not need to sleep 30s now. The sleep is moved to callers, since they still need them to ensure the volume does not disappear.

Dropped support for non-systemd system.

Which issue(s) this PR fixes:

Special notes for your reviewer:

I think the non-systemd setup never worked before. Do a non-systemd system have Main PID in its output?

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 22, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot
Copy link
Contributor

Hi @huww98. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Mar 22, 2024
@k8s-ci-robot k8s-ci-robot added area/e2e-test-framework Issues or PRs related to refactoring the kubernetes e2e test framework area/test sig/storage Categorizes an issue or PR as relevant to SIG Storage. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 22, 2024
@huww98 huww98 force-pushed the kubelet-speedup branch 2 times, most recently from f29097e to eefa97b Compare March 22, 2024 15:08
@bart0sh bart0sh added this to Triage in SIG Node CI/Test Board Mar 24, 2024
@carlory
Copy link
Member

carlory commented Mar 25, 2024

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 25, 2024
@huww98 huww98 force-pushed the kubelet-speedup branch 2 times, most recently from b1e6e0b to deb699a Compare March 30, 2024 14:54
@bart0sh bart0sh moved this from Triage to Archive-it in SIG Node CI/Test Board Mar 31, 2024
@huww98
Copy link
Contributor Author

huww98 commented May 13, 2024

@pohly PTAL again, thanks.

test/e2e/framework/node/wait.go Outdated Show resolved Hide resolved
test/e2e/framework/node/wait.go Outdated Show resolved Hide resolved
@@ -104,6 +96,9 @@ func TestKubeletRestartsAndRestoresMount(ctx context.Context, c clientset.Interf
ginkgo.By("Restarting kubelet")
KubeletCommand(ctx, KRestart, c, clientPod)

ginkgo.By("Wait 20s for the volume to become stable")
time.Sleep(20 * time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Magic timeouts have been great (not!) source of test flakes. Can you elaborate on why 20 seconds where chosen? How certain is it that this value works reliably? How does it affect the runtime of tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sleep is moved from KubeletCommand() just above it. We are asserting that something is not happening, so we need a timeout.

I choose 20 seconds because the original code in KubeletCommand():

framework.Logf("Noticed that kubelet PID is changed. Waiting for 30 Seconds for Kubelet to come back")
time.Sleep(30 * time.Second)

The sleep now starts after node become ready, so (20s + WaitForNodeToBeReady) should roughly equal to the original 30s.

@k8s-ci-robot k8s-ci-robot added the sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. label May 16, 2024
@k8s-ci-robot k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label May 16, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: huww98
Once this PR has been reviewed and has the lgtm label, please assign pohly for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@huww98
Copy link
Contributor Author

huww98 commented May 16, 2024

/remove-sig node
/remove-sig cloud-provider
I found the impact of changing WaitConditionToBe() to panic on failure is too wide. Reverted that for this PR.

@k8s-ci-robot k8s-ci-robot removed sig/node Categorizes an issue or PR as relevant to SIG Node. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. labels May 16, 2024
@huww98 huww98 requested a review from pohly May 16, 2024 09:14
@huww98
Copy link
Contributor Author

huww98 commented May 16, 2024

CreateVolume failed to create single zonal disk "pvc-cd42c4cd-45d0-480c-91bc-3357014114f6": failed to insert zonal disk: unknown Insert disk operation error: operation operation-1715853438060-6188f42a98737-9493b122-2ab6b2c5 failed (INTERNAL_ERROR): Internal error. Please try again or contact Google Support. (Code: 'Internal error')

OK, so let me try again

/retest

@@ -160,32 +161,38 @@ func WaitForNodeSchedulable(ctx context.Context, c clientset.Interface, name str
return false
}

func WaitForNodeHeartbeatAfter(ctx context.Context, c clientset.Interface, name string, after metav1.Time, timeout time.Duration) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add documentation? Please don't just repeat the function name. Paraphrase it and explain the parameters.

"nodeName" instead of "name" would be more descriptive. "after" is also very generic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think name is enough, given the function name, package name as the context. name is also consistent with other functions in this package.

I've added docs to explain after. I think this should be enough. The type metav1.Time can also provide reader some context.

Speed up stopping by not waiting for Node not ready, `systemctl` will ensure
kubelet process stopped before return. This should save 40s per case.

Since stop command does not wait for not ready, start command needs to wait for
the next heartbeat to ensure we are checking status from new process.

implement restart by stop then start, to get heartbeat time when kubelet is
down. And we do not need to sleep 30s now. The sleep is moved to callers, since
they still need them to ensure the volume does not disappear.

Dropped support for non-systemd system.
Copy link
Contributor

@pohly pohly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me overall, but I am not familiar with KubeletCommand and thus cannot tell whether the proposed changes are okay.

Perhaps someone who has used this and/or written it can chime in?

@huww98
Copy link
Contributor Author

huww98 commented May 21, 2024

/cc @copejon
As the author of #44923 , can you take a look?

@k8s-ci-robot
Copy link
Contributor

@huww98: GitHub didn't allow me to request PR reviews from the following users: copejon.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @copejon
As the author of #44923 , can you take a look?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@huww98 huww98 changed the title e2e/storage: speed up kubectl commands e2e/storage: speed up kubelet commands May 21, 2024
@huww98
Copy link
Contributor Author

huww98 commented May 27, 2024

/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/e2e-test-framework Issues or PRs related to refactoring the kubernetes e2e test framework area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. sig/storage Categorizes an issue or PR as relevant to SIG Storage. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Development

Successfully merging this pull request may close these issues.

None yet

4 participants