-
Notifications
You must be signed in to change notification settings - Fork 38.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
e2e/storage: speed up kubelet commands #124028
base: master
Are you sure you want to change the base?
Conversation
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Hi @huww98. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
f29097e
to
eefa97b
Compare
/ok-to-test |
b1e6e0b
to
deb699a
Compare
@pohly PTAL again, thanks. |
@@ -104,6 +96,9 @@ func TestKubeletRestartsAndRestoresMount(ctx context.Context, c clientset.Interf | |||
ginkgo.By("Restarting kubelet") | |||
KubeletCommand(ctx, KRestart, c, clientPod) | |||
|
|||
ginkgo.By("Wait 20s for the volume to become stable") | |||
time.Sleep(20 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. Magic timeouts have been great (not!) source of test flakes. Can you elaborate on why 20 seconds where chosen? How certain is it that this value works reliably? How does it affect the runtime of tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sleep is moved from KubeletCommand()
just above it. We are asserting that something is not happening, so we need a timeout.
I choose 20 seconds because the original code in KubeletCommand():
framework.Logf("Noticed that kubelet PID is changed. Waiting for 30 Seconds for Kubelet to come back")
time.Sleep(30 * time.Second)
The sleep now starts after node become ready, so (20s + WaitForNodeToBeReady) should roughly equal to the original 30s.
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: huww98 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/remove-sig node |
CreateVolume failed to create single zonal disk "pvc-cd42c4cd-45d0-480c-91bc-3357014114f6": failed to insert zonal disk: unknown Insert disk operation error: operation operation-1715853438060-6188f42a98737-9493b122-2ab6b2c5 failed (INTERNAL_ERROR): Internal error. Please try again or contact Google Support. (Code: 'Internal error') OK, so let me try again /retest |
@@ -160,32 +161,38 @@ func WaitForNodeSchedulable(ctx context.Context, c clientset.Interface, name str | |||
return false | |||
} | |||
|
|||
func WaitForNodeHeartbeatAfter(ctx context.Context, c clientset.Interface, name string, after metav1.Time, timeout time.Duration) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add documentation? Please don't just repeat the function name. Paraphrase it and explain the parameters.
"nodeName" instead of "name" would be more descriptive. "after" is also very generic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think name
is enough, given the function name, package name as the context. name
is also consistent with other functions in this package.
I've added docs to explain after
. I think this should be enough. The type metav1.Time
can also provide reader some context.
Speed up stopping by not waiting for Node not ready, `systemctl` will ensure kubelet process stopped before return. This should save 40s per case. Since stop command does not wait for not ready, start command needs to wait for the next heartbeat to ensure we are checking status from new process. implement restart by stop then start, to get heartbeat time when kubelet is down. And we do not need to sleep 30s now. The sleep is moved to callers, since they still need them to ensure the volume does not disappear. Dropped support for non-systemd system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me overall, but I am not familiar with KubeletCommand
and thus cannot tell whether the proposed changes are okay.
Perhaps someone who has used this and/or written it can chime in?
@huww98: GitHub didn't allow me to request PR reviews from the following users: copejon. Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/retest |
What type of PR is this?
/kind cleanup
What this PR does / why we need it:
Speed up stopping by not waiting for Node not ready,
systemctl
will ensure kubelet process stopped before return. This should save 40s per case.Since stop command does not wait for not ready, start command needs to wait for the next heartbeat to ensure we are checking status from new process.
implement restart by stop then start, to get heartbeat time when kubelet is down. And we do not need to sleep 30s now. The sleep is moved to callers, since they still need them to ensure the volume does not disappear.
Dropped support for non-systemd system.
Which issue(s) this PR fixes:
Special notes for your reviewer:
I think the non-systemd setup never worked before. Do a non-systemd system have
Main PID
in its output?Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: