-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-11083: pao e2e: fix update test suit timeouts #626
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: yanirq The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
infra issues |
return false, fmt.Errorf("failed to execute command %q on node: %q; %w", cmd, node.Name, err) | ||
} | ||
return wait.PollImmediate(interval, timeout, func() (bool, error) { | ||
cmd := []string{"/bin/bash", "-c", "pidof stalld"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When stalld disabled, pidof stalld
will return with an error, this is why I added the || true
.
Not checking the error returned from ExecCommandOnNode
in the line below instead is not so great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there were some test cases where the the command itself with the condition actually failed. this is why we moved to a different kind of check (see the rest of the code)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that we're not checking the error returned from nodes.ExecCommandOnNode(cmd, node)
hence we don't know whether the command execution was successful.
I don't mind changing the executed command to something that doesn't contain pipes but we need to check the error anyway
} | ||
// we want stalld to run | ||
if err != nil { | ||
klog.Warningf("node=%q stalld_pid=%q is not a valid pid number: %v", node.Name, stalldPid, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would leave the warning since it doesn't affect the test result and provide some good indication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will spam the log and the error message was moved to a higher level for better visibility. Knowing the pid number itself does not provide too much helpful info in this particular case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would at least give us some indication about where the test get stuck and how much time it takes.
if !run { // we don't want stalld to run | ||
if err == nil { | ||
return false, fmt.Errorf("node=%q stalld_pid=%q stalld is running when it shouldn't", node.Name, stalldPid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this was definitely a mistake since we need to wait some time for TuneD to catch up, but I would replace the error with a warning
klog.Warningf("node=%q stalld_pid=%q stalld is still running when it shouldn't", node.Name, stalldPid)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rather have this manifested as an actual error (see higher level)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then you can returned it at the end of this function instead of write it down multiple times
} | ||
return true, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here it means that err != nil
.
It can happens for various reasons if we're not checking that the output received from ExecCommandOnNode
is OK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we expect ExecCommandOnNode
command to return error anyway (keep me honest here) then we will need to introduce some error handling here by error type, e.g : if error was for the execution itself or the error came out from not having stalld running
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we expect ExecCommandOnNode command to return error anyway
No we don't that is my point exactly. if ExecCommandOnNode
return with an error, you don't know whether it's because stalld is not running (which is sometimes a good thing, meaning the test passes) or the execution itself failed.
need to introduce some error handling here by error type
We don't know what kind of errors might return so we perform such a check efficiently
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would argue that that was the exact behavior even before this patch , we would still return true at the end of the function and in the last observation we have this was always the case since the syntax with || true was always failing. we can however get the error message from execute command on node and print it out as a warning to provide more debuging info.
a5bbb7f
to
83ec467
Compare
/retest |
1 similar comment
/retest |
afterAll function should be skipped if tests are skipped to save testing time and irrelevant errors.
01f051c
to
5a68d4f
Compare
/test ci/prow/unit |
@yanirq: The specified target(s) for
Use In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/test all |
/retest |
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch of the wg counter!
/lgtm |
By(fmt.Sprintf("Executing %q", cmd)) | ||
stalldPid, err := nodes.ExecCommandOnNode(cmd, node) | ||
ExpectWithOffset(1, err).ToNot(HaveOccurred(), "failed to execute command %q on node: %q; %w", cmd, node.Name, err) | ||
stalldPid, _ := nodes.ExecCommandOnNode(cmd, node) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still an issue, we can't ignore the error here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, so I might bring back the pipe format that was non consistently failing maybe due to the wg issue we had and test it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not an actual piping BTW, just an OR operator.
@jmencak Could you please elaborate when did you encounter issues with this ||
operator when it's part of a command executed by node.ExecCommandOnNode
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry , I meant the OR operator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jiri have noticed that using ||
was flaky - depending also on stalld running or not.
but since I have discovered an issue with the parrallel runs of go procedure not properly blocked by the wg.wait (fixed in the last commit here) this might have been the real issue.
I will test the previous use of ||
true first and will re-introduce it if successful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What error do we get?
In my case none. It was just stuck/hanged forever until I pressed Ctrl+C.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error coming out from node.ExecCommandOnNode(pidof stalld || true)
is: timed out waiting for the condition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When stalld is not running eventually it will fail under:
cluster-node-tuning-operator/test/e2e/performanceprofile/functests/utils/pods/pods.go
Lines 189 to 197 in 947fd2d
func WaitForPodOutput(c *kubernetes.Clientset, pod *corev1.Pod, command []string) ([]byte, error) { | |
var out []byte | |
if err := wait.PollImmediate(15*time.Second, time.Minute, func() (done bool, err error) { | |
out, err = ExecCommandOnPod(c, pod, command) | |
if err != nil { | |
return false, err | |
} | |
return len(out) != 0, nil |
len(out)
Will always = 0 when running pidof stalld || true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so let's echo something instead:
pidof stalld || echo "stalld not running"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok , this could work. updated the PR
/hold |
util functions for checking stalld proccess existance fixed since they were existing the check too early due to hidden errors.
wg counter should be increased outside the go routine since increasing it inside the routine itself will not give a true state to be consumed by wg.wait()
/test e2e-aws-operator |
/retest-required |
@yanirq: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/lgtm |
@yanirq: Jira Issue OCPBUGS-11083: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-11083 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cherry-pick release-4.13 |
@yanirq: #626 failed to apply on top of branch "release-4.13":
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[ART PR BUILD NOTIFIER] This PR has been included in build cluster-node-tuning-operator-container-v4.14.0-202305031028.p0.g10b668d.assembly.stream for distgit cluster-node-tuning-operator. |
* pao e2e: skip hugepages and numa tests properly afterAll function should be skipped if tests are skipped to save testing time and irrelevant errors. * pao e2e: fix stalld enablement checks util functions for checking stalld proccess existance fixed since they were existing the check too early due to hidden errors. * skip falky test OCPBUGS-12836 * fix wg counter in e2e pao update tests wg counter should be increased outside the go routine since increasing it inside the routine itself will not give a true state to be consumed by wg.wait()
[ART PR BUILD NOTIFIER] This PR has been included in build cluster-node-tuning-operator-container-v4.14.0-202305031815.p0.g10b668d.assembly.stream for distgit cluster-node-tuning-operator. |
* pao e2e: skip hugepages and numa tests properly afterAll function should be skipped if tests are skipped to save testing time and irrelevant errors. * pao e2e: fix stalld enablement checks util functions for checking stalld proccess existance fixed since they were existing the check too early due to hidden errors. * skip falky test OCPBUGS-12836 * fix wg counter in e2e pao update tests wg counter should be increased outside the go routine since increasing it inside the routine itself will not give a true state to be consumed by wg.wait()
* pao e2e: skip hugepages and numa tests properly afterAll function should be skipped if tests are skipped to save testing time and irrelevant errors. * pao e2e: fix stalld enablement checks util functions for checking stalld proccess existance fixed since they were existing the check too early due to hidden errors. * skip falky test OCPBUGS-12836 * fix wg counter in e2e pao update tests wg counter should be increased outside the go routine since increasing it inside the routine itself will not give a true state to be consumed by wg.wait()
This PR comes to fix :