OCPBUGS-11083: pao e2e: fix update test suit timeouts #626

yanirq · 2023-04-19T19:57:43Z

This PR comes to fix :

stalld flaky tests the occur due to race conditions
afterAll function should be skipped if tests are skipped to save testing time and irrelevant errors.
skip flaky update test due to OCPBUGS-12836

openshift-ci · 2023-04-19T19:59:58Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: yanirq

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [yanirq]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

yanirq · 2023-04-20T07:14:23Z

/retest

jlojosnegros

/lgtm

yanirq · 2023-04-20T08:56:28Z

infra issues
/test e2e-gcp-pao-updating-profile

Tal-or · 2023-04-23T16:28:27Z

test/e2e/performanceprofile/functests/utils/tuned/tuned.go

-			return false, fmt.Errorf("failed to execute command %q on node: %q; %w", cmd, node.Name, err)
-		}
+	return wait.PollImmediate(interval, timeout, func() (bool, error) {
+		cmd := []string{"/bin/bash", "-c", "pidof stalld"}


When stalld disabled, pidof stalld will return with an error, this is why I added the || true.
Not checking the error returned from ExecCommandOnNode in the line below instead is not so great.

there were some test cases where the the command itself with the condition actually failed. this is why we moved to a different kind of check (see the rest of the code)

The problem is that we're not checking the error returned from nodes.ExecCommandOnNode(cmd, node) hence we don't know whether the command execution was successful.

I don't mind changing the executed command to something that doesn't contain pipes but we need to check the error anyway

Tal-or · 2023-04-23T16:29:06Z

test/e2e/performanceprofile/functests/utils/tuned/tuned.go

 		}
 		// we want stalld to run
 		if err != nil {
-			klog.Warningf("node=%q stalld_pid=%q is not a valid pid number: %v", node.Name, stalldPid, err)


I would leave the warning since it doesn't affect the test result and provide some good indication.

this will spam the log and the error message was moved to a higher level for better visibility. Knowing the pid number itself does not provide too much helpful info in this particular case.

It would at least give us some indication about where the test get stuck and how much time it takes.

Tal-or · 2023-04-23T16:30:57Z

test/e2e/performanceprofile/functests/utils/tuned/tuned.go

 		if !run { // we don't want stalld to run
 			if err == nil {
-				return false, fmt.Errorf("node=%q stalld_pid=%q stalld is running when it shouldn't", node.Name, stalldPid)


Yes this was definitely a mistake since we need to wait some time for TuneD to catch up, but I would replace the error with a warning
klog.Warningf("node=%q stalld_pid=%q stalld is still running when it shouldn't", node.Name, stalldPid)

I rather have this manifested as an actual error (see higher level)

Then you can returned it at the end of this function instead of write it down multiple times

Tal-or · 2023-04-23T16:33:06Z

test/e2e/performanceprofile/functests/utils/tuned/tuned.go

 			}
+			return true, nil


Here it means that err != nil.
It can happens for various reasons if we're not checking that the output received from ExecCommandOnNode is OK

since we expect ExecCommandOnNode command to return error anyway (keep me honest here) then we will need to introduce some error handling here by error type, e.g : if error was for the execution itself or the error came out from not having stalld running

since we expect ExecCommandOnNode command to return error anyway

No we don't that is my point exactly. if ExecCommandOnNode return with an error, you don't know whether it's because stalld is not running (which is sometimes a good thing, meaning the test passes) or the execution itself failed.

need to introduce some error handling here by error type

We don't know what kind of errors might return so we perform such a check efficiently

I would argue that that was the exact behavior even before this patch , we would still return true at the end of the function and in the last observation we have this was always the case since the syntax with || true was always failing. we can however get the error message from execute command on node and print it out as a warning to provide more debuging info.

yanirq · 2023-04-27T10:23:17Z

/retest

yanirq · 2023-04-28T15:40:33Z

/retest

afterAll function should be skipped if tests are skipped to save testing time and irrelevant errors.

yanirq · 2023-04-30T13:15:03Z

/test ci/prow/unit

openshift-ci · 2023-04-30T13:15:07Z

@yanirq: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test e2e-aws-operator
/test e2e-aws-ovn
/test e2e-gcp-pao
/test e2e-gcp-pao-updating-profile
/test e2e-no-cluster
/test e2e-upgrade
/test images
/test unit
/test verify
/test vet

Use /test all to run all jobs.

In response to this:

/test ci/prow/unit

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

yanirq · 2023-04-30T13:42:37Z

/test all

yanirq · 2023-04-30T14:48:07Z

/retest
infra issues

yanirq · 2023-04-30T21:43:40Z

/retest

Tal-or

Nice catch of the wg counter!

Tal-or · 2023-05-02T06:21:45Z

/lgtm
But we still need to figure out the timeout issue

Tal-or · 2023-05-02T06:25:47Z

test/e2e/performanceprofile/functests/utils/tuned/tuned.go

 	By(fmt.Sprintf("Executing %q", cmd))
-	stalldPid, err := nodes.ExecCommandOnNode(cmd, node)
-	ExpectWithOffset(1, err).ToNot(HaveOccurred(), "failed to execute command %q on node: %q; %w", cmd, node.Name, err)
+	stalldPid, _ := nodes.ExecCommandOnNode(cmd, node)


This is still an issue, we can't ignore the error here

ok, so I might bring back the pipe format that was non consistently failing maybe due to the wg issue we had and test it

This is not an actual piping BTW, just an OR operator.
@jmencak Could you please elaborate when did you encounter issues with this || operator when it's part of a command executed by node.ExecCommandOnNode?

sorry , I meant the OR operator

Jiri have noticed that using || was flaky - depending also on stalld running or not.
but since I have discovered an issue with the parrallel runs of go procedure not properly blocked by the wg.wait (fixed in the last commit here) this might have been the real issue.
I will test the previous use of || true first and will re-introduce it if successful.

What error do we get?

In my case none. It was just stuck/hanged forever until I pressed Ctrl+C.

The error coming out from node.ExecCommandOnNode(pidof stalld || true) is: timed out waiting for the condition

When stalld is not running eventually it will fail under:

cluster-node-tuning-operator/test/e2e/performanceprofile/functests/utils/pods/pods.go

Lines 189 to 197 in 947fd2d

func WaitForPodOutput(c *kubernetes.Clientset, pod *corev1.Pod, command []string) ([]byte, error) {

var out []byte

if err := wait.PollImmediate(15*time.Second, time.Minute, func() (done bool, err error) {

out, err = ExecCommandOnPod(c, pod, command)

if err != nil {

return false, err

}

return len(out) != 0, nil

len(out) Will always = 0 when running pidof stalld || true

so let's echo something instead:
pidof stalld || echo "stalld not running"

ok , this could work. updated the PR

Tal-or · 2023-05-02T06:26:29Z

/hold
We still need to discuss about #626 (comment)

util functions for checking stalld proccess existance fixed since they were existing the check too early due to hidden errors.

wg counter should be increased outside the go routine since increasing it inside the routine itself will not give a true state to be consumed by wg.wait()

yanirq · 2023-05-02T21:43:09Z

/test e2e-aws-operator

yanirq · 2023-05-02T23:15:10Z

/retest-required

openshift-ci · 2023-05-03T00:27:56Z

@yanirq: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Tal-or · 2023-05-03T07:45:59Z

/lgtm
/hold cancel
Thank you @yanirq!

openshift-ci-robot · 2023-05-03T07:49:53Z

@yanirq: Jira Issue OCPBUGS-11083: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-11083 has been moved to the MODIFIED state.

In response to this:

This PR comes to fix :

stalld flaky tests the occur due to race conditions

afterAll function should be skipped if tests are skipped to save testing time and irrelevant errors.

skip flaky update test due to OCPBUGS-12836

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

yanirq · 2023-05-03T09:42:41Z

/cherry-pick release-4.13

openshift-cherrypick-robot · 2023-05-03T09:43:27Z

@yanirq: #626 failed to apply on top of branch "release-4.13":

Applying: pao e2e: skip hugepages and numa tests properly
Using index info to reconstruct a base tree...
M	test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go
Falling back to patching base and 3-way merge...
Auto-merging test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go
CONFLICT (content): Merge conflict in test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 pao e2e: skip hugepages and numa tests properly
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-4.13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-bot · 2023-05-03T11:02:11Z

[ART PR BUILD NOTIFIER]

This PR has been included in build cluster-node-tuning-operator-container-v4.14.0-202305031028.p0.g10b668d.assembly.stream for distgit cluster-node-tuning-operator.
All builds following this will include this PR.

* pao e2e: skip hugepages and numa tests properly afterAll function should be skipped if tests are skipped to save testing time and irrelevant errors. * pao e2e: fix stalld enablement checks util functions for checking stalld proccess existance fixed since they were existing the check too early due to hidden errors. * skip falky test OCPBUGS-12836 * fix wg counter in e2e pao update tests wg counter should be increased outside the go routine since increasing it inside the routine itself will not give a true state to be consumed by wg.wait()

openshift-bot · 2023-05-03T18:49:33Z

[ART PR BUILD NOTIFIER]

This PR has been included in build cluster-node-tuning-operator-container-v4.14.0-202305031815.p0.g10b668d.assembly.stream for distgit cluster-node-tuning-operator.
All builds following this will include this PR.

* pao e2e: skip hugepages and numa tests properly afterAll function should be skipped if tests are skipped to save testing time and irrelevant errors. * pao e2e: fix stalld enablement checks util functions for checking stalld proccess existance fixed since they were existing the check too early due to hidden errors. * skip falky test OCPBUGS-12836 * fix wg counter in e2e pao update tests wg counter should be increased outside the go routine since increasing it inside the routine itself will not give a true state to be consumed by wg.wait()

openshift-ci bot requested review from jlojosnegros and kpouget April 19, 2023 19:59

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 19, 2023

jlojosnegros reviewed Apr 20, 2023

View reviewed changes

openshift-ci bot assigned jlojosnegros Apr 20, 2023

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 20, 2023

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Apr 20, 2023

yanirq changed the title ~~pao e2e: skip hugepages and numa tests properly~~ pao e2e: fix update test suit timeouts Apr 20, 2023

yanirq mentioned this pull request Apr 20, 2023

pao e2e: fix stalld enablement checks #625

Closed

Tal-or reviewed Apr 23, 2023

View reviewed changes

yanirq force-pushed the numa_test_skip branch 4 times, most recently from a5bbb7f to 83ec467 Compare April 27, 2023 08:33

pao e2e: skip hugepages and numa tests properly

7cff7af

afterAll function should be skipped if tests are skipped to save testing time and irrelevant errors.

yanirq force-pushed the numa_test_skip branch 3 times, most recently from 01f051c to 5a68d4f Compare April 30, 2023 12:15

Tal-or reviewed May 1, 2023

View reviewed changes

yanirq changed the title ~~pao e2e: fix update test suit timeouts~~ OCPBUGS-11083: pao e2e: fix update test suit timeouts May 1, 2023

openshift-ci bot assigned Tal-or May 2, 2023

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 2, 2023

Tal-or reviewed May 2, 2023

View reviewed changes

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 2, 2023

yanirq force-pushed the numa_test_skip branch from bad8ce1 to 70a1829 Compare May 2, 2023 13:52

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label May 2, 2023

yanirq mentioned this pull request May 2, 2023

[release-4.13] OCPBUGS-11384: workload-hints: disable stalld when rt disabled #604

Merged

yanirq added 3 commits May 2, 2023 18:22

pao e2e: fix stalld enablement checks

0d5df02

util functions for checking stalld proccess existance fixed since they were existing the check too early due to hidden errors.

skip falky test OCPBUGS-12836

7fbe2a3

fix wg counter in e2e pao update tests

63b07c1

wg counter should be increased outside the go routine since increasing it inside the routine itself will not give a true state to be consumed by wg.wait()

yanirq force-pushed the numa_test_skip branch from 70a1829 to 63b07c1 Compare May 2, 2023 15:23

openshift-ci bot added lgtm Indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels May 3, 2023

openshift-merge-robot merged commit 10b668d into openshift:master May 3, 2023

mrniranjan mentioned this pull request May 3, 2023

[release-4.13] OCPBUGS-11709: Backup and revert profile when hugepages test completes #612

Closed

yanirq mentioned this pull request May 3, 2023

[release-4.13] [manual] OCPBUGS-11336: pao e2e: fix update test suit timeouts #642

Merged

	func WaitForPodOutput(c kubernetes.Clientset, pod corev1.Pod, command []string) ([]byte, error) {
	var out []byte
	if err := wait.PollImmediate(15*time.Second, time.Minute, func() (done bool, err error) {
	out, err = ExecCommandOnPod(c, pod, command)
	if err != nil {
	return false, err
	}

	return len(out) != 0, nil

OCPBUGS-11083: pao e2e: fix update test suit timeouts #626

OCPBUGS-11083: pao e2e: fix update test suit timeouts #626

Conversation

yanirq commented Apr 19, 2023 • edited Loading

openshift-ci bot commented Apr 19, 2023

yanirq commented Apr 20, 2023

jlojosnegros left a comment

Choose a reason for hiding this comment

yanirq commented Apr 20, 2023

Tal-or Apr 23, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Tal-or Apr 23, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Tal-or Apr 24, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yanirq commented Apr 27, 2023

yanirq commented Apr 28, 2023

yanirq commented Apr 30, 2023

openshift-ci bot commented Apr 30, 2023

yanirq commented Apr 30, 2023

yanirq commented Apr 30, 2023

yanirq commented Apr 30, 2023

Tal-or left a comment • edited Loading

Choose a reason for hiding this comment

Tal-or commented May 2, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yanirq May 2, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yanirq May 2, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Tal-or commented May 2, 2023

yanirq commented May 2, 2023

yanirq commented May 2, 2023

openshift-ci bot commented May 3, 2023

Tal-or commented May 3, 2023

openshift-ci-robot commented May 3, 2023

yanirq commented May 3, 2023

openshift-cherrypick-robot commented May 3, 2023

openshift-bot commented May 3, 2023

openshift-bot commented May 3, 2023

yanirq commented Apr 19, 2023 •

edited

Loading

Tal-or Apr 23, 2023 •

edited

Loading

Tal-or Apr 23, 2023 •

edited

Loading

Tal-or Apr 24, 2023 •

edited

Loading

Tal-or left a comment •

edited

Loading

yanirq May 2, 2023 •

edited

Loading

yanirq May 2, 2023 •

edited

Loading