Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-11083: pao e2e: fix update test suit timeouts #626

Merged
merged 4 commits into from
May 3, 2023

Conversation

yanirq
Copy link
Contributor

@yanirq yanirq commented Apr 19, 2023

This PR comes to fix :

  • stalld flaky tests the occur due to race conditions
  • afterAll function should be skipped if tests are skipped to save testing time and irrelevant errors.
  • skip flaky update test due to OCPBUGS-12836

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 19, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: yanirq

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 19, 2023
@yanirq
Copy link
Contributor Author

yanirq commented Apr 20, 2023

/retest

Copy link
Contributor

@jlojosnegros jlojosnegros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 20, 2023
@yanirq
Copy link
Contributor Author

yanirq commented Apr 20, 2023

infra issues
/test e2e-gcp-pao-updating-profile

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Apr 20, 2023
@yanirq yanirq changed the title pao e2e: skip hugepages and numa tests properly pao e2e: fix update test suit timeouts Apr 20, 2023
return false, fmt.Errorf("failed to execute command %q on node: %q; %w", cmd, node.Name, err)
}
return wait.PollImmediate(interval, timeout, func() (bool, error) {
cmd := []string{"/bin/bash", "-c", "pidof stalld"}
Copy link
Contributor

@Tal-or Tal-or Apr 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When stalld disabled, pidof stalld will return with an error, this is why I added the || true.
Not checking the error returned from ExecCommandOnNode in the line below instead is not so great.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there were some test cases where the the command itself with the condition actually failed. this is why we moved to a different kind of check (see the rest of the code)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that we're not checking the error returned from nodes.ExecCommandOnNode(cmd, node) hence we don't know whether the command execution was successful.

I don't mind changing the executed command to something that doesn't contain pipes but we need to check the error anyway

}
// we want stalld to run
if err != nil {
klog.Warningf("node=%q stalld_pid=%q is not a valid pid number: %v", node.Name, stalldPid, err)
Copy link
Contributor

@Tal-or Tal-or Apr 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would leave the warning since it doesn't affect the test result and provide some good indication.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will spam the log and the error message was moved to a higher level for better visibility. Knowing the pid number itself does not provide too much helpful info in this particular case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would at least give us some indication about where the test get stuck and how much time it takes.

if !run { // we don't want stalld to run
if err == nil {
return false, fmt.Errorf("node=%q stalld_pid=%q stalld is running when it shouldn't", node.Name, stalldPid)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this was definitely a mistake since we need to wait some time for TuneD to catch up, but I would replace the error with a warning
klog.Warningf("node=%q stalld_pid=%q stalld is still running when it shouldn't", node.Name, stalldPid)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rather have this manifested as an actual error (see higher level)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then you can returned it at the end of this function instead of write it down multiple times

}
return true, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it means that err != nil.
It can happens for various reasons if we're not checking that the output received from ExecCommandOnNode is OK

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we expect ExecCommandOnNode command to return error anyway (keep me honest here) then we will need to introduce some error handling here by error type, e.g : if error was for the execution itself or the error came out from not having stalld running

Copy link
Contributor

@Tal-or Tal-or Apr 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we expect ExecCommandOnNode command to return error anyway

No we don't that is my point exactly. if ExecCommandOnNode return with an error, you don't know whether it's because stalld is not running (which is sometimes a good thing, meaning the test passes) or the execution itself failed.

need to introduce some error handling here by error type

We don't know what kind of errors might return so we perform such a check efficiently

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue that that was the exact behavior even before this patch , we would still return true at the end of the function and in the last observation we have this was always the case since the syntax with || true was always failing. we can however get the error message from execute command on node and print it out as a warning to provide more debuging info.

@yanirq yanirq force-pushed the numa_test_skip branch 4 times, most recently from a5bbb7f to 83ec467 Compare April 27, 2023 08:33
@yanirq
Copy link
Contributor Author

yanirq commented Apr 27, 2023

/retest

1 similar comment
@yanirq
Copy link
Contributor Author

yanirq commented Apr 28, 2023

/retest

afterAll function should be skipped if tests are skipped
to save testing time and irrelevant errors.
@yanirq yanirq force-pushed the numa_test_skip branch 3 times, most recently from 01f051c to 5a68d4f Compare April 30, 2023 12:15
@yanirq
Copy link
Contributor Author

yanirq commented Apr 30, 2023

/test ci/prow/unit

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 30, 2023

@yanirq: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

  • /test e2e-aws-operator
  • /test e2e-aws-ovn
  • /test e2e-gcp-pao
  • /test e2e-gcp-pao-updating-profile
  • /test e2e-no-cluster
  • /test e2e-upgrade
  • /test images
  • /test unit
  • /test verify
  • /test vet

Use /test all to run all jobs.

In response to this:

/test ci/prow/unit

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@yanirq
Copy link
Contributor Author

yanirq commented Apr 30, 2023

/test all

@yanirq
Copy link
Contributor Author

yanirq commented Apr 30, 2023

/retest
infra issues

@yanirq
Copy link
Contributor Author

yanirq commented Apr 30, 2023

/retest

Copy link
Contributor

@Tal-or Tal-or left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch of the wg counter!

@yanirq yanirq changed the title pao e2e: fix update test suit timeouts OCPBUGS-11083: pao e2e: fix update test suit timeouts May 1, 2023
@Tal-or
Copy link
Contributor

Tal-or commented May 2, 2023

/lgtm
But we still need to figure out the timeout issue

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 2, 2023
By(fmt.Sprintf("Executing %q", cmd))
stalldPid, err := nodes.ExecCommandOnNode(cmd, node)
ExpectWithOffset(1, err).ToNot(HaveOccurred(), "failed to execute command %q on node: %q; %w", cmd, node.Name, err)
stalldPid, _ := nodes.ExecCommandOnNode(cmd, node)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still an issue, we can't ignore the error here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so I might bring back the pipe format that was non consistently failing maybe due to the wg issue we had and test it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not an actual piping BTW, just an OR operator.
@jmencak Could you please elaborate when did you encounter issues with this || operator when it's part of a command executed by node.ExecCommandOnNode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry , I meant the OR operator

Copy link
Contributor Author

@yanirq yanirq May 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jiri have noticed that using || was flaky - depending also on stalld running or not.
but since I have discovered an issue with the parrallel runs of go procedure not properly blocked by the wg.wait (fixed in the last commit here) this might have been the real issue.
I will test the previous use of || true first and will re-introduce it if successful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What error do we get?

In my case none. It was just stuck/hanged forever until I pressed Ctrl+C.

Copy link
Contributor Author

@yanirq yanirq May 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error coming out from node.ExecCommandOnNode(pidof stalld || true) is: timed out waiting for the condition

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When stalld is not running eventually it will fail under:

func WaitForPodOutput(c *kubernetes.Clientset, pod *corev1.Pod, command []string) ([]byte, error) {
var out []byte
if err := wait.PollImmediate(15*time.Second, time.Minute, func() (done bool, err error) {
out, err = ExecCommandOnPod(c, pod, command)
if err != nil {
return false, err
}
return len(out) != 0, nil

len(out) Will always = 0 when running pidof stalld || true

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so let's echo something instead:
pidof stalld || echo "stalld not running"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok , this could work. updated the PR

@Tal-or
Copy link
Contributor

Tal-or commented May 2, 2023

/hold
We still need to discuss about #626 (comment)

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 2, 2023
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label May 2, 2023
yanirq added 3 commits May 2, 2023 18:22
util functions for checking stalld proccess existance
fixed since they were existing the check too early due
to hidden errors.
wg counter should be increased outside the go routine
since increasing it inside the routine itself will not
give a true state to be consumed by wg.wait()
@yanirq
Copy link
Contributor Author

yanirq commented May 2, 2023

/test e2e-aws-operator

@yanirq
Copy link
Contributor Author

yanirq commented May 2, 2023

/retest-required

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 3, 2023

@yanirq: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@Tal-or
Copy link
Contributor

Tal-or commented May 3, 2023

/lgtm
/hold cancel
Thank you @yanirq!

@openshift-ci openshift-ci bot added lgtm Indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels May 3, 2023
@openshift-merge-robot openshift-merge-robot merged commit 10b668d into openshift:master May 3, 2023
@openshift-ci-robot
Copy link
Contributor

@yanirq: Jira Issue OCPBUGS-11083: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-11083 has been moved to the MODIFIED state.

In response to this:

This PR comes to fix :

  • stalld flaky tests the occur due to race conditions
  • afterAll function should be skipped if tests are skipped to save testing time and irrelevant errors.
  • skip flaky update test due to OCPBUGS-12836

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@yanirq
Copy link
Contributor Author

yanirq commented May 3, 2023

/cherry-pick release-4.13

@openshift-cherrypick-robot

@yanirq: #626 failed to apply on top of branch "release-4.13":

Applying: pao e2e: skip hugepages and numa tests properly
Using index info to reconstruct a base tree...
M	test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go
Falling back to patching base and 3-way merge...
Auto-merging test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go
CONFLICT (content): Merge conflict in test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 pao e2e: skip hugepages and numa tests properly
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-4.13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

This PR has been included in build cluster-node-tuning-operator-container-v4.14.0-202305031028.p0.g10b668d.assembly.stream for distgit cluster-node-tuning-operator.
All builds following this will include this PR.

dagrayvid pushed a commit to dagrayvid/cluster-node-tuning-operator that referenced this pull request May 3, 2023
* pao e2e: skip hugepages and numa tests properly

afterAll function should be skipped if tests are skipped
to save testing time and irrelevant errors.

* pao e2e: fix stalld enablement checks

util functions for checking stalld proccess existance
fixed since they were existing the check too early due
to hidden errors.

* skip falky test OCPBUGS-12836

* fix wg counter in e2e pao update tests

wg counter should be increased outside the go routine
since increasing it inside the routine itself will not
give a true state to be consumed by wg.wait()
@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

This PR has been included in build cluster-node-tuning-operator-container-v4.14.0-202305031815.p0.g10b668d.assembly.stream for distgit cluster-node-tuning-operator.
All builds following this will include this PR.

IlyaTyomkin pushed a commit to IlyaTyomkin/cluster-node-tuning-operator that referenced this pull request May 23, 2023
* pao e2e: skip hugepages and numa tests properly

afterAll function should be skipped if tests are skipped
to save testing time and irrelevant errors.

* pao e2e: fix stalld enablement checks

util functions for checking stalld proccess existance
fixed since they were existing the check too early due
to hidden errors.

* skip falky test OCPBUGS-12836

* fix wg counter in e2e pao update tests

wg counter should be increased outside the go routine
since increasing it inside the routine itself will not
give a true state to be consumed by wg.wait()
IlyaTyomkin pushed a commit to IlyaTyomkin/cluster-node-tuning-operator that referenced this pull request Jun 13, 2023
* pao e2e: skip hugepages and numa tests properly

afterAll function should be skipped if tests are skipped
to save testing time and irrelevant errors.

* pao e2e: fix stalld enablement checks

util functions for checking stalld proccess existance
fixed since they were existing the check too early due
to hidden errors.

* skip falky test OCPBUGS-12836

* fix wg counter in e2e pao update tests

wg counter should be increased outside the go routine
since increasing it inside the routine itself will not
give a true state to be consumed by wg.wait()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants