e2e:cpuloadbalance: deflake the test #730

Tal-or · 2023-07-25T13:09:33Z

This PR contains bunch of improvement in order to deflake the test.
Most of the commits are cosmetics, but the verification of the sched
domains before the test begins commit.

openshift-ci · 2023-07-25T13:12:14Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Tal-or
Once this PR has been reviewed and has the lgtm label, please assign ffromani for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Tal-or · 2023-07-25T13:13:14Z

This PR suppose to deflake the test or make the failure consistent.
either way, this is better than the flakiness we have right now.

return map from the `getCPUswithLoadBalanceDisabled` function and then remove the nested loop from the check. This is done only to make the test more clear and not contain an actual fix. Signed-off-by: Talor Itzhak <titzhak@redhat.com>

Signed-off-by: Talor Itzhak <titzhak@redhat.com>

The test pod is the only GU pod that requests for cpu-load-balancing disable. This means that all the cpus on the system should be with cpu-load-balancing enable before test starts. We should verify that before the test begin and bail out early if it doesn't Signed-off-by: Talor Itzhak <titzhak@redhat.com>

The pod get deleted during the test and there's only single `It` in the node spec anyway, so the `AfterEach` is not needed Signed-off-by: Talor Itzhak <titzhak@redhat.com>

After the pod gets deleted, all cpus should be back into sched domain, so the check should be simpler. Signed-off-by: Talor Itzhak <titzhak@redhat.com>

Tal-or · 2023-07-25T16:27:41Z

test/e2e/performanceprofile/functests/1_performance/cpu_management.go

@@ -343,14 +350,14 @@ var _ = Describe("[rfe_id:27363][performance] CPU Management", Ordered, func() {
 					return true
 				} else {
 					for _, podcpu := range podCpus.ToSlice() {
-						for _, cpu := range cpusNotinSchedulingDomains {
-							if !strings.Contains(cpu, fmt.Sprint(podcpu)) {


This line checks if the podcpu is not substring of cpu.
podcpu is cpu id for example: 3 .
the cpu is single line output of the /proc/schedstats command, for example:
[ cpu0 0 0 0 0 0 0 68186178574807 516377247436 331031573 cpu1 0 0 0 0 0 0 75970491002822 375072790117 330836684]
The problem with this check is that if the string "3" not appears somewhere in this line, we return true, which is wrong, because although this line refer to cpu0 and cpu1 the string 3 still appears.

this deserves to be a comment in the code

I don't think the test is going to pass.
I think this patch just float that the issue is real and consistent.
IOW, this test should never passed because the kernel doesn't behave as we expected.

Right, there's probably little point in documenting this even in the fixing patch.

Tal-or · 2023-07-26T08:06:43Z

/retest

Tal-or · 2023-07-26T09:30:38Z

[performance]Hugepages [rfe_id:27354]Huge pages support for container workloads [It] [test_id:27477][crit:high][vendor:cnf-qe@redhat.com][level:acceptance] Huge pages support for container workloads
/go/src/github.com/openshift/cluster-node-tuning-operator/test/e2e/performanceprofile/functests/1_performance/hugepages.go:123
  [FAILED] Unexpected error:
      <*fmt.wrapError | 0xc0004d3ca0>: {
          msg: "failed to run command [cat /rootfs/sys/fs/cgroup/hugetlb/hugetlb.1GB.usage_in_bytes]: output \"cat: /rootfs/sys/fs/cgroup/hugetlb/hugetlb.1GB.usage_in_bytes: No such file or directory\\r\\r\\n\"; error \"\"; command terminated with exit code 1",
          err: <exec.CodeExitError>{
              Err: <*errors.errorString | 0xc000342870>{
                  s: "command terminated with exit code 1",
              },
              Code: 1,
          },
      }
      failed to run command [cat /rootfs/sys/fs/cgroup/hugetlb/hugetlb.1GB.usage_in_bytes]: output "cat: /rootfs/sys/fs/cgroup/hugetlb/hugetlb.1GB.usage_in_bytes: No such file or directory\r\r\n"; error ""; command terminated with exit code 1

We need to check if this is another issue

Tal-or · 2023-07-26T09:30:48Z

/test e2e-gcp-pao

Tal-or · 2023-07-26T10:53:16Z

/test e2e-gcp-pao

Tal-or · 2023-07-26T12:59:01Z

/test e2e-gcp-pao

ffromani

/lgtm

nice work!

Tal-or · 2023-08-02T07:07:39Z

/test e2e-gcp-pao

openshift-ci · 2023-08-02T08:27:49Z

@Tal-or: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-gcp-pao	`01ebf8d`	link	true	`/test e2e-gcp-pao`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

ffromani · 2023-08-08T09:07:38Z

please do NOT rebase until both #729 , #750 and #752 merged

openshift-merge-robot · 2023-08-08T09:07:52Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

yanirq · 2023-08-09T09:29:54Z

@Tal-or we can rebase now

Tal-or · 2023-08-17T09:53:19Z

Agreed with @mrniranjan to remove the check at the end of the test until we'll figure out why the kernel doesn't put the cpus back in the sched domain

openshift-ci bot requested review from dagrayvid and jmencak July 25, 2023 13:12

Tal-or force-pushed the fix_cpuloadbalance_test branch from e0cb162 to e252517 Compare July 25, 2023 14:40

Tal-or added 5 commits July 25, 2023 17:41

e2e:cpuloadbalance: remove nested loop

206819b

return map from the `getCPUswithLoadBalanceDisabled` function and then remove the nested loop from the check. This is done only to make the test more clear and not contain an actual fix. Signed-off-by: Talor Itzhak <titzhak@redhat.com>

e2e:cpuloadbalance: fix cameCase format

9ead212

Signed-off-by: Talor Itzhak <titzhak@redhat.com>

e2e:cpuloadbalance: remove AfterEach

638a32d

The pod get deleted during the test and there's only single `It` in the node spec anyway, so the `AfterEach` is not needed Signed-off-by: Talor Itzhak <titzhak@redhat.com>

e2e:cpuloadbalance: simplified check after pod deletion

01ebf8d

After the pod gets deleted, all cpus should be back into sched domain, so the check should be simpler. Signed-off-by: Talor Itzhak <titzhak@redhat.com>

Tal-or force-pushed the fix_cpuloadbalance_test branch from e252517 to 8042047 Compare July 25, 2023 14:42

Tal-or commented Jul 25, 2023

View reviewed changes

Tal-or force-pushed the fix_cpuloadbalance_test branch from 8042047 to 01ebf8d Compare July 25, 2023 16:29

yanirq mentioned this pull request Aug 1, 2023

OCPBUGS-16976: Update the config.openshift.io/node object's cgroupMode to "v1" #737

Merged

ffromani reviewed Aug 1, 2023

View reviewed changes

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 8, 2023

ffromani mentioned this pull request Aug 9, 2023

OCPBUGS-16908: Release leader election on manager exit #745

Merged

Tal-or closed this Aug 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

e2e:cpuloadbalance: deflake the test #730

e2e:cpuloadbalance: deflake the test #730

Tal-or commented Jul 25, 2023 •

edited

openshift-ci bot commented Jul 25, 2023

Tal-or commented Jul 25, 2023

Tal-or Jul 25, 2023 •

edited

ffromani Aug 1, 2023

Tal-or Aug 1, 2023 •

edited

ffromani Aug 2, 2023

Tal-or commented Jul 26, 2023

Tal-or commented Jul 26, 2023

Tal-or commented Jul 26, 2023

Tal-or commented Jul 26, 2023

Tal-or commented Jul 26, 2023

ffromani left a comment

Tal-or commented Aug 2, 2023

openshift-ci bot commented Aug 2, 2023

ffromani commented Aug 8, 2023 •

edited

openshift-merge-robot commented Aug 8, 2023

yanirq commented Aug 9, 2023

Tal-or commented Aug 17, 2023

e2e:cpuloadbalance: deflake the test #730

e2e:cpuloadbalance: deflake the test #730

Conversation

Tal-or commented Jul 25, 2023 • edited

openshift-ci bot commented Jul 25, 2023

Tal-or commented Jul 25, 2023

Tal-or Jul 25, 2023 • edited

Choose a reason for hiding this comment

ffromani Aug 1, 2023

Choose a reason for hiding this comment

Tal-or Aug 1, 2023 • edited

Choose a reason for hiding this comment

ffromani Aug 2, 2023

Choose a reason for hiding this comment

Tal-or commented Jul 26, 2023

Tal-or commented Jul 26, 2023

Tal-or commented Jul 26, 2023

Tal-or commented Jul 26, 2023

Tal-or commented Jul 26, 2023

ffromani left a comment

Choose a reason for hiding this comment

Tal-or commented Aug 2, 2023

openshift-ci bot commented Aug 2, 2023

ffromani commented Aug 8, 2023 • edited

openshift-merge-robot commented Aug 8, 2023

yanirq commented Aug 9, 2023

Tal-or commented Aug 17, 2023

Tal-or commented Jul 25, 2023 •

edited

Tal-or Jul 25, 2023 •

edited

Tal-or Aug 1, 2023 •

edited

ffromani commented Aug 8, 2023 •

edited