Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

e2e:cpuloadbalance: deflake the test #730

Closed
wants to merge 5 commits into from

Conversation

Tal-or
Copy link
Contributor

@Tal-or Tal-or commented Jul 25, 2023

This PR contains bunch of improvement in order to deflake the test.
Most of the commits are cosmetics, but the verification of the sched
domains before the test begins commit.

@openshift-ci openshift-ci bot requested review from dagrayvid and jmencak July 25, 2023 13:12
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 25, 2023

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Tal-or
Once this PR has been reviewed and has the lgtm label, please assign ffromani for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Tal-or
Copy link
Contributor Author

Tal-or commented Jul 25, 2023

This PR suppose to deflake the test or make the failure consistent.
either way, this is better than the flakiness we have right now.

return map from the `getCPUswithLoadBalanceDisabled` function
and then remove the nested loop from the check.

This is done only to make the test more clear and
not contain an actual fix.

Signed-off-by: Talor Itzhak <titzhak@redhat.com>
Signed-off-by: Talor Itzhak <titzhak@redhat.com>
The test pod is the only GU pod that requests for
cpu-load-balancing disable.

This means that all the cpus on the system should be
with cpu-load-balancing enable before test starts.

We should verify that before the test begin and bail out
early if it doesn't

Signed-off-by: Talor Itzhak <titzhak@redhat.com>
The pod get deleted during the test and there's only single `It` in
the node spec anyway, so the `AfterEach` is not needed

Signed-off-by: Talor Itzhak <titzhak@redhat.com>
After the pod gets deleted, all cpus should be back into sched domain,
so the check should be simpler.

Signed-off-by: Talor Itzhak <titzhak@redhat.com>
@@ -343,14 +350,14 @@ var _ = Describe("[rfe_id:27363][performance] CPU Management", Ordered, func() {
return true
} else {
for _, podcpu := range podCpus.ToSlice() {
for _, cpu := range cpusNotinSchedulingDomains {
if !strings.Contains(cpu, fmt.Sprint(podcpu)) {
Copy link
Contributor Author

@Tal-or Tal-or Jul 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line checks if the podcpu is not substring of cpu.
podcpu is cpu id for example: 3 .
the cpu is single line output of the /proc/schedstats command, for example:
[ cpu0 0 0 0 0 0 0 68186178574807 516377247436 331031573 cpu1 0 0 0 0 0 0 75970491002822 375072790117 330836684]
The problem with this check is that if the string "3" not appears somewhere in this line, we return true, which is wrong, because although this line refer to cpu0 and cpu1 the string 3 still appears.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this deserves to be a comment in the code

Copy link
Contributor Author

@Tal-or Tal-or Aug 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the test is going to pass.
I think this patch just float that the issue is real and consistent.
IOW, this test should never passed because the kernel doesn't behave as we expected.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, there's probably little point in documenting this even in the fixing patch.

@Tal-or
Copy link
Contributor Author

Tal-or commented Jul 26, 2023

/retest

@Tal-or
Copy link
Contributor Author

Tal-or commented Jul 26, 2023

[performance]Hugepages [rfe_id:27354]Huge pages support for container workloads [It] [test_id:27477][crit:high][vendor:cnf-qe@redhat.com][level:acceptance] Huge pages support for container workloads
/go/src/github.com/openshift/cluster-node-tuning-operator/test/e2e/performanceprofile/functests/1_performance/hugepages.go:123
  [FAILED] Unexpected error:
      <*fmt.wrapError | 0xc0004d3ca0>: {
          msg: "failed to run command [cat /rootfs/sys/fs/cgroup/hugetlb/hugetlb.1GB.usage_in_bytes]: output \"cat: /rootfs/sys/fs/cgroup/hugetlb/hugetlb.1GB.usage_in_bytes: No such file or directory\\r\\r\\n\"; error \"\"; command terminated with exit code 1",
          err: <exec.CodeExitError>{
              Err: <*errors.errorString | 0xc000342870>{
                  s: "command terminated with exit code 1",
              },
              Code: 1,
          },
      }
      failed to run command [cat /rootfs/sys/fs/cgroup/hugetlb/hugetlb.1GB.usage_in_bytes]: output "cat: /rootfs/sys/fs/cgroup/hugetlb/hugetlb.1GB.usage_in_bytes: No such file or directory\r\r\n"; error ""; command terminated with exit code 1

We need to check if this is another issue

@Tal-or
Copy link
Contributor Author

Tal-or commented Jul 26, 2023

/test e2e-gcp-pao

2 similar comments
@Tal-or
Copy link
Contributor Author

Tal-or commented Jul 26, 2023

/test e2e-gcp-pao

@Tal-or
Copy link
Contributor Author

Tal-or commented Jul 26, 2023

/test e2e-gcp-pao

Copy link
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

nice work!

@Tal-or
Copy link
Contributor Author

Tal-or commented Aug 2, 2023

/test e2e-gcp-pao

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 2, 2023

@Tal-or: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-pao 01ebf8d link true /test e2e-gcp-pao

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@ffromani
Copy link
Contributor

ffromani commented Aug 8, 2023

please do NOT rebase until both #729 , #750 and #752 merged

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 8, 2023
@openshift-merge-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@yanirq
Copy link
Contributor

yanirq commented Aug 9, 2023

@Tal-or we can rebase now

@Tal-or
Copy link
Contributor Author

Tal-or commented Aug 17, 2023

Agreed with @mrniranjan to remove the check at the end of the test until we'll figure out why the kernel doesn't put the cpus back in the sched domain

@Tal-or Tal-or closed this Aug 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants