Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NO-JIRA: e2e: set performance profile cpus using env vars #909

Merged
merged 2 commits into from Feb 1, 2024

Conversation

shajmakh
Copy link
Contributor

@shajmakh shajmakh commented Jan 9, 2024

We've been observing lately that some tests that involve disabling load balancing are failing (like 32646) because the expected result does not have specific anticipated CPUs. After investigation, it turns out that one factor is the profile configuration of the CPU distribution.

PAO functional tests configure fixed CPU values under the PP. This is considered misconfiguration, especially when the system has more than 4 CPUs, and there is no guarantee that the functionality of the performance profile controller will work adequately with not all cpus reflected in the CPU section in the PP.

To resolve this complication, we are introducing new environment variables RESERVED_CPU_SET, ISOLATED_CPU_SET, OFFLINED_CPU_SET, should be set the profile would use them instead of the defaults.

@shajmakh shajmakh marked this pull request as draft January 9, 2024 11:46
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 9, 2024
@shajmakh shajmakh marked this pull request as ready for review January 9, 2024 11:56
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 9, 2024
@openshift-ci openshift-ci bot requested a review from kpouget January 9, 2024 11:58
@shajmakh
Copy link
Contributor Author

shajmakh commented Jan 9, 2024

/cc @ffromani @mrniranjan @Tal-or

@shajmakh
Copy link
Contributor Author

shajmakh commented Jan 9, 2024

Note: The README.md file is generally outdated and needs to be updated with the new suites and instructions on adding and running them, yet this can be addressed later as it is outside of this PR scope.

Copy link
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at first I was a bit taken aback by the approach here but after a more careful review and some thinking I'm growing to like it.
Do you have already a sketch about how the changes will be used e.g in openshift/release flows? (e.g. which changes to prow config will be needed?)

report.xml Outdated Show resolved Hide resolved
test/e2e/performanceprofile/functests/0_config/config.go Outdated Show resolved Hide resolved
shajmakh added a commit to shajmakh/release that referenced this pull request Jan 9, 2024
Following
openshift/cluster-node-tuning-operator#909, we now can provide the CPU specifications for the performance profile and
considering the u/s CI runs on AWS-generated clusters with constant
CPU settings, according to the PPC CPU calculation using must-gather
data, we export the result in environment variables indicating reserved and isolated CPUs.

Note: this will need to be maintained should environment settings
change.

Signed-off-by: shajmakh <shajmakh@redhat.com>
@shajmakh
Copy link
Contributor Author

shajmakh commented Jan 9, 2024

at first I was a bit taken aback by the approach here but after a more careful review and some thinking I'm growing to like it. Do you have already a sketch about how the changes will be used e.g in openshift/release flows? (e.g. which changes to prow config will be needed?)

Thank you for reviewing this.
yes here is the changes for cnf-features-deploy u/s CI:
openshift/release#47439
another similar change is needed for NTO lane.

Copy link
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

so this is the minimal solution to adapt the builtin hardcoded profile to different environments setting up the cpu setup correctly and consuming all CPUs.
Many other tradeoffs are possible ranging from pre-building a profile and letting the tests consume it (which should already be possible today) to fully automatically compute the right baseline CPU allocation autodetecting from the cluster at each run

where to draw the line is exactly one of the many cases about finding the correct tradeoff.

I think this PR can unblock us and enable us to further explore the solution space if we can and want (we really should FWIW) so I'm fine with this initial step.

Copy link
Contributor

openshift-ci bot commented Jan 9, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ffromani, shajmakh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 9, 2024
Copy link
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@ffromani
Copy link
Contributor

ffromani commented Jan 9, 2024

/retest-required

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 9, 2024
@shajmakh
Copy link
Contributor Author

shajmakh commented Jan 9, 2024

/hold
failure is legit, working on a fix

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 9, 2024
@ffromani
Copy link
Contributor

ffromani commented Jan 9, 2024

/hold failure is legit, working on a fix

right, sorry I missed it. Also, it seems we have a clear case of #881 also affecting our suites:
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-node-tuning-operator/909/pull-ci-openshift-cluster-node-tuning-operator-master-e2e-gcp-pao/1744705071402192896

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jan 9, 2024
shajmakh added a commit to shajmakh/release that referenced this pull request Jan 9, 2024
Following
openshift/cluster-node-tuning-operator#909, we now can provide the CPU
specifications for the performance profile that will be used as a base
for the functional tests. Given that the NTO u/s CI is using ipi
deployment on vms and that the deployment settings are constant, according to the PPC CPU calculation using must-gather data, we export the result in environment variables indicating reserved and isolated CPUs.

Note: this will need to be maintained should environment settings change
in gcp.

Signed-off-by: shajmakh <shajmakh@redhat.com>
@ffromani
Copy link
Contributor

ffromani commented Jan 9, 2024

/hold failure is legit, working on a fix

right, sorry I missed it. Also, it seems we have a clear case of #881 also affecting our suites: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-node-tuning-operator/909/pull-ci-openshift-cluster-node-tuning-operator-master-e2e-gcp-pao/1744705071402192896

fixed in #911

Copy link
Contributor

openshift-ci bot commented Jan 31, 2024

@shajmakh: The /retest command does not accept any targets.
The following commands are available to trigger required jobs:

  • /test e2e-aws-operator
  • /test e2e-aws-ovn
  • /test e2e-aws-ovn-techpreview
  • /test e2e-gcp-pao
  • /test e2e-gcp-pao-updating-profile
  • /test e2e-gcp-pao-workloadhints
  • /test e2e-hypershift
  • /test e2e-no-cluster
  • /test e2e-upgrade
  • /test images
  • /test unit
  • /test verify
  • /test vet

The following commands are available to trigger optional jobs:

  • /test lint

Use /test all to run all jobs.

In response to this:

infra issue
/retest ci/prow/e2e-gcp-pao-updating-profile

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@shajmakh
Copy link
Contributor Author

ci/prow/e2e-gcp-pao passing is already a good sign.
/retest e2e-gcp-pao-updating-profile

Copy link
Contributor

openshift-ci bot commented Jan 31, 2024

@shajmakh: The /retest command does not accept any targets.
The following commands are available to trigger required jobs:

  • /test e2e-aws-operator
  • /test e2e-aws-ovn
  • /test e2e-aws-ovn-techpreview
  • /test e2e-gcp-pao
  • /test e2e-gcp-pao-updating-profile
  • /test e2e-gcp-pao-workloadhints
  • /test e2e-hypershift
  • /test e2e-no-cluster
  • /test e2e-upgrade
  • /test images
  • /test unit
  • /test verify
  • /test vet

The following commands are available to trigger optional jobs:

  • /test lint

Use /test all to run all jobs.

In response to this:

ci/prow/e2e-gcp-pao passing is already a good sign.
/retest e2e-gcp-pao-updating-profile

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@shajmakh
Copy link
Contributor Author

/test e2e-gcp-pao-updating-profile

@ffromani
Copy link
Contributor

ffromani commented Feb 1, 2024

/retitle NO-JIRA: e2e: set performance profile cpus using env vars

@openshift-ci openshift-ci bot changed the title e2e: set performance profile cpus using env vars NO-JIRA: e2e: set performance profile cpus using env vars Feb 1, 2024
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 1, 2024
@openshift-ci-robot
Copy link
Contributor

@shajmakh: This pull request explicitly references no jira issue.

In response to this:

We've been observing lately that some tests that involve disabling load balancing are failing (like 32646) because the expected result does not have specific anticipated CPUs. After investigation, it turns out that one factor is the profile configuration of the CPU distribution.

PAO functional tests configure fixed CPU values under the PP. This is considered misconfiguration, especially when the system has more than 4 CPUs, and there is no guarantee that the functionality of the performance profile controller will work adequately with not all cpus reflected in the CPU section in the PP.

To resolve this complication, we are introducing new environment variables RESERVED_CPU_SET, ISOLATED_CPU_SET, OFFLINED_CPU_SET, should be set the profile would use them instead of the defaults.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@shajmakh
Copy link
Contributor Author

shajmakh commented Feb 1, 2024

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 1, 2024
@shajmakh
Copy link
Contributor Author

shajmakh commented Feb 1, 2024

we need this in all supported versions
/cherry-pick release-4.15

@openshift-cherrypick-robot

@shajmakh: once the present PR merges, I will cherry-pick it on top of release-4.15 in a new PR and assign it to you.

In response to this:

we need this in all supported versions
/cherry-pick release-4.15

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 1e3a67e and 2 for PR HEAD cc40929 in total

Copy link
Contributor

openshift-ci bot commented Feb 1, 2024

@shajmakh: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit 64ed3fe into openshift:master Feb 1, 2024
15 checks passed
@openshift-cherrypick-robot

@shajmakh: new pull request created: #934

In response to this:

we need this in all supported versions
/cherry-pick release-4.15

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

This PR has been included in build cluster-node-tuning-operator-container-v4.16.0-202402020041.p0.g64ed3fe.assembly.stream for distgit cluster-node-tuning-operator.
All builds following this will include this PR.

shajmakh added a commit to shajmakh/release that referenced this pull request Feb 6, 2024
Following
openshift/cluster-node-tuning-operator#909, we now can provide the CPU specifications for the performance profile and
considering the u/s CI runs on AWS-generated clusters with constant
CPU settings, according to the PPC CPU calculation using must-gather
data, we export the result in environment variables indicating reserved and isolated CPUs.

Note: this will need to be maintained should environment settings
change.

Signed-off-by: shajmakh <shajmakh@redhat.com>
shajmakh added a commit to shajmakh/release that referenced this pull request Feb 6, 2024
Following
openshift/cluster-node-tuning-operator#909, we now can provide the CPU
specifications for the performance profile that will be used as a base
for the functional tests. Given that the NTO u/s CI is using ipi
deployment on vms and that the deployment settings are constant, according to the PPC CPU calculation using must-gather data, we export the result in environment variables indicating reserved and isolated CPUs.

Note: this will need to be maintained should environment settings change
in gcp.

Signed-off-by: shajmakh <shajmakh@redhat.com>
shajmakh added a commit to shajmakh/release that referenced this pull request Feb 29, 2024
Following
openshift/cluster-node-tuning-operator#909, we now can provide the CPU
specifications for the performance profile that will be used as a base
for the functional tests. Given that the NTO u/s CI is using ipi
deployment on vms and that the deployment settings are constant, according to the PPC CPU calculation using must-gather data, we export the result in environment variables indicating reserved and isolated CPUs.

Note: this will need to be maintained should environment settings change
in gcp.

Signed-off-by: shajmakh <shajmakh@redhat.com>
shajmakh added a commit to shajmakh/release that referenced this pull request Feb 29, 2024
Following openshift/cluster-node-tuning-operator#909, we now can provide the CPU specifications for the performance profile.
Considering the u/s CI runs on BM-node clusters with constant CPU settings (0-111), according to the PPC CPU calculation using must-gather data, we set the result in environment variables indicating reserved and isolated CPUs.

Note: this will need to be maintained should cluster nodes' settings
change.

Signed-off-by: shajmakh <shajmakh@redhat.com>
shajmakh added a commit to shajmakh/release that referenced this pull request Feb 29, 2024
Following openshift/cluster-node-tuning-operator#909, we now can provide the CPU
specifications for the performance profile that will be used as a base
for the functional tests. Given that the NTO and cnf-features-deploy u/s CI is using vm-node clusters and that the deployment settings are constant, according to the PPC CPU calculation using must-gather data, we export the result in environment variables indicating reserved and isolated CPUs.

Note: this will need to be maintained should node cpu settings change.
in gcp.

Signed-off-by: shajmakh <shajmakh@redhat.com>
shajmakh added a commit to shajmakh/release that referenced this pull request Feb 29, 2024
Following openshift/cluster-node-tuning-operator#909, we now can provide the CPU
specifications for the performance profile that will be used as a base
for the functional tests. Given that the NTO and cnf-features-deploy u/s CI is using vm-node clusters and that the deployment settings are constant, according to the PPC CPU calculation using must-gather data, we export the result in environment variables indicating reserved and isolated CPUs.

Note: this will need to be maintained should node cpu settings change.
in gcp.

Signed-off-by: shajmakh <shajmakh@redhat.com>
shajmakh added a commit to shajmakh/release that referenced this pull request Feb 29, 2024
Following openshift/cluster-node-tuning-operator#909, we now can provide the CPU
specifications for the performance profile that will be used as a base
for the functional tests. Given that the NTO and cnf-features-deploy u/s CI is using vm-node clusters and that the deployment settings are constant, according to the PPC CPU calculation using must-gather data, we export the result in environment variables indicating reserved and isolated CPUs.

Note: this will need to be maintained should node cpu settings change.
in gcp.

Signed-off-by: shajmakh <shajmakh@redhat.com>
shajmakh added a commit to shajmakh/release that referenced this pull request Feb 29, 2024
Following openshift/cluster-node-tuning-operator#909, we now can provide the CPU
specifications for the performance profile that will be used as a base
for the functional tests. Given that the NTO and cnf-features-deploy u/s CI is using vm-node clusters and that the deployment settings are constant, according to the PPC CPU calculation using must-gather data, we export the result in environment variables indicating reserved and isolated CPUs.

Note: this will need to be maintained should node cpu settings change.
in gcp.

Signed-off-by: Shereen Haj <shajmakh@redhat.com>
shajmakh added a commit to shajmakh/release that referenced this pull request Mar 1, 2024
Following openshift/cluster-node-tuning-operator#909, we now can provide the CPU
specifications for the performance profile that will be used as a base
for the functional tests. Given that the NTO and cnf-features-deploy u/s CI is using vm-node clusters and that the deployment settings are constant, according to the PPC CPU calculation using must-gather data, we export the result in environment variables indicating reserved and isolated CPUs.

Note: this will need to be maintained should node cpu settings change.
in gcp.

Signed-off-by: Shereen Haj <shajmakh@redhat.com>
shajmakh added a commit to shajmakh/release that referenced this pull request Apr 4, 2024
Following openshift/cluster-node-tuning-operator#909, we now can provide the CPU specifications for the performance profile.
Considering the u/s CI runs on BM-node clusters with constant CPU settings (0-111), according to the PPC CPU calculation using must-gather data, we set the result in environment variables indicating reserved and isolated CPUs.

Note: this will need to be maintained should cluster nodes' settings
change.

Signed-off-by: shajmakh <shajmakh@redhat.com>
shajmakh added a commit to shajmakh/release that referenced this pull request Apr 4, 2024
Following openshift/cluster-node-tuning-operator#909, we now can provide the CPU
specifications for the performance profile that will be used as a base
for the functional tests. Given that the NTO and cnf-features-deploy u/s CI is using vm-node clusters and that the deployment settings are constant, according to the PPC CPU calculation using must-gather data, we export the result in environment variables indicating reserved and isolated CPUs.

Note: this will need to be maintained should node cpu settings change.
in gcp.

Signed-off-by: Shereen Haj <shajmakh@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants