NO-JIRA: e2e: set performance profile cpus using env vars #909

shajmakh · 2024-01-09T11:46:14Z

We've been observing lately that some tests that involve disabling load balancing are failing (like 32646) because the expected result does not have specific anticipated CPUs. After investigation, it turns out that one factor is the profile configuration of the CPU distribution.

PAO functional tests configure fixed CPU values under the PP. This is considered misconfiguration, especially when the system has more than 4 CPUs, and there is no guarantee that the functionality of the performance profile controller will work adequately with not all cpus reflected in the CPU section in the PP.

To resolve this complication, we are introducing new environment variables RESERVED_CPU_SET, ISOLATED_CPU_SET, OFFLINED_CPU_SET, should be set the profile would use them instead of the defaults.

shajmakh · 2024-01-09T12:02:14Z

/cc @ffromani @mrniranjan @Tal-or

shajmakh · 2024-01-09T12:05:27Z

Note: The README.md file is generally outdated and needs to be updated with the new suites and instructions on adding and running them, yet this can be addressed later as it is outside of this PR scope.

ffromani

at first I was a bit taken aback by the approach here but after a more careful review and some thinking I'm growing to like it.
Do you have already a sketch about how the changes will be used e.g in openshift/release flows? (e.g. which changes to prow config will be needed?)

report.xml

test/e2e/performanceprofile/functests/0_config/config.go

Following openshift/cluster-node-tuning-operator#909, we now can provide the CPU specifications for the performance profile and considering the u/s CI runs on AWS-generated clusters with constant CPU settings, according to the PPC CPU calculation using must-gather data, we export the result in environment variables indicating reserved and isolated CPUs. Note: this will need to be maintained should environment settings change. Signed-off-by: shajmakh <shajmakh@redhat.com>

shajmakh · 2024-01-09T13:14:12Z

at first I was a bit taken aback by the approach here but after a more careful review and some thinking I'm growing to like it. Do you have already a sketch about how the changes will be used e.g in openshift/release flows? (e.g. which changes to prow config will be needed?)

Thank you for reviewing this.
yes here is the changes for cnf-features-deploy u/s CI:
openshift/release#47439
another similar change is needed for NTO lane.

ffromani

/approve

so this is the minimal solution to adapt the builtin hardcoded profile to different environments setting up the cpu setup correctly and consuming all CPUs.
Many other tradeoffs are possible ranging from pre-building a profile and letting the tests consume it (which should already be possible today) to fully automatically compute the right baseline CPU allocation autodetecting from the cluster at each run

where to draw the line is exactly one of the many cases about finding the correct tradeoff.

I think this PR can unblock us and enable us to further explore the solution space if we can and want (we really should FWIW) so I'm fine with this initial step.

openshift-ci · 2024-01-09T13:30:55Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ffromani, shajmakh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ffromani]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ffromani

/lgtm

ffromani · 2024-01-09T16:12:00Z

/retest-required

shajmakh · 2024-01-09T16:39:31Z

/hold
failure is legit, working on a fix

ffromani · 2024-01-09T16:51:44Z

/hold failure is legit, working on a fix

right, sorry I missed it. Also, it seems we have a clear case of #881 also affecting our suites:
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-node-tuning-operator/909/pull-ci-openshift-cluster-node-tuning-operator-master-e2e-gcp-pao/1744705071402192896

Following openshift/cluster-node-tuning-operator#909, we now can provide the CPU specifications for the performance profile that will be used as a base for the functional tests. Given that the NTO u/s CI is using ipi deployment on vms and that the deployment settings are constant, according to the PPC CPU calculation using must-gather data, we export the result in environment variables indicating reserved and isolated CPUs. Note: this will need to be maintained should environment settings change in gcp. Signed-off-by: shajmakh <shajmakh@redhat.com>

ffromani · 2024-01-09T17:44:17Z

/hold failure is legit, working on a fix

right, sorry I missed it. Also, it seems we have a clear case of #881 also affecting our suites: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-node-tuning-operator/909/pull-ci-openshift-cluster-node-tuning-operator-master-e2e-gcp-pao/1744705071402192896

fixed in #911

openshift-ci · 2024-01-31T14:22:42Z

@shajmakh: The /retest command does not accept any targets.
The following commands are available to trigger required jobs:

/test e2e-aws-operator
/test e2e-aws-ovn
/test e2e-aws-ovn-techpreview
/test e2e-gcp-pao
/test e2e-gcp-pao-updating-profile
/test e2e-gcp-pao-workloadhints
/test e2e-hypershift
/test e2e-no-cluster
/test e2e-upgrade
/test images
/test unit
/test verify
/test vet

The following commands are available to trigger optional jobs:

/test lint

Use /test all to run all jobs.

In response to this:

infra issue
/retest ci/prow/e2e-gcp-pao-updating-profile

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

shajmakh · 2024-01-31T14:23:52Z

ci/prow/e2e-gcp-pao passing is already a good sign.
/retest e2e-gcp-pao-updating-profile

openshift-ci · 2024-01-31T14:24:04Z

@shajmakh: The /retest command does not accept any targets.
The following commands are available to trigger required jobs:

/test e2e-aws-operator
/test e2e-aws-ovn
/test e2e-aws-ovn-techpreview
/test e2e-gcp-pao
/test e2e-gcp-pao-updating-profile
/test e2e-gcp-pao-workloadhints
/test e2e-hypershift
/test e2e-no-cluster
/test e2e-upgrade
/test images
/test unit
/test verify
/test vet

The following commands are available to trigger optional jobs:

/test lint

Use /test all to run all jobs.

In response to this:

ci/prow/e2e-gcp-pao passing is already a good sign.
/retest e2e-gcp-pao-updating-profile

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

shajmakh · 2024-01-31T14:24:27Z

/test e2e-gcp-pao-updating-profile

ffromani · 2024-02-01T09:26:17Z

/retitle NO-JIRA: e2e: set performance profile cpus using env vars

openshift-ci-robot · 2024-02-01T09:26:24Z

@shajmakh: This pull request explicitly references no jira issue.

In response to this:

We've been observing lately that some tests that involve disabling load balancing are failing (like 32646) because the expected result does not have specific anticipated CPUs. After investigation, it turns out that one factor is the profile configuration of the CPU distribution.

PAO functional tests configure fixed CPU values under the PP. This is considered misconfiguration, especially when the system has more than 4 CPUs, and there is no guarantee that the functionality of the performance profile controller will work adequately with not all cpus reflected in the CPU section in the PP.

To resolve this complication, we are introducing new environment variables RESERVED_CPU_SET, ISOLATED_CPU_SET, OFFLINED_CPU_SET, should be set the profile would use them instead of the defaults.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

shajmakh · 2024-02-01T10:03:39Z

/hold cancel

shajmakh · 2024-02-01T10:04:38Z

we need this in all supported versions
/cherry-pick release-4.15

openshift-cherrypick-robot · 2024-02-01T10:04:41Z

@shajmakh: once the present PR merges, I will cherry-pick it on top of release-4.15 in a new PR and assign it to you.

In response to this:

we need this in all supported versions
/cherry-pick release-4.15

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2024-02-01T15:28:48Z

/retest-required

Remaining retests: 0 against base HEAD 1e3a67e and 2 for PR HEAD cc40929 in total

openshift-ci · 2024-02-01T17:32:50Z

@shajmakh: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-cherrypick-robot · 2024-02-01T17:34:33Z

@shajmakh: new pull request created: #934

In response to this:

we need this in all supported versions
/cherry-pick release-4.15

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-bot · 2024-02-02T01:23:45Z

[ART PR BUILD NOTIFIER]

This PR has been included in build cluster-node-tuning-operator-container-v4.16.0-202402020041.p0.g64ed3fe.assembly.stream for distgit cluster-node-tuning-operator.
All builds following this will include this PR.

Following openshift/cluster-node-tuning-operator#909, we now can provide the CPU specifications for the performance profile and considering the u/s CI runs on AWS-generated clusters with constant CPU settings, according to the PPC CPU calculation using must-gather data, we export the result in environment variables indicating reserved and isolated CPUs. Note: this will need to be maintained should environment settings change. Signed-off-by: shajmakh <shajmakh@redhat.com>

Following openshift/cluster-node-tuning-operator#909, we now can provide the CPU specifications for the performance profile that will be used as a base for the functional tests. Given that the NTO u/s CI is using ipi deployment on vms and that the deployment settings are constant, according to the PPC CPU calculation using must-gather data, we export the result in environment variables indicating reserved and isolated CPUs. Note: this will need to be maintained should environment settings change in gcp. Signed-off-by: shajmakh <shajmakh@redhat.com>

Following openshift/cluster-node-tuning-operator#909, we now can provide the CPU specifications for the performance profile. Considering the u/s CI runs on BM-node clusters with constant CPU settings (0-111), according to the PPC CPU calculation using must-gather data, we set the result in environment variables indicating reserved and isolated CPUs. Note: this will need to be maintained should cluster nodes' settings change. Signed-off-by: shajmakh <shajmakh@redhat.com>

Following openshift/cluster-node-tuning-operator#909, we now can provide the CPU specifications for the performance profile that will be used as a base for the functional tests. Given that the NTO and cnf-features-deploy u/s CI is using vm-node clusters and that the deployment settings are constant, according to the PPC CPU calculation using must-gather data, we export the result in environment variables indicating reserved and isolated CPUs. Note: this will need to be maintained should node cpu settings change. in gcp. Signed-off-by: shajmakh <shajmakh@redhat.com>

Following openshift/cluster-node-tuning-operator#909, we now can provide the CPU specifications for the performance profile that will be used as a base for the functional tests. Given that the NTO and cnf-features-deploy u/s CI is using vm-node clusters and that the deployment settings are constant, according to the PPC CPU calculation using must-gather data, we export the result in environment variables indicating reserved and isolated CPUs. Note: this will need to be maintained should node cpu settings change. in gcp. Signed-off-by: Shereen Haj <shajmakh@redhat.com>

Following openshift/cluster-node-tuning-operator#909, we now can provide the CPU specifications for the performance profile. Considering the u/s CI runs on BM-node clusters with constant CPU settings (0-111), according to the PPC CPU calculation using must-gather data, we set the result in environment variables indicating reserved and isolated CPUs. Note: this will need to be maintained should cluster nodes' settings change. Signed-off-by: shajmakh <shajmakh@redhat.com>

Following openshift/cluster-node-tuning-operator#909, we now can provide the CPU specifications for the performance profile that will be used as a base for the functional tests. Given that the NTO and cnf-features-deploy u/s CI is using vm-node clusters and that the deployment settings are constant, according to the PPC CPU calculation using must-gather data, we export the result in environment variables indicating reserved and isolated CPUs. Note: this will need to be maintained should node cpu settings change. in gcp. Signed-off-by: Shereen Haj <shajmakh@redhat.com>

shajmakh marked this pull request as draft January 9, 2024 11:46

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 9, 2024

openshift-ci bot requested review from jlojosnegros and Tal-or January 9, 2024 11:47

shajmakh force-pushed the adaptive-pp-by-input branch from 8a1ceb9 to 031e6f5 Compare January 9, 2024 11:56

shajmakh marked this pull request as ready for review January 9, 2024 11:56

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 9, 2024

openshift-ci bot requested a review from kpouget January 9, 2024 11:58

shajmakh mentioned this pull request Jan 9, 2024

WIP: performance: e2e: configure adaptive cpu in profile #884

Closed

ffromani reviewed Jan 9, 2024

View reviewed changes

report.xml Outdated Show resolved Hide resolved

test/e2e/performanceprofile/functests/0_config/config.go Outdated Show resolved Hide resolved

shajmakh force-pushed the adaptive-pp-by-input branch from 031e6f5 to 9dd5e63 Compare January 9, 2024 12:57

shajmakh mentioned this pull request Jan 9, 2024

use env vars for reserved and isolated cpus in NTO and cnf-features-deploy openshift/release#47439

Closed

ffromani reviewed Jan 9, 2024

View reviewed changes

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 9, 2024

ffromani reviewed Jan 9, 2024

View reviewed changes

openshift-ci bot assigned ffromani Jan 9, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 9, 2024

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 9, 2024

shajmakh force-pushed the adaptive-pp-by-input branch from 9dd5e63 to 1d1907d Compare January 9, 2024 17:01

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jan 9, 2024

openshift-ci bot changed the title ~~e2e: set performance profile cpus using env vars~~ NO-JIRA: e2e: set performance profile cpus using env vars Feb 1, 2024

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 1, 2024

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 1, 2024

openshift-merge-bot bot merged commit 64ed3fe into openshift:master Feb 1, 2024
15 checks passed

openshift-cherrypick-robot mentioned this pull request Feb 1, 2024

NO-JIRA: [release-4.15] e2e: set performance profile cpus using env vars #934

Open

shajmakh mentioned this pull request Feb 6, 2024

teclo5g: use env vars for reserved and isolated cpus openshift/release#48512

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NO-JIRA: e2e: set performance profile cpus using env vars #909

NO-JIRA: e2e: set performance profile cpus using env vars #909

shajmakh commented Jan 9, 2024

shajmakh commented Jan 9, 2024

shajmakh commented Jan 9, 2024

ffromani left a comment

shajmakh commented Jan 9, 2024

ffromani left a comment

openshift-ci bot commented Jan 9, 2024

ffromani left a comment

ffromani commented Jan 9, 2024

shajmakh commented Jan 9, 2024

ffromani commented Jan 9, 2024

ffromani commented Jan 9, 2024

openshift-ci bot commented Jan 31, 2024

shajmakh commented Jan 31, 2024

openshift-ci bot commented Jan 31, 2024

shajmakh commented Jan 31, 2024

ffromani commented Feb 1, 2024

openshift-ci-robot commented Feb 1, 2024

shajmakh commented Feb 1, 2024

shajmakh commented Feb 1, 2024

openshift-cherrypick-robot commented Feb 1, 2024

openshift-ci-robot commented Feb 1, 2024

openshift-ci bot commented Feb 1, 2024

openshift-cherrypick-robot commented Feb 1, 2024

openshift-bot commented Feb 2, 2024

NO-JIRA: e2e: set performance profile cpus using env vars #909

NO-JIRA: e2e: set performance profile cpus using env vars #909

Conversation

shajmakh commented Jan 9, 2024

shajmakh commented Jan 9, 2024

shajmakh commented Jan 9, 2024

ffromani left a comment

Choose a reason for hiding this comment

shajmakh commented Jan 9, 2024

ffromani left a comment

Choose a reason for hiding this comment

openshift-ci bot commented Jan 9, 2024

ffromani left a comment

Choose a reason for hiding this comment

ffromani commented Jan 9, 2024

shajmakh commented Jan 9, 2024

ffromani commented Jan 9, 2024

ffromani commented Jan 9, 2024

openshift-ci bot commented Jan 31, 2024

shajmakh commented Jan 31, 2024

openshift-ci bot commented Jan 31, 2024

shajmakh commented Jan 31, 2024

ffromani commented Feb 1, 2024

openshift-ci-robot commented Feb 1, 2024

shajmakh commented Feb 1, 2024

shajmakh commented Feb 1, 2024

openshift-cherrypick-robot commented Feb 1, 2024

openshift-ci-robot commented Feb 1, 2024

openshift-ci bot commented Feb 1, 2024

openshift-cherrypick-robot commented Feb 1, 2024

openshift-bot commented Feb 2, 2024