OCPBUGS-24581: rps: fix mask update for SR-IOV devices #877

Tal-or · 2023-12-12T11:27:32Z

SR-IOV devices get moved (renamed) upon their creation.
This causes the set-rps-mask.sh to fail since the queue's path has changed.

We should add additional udev rule to act upon the move of a physical device
and set the rps mask correctly.

The set-rps-mask.sh script has modified to act upon those two
different scenarios:

when queues are being added.
when net device is moved.

In addition we fail silently (exit 0) when queues are failed to get
updated due to the mentioned above.
The queues that were failed to get updated, are expected to be updated
by the instance that get trrigered after the device move (renaming).

The journal spammed by warnings about the fact that systemd unit name is not escaped. we use systemd-escape in order to escape the unit-name properly. `--path` is for telling systemd-escape that the input is a valid file path name. `--template` is telling systemd-escape to inserts the escaped strings in a unit name template. %c accepts the output of `PROGRAM` Signed-off-by: Talor Itzhak <titzhak@redhat.com>

openshift-ci-robot · 2023-12-12T11:27:37Z

@Tal-or: This pull request references Jira Issue OCPBUGS-24581, which is invalid:

expected the bug to target the "4.16.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

SRI-OV devices get moved (renamed) upon their creation.
This causes the set-rps-mask.sh to fail since the queue's path has changed.

We should add additional udev rule to act upon the move of a physical device
and set the rps mask correctly.

The set-rps-mask.sh script has modified to act upon those two
different scenarios:

when queues are being added.

when net device is moved.

In addition we fail silently (exit 0) when queues are failed to get
updated due to the mentioned above.
The queues that were failed to get updated, are expected to be updated
by the instance that get trrigered after the device move (renaming).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Tal-or · 2023-12-12T11:30:51Z

/jira refresh

openshift-ci-robot · 2023-12-12T11:30:58Z

@Tal-or: This pull request references Jira Issue OCPBUGS-24581, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.16.0) matches configured target version for branch (4.16.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ffromani

inline comments about implementation

ffromani · 2023-12-12T11:38:29Z

assets/performanceprofile/scripts/set-rps-mask.sh

+# the 'echo' command might failed if the device path which the queue belongs to has changes
+# this can happen in case of SRI-OV devices renaming
+exit 0


The idea is ok but forcing an exit 0 will make the function non-composable. The calling program will always exit here, won't it? If we want to swallow the error from echo, I'd either use return 0 or echo "${mask}" 2> /dev/null > "/sys${queue_path}${queue_num}/rps_cpus" || :

We should return 0 in the main flow context (at the end of the file) instead of the function I think

We don't want to exit 0 in the main flow because we still want to catch errors if the second function is being called

The set_net_dev_rps one

ok, we do want swallow error of echo on line 12 though, right?

Yes and you're right we should return 0 and not exit. it would basically have the same affect but will keep the function composable

assets/performanceprofile/configs/99-netdev-physical-rps.rules

ffromani · 2023-12-12T11:41:09Z

assets/performanceprofile/scripts/set-rps-mask.sh

-# replace '/' with '-'
-queue_num="${queue_num/\//-}"
-
+# replace x2d with hyphen (-) which is an escaped character


does systemd-escape support reverse functionality (un-escape) and if so, do we want to use it?

It supports reverse.
The original systemd-escape output is:
\x2d but the backslash gets removed when we pass the argument to the script. Due to that fact it failed to revert this part.
Besides that, when we call the script we invoked it under the systemd-unit and passing the path argument with %I which basically perform the unesacping for us and which is cheaper than spawn another process for the systemd-escape call here

it surely is, but it's also asymmetric. We do escaping using systemd-escape, we do unescaping ourselves. We save resources, we need to ensure the consistency ourselves.
Now: this can very much be a path we want to choose, but let's make sure it's intentional, agreed, explicit.

Yes I agree it's kind of odd and if the de-escaping was perfect I would definitely choose this path.
But I really didn't understand why the backslash get omitted. It's probably related to the way of how the systemd-unit passes the argument

possibly true. Considering the nature of this bug I won't block for this, even though I think it would deserve a followup

Tal-or · 2023-12-13T09:32:40Z

/retest

assets/performanceprofile/configs/99-netdev-physical-rps.rules

openshift-ci-robot · 2023-12-14T14:19:58Z

@Tal-or: This pull request references Jira Issue OCPBUGS-24581, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.16.0) matches configured target version for branch (4.16.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

SR-IOV devices get moved (renamed) upon their creation.
This causes the set-rps-mask.sh to fail since the queue's path has changed.

We should add additional udev rule to act upon the move of a physical device
and set the rps mask correctly.

The set-rps-mask.sh script has modified to act upon those two
different scenarios:

when queues are being added.

when net device is moved.

In addition we fail silently (exit 0) when queues are failed to get
updated due to the mentioned above.
The queues that were failed to get updated, are expected to be updated
by the instance that get trrigered after the device move (renaming).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

assets/performanceprofile/scripts/set-rps-mask.sh

ffromani · 2023-12-14T15:33:32Z

/approve

we want this fix

assets/performanceprofile/scripts/set-rps-mask.sh

ffromani · 2023-12-14T15:35:15Z

getting there. A comment inside, and need to update manual_machineconfig.yaml.

openshift-ci · 2023-12-14T15:41:18Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ffromani, Tal-or

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ffromani]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ffromani · 2023-12-14T15:45:59Z

/lgtm
/hold

to give time to other reviewers

the indentation was lost, but not blocking for now

MarSik · 2023-12-15T08:03:43Z

/retest-required

ffromani · 2023-12-15T08:31:55Z

/lgtm

ffromani · 2023-12-15T08:32:09Z

/hold cancel

we have second review now

ffromani · 2023-12-15T08:49:23Z

/retest-required

flakes + infra issues (crio overloaded?!?)

ffromani · 2023-12-15T08:58:20Z

/retest

ffromani · 2023-12-15T09:50:07Z

/retest-required

(the failed lane didn't restart)

openshift-ci-robot · 2023-12-15T09:55:56Z

/retest-required

Remaining retests: 0 against base HEAD 778695d and 2 for PR HEAD 55a1c6d in total

ffromani · 2023-12-15T10:30:57Z

ok, let's hold the retests. It's either infra issue or the rules are actually causing excessive load. Investigating.

ffromani · 2023-12-15T13:04:25Z

/retest-required

we have strong suspects it's a flake/infra issue because the lane passed previously without relevant updates being done

ffromani · 2023-12-15T13:08:28Z

/retest-required

we have strong suspects it's a flake because the lane passed previously without relevant updates being done

 Adding transient rw bind mount for /run/secrets/rhsm
[1/2] STEP 1/5: FROM image-registry.openshift-image-registry.svc:5000/ci-op-fjyqjw1h/pipeline@sha256:d336e16fa089260e51b888c7acbd69728c1eac3a7e858f0e1755e396d81302ec AS builder
Trying to pull image-registry.openshift-image-registry.svc:5000/ci-op-fjyqjw1h/pipeline@sha256:d336e16fa089260e51b888c7acbd69728c1eac3a7e858f0e1755e396d81302ec...
error: build error: creating build container: parsing image configuration: fetching blob: StatusCode: 400,

ffromani · 2023-12-15T13:10:54Z

/test e2e-gcp-pao-workloadhints

ffromani · 2023-12-15T13:12:20Z

/test e2e-gcp-pao-workloadhints

 Adding transient rw bind mount for /run/secrets/rhsm
[1/2] STEP 1/5: FROM image-registry.openshift-image-registry.svc:5000/ci-op-fjyqjw1h/pipeline@sha256:d336e16fa089260e51b888c7acbd69728c1eac3a7e858f0e1755e396d81302ec AS builder
Trying to pull image-registry.openshift-image-registry.svc:5000/ci-op-fjyqjw1h/pipeline@sha256:d336e16fa089260e51b888c7acbd69728c1eac3a7e858f0e1755e396d81302ec...
error: build error: creating build container: parsing image configuration: fetching blob: StatusCode: 400, 
INFO[2023-12-15T13:11:12Z] Building cluster-node-tuning-operator-us-tuned 
INFO[2023-12-15T13:11:12Z] Found existing build "cluster-node-tuning-operator-us-tuned-amd64" 
INFO[2023-12-15T13:11:12Z] Build cluster-node-tuning-operator-us-tuned-amd64 succeeded after 4m10s 
INFO[2023-12-15T13:11:12Z] Image ci-op-fjyqjw1h/pipeline:cluster-node-tuning-operator-us-tuned created  for-build=cluster-node-tuning-operator-us-tuned
INFO[2023-12-15T13:11:12Z] Tagging cluster-node-tuning-operator-us-tuned into stable 
INFO[2023-12-15T13:11:12Z] Ran for 3s                                   
ERRO[2023-12-15T13:11:12Z] Some steps failed:                           
ERRO[2023-12-15T13:11:12Z] 
  * could not run steps: step cluster-node-tuning-operator failed: error occurred handling build cluster-node-tuning-operator-amd64: the build cluster-node-tuning-operator-amd64 failed after 1m9s with reason DockerBuildFailed: Dockerfile build strategy has failed.
Writing manifest to image destination
Adding transient rw bind mount for /run/secrets/rhsm
[1/2] STEP 1/5: FROM image-registry.openshift-image-regist...51b888c7acbd69728c1eac3a7e858f0e1755e396d81302ec AS builder
Trying to pull image-registry.openshift-image-registry.svc...a089260e51b888c7acbd69728c1eac3a7e858f0e1755e396d81302ec...
error: build error: creating build container: parsing image configuration: fetching blob: StatusCode: 400, 
INFO[2023-12-15T13:11:12Z] Reporting job state 'failed' with reason 'executing_graph:step_failed:building_project_image'

MarSik · 2023-12-15T13:16:11Z

/cherry-pick release-4.15

openshift-cherrypick-robot · 2023-12-15T13:16:13Z

@MarSik: once the present PR merges, I will cherry-pick it on top of release-4.15 in a new PR and assign it to you.

In response to this:

/cherry-pick release-4.15

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ffromani · 2023-12-15T13:16:34Z

/test e2e-gcp-pao-workloadhints

MarSik · 2023-12-15T14:52:33Z

/retest-required

ffromani · 2023-12-15T19:55:18Z

/retest- required

MarSik · 2023-12-15T20:16:33Z

/retest-required

EinatGlottmann · 2023-12-16T05:00:29Z

/retest-required

Tal-or · 2023-12-16T16:14:18Z

/test e2e-aws-ovn

openshift-ci · 2023-12-16T18:10:29Z

@Tal-or: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci-robot · 2023-12-16T18:13:57Z

@Tal-or: Jira Issue OCPBUGS-24581: All pull requests linked via external trackers have merged:

openshift/cluster-node-tuning-operator#877

Jira Issue OCPBUGS-24581 has been moved to the MODIFIED state.

In response to this:

SR-IOV devices get moved (renamed) upon their creation.
This causes the set-rps-mask.sh to fail since the queue's path has changed.

We should add additional udev rule to act upon the move of a physical device
and set the rps mask correctly.

The set-rps-mask.sh script has modified to act upon those two
different scenarios:

when queues are being added.

when net device is moved.

In addition we fail silently (exit 0) when queues are failed to get
updated due to the mentioned above.
The queues that were failed to get updated, are expected to be updated
by the instance that get trrigered after the device move (renaming).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-cherrypick-robot · 2023-12-16T18:14:45Z

@MarSik: new pull request created: #886

In response to this:

/cherry-pick release-4.15

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-bot · 2023-12-16T21:22:10Z

[ART PR BUILD NOTIFIER]

This PR has been included in build cluster-node-tuning-operator-container-v4.16.0-202312162050.p0.gc9a93af.assembly.stream for distgit cluster-node-tuning-operator.
All builds following this will include this PR.

openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Dec 12, 2023

openshift-ci bot requested review from dagrayvid and ffromani December 12, 2023 11:28

openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Dec 12, 2023

ffromani reviewed Dec 12, 2023

View reviewed changes

ffromani reviewed Dec 14, 2023

View reviewed changes

assets/performanceprofile/configs/99-netdev-physical-rps.rules Outdated Show resolved Hide resolved

Tal-or force-pushed the fix_rps_for_sriov_devs branch 2 times, most recently from fe78d34 to 5594691 Compare December 14, 2023 14:18

Tal-or changed the title ~~OCPBUGS-24581: rps: fix mask update for SRI-OV devices~~ OCPBUGS-24581: rps: fix mask update for SR-IOV devices Dec 14, 2023

ffromani reviewed Dec 14, 2023

View reviewed changes

assets/performanceprofile/scripts/set-rps-mask.sh Outdated Show resolved Hide resolved

Tal-or force-pushed the fix_rps_for_sriov_devs branch from 5594691 to d4759d6 Compare December 14, 2023 15:24

ffromani reviewed Dec 14, 2023

View reviewed changes

assets/performanceprofile/scripts/set-rps-mask.sh Outdated Show resolved Hide resolved

Tal-or force-pushed the fix_rps_for_sriov_devs branch 2 times, most recently from 202d813 to 4c26b48 Compare December 14, 2023 15:37

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 14, 2023

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 14, 2023

openshift-ci bot assigned ffromani Dec 14, 2023

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 14, 2023

openshift-ci bot assigned MarSik Dec 15, 2023

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 15, 2023

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 15, 2023

openshift-merge-bot bot merged commit c9a93af into openshift:master Dec 16, 2023
14 checks passed

openshift-cherrypick-robot mentioned this pull request Dec 16, 2023

[release-4.15] OCPBUGS-25552: rps: fix mask update for SR-IOV devices #886

Merged

openshift-ci-robot mentioned this pull request Dec 25, 2023

OCPBUGS-24581: rps: fail gracefully when mask application failed #895

Merged

OCPBUGS-24581: rps: fix mask update for SR-IOV devices #877

OCPBUGS-24581: rps: fix mask update for SR-IOV devices #877

Conversation

Tal-or commented Dec 12, 2023 • edited

openshift-ci-robot commented Dec 12, 2023

Tal-or commented Dec 12, 2023

openshift-ci-robot commented Dec 12, 2023

ffromani left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Tal-or Dec 13, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Tal-or commented Dec 13, 2023

openshift-ci-robot commented Dec 14, 2023

ffromani commented Dec 14, 2023

ffromani commented Dec 14, 2023

openshift-ci bot commented Dec 14, 2023

ffromani commented Dec 14, 2023

MarSik commented Dec 15, 2023

ffromani commented Dec 15, 2023

ffromani commented Dec 15, 2023

ffromani commented Dec 15, 2023

ffromani commented Dec 15, 2023

ffromani commented Dec 15, 2023 • edited

openshift-ci-robot commented Dec 15, 2023

ffromani commented Dec 15, 2023

ffromani commented Dec 15, 2023 • edited

ffromani commented Dec 15, 2023

ffromani commented Dec 15, 2023

ffromani commented Dec 15, 2023

MarSik commented Dec 15, 2023

openshift-cherrypick-robot commented Dec 15, 2023

ffromani commented Dec 15, 2023

MarSik commented Dec 15, 2023

ffromani commented Dec 15, 2023

MarSik commented Dec 15, 2023

EinatGlottmann commented Dec 16, 2023

Tal-or commented Dec 16, 2023

openshift-ci bot commented Dec 16, 2023

openshift-ci-robot commented Dec 16, 2023

openshift-cherrypick-robot commented Dec 16, 2023

openshift-bot commented Dec 16, 2023

Tal-or commented Dec 12, 2023 •

edited

Tal-or Dec 13, 2023 •

edited

ffromani commented Dec 15, 2023 •

edited

ffromani commented Dec 15, 2023 •

edited