Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-24581: rps: fix mask update for SR-IOV devices #877

Merged
merged 2 commits into from Dec 16, 2023

Conversation

Tal-or
Copy link
Contributor

@Tal-or Tal-or commented Dec 12, 2023

SR-IOV devices get moved (renamed) upon their creation.
This causes the set-rps-mask.sh to fail since the queue's path has changed.

We should add additional udev rule to act upon the move of a physical device
and set the rps mask correctly.

The set-rps-mask.sh script has modified to act upon those two
different scenarios:

  1. when queues are being added.
  2. when net device is moved.

In addition we fail silently (exit 0) when queues are failed to get
updated due to the mentioned above.
The queues that were failed to get updated, are expected to be updated
by the instance that get trrigered after the device move (renaming).

The journal spammed by warnings about the fact that systemd unit name
is not escaped.

we use systemd-escape in order to escape the unit-name properly.
`--path` is for telling systemd-escape that the input is a valid file path name.
`--template` is telling systemd-escape to inserts the escaped strings in a unit name template.

%c accepts the output of `PROGRAM`

Signed-off-by: Talor Itzhak <titzhak@redhat.com>
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Dec 12, 2023
@openshift-ci-robot
Copy link
Contributor

@Tal-or: This pull request references Jira Issue OCPBUGS-24581, which is invalid:

  • expected the bug to target the "4.16.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

SRI-OV devices get moved (renamed) upon their creation.
This causes the set-rps-mask.sh to fail since the queue's path has changed.

We should add additional udev rule to act upon the move of a physical device
and set the rps mask correctly.

The set-rps-mask.sh script has modified to act upon those two
different scenarios:

  1. when queues are being added.
  2. when net device is moved.

In addition we fail silently (exit 0) when queues are failed to get
updated due to the mentioned above.
The queues that were failed to get updated, are expected to be updated
by the instance that get trrigered after the device move (renaming).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Tal-or
Copy link
Contributor Author

Tal-or commented Dec 12, 2023

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Dec 12, 2023
@openshift-ci-robot
Copy link
Contributor

@Tal-or: This pull request references Jira Issue OCPBUGS-24581, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inline comments about implementation

# the 'echo' command might failed if the device path which the queue belongs to has changes
# this can happen in case of SRI-OV devices renaming
exit 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is ok but forcing an exit 0 will make the function non-composable. The calling program will always exit here, won't it? If we want to swallow the error from echo, I'd either use return 0 or echo "${mask}" 2> /dev/null > "/sys${queue_path}${queue_num}/rps_cpus" || :

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should return 0 in the main flow context (at the end of the file) instead of the function I think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to exit 0 in the main flow because we still want to catch errors if the second function is being called

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The set_net_dev_rps one

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, we do want swallow error of echo on line 12 though, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes and you're right we should return 0 and not exit. it would basically have the same affect but will keep the function composable

# replace '/' with '-'
queue_num="${queue_num/\//-}"

# replace x2d with hyphen (-) which is an escaped character
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does systemd-escape support reverse functionality (un-escape) and if so, do we want to use it?

Copy link
Contributor Author

@Tal-or Tal-or Dec 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It supports reverse.
The original systemd-escape output is:
\x2d but the backslash gets removed when we pass the argument to the script. Due to that fact it failed to revert this part.
Besides that, when we call the script we invoked it under the systemd-unit and passing the path argument with %I which basically perform the unesacping for us and which is cheaper than spawn another process for the systemd-escape call here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it surely is, but it's also asymmetric. We do escaping using systemd-escape, we do unescaping ourselves. We save resources, we need to ensure the consistency ourselves.
Now: this can very much be a path we want to choose, but let's make sure it's intentional, agreed, explicit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I agree it's kind of odd and if the de-escaping was perfect I would definitely choose this path.
But I really didn't understand why the backslash get omitted. It's probably related to the way of how the systemd-unit passes the argument

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possibly true. Considering the nature of this bug I won't block for this, even though I think it would deserve a followup

@Tal-or
Copy link
Contributor Author

Tal-or commented Dec 13, 2023

/retest

@Tal-or Tal-or force-pushed the fix_rps_for_sriov_devs branch 2 times, most recently from fe78d34 to 5594691 Compare December 14, 2023 14:18
@Tal-or Tal-or changed the title OCPBUGS-24581: rps: fix mask update for SRI-OV devices OCPBUGS-24581: rps: fix mask update for SR-IOV devices Dec 14, 2023
@openshift-ci-robot
Copy link
Contributor

@Tal-or: This pull request references Jira Issue OCPBUGS-24581, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

SR-IOV devices get moved (renamed) upon their creation.
This causes the set-rps-mask.sh to fail since the queue's path has changed.

We should add additional udev rule to act upon the move of a physical device
and set the rps mask correctly.

The set-rps-mask.sh script has modified to act upon those two
different scenarios:

  1. when queues are being added.
  2. when net device is moved.

In addition we fail silently (exit 0) when queues are failed to get
updated due to the mentioned above.
The queues that were failed to get updated, are expected to be updated
by the instance that get trrigered after the device move (renaming).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ffromani
Copy link
Contributor

/approve

we want this fix

@ffromani
Copy link
Contributor

getting there. A comment inside, and need to update manual_machineconfig.yaml.

@Tal-or Tal-or force-pushed the fix_rps_for_sriov_devs branch 2 times, most recently from 202d813 to 4c26b48 Compare December 14, 2023 15:37
Copy link
Contributor

openshift-ci bot commented Dec 14, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ffromani, Tal-or

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 14, 2023
@ffromani
Copy link
Contributor

/lgtm
/hold

to give time to other reviewers

the indentation was lost, but not blocking for now

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 14, 2023
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 14, 2023
@MarSik
Copy link
Contributor

MarSik commented Dec 15, 2023

/retest-required

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 15, 2023
@ffromani
Copy link
Contributor

/lgtm

@ffromani
Copy link
Contributor

/hold cancel

we have second review now

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 15, 2023
@ffromani
Copy link
Contributor

/retest-required

flakes + infra issues (crio overloaded?!?)

@ffromani
Copy link
Contributor

/retest

@ffromani
Copy link
Contributor

ffromani commented Dec 15, 2023

/retest-required

(the failed lane didn't restart)

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 778695d and 2 for PR HEAD 55a1c6d in total

@ffromani
Copy link
Contributor

ok, let's hold the retests. It's either infra issue or the rules are actually causing excessive load. Investigating.

@ffromani
Copy link
Contributor

ffromani commented Dec 15, 2023

/retest-required

we have strong suspects it's a flake/infra issue because the lane passed previously without relevant updates being done

@ffromani
Copy link
Contributor

/retest-required

we have strong suspects it's a flake because the lane passed previously without relevant updates being done

 Adding transient rw bind mount for /run/secrets/rhsm
[1/2] STEP 1/5: FROM image-registry.openshift-image-registry.svc:5000/ci-op-fjyqjw1h/pipeline@sha256:d336e16fa089260e51b888c7acbd69728c1eac3a7e858f0e1755e396d81302ec AS builder
Trying to pull image-registry.openshift-image-registry.svc:5000/ci-op-fjyqjw1h/pipeline@sha256:d336e16fa089260e51b888c7acbd69728c1eac3a7e858f0e1755e396d81302ec...
error: build error: creating build container: parsing image configuration: fetching blob: StatusCode: 400, 

@ffromani
Copy link
Contributor

/test e2e-gcp-pao-workloadhints

@ffromani
Copy link
Contributor

/test e2e-gcp-pao-workloadhints

 Adding transient rw bind mount for /run/secrets/rhsm
[1/2] STEP 1/5: FROM image-registry.openshift-image-registry.svc:5000/ci-op-fjyqjw1h/pipeline@sha256:d336e16fa089260e51b888c7acbd69728c1eac3a7e858f0e1755e396d81302ec AS builder
Trying to pull image-registry.openshift-image-registry.svc:5000/ci-op-fjyqjw1h/pipeline@sha256:d336e16fa089260e51b888c7acbd69728c1eac3a7e858f0e1755e396d81302ec...
error: build error: creating build container: parsing image configuration: fetching blob: StatusCode: 400, 
INFO[2023-12-15T13:11:12Z] Building cluster-node-tuning-operator-us-tuned 
INFO[2023-12-15T13:11:12Z] Found existing build "cluster-node-tuning-operator-us-tuned-amd64" 
INFO[2023-12-15T13:11:12Z] Build cluster-node-tuning-operator-us-tuned-amd64 succeeded after 4m10s 
INFO[2023-12-15T13:11:12Z] Image ci-op-fjyqjw1h/pipeline:cluster-node-tuning-operator-us-tuned created  for-build=cluster-node-tuning-operator-us-tuned
INFO[2023-12-15T13:11:12Z] Tagging cluster-node-tuning-operator-us-tuned into stable 
INFO[2023-12-15T13:11:12Z] Ran for 3s                                   
ERRO[2023-12-15T13:11:12Z] Some steps failed:                           
ERRO[2023-12-15T13:11:12Z] 
  * could not run steps: step cluster-node-tuning-operator failed: error occurred handling build cluster-node-tuning-operator-amd64: the build cluster-node-tuning-operator-amd64 failed after 1m9s with reason DockerBuildFailed: Dockerfile build strategy has failed.
Writing manifest to image destination
Adding transient rw bind mount for /run/secrets/rhsm
[1/2] STEP 1/5: FROM image-registry.openshift-image-regist...51b888c7acbd69728c1eac3a7e858f0e1755e396d81302ec AS builder
Trying to pull image-registry.openshift-image-registry.svc...a089260e51b888c7acbd69728c1eac3a7e858f0e1755e396d81302ec...
error: build error: creating build container: parsing image configuration: fetching blob: StatusCode: 400, 
INFO[2023-12-15T13:11:12Z] Reporting job state 'failed' with reason 'executing_graph:step_failed:building_project_image' 

@MarSik
Copy link
Contributor

MarSik commented Dec 15, 2023

/cherry-pick release-4.15

@openshift-cherrypick-robot

@MarSik: once the present PR merges, I will cherry-pick it on top of release-4.15 in a new PR and assign it to you.

In response to this:

/cherry-pick release-4.15

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ffromani
Copy link
Contributor

/test e2e-gcp-pao-workloadhints

@MarSik
Copy link
Contributor

MarSik commented Dec 15, 2023

/retest-required

@ffromani
Copy link
Contributor

/retest- required

@MarSik
Copy link
Contributor

MarSik commented Dec 15, 2023

/retest-required

1 similar comment
@EinatGlottmann
Copy link

/retest-required

@Tal-or
Copy link
Contributor Author

Tal-or commented Dec 16, 2023

/test e2e-aws-ovn

Copy link
Contributor

openshift-ci bot commented Dec 16, 2023

@Tal-or: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit c9a93af into openshift:master Dec 16, 2023
14 checks passed
@openshift-ci-robot
Copy link
Contributor

@Tal-or: Jira Issue OCPBUGS-24581: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-24581 has been moved to the MODIFIED state.

In response to this:

SR-IOV devices get moved (renamed) upon their creation.
This causes the set-rps-mask.sh to fail since the queue's path has changed.

We should add additional udev rule to act upon the move of a physical device
and set the rps mask correctly.

The set-rps-mask.sh script has modified to act upon those two
different scenarios:

  1. when queues are being added.
  2. when net device is moved.

In addition we fail silently (exit 0) when queues are failed to get
updated due to the mentioned above.
The queues that were failed to get updated, are expected to be updated
by the instance that get trrigered after the device move (renaming).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-cherrypick-robot

@MarSik: new pull request created: #886

In response to this:

/cherry-pick release-4.15

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

This PR has been included in build cluster-node-tuning-operator-container-v4.16.0-202312162050.p0.gc9a93af.assembly.stream for distgit cluster-node-tuning-operator.
All builds following this will include this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants