New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-24581: rps: fix mask update for SR-IOV devices #877
OCPBUGS-24581: rps: fix mask update for SR-IOV devices #877
Conversation
The journal spammed by warnings about the fact that systemd unit name is not escaped. we use systemd-escape in order to escape the unit-name properly. `--path` is for telling systemd-escape that the input is a valid file path name. `--template` is telling systemd-escape to inserts the escaped strings in a unit name template. %c accepts the output of `PROGRAM` Signed-off-by: Talor Itzhak <titzhak@redhat.com>
@Tal-or: This pull request references Jira Issue OCPBUGS-24581, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/jira refresh |
@Tal-or: This pull request references Jira Issue OCPBUGS-24581, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inline comments about implementation
# the 'echo' command might failed if the device path which the queue belongs to has changes | ||
# this can happen in case of SRI-OV devices renaming | ||
exit 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is ok but forcing an exit 0 will make the function non-composable. The calling program will always exit here, won't it? If we want to swallow the error from echo, I'd either use return 0
or echo "${mask}" 2> /dev/null > "/sys${queue_path}${queue_num}/rps_cpus" || :
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should return 0 in the main flow context (at the end of the file) instead of the function I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want to exit 0 in the main flow because we still want to catch errors if the second function is being called
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The set_net_dev_rps
one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, we do want swallow error of echo
on line 12 though, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes and you're right we should return 0 and not exit. it would basically have the same affect but will keep the function composable
# replace '/' with '-' | ||
queue_num="${queue_num/\//-}" | ||
|
||
# replace x2d with hyphen (-) which is an escaped character |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does systemd-escape
support reverse functionality (un-escape) and if so, do we want to use it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It supports reverse.
The original systemd-escape output is:
\x2d
but the backslash gets removed when we pass the argument to the script. Due to that fact it failed to revert this part.
Besides that, when we call the script we invoked it under the systemd-unit and passing the path argument with %I
which basically perform the unesacping for us and which is cheaper than spawn another process for the systemd-escape call here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it surely is, but it's also asymmetric. We do escaping using systemd-escape, we do unescaping ourselves. We save resources, we need to ensure the consistency ourselves.
Now: this can very much be a path we want to choose, but let's make sure it's intentional, agreed, explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I agree it's kind of odd and if the de-escaping was perfect I would definitely choose this path.
But I really didn't understand why the backslash get omitted. It's probably related to the way of how the systemd-unit passes the argument
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
possibly true. Considering the nature of this bug I won't block for this, even though I think it would deserve a followup
/retest |
fe78d34
to
5594691
Compare
@Tal-or: This pull request references Jira Issue OCPBUGS-24581, which is valid. 3 validation(s) were run on this bug
The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
5594691
to
d4759d6
Compare
/approve we want this fix |
getting there. A comment inside, and need to update |
202d813
to
4c26b48
Compare
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ffromani, Tal-or The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm to give time to other reviewers the indentation was lost, but not blocking for now |
/retest-required |
/lgtm |
/hold cancel we have second review now |
/retest-required flakes + infra issues (crio overloaded?!?) |
/retest |
/retest-required (the failed lane didn't restart) |
ok, let's hold the retests. It's either infra issue or the rules are actually causing excessive load. Investigating. |
/retest-required we have strong suspects it's a flake/infra issue because the lane passed previously without relevant updates being done |
|
/test e2e-gcp-pao-workloadhints |
|
/cherry-pick release-4.15 |
@MarSik: once the present PR merges, I will cherry-pick it on top of release-4.15 in a new PR and assign it to you. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/test e2e-gcp-pao-workloadhints |
/retest-required |
/retest- required |
/retest-required |
1 similar comment
/retest-required |
/test e2e-aws-ovn |
@Tal-or: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@Tal-or: Jira Issue OCPBUGS-24581: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-24581 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@MarSik: new pull request created: #886 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[ART PR BUILD NOTIFIER] This PR has been included in build cluster-node-tuning-operator-container-v4.16.0-202312162050.p0.gc9a93af.assembly.stream for distgit cluster-node-tuning-operator. |
SR-IOV devices get moved (renamed) upon their creation.
This causes the
set-rps-mask.sh
to fail since the queue's path has changed.We should add additional udev rule to act upon the move of a physical device
and set the rps mask correctly.
The
set-rps-mask.sh
script has modified to act upon those twodifferent scenarios:
In addition we fail silently (exit 0) when queues are failed to get
updated due to the mentioned above.
The queues that were failed to get updated, are expected to be updated
by the instance that get trrigered after the device move (renaming).