-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
check_cmd_dmesg() Reason Strings Cause Problems #143
Labels
Projects
Milestone
Comments
mej
added a commit
that referenced
this issue
Sep 18, 2023
When using `check_cmd_dmesg()` directly (as written in `scripts/lbnl_cmd.nhc`) with a negated match string, the default behavior of `check_cmd_output()` (which `check_cmd_dmesg()` wraps) used for error reporting causes the "Reason" field to contain not only the match string that was found (and shouldn't have been) but also the **_line number_** where the match was found. In the case of `dmesg` output, the line number is almost completely useless; moreover, it prevents Slurm and other schedulers/RMs from being able to group all the affected nodes together -- because the line numbers almost always differ! Granted that users/admins can override the default failure message generation behavior (via `-M` entries, all of which are passed directly to `check_cmd_output()`), but in the specific case of `check_cmd_dmesg()`, I think the default behavior should suppress the line numbers and use a simpler, more concise message instead. This changeset adds a bit of pre-processing to `check_cmd_dmesg()`. Each "mstr/message" pair" is examined, and for each match string (`-m` argument) that doesn't have a corresponding message (`-M` argument) that overrides the default will have a new default provided to it that omits the extraneous information. Addresses #143.
mej
added a commit
that referenced
this issue
Sep 19, 2023
When using `check_cmd_dmesg()` directly (as written in `scripts/lbnl_cmd.nhc`) with a negated match string, the default behavior of `check_cmd_output()` (which `check_cmd_dmesg()` wraps) used for error reporting causes the "Reason" field to contain not only the match string that was found (and shouldn't have been) but also the **_line number_** where the match was found. In the case of `dmesg` output, the line number is almost completely useless; moreover, it prevents Slurm and other schedulers/RMs from being able to group all the affected nodes together -- because the line numbers almost always differ! Granted that users/admins can override the default failure message generation behavior (via `-M` entries, all of which are passed directly to `check_cmd_output()`), but in the specific case of `check_cmd_dmesg()`, I think the default behavior should suppress the line numbers and use a simpler, more concise message instead. This changeset does exactly that by adding a bit of pre-processing to the command-line arguments passed to `check_cmd_dmesg()` before passing them on to `check_cmd_output()`. Each match string (`-m` argument) that doesn't already have a corresponding message (`-M` argument) to override the default will have a new default provided to it that omits the extraneous information. In other words, any `-m`_`mstr`_ that already has a matching `-M`_`message`_ will be passed on to `check_cmd_output()` exactly as it is; any `-m`_`mstr`_ that _**lacks**_ a corresponding `-M`_`message`_ — or that has an **empty** _`message`_ as a placeholder — will be assigned a new `-M`_`message`_ that gets passed to `check_cmd_output()` without any line number or other dynamic information. Fixes #143.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
When using
check_cmd_dmesg()
directly (as written inscripts/lbnl_cmd.nhc
) with a negated match string, the default behavior ofcheck_cmd_output()
(whichcheck_cmd_dmesg()
wraps) used for error reporting causes the "Reason" field to contain not only the match string that was found (and shouldn't have been) but also the line number where the match was found. In the case ofdmesg
output, the line number is almost completely useless; moreover, it prevents Slurm and other schedulers/RMs from being able to group all the affected nodes together -- because the line numbers almost always differ!Granted that users/admins can override the default failure message generation behavior (via
-M
entries, all of which are passed directly tocheck_cmd_output()
), but in the specific case ofcheck_cmd_dmesg()
, I think the default behavior should suppress the line numbers and use a simpler, more concise message instead.The text was updated successfully, but these errors were encountered: