Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check_cmd_dmesg() Reason Strings Cause Problems #143

Closed
mej opened this issue Sep 8, 2023 · 0 comments · Fixed by #144
Closed

check_cmd_dmesg() Reason Strings Cause Problems #143

mej opened this issue Sep 8, 2023 · 0 comments · Fixed by #144
Assignees
Labels
enhancement UI/UX boost Improvements in the user interface/experience
Milestone

Comments

@mej
Copy link
Owner

mej commented Sep 8, 2023

When using check_cmd_dmesg() directly (as written in scripts/lbnl_cmd.nhc) with a negated match string, the default behavior of check_cmd_output() (which check_cmd_dmesg() wraps) used for error reporting causes the "Reason" field to contain not only the match string that was found (and shouldn't have been) but also the line number where the match was found. In the case of dmesg output, the line number is almost completely useless; moreover, it prevents Slurm and other schedulers/RMs from being able to group all the affected nodes together -- because the line numbers almost always differ!

Granted that users/admins can override the default failure message generation behavior (via -M entries, all of which are passed directly to check_cmd_output()), but in the specific case of check_cmd_dmesg(), I think the default behavior should suppress the line numbers and use a simpler, more concise message instead.

@mej mej changed the title check_cmd_dmesg() check_cmd_dmesg() Reason Strings Cause Problems Sep 8, 2023
@mej mej self-assigned this Sep 8, 2023
@mej mej added enhancement UI/UX boost Improvements in the user interface/experience labels Sep 8, 2023
@mej mej added this to Triage / TODO in NHC 1.5 Release via automation Sep 8, 2023
@mej mej added this to the 1.5 Release milestone Sep 8, 2023
mej added a commit that referenced this issue Sep 18, 2023
When using `check_cmd_dmesg()` directly (as written in `scripts/lbnl_cmd.nhc`)
with a negated match string, the default behavior of
`check_cmd_output()` (which `check_cmd_dmesg()` wraps) used for error
reporting causes the "Reason" field to contain not only the match
string that was found (and shouldn't have been) but also the **_line
number_** where the match was found.  In the case of `dmesg` output,
the line number is almost completely useless; moreover, it prevents
Slurm and other schedulers/RMs from being able to group all the
affected nodes together -- because the line numbers almost always
differ!

Granted that users/admins can override the default failure message
generation behavior (via `-M` entries, all of which are passed
directly to `check_cmd_output()`), but in the specific case of
`check_cmd_dmesg()`, I think the default behavior should suppress the
line numbers and use a simpler, more concise message instead.

This changeset adds a bit of pre-processing to `check_cmd_dmesg()`.
Each "mstr/message" pair" is examined, and for each match string (`-m`
argument) that doesn't have a corresponding message (`-M` argument)
that overrides the default will have a new default provided to it that
omits the extraneous information.

Addresses #143.
mej added a commit that referenced this issue Sep 19, 2023
When using `check_cmd_dmesg()` directly (as written in `scripts/lbnl_cmd.nhc`)
with a negated match string, the default behavior of
`check_cmd_output()` (which `check_cmd_dmesg()` wraps) used for error
reporting causes the "Reason" field to contain not only the match
string that was found (and shouldn't have been) but also the **_line
number_** where the match was found.  In the case of `dmesg` output,
the line number is almost completely useless; moreover, it prevents
Slurm and other schedulers/RMs from being able to group all the
affected nodes together -- because the line numbers almost always
differ!

Granted that users/admins can override the default failure message
generation behavior (via `-M` entries, all of which are passed
directly to `check_cmd_output()`), but in the specific case of
`check_cmd_dmesg()`, I think the default behavior should suppress the
line numbers and use a simpler, more concise message instead.

This changeset does exactly that by adding a bit of pre-processing to
the command-line arguments passed to `check_cmd_dmesg()` before
passing them on to `check_cmd_output()`.  Each match string (`-m`
argument) that doesn't already have a corresponding message (`-M`
argument) to override the default will have a new default provided to
it that omits the extraneous information.  In other words, any
`-m`_`mstr`_ that already has a matching `-M`_`message`_ will be
passed on to `check_cmd_output()` exactly as it is; any `-m`_`mstr`_
that _**lacks**_ a corresponding `-M`_`message`_ — or that has an
**empty** _`message`_ as a placeholder — will be assigned a new
`-M`_`message`_ that gets passed to `check_cmd_output()` without any
line number or other dynamic information.

Fixes #143.
@mej mej linked a pull request Sep 19, 2023 that will close this issue
@mej mej closed this as completed Sep 20, 2023
NHC 1.5 Release automation moved this from Triage / TODO to Completed / Merged Sep 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement UI/UX boost Improvements in the user interface/experience
Projects
NHC 1.5 Release
  
Completed / Merged
1 participant