Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

External Match doesn't work for me #135

Closed
Aelmazaty opened this issue May 2, 2023 · 3 comments
Closed

External Match doesn't work for me #135

Aelmazaty opened this issue May 2, 2023 · 3 comments
Assignees
Labels
need info Additional information required from user or community question

Comments

@Aelmazaty
Copy link

Hello,
I'm trying to group NHC checks per partition instead of per node.
I have some nodes in a non-standard partition and will need to run different checks on them. Unfortunately they have the same naming convention as the standard nodes. So wildcards can not be used.
Learning about external match possibility here https://github.com/mej/nhc/blob/9c4a38c0c9f48f92005c9120ca88145c33841dac/scripts/common.nhc#LL296

I've tried to add the following to /etc/sysconfig/nhc:
NHC_MCHECK_DELIM=( [0]="@" )
NHC_MCHECK_COMMAND=(
[0]="sinfo -p %m --format="%n" | grep -v HOSTNAMES | fgrep -w %h"
)

and in "nhc.conf" I use @sra@ as "sra" is the partition name.

However this doesn't seem to be working. Checking the logs, "mcheck_external()" doesn't seem to be ever called. It seems to be trying to match it as glob
">[{L2/S0/D5/R1}@common.nhc:290:mcheck_glob()]> dbg 'Glob match check: hl-codon-113-01 does not match @sra@'"

Any hints on how to make this work?
Thanks

@mej
Copy link
Owner

mej commented Jun 1, 2023

I have been digging into this off and on, in between coding and other work, since I first saw your post, and I'm afraid I can't say definitively that I know what's wrong. But I'll share my efforts so far, and hopefully we can get this fixed for you anyway!

First off, you didn't mention the specific version of NHC you're using. Everything in this post is based on the current dev branch here on GitHub (specifically, commit dc10825); I did not test prior releases.

In order to get a clear view of what was going on, I started out by running nhc in trace mode with a very brief, very contrived configuration. It changed a bit over the course of all my testing, but in the end, I wound up with just 4 lines (the first two being for debugging/tracking purposes):

### test.conf
    *    || declare -p NHC_MCHECK_DELIM NHC_MCHECK_COMMAND
    *    || set | fgrep NHC_MCHECK_
  @gpu@  || echo "GPU node"
 !@gpu@  || echo "Not a GPU node"

For expedience, I opted to put the external match settings on the command line, at least initially, using essentially the same settings you provided above (save the partition name, of course), and I ran nhc for both a GPU and a non-GPU node. The commands I used were:

nhc -avl - -c test.conf HOSTNAME=some-gpu-node    NHC_MCHECK_DELIM[0]=@ NHC_MCHECK_COMMAND[0]='"sinfo -hp %m --format=\"%n\" | fgrep -qw %h"'
nhc -avl - -c test.conf HOSTNAME=some-nongpu-node NHC_MCHECK_DELIM[0]=@ NHC_MCHECK_COMMAND[0]='"sinfo -hp %m --format=\"%n\" | fgrep -qw %h"'

The above settings/commands work perfectly, so I'd be curious to hear whether or not they work on your system. You may have noticed the single- and the double-quoting of the sinfo command; careful quoting is essential due to the multiple layers of shell interpretation involved.

Unfortunately, when trying to move the external match settings from the command line to the global config in /etc/sysconfig/nhc, hilarity ensued; i.e., I started seeing exactly the behavior you described, with mcheck() trying to treat it as a glob instead of an external match. I threw numerous instances of that declare -p command (1st line of the config) all over the place to try and figure out where things were going astray. Come to find out, when setting the variables using export or declare, the values vanished once nhcmain_load_sysconfig() returned. Long story short, since the sysconfig file is sourced inside a function, any variables you declare will become local variables, just as local would.

So I was able to get it to work reliably using exactly these two lines:

NHC_MCHECK_DELIM=( [0]=@ )
NHC_MCHECK_COMMAND=( [0]='sinfo -hp %m --format="%n" | fgrep -qw %h' )

Can you try exactly those settings and see if they work for you?

@mej mej self-assigned this Jun 1, 2023
@mej mej added question need info Additional information required from user or community labels Jun 1, 2023
@Aelmazaty
Copy link
Author

Thanks a lot!
I confirm the solution you suggested works on version 1.4.3-1

@mej
Copy link
Owner

mej commented Jun 6, 2023

Awesome! I'm very glad to hear you got it working. :)

I'll go ahead and close this, but let me know if you run into any other issues!

@mej mej closed this as completed Jun 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need info Additional information required from user or community question
Projects
None yet
Development

No branches or pull requests

2 participants