check_ps_unauth_users() killing interactive SLURM jobs #15

bbbbbrie · 2016-04-28T16:39:14Z

I experienced trouble with the SLURM implementation of check_ps_unauth_users() in release 1.4.2 of NHC killing interactive jobs. (Jobs submitted via sbatch are left alone.)

Undesired/unexpected behavior
check_ps_unauth_users: foo's "sleep" process is unauthorized. (PID 12347)
check_ps_unauth_users: foo's "/bin/bash" process is unauthorized. (PID 12372)

Upon closer inspection, this appeared to be a result of how the list of users with currently running jobs was calculated:

STAT_OUT=$(${STAT_CMD:-/usr/bin/stat} ${STAT_FMT_ARGS:--c} %U JOBFILE_PATH/job*/slurm_script)

Details
Job files like slurm_script are not created when interactive jobs are launched. Instead, there is a file with the node's hostname and job ID as a part of the filename:

|-- compute-0-2_1084.4294967294
|-- cred_state
|-- cred_state.old
-- job01084
-- slurm_script

Potential solution
I successfully addressed this locally using squeue, which can be configured to report just usernames:

STAT_OUT=$(squeue -w localhost --noheader -o %u)

This should report the username of all users with jobs running on the local node. (If a user is running jobs but not on this node, any processes she has on localhost are unauthorized.)

This has been tested with SLURM 15.08.7.

Please let me know if I have overlooked something or if you have any questions.

Thanks!

(NHC is awesome; thank you!)

The text was updated successfully, but these errors were encountered:

mej · 2016-05-05T04:58:01Z

Thanks, Brie!

This is, at least in part, a known issue. Parallel/MPI jobs have a similar issue, in that the jobscript only exists on the job's head node.

I've been hesitant to invoke SLURM commands from within NHC since it is often spawned from SLURM itself, and I certainly don't want to risk creating any sort of deadlock situation. I have not, however, had a chance to discuss with @jette or @dannyauble whether or not this is safe; it may be perfectly okay in SLURM.

Do you happen to know if squeue talks to the local slurmd or the master node's slurmctld? Do you run NHC via SLURM or via cron?

jette · 2016-05-05T14:49:48Z

Hi Michael,

squeue only talks with the master node's slurmctld.

There are very few slurm commands that communicate directly with the
local slurmd (e.g. slurm_job_step_get_pids). Also the slurmd is
extensively multi-threaded and I do not believe that you could create a
deadlock situation from NHC (if you did, I would consider that a Slurm
bug).

On 2016-05-04 22:58, Michael Jennings wrote:

Thanks, Brie!

This is, at least in part, a known issue. Parallel/MPI jobs have a
similar issue, in that the jobscript only exists on the job's head
node.

I've been hesitant to invoke SLURM commands from within NHC since it
is often spawned from SLURM itself, and I certainly don't want to risk
creating any sort of deadlock situation. I have not, however, had a
chance to discuss with @jette [1] or @dannyauble [2] whether or not
this is safe; it may be perfectly okay in SLURM.

Do you happen to know if squeue talks to the local slurmd or the
master node's slurmctld? Do you run NHC via SLURM or via cron?

You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub [3]

Links:

[1] https://github.com/jette
[2] https://github.com/dannyauble
[3] #15 (comment)

mej · 2016-05-05T21:54:45Z

Okay, great, thanks @jette! That's exactly what I was hoping for.

@bbbbbrie: What this means is basically that your fix is dead-on correct. I'll just need to abstract out the command and parameters into variables, as I always do, just to give sites the flexibility of tweaking things if need be (e.g., for sites like ours for whom -w localhost gives an error and which must use -w <nodename> instead).

My only remaining concern would be one of concurrency at scale. Since SLURM (wisely) runs NHC on all nodes simultaneously, we can assume that they would all be running the squeue inquiry at roughly the same time as well. If that causes delays at scale, we might have to pre-generate the user <-> node mapping in advance. (Not that I think it will...I just like to plan ahead!) :-)

I'll look at implementing this, or if you'd prefer, feel free to go ahead and send a Pull Request against the dev branch. Thanks again for the report! And thanks Moe for your insights!

jette · 2016-05-05T22:03:25Z

The squeue commands will be processed in parallel by the slurmctld (head node) daemon. At some point scalability will become a concern, but we don't see problems today with thousands of nodes. I would recommend adding filter options to squeue if possible (e.g. "--user=mej", "--state=running", etc.), which can reduce the amount of data the slurm daemons and commands need to process and improve scalability.

bbbbbrie · 2016-05-10T15:33:33Z

Thank you both for your feedback!

@mej I am submitting a Pull Request containing some of the details discussed here. I am using -w nodename as that syntax works where -w localhost fails (and continues to work where -w localhost succeeds). I have abstracted the squeue command and arguments in to variables as you suggested. I will defer to your judgment on style for this.

This has been tested with NHC 1.4.2 on clusters running SLURM 14.11 and 15.08.

Thank you and please let me know if you have any questions!

(Please feel free to close this when appropriate.)

This fixes #15 by using squeue instead of stat to obtain the list of authorized users

OleHolmNielsen · 2016-10-28T12:37:40Z

We have a new CentOS 7.2 cluster running the Slurm 16.05 batch system. We have experienced the very same problem reported above! I look forward to an updated NHC version, and in the meantime I'll have to comment out any check_ps_unauth_users checks in nhc.conf.

mej · 2016-10-29T10:56:28Z

@OleHolmNielsen: The mej:dev branch has this fix already. Are you able to build from that branch? I'm in the process of changing jobs right now, so I'm not sure when I'll be back in a position to get 1.4.3 into beta, I'm working on getting it figured out, though, and I'll see what I can do if you aren't able to build your own RPMs from the development tree.

OleHolmNielsen · 2016-10-31T07:50:35Z

Hi Michael,

On 10/29/2016 12:56 PM, Michael Jennings wrote:

@OleHolmNielsen https://github.com/OleHolmNielsen: The mej:dev
https://github.com/mej/nhc/tree/dev branch has this fix already. Are
you able to build from that branch? I'm in the process of changing jobs
right now, so I'm not sure when I'll be back in a position to get 1.4.3
</mej/nhc/milestone/1> into beta, I'm working on getting it figured out,
though, and I'll see what I can do if you aren't able to build your own
RPMs from the development tree.

Thanks for the info, and good luck with your new job!

I retrieved the https://github.com/mej/nhc/tree/dev zip-file, but I'm
not experienced in building RPMs from Github. Could you kindly add a
few instructions to the README file? This is what I figured out:

unzip nhc-dev.zip
mv nhc-dev lbnl-nhc-1.4.3
cd lbnl-nhc-1.4.3
./autogen.sh
cd ..
tar czvf ~/rpmbuild/SOURCES/lbnl-nhc-1.4.3.tar.gz lbnl-nhc-1.4.3
rpmbuild -ta ~/rpmbuild/SOURCES/lbnl-nhc-1.4.3.tar.gz

Would you agree that this sounds correct?

I'll be going to SC16 in Salt Lake City with a group of sysadmins from
Denmark, are you going to be there?

Best regards,
Ole

mej · 2021-05-22T07:15:55Z

This is fixed in the dev branch which is about to be released. Closing this issue.

mej added bug enhancement labels May 5, 2016

mej added this to the 1.4.3 Release milestone May 5, 2016

mej self-assigned this May 5, 2016

bbbbbrie mentioned this issue May 10, 2016

This fixes #15 by using squeue instead of stat to obtain the list of authorized users #16

Merged

mej pushed a commit that referenced this issue May 31, 2016

Merge pull request #16 from bbbbbrie/dev

919063d

This fixes #15 by using squeue instead of stat to obtain the list of authorized users

mej added this to Pending in NHC 1.4.3 Release Oct 30, 2018

mej closed this as completed May 22, 2021

NHC 1.4.3 Release automation moved this from Triage / TODO to Completed / Merged May 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

check_ps_unauth_users() killing interactive SLURM jobs #15

check_ps_unauth_users() killing interactive SLURM jobs #15

bbbbbrie commented Apr 28, 2016 •

edited

Loading

mej commented May 5, 2016

jette commented May 5, 2016

Links:

mej commented May 5, 2016

jette commented May 5, 2016

bbbbbrie commented May 10, 2016

OleHolmNielsen commented Oct 28, 2016

mej commented Oct 29, 2016

OleHolmNielsen commented Oct 31, 2016

mej commented May 22, 2021

check_ps_unauth_users() killing interactive SLURM jobs #15

check_ps_unauth_users() killing interactive SLURM jobs #15

Comments

bbbbbrie commented Apr 28, 2016 • edited Loading

mej commented May 5, 2016

jette commented May 5, 2016

Links:

mej commented May 5, 2016

jette commented May 5, 2016

bbbbbrie commented May 10, 2016

OleHolmNielsen commented Oct 28, 2016

mej commented Oct 29, 2016

OleHolmNielsen commented Oct 31, 2016

mej commented May 22, 2021

bbbbbrie commented Apr 28, 2016 •

edited

Loading