Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check_ps_unauth_users() killing interactive SLURM jobs #15

Closed
bbbbbrie opened this issue Apr 28, 2016 · 9 comments
Closed

check_ps_unauth_users() killing interactive SLURM jobs #15

bbbbbrie opened this issue Apr 28, 2016 · 9 comments

Comments

@bbbbbrie
Copy link
Contributor

bbbbbrie commented Apr 28, 2016

I experienced trouble with the SLURM implementation of check_ps_unauth_users() in release 1.4.2 of NHC killing interactive jobs. (Jobs submitted via sbatch are left alone.)

Undesired/unexpected behavior
check_ps_unauth_users: foo's "sleep" process is unauthorized. (PID 12347)
check_ps_unauth_users: foo's "/bin/bash" process is unauthorized. (PID 12372)

Upon closer inspection, this appeared to be a result of how the list of users with currently running jobs was calculated:

STAT_OUT=$(${STAT_CMD:-/usr/bin/stat} ${STAT_FMT_ARGS:--c} %U JOBFILE_PATH/job*/slurm_script)

Details
Job files like slurm_script are not created when interactive jobs are launched. Instead, there is a file with the node's hostname and job ID as a part of the filename:

|-- compute-0-2_1084.4294967294
|-- cred_state
|-- cred_state.old
-- job01084
-- slurm_script

Potential solution
I successfully addressed this locally using squeue, which can be configured to report just usernames:

STAT_OUT=$(squeue -w localhost --noheader -o %u)

This should report the username of all users with jobs running on the local node. (If a user is running jobs but not on this node, any processes she has on localhost are unauthorized.)

This has been tested with SLURM 15.08.7.

Please let me know if I have overlooked something or if you have any questions.

Thanks!

(NHC is awesome; thank you!)

@mej mej added this to the 1.4.3 Release milestone May 5, 2016
@mej mej self-assigned this May 5, 2016
@mej
Copy link
Owner

mej commented May 5, 2016

Thanks, Brie!

This is, at least in part, a known issue. Parallel/MPI jobs have a similar issue, in that the jobscript only exists on the job's head node.

I've been hesitant to invoke SLURM commands from within NHC since it is often spawned from SLURM itself, and I certainly don't want to risk creating any sort of deadlock situation. I have not, however, had a chance to discuss with @jette or @dannyauble whether or not this is safe; it may be perfectly okay in SLURM.

Do you happen to know if squeue talks to the local slurmd or the master node's slurmctld? Do you run NHC via SLURM or via cron?

@jette
Copy link

jette commented May 5, 2016

Hi Michael,

squeue only talks with the master node's slurmctld.

There are very few slurm commands that communicate directly with the
local slurmd (e.g. slurm_job_step_get_pids). Also the slurmd is
extensively multi-threaded and I do not believe that you could create a
deadlock situation from NHC (if you did, I would consider that a Slurm
bug).

On 2016-05-04 22:58, Michael Jennings wrote:

Thanks, Brie!

This is, at least in part, a known issue. Parallel/MPI jobs have a
similar issue, in that the jobscript only exists on the job's head
node.

I've been hesitant to invoke SLURM commands from within NHC since it
is often spawned from SLURM itself, and I certainly don't want to risk
creating any sort of deadlock situation. I have not, however, had a
chance to discuss with @jette [1] or @dannyauble [2] whether or not
this is safe; it may be perfectly okay in SLURM.

Do you happen to know if squeue talks to the local slurmd or the
master node's slurmctld? Do you run NHC via SLURM or via cron?

You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub [3]

Links:

[1] https://github.com/jette
[2] https://github.com/dannyauble
[3] #15 (comment)

@mej
Copy link
Owner

mej commented May 5, 2016

Okay, great, thanks @jette! That's exactly what I was hoping for.

@bbbbbrie: What this means is basically that your fix is dead-on correct. I'll just need to abstract out the command and parameters into variables, as I always do, just to give sites the flexibility of tweaking things if need be (e.g., for sites like ours for whom -w localhost gives an error and which must use -w <nodename> instead).

My only remaining concern would be one of concurrency at scale. Since SLURM (wisely) runs NHC on all nodes simultaneously, we can assume that they would all be running the squeue inquiry at roughly the same time as well. If that causes delays at scale, we might have to pre-generate the user <-> node mapping in advance. (Not that I think it will...I just like to plan ahead!) :-)

I'll look at implementing this, or if you'd prefer, feel free to go ahead and send a Pull Request against the dev branch. Thanks again for the report! And thanks Moe for your insights!

@jette
Copy link

jette commented May 5, 2016

The squeue commands will be processed in parallel by the slurmctld (head node) daemon. At some point scalability will become a concern, but we don't see problems today with thousands of nodes. I would recommend adding filter options to squeue if possible (e.g. "--user=mej", "--state=running", etc.), which can reduce the amount of data the slurm daemons and commands need to process and improve scalability.

@bbbbbrie
Copy link
Contributor Author

Thank you both for your feedback!

@mej I am submitting a Pull Request containing some of the details discussed here. I am using -w nodename as that syntax works where -w localhost fails (and continues to work where -w localhost succeeds). I have abstracted the squeue command and arguments in to variables as you suggested. I will defer to your judgment on style for this.

This has been tested with NHC 1.4.2 on clusters running SLURM 14.11 and 15.08.

Thank you and please let me know if you have any questions!

(Please feel free to close this when appropriate.)

mej pushed a commit that referenced this issue May 31, 2016
This fixes #15 by using squeue instead of stat to obtain the list of authorized users
@OleHolmNielsen
Copy link

We have a new CentOS 7.2 cluster running the Slurm 16.05 batch system. We have experienced the very same problem reported above! I look forward to an updated NHC version, and in the meantime I'll have to comment out any check_ps_unauth_users checks in nhc.conf.

@mej
Copy link
Owner

mej commented Oct 29, 2016

@OleHolmNielsen: The mej:dev branch has this fix already. Are you able to build from that branch? I'm in the process of changing jobs right now, so I'm not sure when I'll be back in a position to get 1.4.3 into beta, I'm working on getting it figured out, though, and I'll see what I can do if you aren't able to build your own RPMs from the development tree.

@OleHolmNielsen
Copy link

Hi Michael,

On 10/29/2016 12:56 PM, Michael Jennings wrote:

@OleHolmNielsen https://github.com/OleHolmNielsen: The mej:dev
https://github.com/mej/nhc/tree/dev branch has this fix already. Are
you able to build from that branch? I'm in the process of changing jobs
right now, so I'm not sure when I'll be back in a position to get 1.4.3
</mej/nhc/milestone/1> into beta, I'm working on getting it figured out,
though, and I'll see what I can do if you aren't able to build your own
RPMs from the development tree.

Thanks for the info, and good luck with your new job!

I retrieved the https://github.com/mej/nhc/tree/dev zip-file, but I'm
not experienced in building RPMs from Github. Could you kindly add a
few instructions to the README file? This is what I figured out:

unzip nhc-dev.zip
mv nhc-dev lbnl-nhc-1.4.3
cd lbnl-nhc-1.4.3
./autogen.sh
cd ..
tar czvf ~/rpmbuild/SOURCES/lbnl-nhc-1.4.3.tar.gz lbnl-nhc-1.4.3
rpmbuild -ta ~/rpmbuild/SOURCES/lbnl-nhc-1.4.3.tar.gz

Would you agree that this sounds correct?

I'll be going to SC16 in Salt Lake City with a group of sysadmins from
Denmark, are you going to be there?

Best regards,
Ole

@mej mej added this to Pending in NHC 1.4.3 Release Oct 30, 2018
@mej
Copy link
Owner

mej commented May 22, 2021

This is fixed in the dev branch which is about to be released. Closing this issue.

@mej mej closed this as completed May 22, 2021
NHC 1.4.3 Release automation moved this from Triage / TODO to Completed / Merged May 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
NHC 1.4.3 Release
  
Completed / Merged
Development

No branches or pull requests

4 participants