-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
check_ps_unauth_users() killing interactive SLURM jobs #15
Comments
Thanks, Brie! This is, at least in part, a known issue. Parallel/MPI jobs have a similar issue, in that the jobscript only exists on the job's head node. I've been hesitant to invoke SLURM commands from within NHC since it is often spawned from SLURM itself, and I certainly don't want to risk creating any sort of deadlock situation. I have not, however, had a chance to discuss with @jette or @dannyauble whether or not this is safe; it may be perfectly okay in SLURM. Do you happen to know if |
Hi Michael, squeue only talks with the master node's slurmctld. There are very few slurm commands that communicate directly with the On 2016-05-04 22:58, Michael Jennings wrote:
|
Okay, great, thanks @jette! That's exactly what I was hoping for. @bbbbbrie: What this means is basically that your fix is dead-on correct. I'll just need to abstract out the command and parameters into variables, as I always do, just to give sites the flexibility of tweaking things if need be (e.g., for sites like ours for whom My only remaining concern would be one of concurrency at scale. Since SLURM (wisely) runs NHC on all nodes simultaneously, we can assume that they would all be running the I'll look at implementing this, or if you'd prefer, feel free to go ahead and send a Pull Request against the |
The squeue commands will be processed in parallel by the slurmctld (head node) daemon. At some point scalability will become a concern, but we don't see problems today with thousands of nodes. I would recommend adding filter options to squeue if possible (e.g. "--user=mej", "--state=running", etc.), which can reduce the amount of data the slurm daemons and commands need to process and improve scalability. |
Thank you both for your feedback! @mej I am submitting a Pull Request containing some of the details discussed here. I am using This has been tested with NHC 1.4.2 on clusters running SLURM 14.11 and 15.08. Thank you and please let me know if you have any questions! (Please feel free to close this when appropriate.) |
This fixes #15 by using squeue instead of stat to obtain the list of authorized users
We have a new CentOS 7.2 cluster running the Slurm 16.05 batch system. We have experienced the very same problem reported above! I look forward to an updated NHC version, and in the meantime I'll have to comment out any check_ps_unauth_users checks in nhc.conf. |
@OleHolmNielsen: The mej:dev branch has this fix already. Are you able to build from that branch? I'm in the process of changing jobs right now, so I'm not sure when I'll be back in a position to get 1.4.3 into beta, I'm working on getting it figured out, though, and I'll see what I can do if you aren't able to build your own RPMs from the development tree. |
Hi Michael, On 10/29/2016 12:56 PM, Michael Jennings wrote:
Thanks for the info, and good luck with your new job! I retrieved the https://github.com/mej/nhc/tree/dev zip-file, but I'm unzip nhc-dev.zip Would you agree that this sounds correct? I'll be going to SC16 in Salt Lake City with a group of sysadmins from Best regards, |
This is fixed in the dev branch which is about to be released. Closing this issue. |
I experienced trouble with the SLURM implementation of check_ps_unauth_users() in release 1.4.2 of NHC killing interactive jobs. (Jobs submitted via sbatch are left alone.)
Undesired/unexpected behavior
check_ps_unauth_users: foo's "sleep" process is unauthorized. (PID 12347)
check_ps_unauth_users: foo's "/bin/bash" process is unauthorized. (PID 12372)
Upon closer inspection, this appeared to be a result of how the list of users with currently running jobs was calculated:
STAT_OUT=$(${STAT_CMD:-/usr/bin/stat} ${STAT_FMT_ARGS:--c} %U JOBFILE_PATH/job*/slurm_script)
Details
Job files like slurm_script are not created when interactive jobs are launched. Instead, there is a file with the node's hostname and job ID as a part of the filename:
|-- compute-0-2_1084.4294967294
|-- cred_state
|-- cred_state.old
-- job01084
-- slurm_script
Potential solution
I successfully addressed this locally using squeue, which can be configured to report just usernames:
STAT_OUT=$(squeue -w localhost --noheader -o %u)
This should report the username of all users with jobs running on the local node. (If a user is running jobs but not on this node, any processes she has on localhost are unauthorized.)
This has been tested with SLURM 15.08.7.
Please let me know if I have overlooked something or if you have any questions.
Thanks!
(NHC is awesome; thank you!)
The text was updated successfully, but these errors were encountered: