-
Notifications
You must be signed in to change notification settings - Fork 932
SLURM: launch all processes via slurmd #3999
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SLURM: launch all processes via slurmd #3999
Conversation
|
I don't personally have an issue with making this change. However, this is a big change in behavior for SLURM users that does have consequences - i.e., there is now another daemon on the head node impacting performance. I would much prefer to make this optional instead of required, with the default being the old behavior until we give users a chance to know this is coming and adapt. |
rhc54
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said in my separate comment, I'd prefer this behavior was optional instead of required as not every installation will agree with it.
|
sounds like a good idea. Maybe a configure option? Or should it be an mca parameter for ras framework? |
|
I would do it as an MCA param in the RAS component so it can be overridden (in either direction) by a user, especially if some problem turns up that we missed and/or didn't anticipate. |
|
Just to document the reasons behind the caution. There has been considerable research done regarding the impact of daemons sitting on compute nodes. Even daemons that block still contribute to jitter on the node, thus slowing down the application procs on that node. mpirun executing on a compute node where application procs are running causes additional impact. Every process that generates stdout/err causes mpirun to wake up - ditto for the passing of stdin. The result is that procs on the head node wind up being measurably slower than their peers. For embarrassingly parallel applications, this doesn't have too much impact other than slightly lengthening time to solution. Experiments have shown up to a 3% impact in that regard. However, more complex applications, especially those utilizing collectives, see an increased likelihood of application failure as the impacted procs continue to lag behind. Thus, the direction has been to reduce and/or eliminate daemons from the compute nodes. This is one of the objectives of the PMIx effort - integration with the RM provides access to the info and services that otherwise would require mpirun and its daemons. Adding another daemon to the head node will increase the impact on the application in both time to solution and probability of failure. This is why it is important to allow users to "opt out". Hope the explanation helps document the reason for caution. |
It turns out that the approach of having the HNP do the fork/exec of MPI ranks on the head node in a SLURM environment introduces problems when users/sysadmins want to use the SLURM scancl tool or sbatch --signal option to signal a job. This commit disables use of the HNP fork/exec procedure when a job is launched into a SLURM controlled allocation. update NEWS with a blurb about new ras framework mca parameter. related to open-mpi#3998 Signed-off-by: Howard Pritchard <hppritcha@gmail.com>
860e3cb to
d08be74
Compare
|
@rhc54 check now. added a ras mca parameter that allows for control of whether or not a separate orted is used on the head node. in non-cray slurm environments, it is default set to false so we keep current behavior. on cray systems it is default set to true to handle the rdma cookie problem we know about. If you think a different name would be better let me know. |
|
The IBM CI (GNU Compiler) build failed! Please review the log, linked below. Gist: https://gist.github.com/110f9ce2b92191f0becc3c6a68dd9d24 |
|
The IBM CI (XL Compiler) build failed! Please review the log, linked below. Gist: https://gist.github.com/28378baa2ae11866011964686b1025c6 |
|
Note that |
|
thanks for the heads up @jjhursey |
|
Thanks - looks good |
|
bot:retest |
It turns out that the approach of having the HNP do the
fork/exec of MPI ranks on the head node in a SLURM environment
introduces problems when users/sysadmins want to use the SLURM
scancl tool or sbatch --signal option to signal a job.
This commit disables use of the HNP fork/exec procedure when
a job is launched into a SLURM controlled allocation.
related to #3998
Signed-off-by: Howard Pritchard hppritcha@gmail.com