SLURM: launch all processes via slurmd #3999

hppritcha · 2017-08-01T19:37:22Z

It turns out that the approach of having the HNP do the
fork/exec of MPI ranks on the head node in a SLURM environment
introduces problems when users/sysadmins want to use the SLURM
scancl tool or sbatch --signal option to signal a job.

This commit disables use of the HNP fork/exec procedure when
a job is launched into a SLURM controlled allocation.

related to #3998

Signed-off-by: Howard Pritchard hppritcha@gmail.com

rhc54 · 2017-08-01T21:12:37Z

I don't personally have an issue with making this change. However, this is a big change in behavior for SLURM users that does have consequences - i.e., there is now another daemon on the head node impacting performance. I would much prefer to make this optional instead of required, with the default being the old behavior until we give users a chance to know this is coming and adapt.

rhc54

As I said in my separate comment, I'd prefer this behavior was optional instead of required as not every installation will agree with it.

hppritcha · 2017-08-01T21:41:58Z

sounds like a good idea. Maybe a configure option? Or should it be an mca parameter for ras framework?

rhc54 · 2017-08-01T21:52:11Z

I would do it as an MCA param in the RAS component so it can be overridden (in either direction) by a user, especially if some problem turns up that we missed and/or didn't anticipate.

rhc54 · 2017-08-02T01:08:10Z

Just to document the reasons behind the caution. There has been considerable research done regarding the impact of daemons sitting on compute nodes. Even daemons that block still contribute to jitter on the node, thus slowing down the application procs on that node.

mpirun executing on a compute node where application procs are running causes additional impact. Every process that generates stdout/err causes mpirun to wake up - ditto for the passing of stdin. The result is that procs on the head node wind up being measurably slower than their peers.

For embarrassingly parallel applications, this doesn't have too much impact other than slightly lengthening time to solution. Experiments have shown up to a 3% impact in that regard. However, more complex applications, especially those utilizing collectives, see an increased likelihood of application failure as the impacted procs continue to lag behind.

Thus, the direction has been to reduce and/or eliminate daemons from the compute nodes. This is one of the objectives of the PMIx effort - integration with the RM provides access to the info and services that otherwise would require mpirun and its daemons.

Adding another daemon to the head node will increase the impact on the application in both time to solution and probability of failure. This is why it is important to allow users to "opt out".

Hope the explanation helps document the reason for caution.

It turns out that the approach of having the HNP do the fork/exec of MPI ranks on the head node in a SLURM environment introduces problems when users/sysadmins want to use the SLURM scancl tool or sbatch --signal option to signal a job. This commit disables use of the HNP fork/exec procedure when a job is launched into a SLURM controlled allocation. update NEWS with a blurb about new ras framework mca parameter. related to open-mpi#3998 Signed-off-by: Howard Pritchard <hppritcha@gmail.com>

hppritcha · 2017-08-02T20:59:27Z

@rhc54 check now. added a ras mca parameter that allows for control of whether or not a separate orted is used on the head node. in non-cray slurm environments, it is default set to false so we keep current behavior. on cray systems it is default set to true to handle the rdma cookie problem we know about. If you think a different name would be better let me know.

ibm-ompi · 2017-08-02T21:09:28Z

The IBM CI (GNU Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/110f9ce2b92191f0becc3c6a68dd9d24

ibm-ompi · 2017-08-02T21:11:11Z

The IBM CI (XL Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/28378baa2ae11866011964686b1025c6

jjhursey · 2017-08-02T21:11:14Z

Note that master is broken so all CI is going to break on topo_treematch until it is fixed 😞

hppritcha · 2017-08-02T21:22:11Z

thanks for the heads up @jjhursey

rhc54 · 2017-08-03T00:44:18Z

Thanks - looks good

jjhursey · 2017-08-03T01:13:43Z

bot:retest

hppritcha requested review from hjelmn and rhc54 August 1, 2017 19:37

hppritcha assigned hjelmn and rhc54 Aug 1, 2017

hjelmn approved these changes Aug 1, 2017

View reviewed changes

rhc54 requested changes Aug 1, 2017

View reviewed changes

hppritcha force-pushed the topic/slurmd_controls_them_all branch from 860e3cb to d08be74 Compare August 2, 2017 20:57

rhc54 approved these changes Aug 3, 2017

View reviewed changes

hppritcha merged commit 897c627 into open-mpi:master Aug 3, 2017

hppritcha deleted the topic/slurmd_controls_them_all branch May 2, 2018 02:59

SLURM: launch all processes via slurmd #3999

SLURM: launch all processes via slurmd #3999

Uh oh!

Conversation

hppritcha commented Aug 1, 2017

Uh oh!

rhc54 commented Aug 1, 2017

Uh oh!

rhc54 left a comment

Choose a reason for hiding this comment

Uh oh!

hppritcha commented Aug 1, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rhc54 commented Aug 1, 2017

Uh oh!

rhc54 commented Aug 2, 2017

Uh oh!

hppritcha commented Aug 2, 2017

Uh oh!

ibm-ompi commented Aug 2, 2017

Uh oh!

ibm-ompi commented Aug 2, 2017

Uh oh!

jjhursey commented Aug 2, 2017

Uh oh!

hppritcha commented Aug 2, 2017

Uh oh!

rhc54 commented Aug 3, 2017

Uh oh!

jjhursey commented Aug 3, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

hppritcha commented Aug 1, 2017 •

edited

Loading