Skip to content

Conversation

@hppritcha
Copy link
Member

It turns out that the approach of having the HNP do the
fork/exec of MPI ranks on the head node in a SLURM environment
introduces problems when users/sysadmins want to use the SLURM
scancl tool or sbatch --signal option to signal a job.

This commit disables use of the HNP fork/exec procedure when
a job is launched into a SLURM controlled allocation.

related to #3998

Signed-off-by: Howard Pritchard hppritcha@gmail.com

@hppritcha hppritcha requested review from hjelmn and rhc54 August 1, 2017 19:37
@rhc54
Copy link
Contributor

rhc54 commented Aug 1, 2017

I don't personally have an issue with making this change. However, this is a big change in behavior for SLURM users that does have consequences - i.e., there is now another daemon on the head node impacting performance. I would much prefer to make this optional instead of required, with the default being the old behavior until we give users a chance to know this is coming and adapt.

Copy link
Contributor

@rhc54 rhc54 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said in my separate comment, I'd prefer this behavior was optional instead of required as not every installation will agree with it.

@hppritcha
Copy link
Member Author

hppritcha commented Aug 1, 2017

sounds like a good idea. Maybe a configure option? Or should it be an mca parameter for ras framework?

@rhc54
Copy link
Contributor

rhc54 commented Aug 1, 2017

I would do it as an MCA param in the RAS component so it can be overridden (in either direction) by a user, especially if some problem turns up that we missed and/or didn't anticipate.

@rhc54
Copy link
Contributor

rhc54 commented Aug 2, 2017

Just to document the reasons behind the caution. There has been considerable research done regarding the impact of daemons sitting on compute nodes. Even daemons that block still contribute to jitter on the node, thus slowing down the application procs on that node.

mpirun executing on a compute node where application procs are running causes additional impact. Every process that generates stdout/err causes mpirun to wake up - ditto for the passing of stdin. The result is that procs on the head node wind up being measurably slower than their peers.

For embarrassingly parallel applications, this doesn't have too much impact other than slightly lengthening time to solution. Experiments have shown up to a 3% impact in that regard. However, more complex applications, especially those utilizing collectives, see an increased likelihood of application failure as the impacted procs continue to lag behind.

Thus, the direction has been to reduce and/or eliminate daemons from the compute nodes. This is one of the objectives of the PMIx effort - integration with the RM provides access to the info and services that otherwise would require mpirun and its daemons.

Adding another daemon to the head node will increase the impact on the application in both time to solution and probability of failure. This is why it is important to allow users to "opt out".

Hope the explanation helps document the reason for caution.

It turns out that the approach of having the HNP do the
fork/exec of MPI ranks on the head node in a SLURM environment
introduces problems when users/sysadmins want to use the SLURM
scancl tool or sbatch --signal option to signal a job.

This commit disables use of the HNP fork/exec procedure when
a job is launched into a SLURM controlled allocation.

update NEWS with a blurb about new ras framework mca parameter.

related to open-mpi#3998

Signed-off-by: Howard Pritchard <hppritcha@gmail.com>
@hppritcha hppritcha force-pushed the topic/slurmd_controls_them_all branch from 860e3cb to d08be74 Compare August 2, 2017 20:57
@hppritcha
Copy link
Member Author

@rhc54 check now. added a ras mca parameter that allows for control of whether or not a separate orted is used on the head node. in non-cray slurm environments, it is default set to false so we keep current behavior. on cray systems it is default set to true to handle the rdma cookie problem we know about. If you think a different name would be better let me know.

@ibm-ompi
Copy link

ibm-ompi commented Aug 2, 2017

The IBM CI (GNU Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/110f9ce2b92191f0becc3c6a68dd9d24

@ibm-ompi
Copy link

ibm-ompi commented Aug 2, 2017

The IBM CI (XL Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/28378baa2ae11866011964686b1025c6

@jjhursey
Copy link
Member

jjhursey commented Aug 2, 2017

Note that master is broken so all CI is going to break on topo_treematch until it is fixed 😞

@hppritcha
Copy link
Member Author

thanks for the heads up @jjhursey

@rhc54
Copy link
Contributor

rhc54 commented Aug 3, 2017

Thanks - looks good

@jjhursey
Copy link
Member

jjhursey commented Aug 3, 2017

bot:retest

@hppritcha hppritcha merged commit 897c627 into open-mpi:master Aug 3, 2017
@hppritcha hppritcha deleted the topic/slurmd_controls_them_all branch May 2, 2018 02:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants