Skip to content

Conversation

rhc54
Copy link
Contributor

@rhc54 rhc54 commented Jan 12, 2017

This allows the system to dynamically detect the number of available processors and set the #slots accordingly.

Signed-off-by: Ralph Castain rhc@open-mpi.org

… specifies the #slots. This allows the system to dynamically detect the number of available processors and set the #slots accordingly.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
@rhc54 rhc54 requested a review from ggouaillardet January 12, 2017 21:23
@rhc54 rhc54 added the bug label Jan 12, 2017
@rhc54 rhc54 added this to the v2.0.2 milestone Jan 12, 2017
@hppritcha
Copy link
Member

Do we need to change any documentation for this change in behavior?

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
@rhc54
Copy link
Contributor Author

rhc54 commented Jan 13, 2017

Good point - I have updated the man page

@hppritcha
Copy link
Member

@jsquyres what do you think about this kind of change in behavior within a release stream?
I recall a lot of discussion about behavior of --host vs --hostfile etc.
at one of our recent F2F's:
https://github.com/open-mpi/ompi/wiki/Meeting-2016-02
just wanting to make sure we are keeping with the original decisions.

@hppritcha
Copy link
Member

Here's the spread sheet that we worked on at the Dallas F2F 2/16:

https://docs.google.com/spreadsheets/d/1poOwNKtYxnDnpF7-D15lmcFVtLRNrmf4_hu3Obtu95M/edit

@rhc54
Copy link
Contributor Author

rhc54 commented Jan 16, 2017

@hppritcha Hate to tell you, but that spreadsheet isn't accurate - at least, that isn't the current behavior.

@ggouaillardet
Copy link
Contributor

@hppritcha @rhc54 imho, this specific case is kind of undocumented/unspecified.

one one hand, we specify a slot-list, and on the other hand, we specify a host.

for example, what if the slot-list contains 12 slots but we specify a different number of slots with the --host option ? should we try to handle this in a "nice" way ? or should we simply abort because of an inconsistency in the options requested by the user ?

in this very specific case, e.g. mpirun --slot-list 0:0-5,1:0-5 --host xxx, yet an other option could be to set the ORTE_NODE_FLAG_SLOTS_GIVEN flag, and set the number of slots to 12 (e.g. the number of slots in the list)

@rhc54
Copy link
Contributor Author

rhc54 commented Jan 16, 2017

I've said this before, but I'll reiterate here - I am no longer supporting the -host/-hostfile code. We keep coming up with every imaginable corner-case, and the complexity of trying to handle all of them - while preserving existing behavior - is insane. Clear evidence: the current master behavior no longer mirrors @jsquyres spreadsheet.

This patch makes the branch follow the master and fixes the user's problem, but the result isn't the behavior in the spreadsheet. Someone else can figure out what they want to do.

@jladd-mlnx
Copy link
Member

bot:mellanox:retest

2 similar comments
@jladd-mlnx
Copy link
Member

bot:mellanox:retest

@jladd-mlnx
Copy link
Member

bot:mellanox:retest

@hppritcha
Copy link
Member

hppritcha commented Jan 24, 2017

Here's what we have at the moment across several releases for the --host/---hostfile behavior where host_foo has a single entry foo. The table below shows how many processes are launched as a function of mpirun command line options and Open MPI release:

Release mpirun --host foo mpirun --host foo:2 mpirun --hostfile host_foo mpirun --hostfile host_foo -np 1
1.8.8 one process doesn't work nslot processes one process
1.10.0 one process doesn't work nslot processes one process
1.10.5 one process doesn't work nslot processes one process
1.10.x one process doesn't work nslot processes one process
2.0.0 one process two processes nslot processes one process
2.0.1 one process two processes nslot processes one process
3.0.0a1 nslot processes two processes nslot processes one process

where nslots is the number of cores on the host foo

@hppritcha
Copy link
Member

Discussed at the devel F2F and decided to keep the behavior of 1.10.x series (modulo the new --host foo:X). So closing this PR.

@hppritcha hppritcha closed this Jan 24, 2017
@jsquyres
Copy link
Member

@jjhursey
Copy link
Member

Food for thought: It would be nice if we had a system regularly running a handful of -host and -hostfile runs and compare the output to confirm that we stay in compliance with what we intend to do. Maybe a CI test on a specific system (keeping to a specific system makes it easy to setup without handling the generic system problem).

@rhc54 rhc54 deleted the cmr20x/host branch January 25, 2017 05:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants