Skip to content

Conversation

@jsquyres
Copy link
Member

  • OOB TCP: be more explicit about version mismatches
  • BTL TCP: be more explicit about connection errors
  • show_help: improve show_help cascading message effects

Copy link
Contributor

@rhc54 rhc54 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like I said in the email thread, I don't think the show help "enable/disable" is really useful, at least so far as I am concerned. I can't use a binary decision like that as multiple jobs may be running concurrently.

However, you are welcome to move ahead if you want.

@jsquyres
Copy link
Member Author

Yeah, I have no doubt that this won't apply to all situations. The thought was that it could apply to at least some situations:

For example, for the OOB TCP show_help message I added in this PR, using the "disable"-like functionality is the difference between:

$ mpirun --host mpi001,mpi002 --mca btl tcp,vader,self ring_c
--------------------------------------------------------------------------
Open MPI detected a mismatch in versions between two processes.  This
typically means that you executed "mpirun" (or "mpiexec") from one
version of Open MPI on on node, but your default path on one of the
other nodes upon which you launched found a different version of Open
MPI.

Open MPI only supports running exactly the same version between all
processes in a single job.

This will almost certainly cause unpredictable behavior, and may end
up aborting your job.

  Local host:             savbu-usnic-a
  Local process name:     [[4366,0],0]
  Local Open MPI version: 4.0.0a1
  Peer host:              something
  Peer process name:      [[4366,0],2]
  Peer Open MPI version:  bogus
--------------------------------------------------------------------------

and

$ mpirun --host mpi001,mpi002 --mca btl tcp,vader,self ring_c
--------------------------------------------------------------------------
Open MPI detected a mismatch in versions between two processes.  This
typically means that you executed "mpirun" (or "mpiexec") from one
version of Open MPI on on node, but your default path on one of the
other nodes upon which you launched found a different version of Open
MPI.

Open MPI only supports running exactly the same version between all
processes in a single job.

This will almost certainly cause unpredictable behavior, and may end
up aborting your job.

  Local host:             savbu-usnic-a
  Local process name:     [[4366,0],0]
  Local Open MPI version: 4.0.0a1
  Peer host:              something
  Peer process name:      [[4366,0],2]
  Peer Open MPI version:  bogus
--------------------------------------------------------------------------
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.

  HNP daemon   : [[15141,0],0] on node savbu-usnic-a
  Remote daemon: [[15141,0],2] on node mpi002

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
ORTE does not know how to route a message to the specified daemon
located on the indicated node:

  my node:   savbu-usnic-a
  target node:  mpi001

This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.
--------------------------------------------------------------------------

@jsquyres
Copy link
Member Author

@rhc54 and I talked on the phone. Seems like this is generally a good idea, but there's definitely some corner cases that it doesn't cover well (e.g., from the orted in a DVM).

Still need to do a little more testing / fixing on this branch; setting opal_show_help_enable(false) after the new TCP BTL show_help messages doesn't appear to work properly. Investigating...

@jsquyres
Copy link
Member Author

jsquyres commented Oct 7, 2017

@rhc54 I finally pushed some updates based on what we talked about before. I ended up re-working the interplay between OPAL and ORTE show_help systems; I think it's a bit simpler now than it was previously.

@bosilca @bwbarrett @mohanasudhan I updated some TCP BTL user messages a bit. Could you guys have a look?

@rhc54
Copy link
Contributor

rhc54 commented Oct 8, 2017

I confess I'm still stuck on this one as it will clearly cause problems for the DVM. I just don't know how we can safely turn off all show_help messages just because some developer felt that theirs should be the last one out, and then leave it off from then on. Any kind of "run-thru" operation, not just the DVM, will be impacted.

Although it would be too big a change, you ideally would want something that was at least job-specific - i.e., don't emit any more show_help messages for this specific job. Not entirely sure how to get there without a lot of editing. I'm just not convinced this solution isn't going to cause more trouble than it solves.

At the very least, we need an option to turn it off so those projects that can't operate this way (e.g., the DVM, ULFM) can continue to function.

@bosilca
Copy link
Member

bosilca commented Oct 9, 2017

get_peer_name only works for IPv4. Most changes in the TCP BTL are superficial and might have some potential benefit.

I share @rhc54 concern regarding why are we doing this enable/disable thing?

@rhc54
Copy link
Contributor

rhc54 commented Oct 11, 2017

We had planned to discuss this on today's telecon, but ran out of time. @jsquyres indicated he and I might iterate some more offline before bringing it back to the group.

Just to ensure we don't miss it, @gpaulsen would you please add it to next week's agenda?

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
This variable is only used when IPv6 support is enabled; ensure to
protect it in an #if OPAL_ENABLE_IPv6 check.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Convert some verbose messages to opal_show_help() messages.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
When an error occurs deep in a call stack, it's common to call
opal_show_help() to display the error.  But as the error propagates up
the call stack, it is not uncommon for the upper layers to also
display their own show_help messages.

New functions opal_show_help_final() and opal_show_vhelp_final() will
ensure that the show_help message emitted will be the *last* one.
Specifically: future calls to opal_show_*help() will not emit any
future messages.  This helps clarify help messages to the end user:
they can see the last relevant message, and then no more (vs. a series
of less-specific error messages as the error propagates up the stack).

Implementing this scheme entailed a minor revamp to slightly simplify
how the interplay between the OPAL and ORTE show_help subsystems
worked.

Previously, opal_show_help and opal_show_vhelp were function pointers.
They initially pointed to OPAL functions (that just ended up calling
opal_output()), and were replaced by ORTE functions during orte_init()
(and replaced with their OPAL counterparts during opal_finalize()).

The new implementation does away with these function pointers.  All
the opal_show_*help functions are actually functions (vs. function
pointers).  If emitting is enabled (i.e., if a _final() function was
not previously invoked), these functions will invoke a single back-end
function to actually emit show_help messages.

This single back-end function will either be the default OPAL function
(that calls opal_output()), or an upper layer can register its own
back-end function (e.g., ORTE registers its function during
orte_init() that does all the aggregation to orterun, etc., and
unregisters its function during orte_finalize()).

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Be more explicit about version mismatch between ORTE processes.  Use
opal_show_help_final() to ensure that it is the *last* show_help
message shown.  Using _final will show the following to the user:

```
$ mpirun --host mpi001,mpi002 --mca btl tcp,vader,self ring_c
--------------------------------------------------------------------------
Open MPI detected a mismatch in versions between two processes.  This
typically means that you executed "mpirun" (or "mpiexec") from one
...etc.
--------------------------------------------------------------------------
```

As opposed to showing two *more* show_help messages after the above
message:

```
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.
...etc.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
ORTE does not know how to route a message to the specified daemon
located on the indicated node:
...etc.
--------------------------------------------------------------------------
```

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
@bwbarrett
Copy link
Member

I'm not sure where I fall on the show help change, but I definitely feel like we're combining too many things in this PR. I'd much rather see the OOB/BTL messages in one PR and the cascade change in a second. Otherwise, we lose the OOB/BTL changes on a revert if the cascade change causes impact we didn't expect.

@hppritcha
Copy link
Member

@jsquyres do you still want to work on this, or should we close?

@jsquyres
Copy link
Member Author

As cited above, there are problems with the concept of a "last" error. I give up -- closing.

The TCP OOB/BTL commits were salvaged to #4942.

@jsquyres jsquyres closed this Mar 21, 2018
@jsquyres jsquyres deleted the pr/misc-updates branch March 21, 2018 19:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants