-
Notifications
You must be signed in to change notification settings - Fork 932
Miscellaneous show_help updates #4199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
jsquyres
commented
Sep 12, 2017
- OOB TCP: be more explicit about version mismatches
- BTL TCP: be more explicit about connection errors
- show_help: improve show_help cascading message effects
rhc54
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like I said in the email thread, I don't think the show help "enable/disable" is really useful, at least so far as I am concerned. I can't use a binary decision like that as multiple jobs may be running concurrently.
However, you are welcome to move ahead if you want.
|
Yeah, I have no doubt that this won't apply to all situations. The thought was that it could apply to at least some situations: For example, for the OOB TCP show_help message I added in this PR, using the "disable"-like functionality is the difference between: and |
|
@rhc54 and I talked on the phone. Seems like this is generally a good idea, but there's definitely some corner cases that it doesn't cover well (e.g., from the Still need to do a little more testing / fixing on this branch; setting |
9ea09b5 to
fbd24a5
Compare
|
@rhc54 I finally pushed some updates based on what we talked about before. I ended up re-working the interplay between OPAL and ORTE show_help systems; I think it's a bit simpler now than it was previously. @bosilca @bwbarrett @mohanasudhan I updated some TCP BTL user messages a bit. Could you guys have a look? |
fbd24a5 to
e27b86c
Compare
|
I confess I'm still stuck on this one as it will clearly cause problems for the DVM. I just don't know how we can safely turn off all show_help messages just because some developer felt that theirs should be the last one out, and then leave it off from then on. Any kind of "run-thru" operation, not just the DVM, will be impacted. Although it would be too big a change, you ideally would want something that was at least job-specific - i.e., don't emit any more show_help messages for this specific job. Not entirely sure how to get there without a lot of editing. I'm just not convinced this solution isn't going to cause more trouble than it solves. At the very least, we need an option to turn it off so those projects that can't operate this way (e.g., the DVM, ULFM) can continue to function. |
|
get_peer_name only works for IPv4. Most changes in the TCP BTL are superficial and might have some potential benefit. I share @rhc54 concern regarding why are we doing this enable/disable thing? |
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
This variable is only used when IPv6 support is enabled; ensure to protect it in an #if OPAL_ENABLE_IPv6 check. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Convert some verbose messages to opal_show_help() messages. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
When an error occurs deep in a call stack, it's common to call opal_show_help() to display the error. But as the error propagates up the call stack, it is not uncommon for the upper layers to also display their own show_help messages. New functions opal_show_help_final() and opal_show_vhelp_final() will ensure that the show_help message emitted will be the *last* one. Specifically: future calls to opal_show_*help() will not emit any future messages. This helps clarify help messages to the end user: they can see the last relevant message, and then no more (vs. a series of less-specific error messages as the error propagates up the stack). Implementing this scheme entailed a minor revamp to slightly simplify how the interplay between the OPAL and ORTE show_help subsystems worked. Previously, opal_show_help and opal_show_vhelp were function pointers. They initially pointed to OPAL functions (that just ended up calling opal_output()), and were replaced by ORTE functions during orte_init() (and replaced with their OPAL counterparts during opal_finalize()). The new implementation does away with these function pointers. All the opal_show_*help functions are actually functions (vs. function pointers). If emitting is enabled (i.e., if a _final() function was not previously invoked), these functions will invoke a single back-end function to actually emit show_help messages. This single back-end function will either be the default OPAL function (that calls opal_output()), or an upper layer can register its own back-end function (e.g., ORTE registers its function during orte_init() that does all the aggregation to orterun, etc., and unregisters its function during orte_finalize()). Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Be more explicit about version mismatch between ORTE processes. Use opal_show_help_final() to ensure that it is the *last* show_help message shown. Using _final will show the following to the user: ``` $ mpirun --host mpi001,mpi002 --mca btl tcp,vader,self ring_c -------------------------------------------------------------------------- Open MPI detected a mismatch in versions between two processes. This typically means that you executed "mpirun" (or "mpiexec") from one ...etc. -------------------------------------------------------------------------- ``` As opposed to showing two *more* show_help messages after the above message: ``` -------------------------------------------------------------------------- ORTE has lost communication with a remote daemon. ...etc. -------------------------------------------------------------------------- -------------------------------------------------------------------------- ORTE does not know how to route a message to the specified daemon located on the indicated node: ...etc. -------------------------------------------------------------------------- ``` Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
e27b86c to
1baea9e
Compare
|
I'm not sure where I fall on the show help change, but I definitely feel like we're combining too many things in this PR. I'd much rather see the OOB/BTL messages in one PR and the cascade change in a second. Otherwise, we lose the OOB/BTL changes on a revert if the cascade change causes impact we didn't expect. |
|
@jsquyres do you still want to work on this, or should we close? |
|
As cited above, there are problems with the concept of a "last" error. I give up -- closing. The TCP OOB/BTL commits were salvaged to #4942. |