-
Notifications
You must be signed in to change notification settings - Fork 932
Print friendly message when dynamics disabled #989
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Print friendly message when dynamics disabled #989
Conversation
|
@rhc54 @hppritcha @bosilca Comments? One thought I have is that there are runtime environments where the MPI dynamics are disabled -- can we use this functionality when we know we are in those environments? A minor problem is that I currently have this functionality in the OMPI layer -- not in the ORTE later; we'd need to figure out some way to expose "hey, this runtime doesn't support MPI dynamics" up to the MPI layer so that someone can call |
|
This is really weird. I had already put protection in the MPI C bindings that returned "not-supported" if, for example, pmix.lookup was NULL as that is the mechanism we chose to indicate that the PMI in use doesn't support that capability. Surprisingly, some PMIs support different combinations of these things, and so you cannot just set one global that declares them all to be unavailable. I have no idea what happened to that protection code - looks like it has somehow been removed? Most sad - it needs to be there or we'll segfault if someone tries to use those functions in a direct launch scenario. As an example, Slurm supports spawn, but none of the other dynamic ops. So I think you have to split this problem into two parts. There are some MPI transports that will not support connect/accept (which means comm_spawn as well), and we need a flag for that purpose. We also the protection for opal_pmix APIs being NULL to indicate that the PMI doesn't support it. No RTE-level checks are required - this all now flows thru the opal_pmi framework. |
|
I'm okay with this. As to your question, the problem with relying on the RTE is that it might itself be able to support spawn/connect/accept, but the underlying mtl that is being used cannot. So I agree with rhc54 - we have different scenarios tthat need to be addressed. rhc54 - I think it would be a better design to not rely on null function pointers, but rather check I had a lot of frustrations getting the cray pmix work sync'd up with the pmix integration with aborts by orte daemons until I added in stub functions that returned these types of error codes. Then I got reasonable error messages back that could help in tracking down missing checks for nulls. |
|
@hppritcha I'm very puzzled by your comment about aborts in the orte daemons. The orte daemons are expressly prohibited from using the cray pmix plugin: https://github.com/open-mpi/ompi/blob/master/orte/mca/ess/base/ess_base_std_orted.c#L514 So how could you ever see an issue in the orted? |
|
@hppritcha @rhc54 Good comments (I can't address what happened to the PMIx protection, of course...). This is just a strawman PR, BTW - I'm totally open to suggestions and requests for edits/changes. How do you want to proceed here? Should COMM_SPAWN and friends (at the MPI API level) do a similar show_help (i.e., show the "you're in an environment where dynamics are not supported" message) when they get back ERR_NOT_SUPPORTED from any of the DPM functions? |
|
Probably a quick call to decide on approach would help the most. As I said, we have two issues to consider:
For example, if I want to do a lookup_nb and get a "not supported" return, does that mean the embedded PMI code doesn't support the non-blocking version? Or does it mean that the host RM doesn't support lookup at all? Without some way of telling the difference, I wind up calling the blocking version just to get another rejection. This wouldn't be a disaster, except that I do need to be careful then and check the error return to ensure it isn't something other than "not supported". Just gets a little verbose as compared to a quick NULL check. Looking to the future, we are going to be extending the opal_pmix definition. If we follow the "always return something" approach, that means every component will have to be extended to provide "no-op" functions. I personally don't care as I won't be doing that work - but it begs the question of whether or not we want to require such effort. Like I said, I'm open on this question - I honestly can't understand how @hppritcha observed what he stated, and wonder if we are trying to solve something that was caused by some other mistake. |
|
I recall that, when working off of rhc54's pmix branch, there were cases where there were unprotected I did not go back later to check with the final big pmix merge whether or not the code had changed enough that defining stub functions was no longer necessary for either avoiding the problem, or producing a helpful job launch failure message. @jsquyres I think modulo rhc54's caveats about functions like |
07e4f7e to
463dbfc
Compare
|
@rhc54 New commits pushed, which include your patch. I actually have double protection in there for PMIX: it checks for the NULL (i.e., your patch), and it also checks for OPAL_ERR_NOT_SUPPORTED as a return from the dpm functions. |
|
👍 |
|
@bosilca More changes went in. Are you still happy? |
ompi/mpi/c/comm_accept.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Why are we returning the MPI_ERR_SPAWN errorcode instead OF OMPI_ERR_NOT_IMPLEMENTED or OMPI_ERR_NOT_SUPPORTED?
- Can we have the same nice output message as below ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Good point, should be ERR_NOT_IMPLEMENTED. I'll update.
ompi_mpi_dynamics_is_enabled()will call show_help() if dynamics are not enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean “not supported”, yes?
On Oct 13, 2015, at 2:29 PM, Jeff Squyres notifications@github.com wrote:
In ompi/mpi/c/comm_accept.c #989 (comment):
@@ -89,6 +92,10 @@ int MPI_Comm_accept(const char *port_name, MPI_Info info, int root,
}
}
- if (!ompi_mpi_dynamics_is_enabled(FUNC_NAME)) {
return OMPI_ERRHANDLER_INVOKE(comm, MPI_ERR_SPAWN, FUNC_NAME);- }
/* parse info object. no prefedined values for this function in MPI-2
Good point, should be ERR_NOT_IMPLEMENTED. I'll update.
ompi_mpi_dynamics_is_enabled() will call show_help() if dynamics are not enabled.
—
Reply to this email directly or view it on GitHub https://github.com/open-mpi/ompi/pull/989/files#r41928358.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rhc54 Yes, I mean "not supported". I just double checked the code, and I used "not supported" in all relevant places (not "not implemented").
(found your line comment in the outdated diff on github :-) )
463dbfc to
8c95833
Compare
ompi/mpi/c/comm_join.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly to the others process management functions please raise OMPI_ERR_NOT_SUPPORTED.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meh; can't believe I missed that -- thx.
8c95833 to
69abe3d
Compare
|
Test FAILed. |
Add ompi_mpi_dynamics_disable() function to disable MPI dynamic process functionality (i.e., such that if MPI_COMM_SPAWN/etc. are invoked, you'll get a show_help error explaining that MPI dynamic process functionality is disabled in this environment -- instead of a potentially-cryptic network or hardware error). Fixes open-mpi#984
Disable the MPI dynamic process functionality when these components are selected to be used.
69abe3d to
889d80a
Compare
|
bot:retest |
|
Looks good. Thanks for checking both panic cases- null function pointer as well as not supported error return value |
…cs-disabled Print friendly message when dynamics disabled
Fix CUDA in v2.x
Two commits:
If merged, this will close #984.