Skip to content

Conversation

@jsquyres
Copy link
Member

@jsquyres jsquyres commented Oct 6, 2015

Two commits:

  1. New infrastructure to gracefully disable MPI dynamic process functionality
  2. Have the MXM MTL and Yalla PML use this functionality

If merged, this will close #984.

@jsquyres jsquyres added this to the Open MPI v2.0.0 milestone Oct 6, 2015
@jsquyres
Copy link
Member Author

jsquyres commented Oct 6, 2015

@rhc54 @hppritcha @bosilca Comments?

One thought I have is that there are runtime environments where the MPI dynamics are disabled -- can we use this functionality when we know we are in those environments? A minor problem is that I currently have this functionality in the OMPI layer -- not in the ORTE later; we'd need to figure out some way to expose "hey, this runtime doesn't support MPI dynamics" up to the MPI layer so that someone can call ompi_mpi_dynamics_disabled() ... or something like that. Thoughts?

@rhc54
Copy link
Contributor

rhc54 commented Oct 7, 2015

This is really weird. I had already put protection in the MPI C bindings that returned "not-supported" if, for example, pmix.lookup was NULL as that is the mechanism we chose to indicate that the PMI in use doesn't support that capability. Surprisingly, some PMIs support different combinations of these things, and so you cannot just set one global that declares them all to be unavailable. I have no idea what happened to that protection code - looks like it has somehow been removed? Most sad - it needs to be there or we'll segfault if someone tries to use those functions in a direct launch scenario.

As an example, Slurm supports spawn, but none of the other dynamic ops.

So I think you have to split this problem into two parts. There are some MPI transports that will not support connect/accept (which means comm_spawn as well), and we need a flag for that purpose. We also the protection for opal_pmix APIs being NULL to indicate that the PMI doesn't support it.

No RTE-level checks are required - this all now flows thru the opal_pmi framework.

@hppritcha
Copy link
Member

I'm okay with this. As to your question, the problem with relying on the RTE is that it might itself be able to support spawn/connect/accept, but the underlying mtl that is being used cannot.

So I agree with rhc54 - we have different scenarios tthat need to be addressed.

rhc54 - I think it would be a better design to not rely on null function pointers, but rather check
if the return value from the function (if non-NULL) is OPAL_ERR_NOT_SUPPORTED.

I had a lot of frustrations getting the cray pmix work sync'd up with the pmix integration with aborts by orte daemons until I added in stub functions that returned these types of error codes. Then I got reasonable error messages back that could help in tracking down missing checks for nulls.

@rhc54
Copy link
Contributor

rhc54 commented Oct 7, 2015

@hppritcha I'm very puzzled by your comment about aborts in the orte daemons. The orte daemons are expressly prohibited from using the cray pmix plugin:

https://github.com/open-mpi/ompi/blob/master/orte/mca/ess/base/ess_base_std_orted.c#L514

So how could you ever see an issue in the orted?

@bosilca
Copy link
Member

bosilca commented Oct 8, 2015

@jsquyres 👍

@jsquyres
Copy link
Member Author

jsquyres commented Oct 8, 2015

@hppritcha @rhc54 Good comments (I can't address what happened to the PMIx protection, of course...). This is just a strawman PR, BTW - I'm totally open to suggestions and requests for edits/changes.

How do you want to proceed here? Should COMM_SPAWN and friends (at the MPI API level) do a similar show_help (i.e., show the "you're in an environment where dynamics are not supported" message) when they get back ERR_NOT_SUPPORTED from any of the DPM functions?

@rhc54
Copy link
Contributor

rhc54 commented Oct 8, 2015

Probably a quick call to decide on approach would help the most. As I said, we have two issues to consider:

  1. when the underlying transports cannot support dynamic addition of procs. Your method of setting a global flag is fine here, but it shouldn't apply to ops like MPI_Lookup as that has nothing to do with this limitation.
  2. when the underlying PMI doesn't support some operation. This is a little trickier as we require support for PMI_Lookup and friends for MPI_Comm_spawn. I don't have a fanatical commitment to the NULL checks, but I am concerned about checking for return codes. How do I tell the difference between the embedded PMI support not containing a function, vs the host RM not supporting that type of request?

For example, if I want to do a lookup_nb and get a "not supported" return, does that mean the embedded PMI code doesn't support the non-blocking version? Or does it mean that the host RM doesn't support lookup at all? Without some way of telling the difference, I wind up calling the blocking version just to get another rejection.

This wouldn't be a disaster, except that I do need to be careful then and check the error return to ensure it isn't something other than "not supported". Just gets a little verbose as compared to a quick NULL check.

Looking to the future, we are going to be extending the opal_pmix definition. If we follow the "always return something" approach, that means every component will have to be extended to provide "no-op" functions. I personally don't care as I won't be doing that work - but it begs the question of whether or not we want to require such effort.

Like I said, I'm open on this question - I honestly can't understand how @hppritcha observed what he stated, and wonder if we are trying to solve something that was caused by some other mistake.

@hppritcha
Copy link
Member

I recall that, when working off of rhc54's pmix branch, there were cases where there were unprotected
(in the sense of not checking for NULL) uses of the pmix interface by, I think, the hnp. Rather than getting something useful back, I'd just get an immediate segfault. I believe I saw a similar problem with the orted's starting up. I found that by defining the missing cray pmix functions and returning an error, it helped a lot. The orte startup code would actually fail with a useful error message.

I did not go back later to check with the final big pmix merge whether or not the code had changed enough that defining stub functions was no longer necessary for either avoiding the problem, or producing a helpful job launch failure message.

@jsquyres I think modulo rhc54's caveats about functions like MPI_Lookup we are saying your approach is good.

@rhc54
Copy link
Contributor

rhc54 commented Oct 13, 2015

@jsquyres here are my changes for the PMIx support check - probably easier if you put them all into one commit.

jeff.txt

@jsquyres jsquyres force-pushed the pr/friendly-message-when-dynamics-disabled branch 2 times, most recently from 07e4f7e to 463dbfc Compare October 13, 2015 20:16
@jsquyres
Copy link
Member Author

@rhc54 New commits pushed, which include your patch. I actually have double protection in there for PMIX: it checks for the NULL (i.e., your patch), and it also checks for OPAL_ERR_NOT_SUPPORTED as a return from the dpm functions.

@rhc54
Copy link
Contributor

rhc54 commented Oct 13, 2015

👍

@jsquyres
Copy link
Member Author

@bosilca More changes went in. Are you still happy?
@hppritcha Comments?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Why are we returning the MPI_ERR_SPAWN errorcode instead OF OMPI_ERR_NOT_IMPLEMENTED or OMPI_ERR_NOT_SUPPORTED?
  2. Can we have the same nice output message as below ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Good point, should be ERR_NOT_IMPLEMENTED. I'll update.
  2. ompi_mpi_dynamics_is_enabled() will call show_help() if dynamics are not enabled.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean “not supported”, yes?

On Oct 13, 2015, at 2:29 PM, Jeff Squyres notifications@github.com wrote:

In ompi/mpi/c/comm_accept.c #989 (comment):

@@ -89,6 +92,10 @@ int MPI_Comm_accept(const char *port_name, MPI_Info info, int root,
}
}

  • if (!ompi_mpi_dynamics_is_enabled(FUNC_NAME)) {
  •    return OMPI_ERRHANDLER_INVOKE(comm, MPI_ERR_SPAWN, FUNC_NAME);
    
  • }

/* parse info object. no prefedined values for this function in MPI-2
Good point, should be ERR_NOT_IMPLEMENTED. I'll update.
ompi_mpi_dynamics_is_enabled() will call show_help() if dynamics are not enabled.

Reply to this email directly or view it on GitHub https://github.com/open-mpi/ompi/pull/989/files#r41928358.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhc54 Yes, I mean "not supported". I just double checked the code, and I used "not supported" in all relevant places (not "not implemented").

(found your line comment in the outdated diff on github :-) )

@jsquyres jsquyres force-pushed the pr/friendly-message-when-dynamics-disabled branch from 463dbfc to 8c95833 Compare October 13, 2015 21:37
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly to the others process management functions please raise OMPI_ERR_NOT_SUPPORTED.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meh; can't believe I missed that -- thx.

@jsquyres jsquyres force-pushed the pr/friendly-message-when-dynamics-disabled branch from 8c95833 to 69abe3d Compare October 14, 2015 10:22
@lanl-ompi
Copy link
Contributor

Test FAILed.

Add ompi_mpi_dynamics_disable() function to disable MPI dynamic
process functionality (i.e., such that if MPI_COMM_SPAWN/etc. are
invoked, you'll get a show_help error explaining that MPI dynamic
process functionality is disabled in this environment -- instead of a
potentially-cryptic network or hardware error).

Fixes open-mpi#984
Disable the MPI dynamic process functionality when these components
are selected to be used.
@jsquyres jsquyres force-pushed the pr/friendly-message-when-dynamics-disabled branch from 69abe3d to 889d80a Compare October 14, 2015 20:43
@jsquyres
Copy link
Member Author

bot:retest

@hppritcha
Copy link
Member

Looks good. Thanks for checking both panic cases- null function pointer as well as not supported error return value

jsquyres added a commit that referenced this pull request Oct 14, 2015
…cs-disabled

Print friendly message when dynamics disabled
@jsquyres jsquyres merged commit 8307330 into open-mpi:master Oct 14, 2015
@jsquyres jsquyres deleted the pr/friendly-message-when-dynamics-disabled branch October 14, 2015 23:52
jsquyres added a commit to jsquyres/ompi that referenced this pull request Aug 23, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add "MPI dynamics are not supported in this environment" functionality

5 participants