Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using PMIx_Get() with timeout not working? #128

Closed
ashleypittman opened this issue Aug 16, 2016 · 4 comments
Closed

Using PMIx_Get() with timeout not working? #128

ashleypittman opened this issue Aug 16, 2016 · 4 comments

Comments

@ashleypittman
Copy link
Contributor

We're using a couple of newish keys in our client and occasionally we're seeing PMIx_Get() blocking, presumably because either due to an old server or a bug the key isn't found. What we'd like to do in this case is detect it and correctly abort. We're using a fence after Put so all gets should complete immediately.

I tried passing a info array into PMIx_Get and setting PMIX_TIMEOUT to either 0 or 1 as a PMIX_INT however this does not change the behaviour, unless we attach and then detach with a debugger, in which case PMIx_Get() returns -107.

I would expect a return code of PMIX_ERR_TIMEOUT (-16) in this case, and I would also like clarification of if 0 is a valid value to use for timeout here, if I expect the key to exist.

@rhc54
Copy link
Contributor

rhc54 commented Aug 17, 2016

Should have this fixed here: open-mpi/ompi#1981

@rhc54
Copy link
Contributor

rhc54 commented Aug 18, 2016

Here is the process I'm assuming you follow:

· Each proc PMIx_Put some data
· Each proc PMIx_Commit the data that it put
· Each proc PMIx_Fence and blocks until that completes
· Each proc calls PMIx_Get on the data for some other process, passing in PMIX_TIMEOUT=0 to indicate that it wants an immediate error return if the data isn’t found either in itself or on its local PMIx server. You don’t want the server to ask the RM to try and fetch it.

If the data was not in the client, nor available on the local server, then you will get a "not found" error back. This means that either the proc never put/commit it, or the fence wasn’t told to “collect all data”.

An alternative approach would be to let PMIX_TIMEOUT=0 mean "never timeout", and instead pass a new boolean PMIX_IMMEDIATE key to indicate that you want an immediate response. I'm open to either solution, so feel free to suggest your preference.

@ashleypittman
Copy link
Contributor Author

As per email we're not calling PMIx_Commit() however we are using PMIX_COLLECT_DATA to PMIx_Fence(). Even with a change to calling PMIx_Commit() however we're still seeing the same behaviour.

@rhc54
Copy link
Contributor

rhc54 commented Aug 18, 2016

Just to clarify: are you getting the immediate response back that the data isn't found? So this issue is fixed, but the desired behavior to get the data that was pushed remains?

Or are you saying that the immediate response behavior is still broken?

rhc54 pushed a commit to rhc54/openpmix that referenced this issue Jan 3, 2018
Check for PMIX_TIMEOUT directives in the fence, get, and connect operations. Have the local PMIx server enforce the timeout, notifying the requesting proc when timeout hits. If a proc terminates (either finalize or die) before a collective it participates in has completed, then notify all other participants as the collective will never complete.

NOTE: this still leaves a few "gaps":
* only local participants are informed of a collective failure due to a proc going away. Need to decide on a way of notifying remote participants, most likely via the event notification system

* if a proc terminates prior to the collective being defined (i.e., another participant having called it), then the server loses all knowledge that the terminating proc existed and the collective "succeeds". This -might- be okay and possibly desirable for resilience since an abnormal termination is something we notify about, but still a behavior worth noting.

Fixes openpmix#128
Fixes openpmix#96

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
@rhc54 rhc54 closed this as completed in #627 Jan 3, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants