-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using PMIx_Get() with timeout not working? #128
Comments
Should have this fixed here: open-mpi/ompi#1981 |
Here is the process I'm assuming you follow: · Each proc PMIx_Put some data If the data was not in the client, nor available on the local server, then you will get a "not found" error back. This means that either the proc never put/commit it, or the fence wasn’t told to “collect all data”. An alternative approach would be to let PMIX_TIMEOUT=0 mean "never timeout", and instead pass a new boolean PMIX_IMMEDIATE key to indicate that you want an immediate response. I'm open to either solution, so feel free to suggest your preference. |
As per email we're not calling PMIx_Commit() however we are using PMIX_COLLECT_DATA to PMIx_Fence(). Even with a change to calling PMIx_Commit() however we're still seeing the same behaviour. |
Just to clarify: are you getting the immediate response back that the data isn't found? So this issue is fixed, but the desired behavior to get the data that was pushed remains? Or are you saying that the immediate response behavior is still broken? |
Check for PMIX_TIMEOUT directives in the fence, get, and connect operations. Have the local PMIx server enforce the timeout, notifying the requesting proc when timeout hits. If a proc terminates (either finalize or die) before a collective it participates in has completed, then notify all other participants as the collective will never complete. NOTE: this still leaves a few "gaps": * only local participants are informed of a collective failure due to a proc going away. Need to decide on a way of notifying remote participants, most likely via the event notification system * if a proc terminates prior to the collective being defined (i.e., another participant having called it), then the server loses all knowledge that the terminating proc existed and the collective "succeeds". This -might- be okay and possibly desirable for resilience since an abnormal termination is something we notify about, but still a behavior worth noting. Fixes openpmix#128 Fixes openpmix#96 Signed-off-by: Ralph Castain <rhc@open-mpi.org>
We're using a couple of newish keys in our client and occasionally we're seeing PMIx_Get() blocking, presumably because either due to an old server or a bug the key isn't found. What we'd like to do in this case is detect it and correctly abort. We're using a fence after Put so all gets should complete immediately.
I tried passing a info array into PMIx_Get and setting PMIX_TIMEOUT to either 0 or 1 as a PMIX_INT however this does not change the behaviour, unless we attach and then detach with a debugger, in which case PMIx_Get() returns -107.
I would expect a return code of PMIX_ERR_TIMEOUT (-16) in this case, and I would also like clarification of if 0 is a valid value to use for timeout here, if I expect the key to exist.
The text was updated successfully, but these errors were encountered: