Skip to content
Howard Pritchard edited this page Jun 25, 2018 · 1 revision

6/25/18 webex notex

Reviewing Ralph Castain's PMIx group extensions

Attending: Ralph, Dan, Howard

Dan has some questions about fault tolerance nature of these proposed functions. Ralph says some of this language is related to race conditions trying to create a group. Dan is thinking that default is false for events - i.e. not get them. Ralph is okay with this and will change it false event solication keys.

Currently all processes wanting to create a group need to call PMIx_Group_construct. Blocking and non-blocking variants of this call. Discuss using the invite/join verses all processes calling the PMIx_Group_construct.

Where would we call this function? Discuss mapping of this API to the Sessions. Really should be at the comm from group function level.

The invite/join seems to map to connect/accept.

Ralph discusses motivation for the invite/join functionality.

Discussion of possible race conditions. Probably need to bake this some. Case of partially complete group construct. Could we leverage PMIx_init for helping with registering an event handler/callback to begin with? Ralph thinks this provide for much more flexibility. Discuss callbacks verses event handlers. Dan suggests the callback should always be invoked, but errors would go to the event handler (as well as other events). Ralph talks about how event handlers are currently implemented in PMix now.

Decide we'd like the error cases routed through the event handler mechanism. This simplifies the callback function code. Idea of event handler returning an I can't handle this to PMIx, PMIx, would stop raising further error events for a give group operation. Ralph says this is how PMIx currently handles errors forwarded to event handlers.

Dan asks about context in which the event handler is invoked. Ralph says the callback/event handler is invoked in the PMIx progress thread. The thread can't block itself while its in the event handler. Illegal to invoke a PMIx blocking function inside one of these event handlers. Dan was thinking of failing group construct, event handler invoked if a process dies, but then try to invite another process to join. Ralph says event handler could initiate a non-blocking invite, but not blocking invite. Could alternately use the thread-shifting trick. Dan giving idea of blocking group create with event handlers doing patch-up under the covers to handle failures is likely what we'd use in an MPI Sessions implementation.

Clone this wiki locally