hints for implementing PMIx in Flux resource manager #60

garlick · 2016-03-14T22:19:48Z

I'm tentatively interested in implementing PMIx in the Flux resource manager (still in early development).

Should I treat this repo as a reference implementation and try to build a new implementation of PMIx from scratch using the public headers, or treat libpmix as an external dependency and implement the RM end of things with help from this implementation, as seems to have been done in slurm?

Are there compliance tests of sorts here, or is launching OpenMPI built for PMIx the acid test that an implementation works at this point?

Just looking for a very high level roadmap of the porting/implementing process to get me started in the right direction. Thanks!

rhc54 · 2016-03-14T22:41:33Z

Hi Jim

I would recommend treating PMIx as an external dependency. I think you'll find that the server convenience library makes integration really easy and fast. Basically, all you have to do is:

call "server_init" to startup the convenience library, providing a struct of callback functions we use when the client requests support from Flux
call "register_nspace" for each job prior to launching any local procs for that job. You provide a set of information to tell us about the job - you can find the list of desired info here
call "register_client" for each process prior to starting it

You'll see a "test" directory in the code - the various "client" tests in there can be used to check your server. You can also use the programs in the "examples" directory. You can also see the reference implementation for a PMIx RM server in OpenMPI itself in the orte/orted/pmix directory as we use it when running under mpiexec.

HTH - happy to talk to you next week when I'm there if that would help.

garlick · 2016-03-14T23:16:16Z

Thank Ralph, will do, and thanks!

garlick · 2016-05-05T22:07:48Z

API question:

pmix_status_t PMIx_server_register_client(const pmix_proc_t *proc,
                                          uid_t uid, gid_t gid,
                                          void *server_object,
                                          pmix_op_cbfunc_t cbfunc, void *cbdata);

is there some way for me to retrieve server_object from server callbacks other than connected(), finalized(), and abort()? For example, how can I get this from fencenb()?

The API seems to be encouraging me to define RM/program state globally but maybe I've missed the secret handshake for passing state through.

rhc54 · 2016-05-05T22:49:08Z

The only reason we explicitly pass/return a server_object in those specific APIs is that they refer to a single process, and so the object can relate to something associated with that proc. In all cases, you can pass in a void*, and it will be returned to you in the callback function.

We just extended the single-proc APIs so you could get back both the proc object and a broader RM state object without having to package them together just for this call.

HTH

garlick · 2016-05-05T22:54:19Z

Sorry, I mean in this callback (as one example):

pmix_status_t pmix_server_fencenb_fn(const pmix_proc_t procs[], size_t nprocs,
                                     const pmix_info_t info[], size_t ninfo,
                                     char *data, size_t ndata,
                                     pmix_modex_cbfunc_t cbfunc, void *cbdata);

how can I retrieve something I passed in to avoid having to define it globally? Not necessarily server_object.

rhc54 · 2016-05-05T23:07:55Z

I think there may be some general confusion here as there is nothing the host RM can "pass in" to a fence operation. When a client calls "fence_nb", a message is sent to the local PMIx server indicating that this client called "fence_nb". The local PMIx server looks at the procs that are going to participate in this collective (as given by the client), and sets up a tracker to count contributions from any other participating local clients.

Once all the participating local clients have contributed, then the local PMIx server calls the host RM's pmix_server_fencenb_fn, passing in the array of participating procs, etc. The cbdata passed is the void* that the local PMIx server wants to have passed back when the RM completes the collective and calls the given pmix_modex_cbfunc_t function.

When the host RM completes the collective and calls the given modex_cbfunc, then it passes in the cbdata it was given by the local PMIx server. It also can pass its own callback function and void* that the modex cbfunc will call (and return) when the local PMIx server is done with notifying the local clients that the collective was completed.

HTH

garlick · 2016-05-05T23:23:33Z

Right it's the tracker state that I've got no way to store between calls without some global state to refer to. I was hoping I could say "here's some flux state" at some point, and then in callbacks, get it passed back to me much as the PMIx server thread is getting its own state back in cbdata above.

rhc54 · 2016-05-05T23:32:38Z

But there is no interaction between the host RM and the local PMIx server during the course of the collective. The local PMIx server only up-calls the host RM once, when it has collected all the calls from the local participants. Then there is no further interaction until the host RM calls down to tell us that the collective is complete.

So there is no way you could pass us the tracker - you don't call us any time during the collective.

garlick · 2016-05-05T23:54:46Z

Sorry, I'm really not getting my point across. My RM code has no global state. From within the RM fence_nb callback, I can't store info in a tracker because I have no way to make RM calls or update state from that function without a pointer to my RM context.

This is what I mean by the PMIx API seems to require me to define my own state globally. This seems to be what all the examples/tests in the PMIx code base do, and what the slurm pmix plugin does.

rhc54 · 2016-05-06T02:03:03Z

I think I'm beginning to grok - apologize for the density of the old brain. So let me ask you this: how do you perform the collective across the nodes that have participants in the fence? When you receive a message, how do you know which fence it belongs to?

You could use the "put" and "get" routines to store/fetch your object, marking the data as "local" and defining an appropriate key for it. For example, you could take the proc array from the fence call and print it in some fashion to create a string key (or just generate an arbitrary one), and then store your tracker object with:

PMIx_Put(PMIX_LOCAL, "unique string", *tracker);

You could then "tag" the fence messages with the identifier, and retrieve your tracker whenever you wanted using the PMIx_Get function. The data will just stay local on the PMIx server.

Alternatively, we could perform the collective for you. Several members have proposed offering this capability as an optional feature, and maybe that's what would help here? You might still need to store some kind of object for other purposes (e.g, an allocation request), but it would take care of the collective use-case.

rhc54 · 2016-05-06T02:04:05Z

Note that the "tracker" object has to be wrapped inside of a pmix_value_t, but that's no problem - the value_t has a pointer object field for just this purpose.

garlick · 2016-05-06T02:13:28Z

That might be the ticket - thanks! Sorry if I wasn't being very clear.

I didn't mean Flux has no global state, just that its state is not accessible via global pointers. We have a distributed KVS with a collective fence operation that we'll use initially for this. So it's a relatively minor point I'm asking about - how to get hold of my own context again from a PMIx callback.

When you receive a message, how do you know which fence it belongs to?

I'm assuming that Flux context along with the nspace:rank from the callback will be sufficient to identify the program and rank participating in the PMI fence.

rhc54 · 2016-05-06T02:52:41Z

Ah, okay - thanks for clarifying!

garlick · 2016-05-06T18:35:46Z

Another question: does fencenb() get called once per rank or once per set of ranks on the local node?

What's it mean in fencenb() when procs[0].rank == 2147483646 (INT32_MAX - 1)

rhc54 · 2016-05-06T18:43:30Z

On the client side, every rank calls it. The host RM, however, only gets called once when all the local clients have called.

The value you showed is PMIX_RANK_WILDCARD, which indicates that all procs in that nspace will be participating.

garlick · 2016-05-06T18:49:27Z

Got it - thanks!

garlick · 2016-05-12T16:48:28Z

@rhc54 - Darn I was hoping that the fix to #85 was going to be something that explained why my prototype pmix server hangs when I scale up to about size=16, especially since my "client" is using the PMI-1 interfaces in libpmix.so. It works fine at smaller sizes.

I'm getting the connected() callback in my server from the right number of clients, and they all are blocked in a call to PMI_Barrier() or rather they are polling in PMIx_Fence(), but I never get the fence_nb() callback on the server end.

My "server" is a program used to start flux on a single node, so it is structured very much like pmix_test. Any quick thoughts on how to find out what's going on in the pmix service thread? Backtrace looks like it's just waiting for events.

(gdb) bt
#0  0x00007f8ad8a16153 in epoll_wait ()
    at ../sysdeps/unix/syscall-template.S:84
#1  0x00007f8ad7140138 in ?? ()
   from /usr/lib/x86_64-linux-gnu/libevent-2.0.so.5
#2  0x00007f8ad712b38a in event_base_loop ()
   from /usr/lib/x86_64-linux-gnu/libevent-2.0.so.5
#3  0x00007f8ad8cfd1a7 in progress_engine (obj=0x25eef10)
    at src/util/progress_threads.c:52
#4  0x00007f8ad8f5b6fa in start_thread (arg=0x7f8ad5972700)
    at pthread_create.c:333
#5  0x00007f8ad8a15b5d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

I do note that I can still make the libpmix tests hang as in #85 by scaling up to around 256 clients. That fails for pmi-1, pmi-2, and pmi-x. I can open a separate issue on that if you like, though I'm guessing that is not related to this problem as I can't get close to that scale.

I've tried linking my server against both pmix-1.1.4rc2 and latest master bde4da1, same result. By the way using my PMIv1 wire protocol server I can launch Flux at about size=512 on my desktop, so I don't think its something unrelated.

This is most likely my problem, so feel free to ignore this message. Just wanted to give status and make sure you didn't have some useful hint off the top of your head. Thanks.

rhc54 · 2016-05-12T16:57:56Z

Have you tried setting the PMIX_DEBUG envar to get debug output? It will tell you what is happening on both client and server sides. The verbosity level if set by the value of the envar - if you get up to 20, you're going to get barraged by all the data packing/unpacking calls, so you might want to stay under 10.

It sounds like something may be getting lost in a thread-shift - I'm not sure that the pmix_test program was fully debugged for that purpose. You might look at the test/simple/simpest.c code for an alternative implementation of a little server, or you can see the full reference implementation by checking out the OMPI master and looking in the orte/orted/pmix directory

garlick · 2016-05-12T18:04:54Z

Good suggestion. Well, in a hanging 16 rank test at PMIX_DEBUG=2, it seems pmix_server_fence() gets called 16 times ("recvd fence with 1 procs") but the fence never completes (there is no "fence complete"). So that narrows it down I guess.

rhc54 · 2016-05-12T18:11:51Z

FWIW: once the PMIx server has recvd all the local client inputs, it will call-up to your server. I suspect your server isn't realizing that it needs to thread-shift to its own thread, and callback down with the "fence globally complete" response into the PMIx server. If you don't thread-shift, then you'll block.

garlick · 2016-05-12T18:34:35Z

Well, I'm not thread shifting, but only because all clients are connected to one server, so any call to fencenb() is already complete. I call the cbfunc() directly from within the fencenb() RM callback to signal completion, which I assume is correct since that's what the tests do?

/* PMIx will gather data from clients and aggregate into 'data'.
 * Since all procs are local in this instance, we simply make the 'cbfunc'
 * call with what was passed to us and we're done.
 */
pmix_status_t x_fencenb (const pmix_proc_t procs[], size_t nprocs,
                         const pmix_info_t info[], size_t ninfo,
                         char *data, size_t ndata,
                         pmix_modex_cbfunc_t cbfunc, void *cbdata)
{
    struct x_server *xs = x_global_state;
    if (xs->verbose)
        msg ("%s", __FUNCTION__);
    assert (nprocs == 1);
    assert (procs[0].rank == PMIX_RANK_WILDCARD); /* all ranks participating */
    if (cbfunc)
        cbfunc (PMIX_SUCCESS, data, ndata, cbdata, NULL, NULL);
    return PMIX_SUCCESS;
}

I don't get that debug message either, incidentally...

rhc54 · 2016-05-12T18:39:52Z

Hmmm...so the PMIx server isn't executing the up-call? I'd check to see that the registration correctly identified that there are 16 local clients. Might need to add some further debug to see what the PMIx server thinks it has - I thought we output that (recvd x of N clients), but maybe not?

garlick · 2016-05-12T20:25:42Z

Ah! Calling all the PMIx_server_register_client() calls before launching any processes seems to have made the problem go away.

I noticed that in a successful run, tracing showed all the registrations completed before connection, and in a hanging run, connections came in early.

garlick · 2016-05-12T20:48:01Z

I do note that I can still make the libpmix tests hang as in #85 by scaling up to around 256 clients. That fails for pmi-1, pmi-2, and pmi-x. I can open a separate issue on that if you like, though I'm guessing that is not related to this problem as I can't get close to that scale.

It seems pmix_test registers processes as it launches them also. When I "fixed" it, I was able to scale much higher e.g.

$ ./pmix_test -n 800 -e ./pmix_client
pmix_test.c:main: Test finished OK!

upwards of that, it looks like I run out of file descriptors.

Not sure if this is the correct way to do things, or if this exposes a race in the PMIx server code.

garlick · 2016-05-12T21:22:37Z

Looking a bit closer at the failing logs, it seems the clients have not completed registration before the first client enters the fence:

[jimbo:59144] recvd FENCE
[jimbo:59144] recvd fence with 1 procs
[jimbo:59144] get_tracker called with 1 procs
[jimbo:59144] new_tracker called with 1 procs
[jimbo:59144] get_tracker called with 1 procs
[jimbo:59144] adding new tracker with 1 procs
[jimbo:59144] new_tracker: all clients not registered nspace FLUX-START
[snip]
[jimbo:59144] pmix:server setup_fork for nspace FLUX-START rank 12
[jimbo:59144] pmix:server register client FLUX-START:12
[jimbo:59144] pmix:server _register_client for nspace FLUX-START rank 12

rhc54 · 2018-03-29T00:36:52Z

We can reopen when flux starts looking at pmix again

garlick mentioned this issue Mar 18, 2016

evaulate PMIX for implementation in Flux flux-framework/flux-core#365

Closed

garlick mentioned this issue May 6, 2016

add pmix support to flux-start flux-framework/flux-core#668

Closed

6 tasks

garlick mentioned this issue May 12, 2016

possible PMIx server race: PMIx_server_register_client vs fence #89

Closed

rhc54 closed this as completed Mar 29, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hints for implementing PMIx in Flux resource manager #60

hints for implementing PMIx in Flux resource manager #60

garlick commented Mar 14, 2016

rhc54 commented Mar 14, 2016

garlick commented Mar 14, 2016

garlick commented May 5, 2016

rhc54 commented May 5, 2016

garlick commented May 5, 2016

rhc54 commented May 5, 2016

garlick commented May 5, 2016

rhc54 commented May 5, 2016

garlick commented May 5, 2016

rhc54 commented May 6, 2016

rhc54 commented May 6, 2016

garlick commented May 6, 2016

rhc54 commented May 6, 2016

garlick commented May 6, 2016

rhc54 commented May 6, 2016

garlick commented May 6, 2016

garlick commented May 12, 2016

rhc54 commented May 12, 2016

garlick commented May 12, 2016

rhc54 commented May 12, 2016

garlick commented May 12, 2016

rhc54 commented May 12, 2016

garlick commented May 12, 2016

garlick commented May 12, 2016

garlick commented May 12, 2016 •

edited

Loading

rhc54 commented Mar 29, 2018

hints for implementing PMIx in Flux resource manager #60

hints for implementing PMIx in Flux resource manager #60

Comments

garlick commented Mar 14, 2016

rhc54 commented Mar 14, 2016

garlick commented Mar 14, 2016

garlick commented May 5, 2016

rhc54 commented May 5, 2016

garlick commented May 5, 2016

rhc54 commented May 5, 2016

garlick commented May 5, 2016

rhc54 commented May 5, 2016

garlick commented May 5, 2016

rhc54 commented May 6, 2016

rhc54 commented May 6, 2016

garlick commented May 6, 2016

rhc54 commented May 6, 2016

garlick commented May 6, 2016

rhc54 commented May 6, 2016

garlick commented May 6, 2016

garlick commented May 12, 2016

rhc54 commented May 12, 2016

garlick commented May 12, 2016

rhc54 commented May 12, 2016

garlick commented May 12, 2016

rhc54 commented May 12, 2016

garlick commented May 12, 2016

garlick commented May 12, 2016

garlick commented May 12, 2016 • edited Loading

rhc54 commented Mar 29, 2018

garlick commented May 12, 2016 •

edited

Loading