-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hints for implementing PMIx in Flux resource manager #60
Comments
Hi Jim I would recommend treating PMIx as an external dependency. I think you'll find that the server convenience library makes integration really easy and fast. Basically, all you have to do is:
You'll see a "test" directory in the code - the various "client" tests in there can be used to check your server. You can also use the programs in the "examples" directory. You can also see the reference implementation for a PMIx RM server in OpenMPI itself in the orte/orted/pmix directory as we use it when running under mpiexec. HTH - happy to talk to you next week when I'm there if that would help. |
Thank Ralph, will do, and thanks! |
API question:
is there some way for me to retrieve The API seems to be encouraging me to define RM/program state globally but maybe I've missed the secret handshake for passing state through. |
The only reason we explicitly pass/return a We just extended the single-proc APIs so you could get back both the proc object and a broader RM state object without having to package them together just for this call. HTH |
Sorry, I mean in this callback (as one example):
how can I retrieve something I passed in to avoid having to define it globally? Not necessarily |
I think there may be some general confusion here as there is nothing the host RM can "pass in" to a fence operation. When a client calls "fence_nb", a message is sent to the local PMIx server indicating that this client called "fence_nb". The local PMIx server looks at the procs that are going to participate in this collective (as given by the client), and sets up a tracker to count contributions from any other participating local clients. Once all the participating local clients have contributed, then the local PMIx server calls the host RM's When the host RM completes the collective and calls the given modex_cbfunc, then it passes in the cbdata it was given by the local PMIx server. It also can pass its own callback function and void* that the modex cbfunc will call (and return) when the local PMIx server is done with notifying the local clients that the collective was completed. HTH |
Right it's the tracker state that I've got no way to store between calls without some global state to refer to. I was hoping I could say "here's some flux state" at some point, and then in callbacks, get it passed back to me much as the PMIx server thread is getting its own state back in cbdata above. |
But there is no interaction between the host RM and the local PMIx server during the course of the collective. The local PMIx server only up-calls the host RM once, when it has collected all the calls from the local participants. Then there is no further interaction until the host RM calls down to tell us that the collective is complete. So there is no way you could pass us the tracker - you don't call us any time during the collective. |
Sorry, I'm really not getting my point across. My RM code has no global state. From within the RM fence_nb callback, I can't store info in a tracker because I have no way to make RM calls or update state from that function without a pointer to my RM context. This is what I mean by the PMIx API seems to require me to define my own state globally. This seems to be what all the examples/tests in the PMIx code base do, and what the slurm pmix plugin does. |
I think I'm beginning to grok - apologize for the density of the old brain. So let me ask you this: how do you perform the collective across the nodes that have participants in the fence? When you receive a message, how do you know which fence it belongs to? You could use the "put" and "get" routines to store/fetch your object, marking the data as "local" and defining an appropriate key for it. For example, you could take the proc array from the fence call and print it in some fashion to create a string key (or just generate an arbitrary one), and then store your tracker object with: PMIx_Put(PMIX_LOCAL, "unique string", *tracker); You could then "tag" the fence messages with the identifier, and retrieve your tracker whenever you wanted using the PMIx_Get function. The data will just stay local on the PMIx server. Alternatively, we could perform the collective for you. Several members have proposed offering this capability as an optional feature, and maybe that's what would help here? You might still need to store some kind of object for other purposes (e.g, an allocation request), but it would take care of the collective use-case. |
Note that the "tracker" object has to be wrapped inside of a pmix_value_t, but that's no problem - the value_t has a pointer object field for just this purpose. |
That might be the ticket - thanks! Sorry if I wasn't being very clear. I didn't mean Flux has no global state, just that its state is not accessible via global pointers. We have a distributed KVS with a collective fence operation that we'll use initially for this. So it's a relatively minor point I'm asking about - how to get hold of my own context again from a PMIx callback.
I'm assuming that Flux context along with the nspace:rank from the callback will be sufficient to identify the program and rank participating in the PMI fence. |
Ah, okay - thanks for clarifying! |
Another question: does What's it mean in |
On the client side, every rank calls it. The host RM, however, only gets called once when all the local clients have called. The value you showed is |
Got it - thanks! |
@rhc54 - Darn I was hoping that the fix to #85 was going to be something that explained why my prototype pmix server hangs when I scale up to about size=16, especially since my "client" is using the PMI-1 interfaces in I'm getting the My "server" is a program used to start flux on a single node, so it is structured very much like
I do note that I can still make the libpmix tests hang as in #85 by scaling up to around 256 clients. That fails for pmi-1, pmi-2, and pmi-x. I can open a separate issue on that if you like, though I'm guessing that is not related to this problem as I can't get close to that scale. I've tried linking my server against both pmix-1.1.4rc2 and latest master bde4da1, same result. By the way using my PMIv1 wire protocol server I can launch Flux at about size=512 on my desktop, so I don't think its something unrelated. This is most likely my problem, so feel free to ignore this message. Just wanted to give status and make sure you didn't have some useful hint off the top of your head. Thanks. |
Have you tried setting the PMIX_DEBUG envar to get debug output? It will tell you what is happening on both client and server sides. The verbosity level if set by the value of the envar - if you get up to 20, you're going to get barraged by all the data packing/unpacking calls, so you might want to stay under 10. It sounds like something may be getting lost in a thread-shift - I'm not sure that the pmix_test program was fully debugged for that purpose. You might look at the test/simple/simpest.c code for an alternative implementation of a little server, or you can see the full reference implementation by checking out the OMPI master and looking in the orte/orted/pmix directory |
Good suggestion. Well, in a hanging 16 rank test at PMIX_DEBUG=2, it seems |
FWIW: once the PMIx server has recvd all the local client inputs, it will call-up to your server. I suspect your server isn't realizing that it needs to thread-shift to its own thread, and callback down with the "fence globally complete" response into the PMIx server. If you don't thread-shift, then you'll block. |
Well, I'm not thread shifting, but only because all clients are connected to one server, so any call to /* PMIx will gather data from clients and aggregate into 'data'.
* Since all procs are local in this instance, we simply make the 'cbfunc'
* call with what was passed to us and we're done.
*/
pmix_status_t x_fencenb (const pmix_proc_t procs[], size_t nprocs,
const pmix_info_t info[], size_t ninfo,
char *data, size_t ndata,
pmix_modex_cbfunc_t cbfunc, void *cbdata)
{
struct x_server *xs = x_global_state;
if (xs->verbose)
msg ("%s", __FUNCTION__);
assert (nprocs == 1);
assert (procs[0].rank == PMIX_RANK_WILDCARD); /* all ranks participating */
if (cbfunc)
cbfunc (PMIX_SUCCESS, data, ndata, cbdata, NULL, NULL);
return PMIX_SUCCESS;
} I don't get that debug message either, incidentally... |
Hmmm...so the PMIx server isn't executing the up-call? I'd check to see that the registration correctly identified that there are 16 local clients. Might need to add some further debug to see what the PMIx server thinks it has - I thought we output that (recvd x of N clients), but maybe not? |
Ah! Calling all the I noticed that in a successful run, tracing showed all the registrations completed before connection, and in a hanging run, connections came in early. |
It seems
upwards of that, it looks like I run out of file descriptors. Not sure if this is the correct way to do things, or if this exposes a race in the PMIx server code. |
Looking a bit closer at the failing logs, it seems the clients have not completed registration before the first client enters the fence:
|
We can reopen when flux starts looking at pmix again |
I'm tentatively interested in implementing PMIx in the Flux resource manager (still in early development).
Should I treat this repo as a reference implementation and try to build a new implementation of PMIx from scratch using the public headers, or treat libpmix as an external dependency and implement the RM end of things with help from this implementation, as seems to have been done in slurm?
Are there compliance tests of sorts here, or is launching OpenMPI built for PMIx the acid test that an implementation works at this point?
Just looking for a very high level roadmap of the porting/implementing process to get me started in the right direction. Thanks!
The text was updated successfully, but these errors were encountered: