-
Notifications
You must be signed in to change notification settings - Fork 853
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Up the submodule pointers for PMIx and PRRTE #12353
Conversation
Thanks for the PR. I'm running AWS CI. |
This PR failed AWS internal CI. Seeing a lot of failures
|
I'm running it again. |
No immediate ideas - it is working fine for me, and obviously passed all your standard CI's. I'd have to know more about the particular setup to provide suggestions on what you might try. |
@wenduwan Any update on this? I'm still unable to reproduce any problems. |
I ran our tests again with many failures as shown above. I haven't got a chance to look into that yet. A quick glance shows |
Finally I got some time to look into this. The issue happens on 2 nodes during MPI_Init
|
Ignore |
@wenduwan Pushed the latest state of the master branches - been a number of fixes since this was originally created. Converted to "draft" to ensure nobody merges this by mistake |
Unfortunately the same tests are still failing... |
Afraid you aren't giving me much to work off of here 😞 I did see this from your printout above: [ip-172-31-4-77.us-west-2.compute.internal:26954] [[54950,1],1] selected pml ob1, but peer [[54950,1],0] on unknown selected pml <trash> You mentioned that |
@rhc54 In our CI we build applications separately for mpi with debug vs non-debug so this shouldn't be an issue. @hppritcha I wonder if someone on your side could quickly verify this PR with OMB/IMB on 2 nodes? |
You don't need to run a bloody OMB test when things are failing in MPI_Init - just run MPI "hello". I can reproduce it with |
Okay, I tracked it down and fixed it. Hopefully okay now! |
Thank you Ralph. The test has passed. |
Hooray! Anyone have an idea on what mpi4py is complaining about? |
probably related to #12384 |
I've got the alternative "spawn" code working, but the MPI message between the parent and a child process seems to be hanging or isn't getting thru. I checked the modex recv calls and the btl/tcp connection info is all getting correctly transferred between all the procs (both parent and child). So I'm a little stumped. Is there a simple way to trace the MPI send/recv procedure to see where the hang might be? Since the modex gets completed, I'm thinking that maybe the communicator isn't being fully constructed (since I eliminated the "nextcid" code), or perhaps the communicator doesn't have consistent ordering of procs in it. |
where is the alternate spawn code? I can take a look at this later this week to see about what's going wrong. |
It is in the
Everything works fine, but the child rank=0 hangs in the Help is appreciated! Minus the message, this runs through thousands of comm_spawn loops without a problem. |
Test against OMPI CI Signed-off-by: Ralph Castain <rhc@pmix.org>
Updated the submodule pointers to track PMIx/PRRTE master changes |
@hppritcha The branch has been renamed |
Just FYI: when running Lisandro's test on a single node, I get the following error message on the first iteration: --------------------------------------------------------------------------
WARNING: Open MPI accepted a TCP connection from what appears to be a
another Open MPI process but cannot find a corresponding process
entry for that peer.
This attempted connection will be ignored; your MPI job may or may not
continue properly.
Local host: rhc-node01
PID: 500
-------------------------------------------------------------------------- The rest of the iterations run silently. The communicator local and remote groups both look correct, so I'm not sure where OMPI is looking for the peer. |
@hppritcha Current status: I have fixed a couple of bugs in the group construct operation regarding modex info storage, and I can now run Lisandro's test to completion. However, if I increase So it appears there is still come communication issue once we get a child job larger than 1 process. No error or warning messages are being printed, so I'm not sure where to start looking. Let me know when you have time to look at this and I'll be happy to assist. |
BTW: I updated the PMIx/PRRTE submodule pointers so they include the required support for the new dpm connect/accept/spawn method |
@hppritcha I believe I know the cause of this last problem and am going to work on it. Meantime, I have opened a PR with the current status so we can see how it performs in CI: #12398 |
Test against OMPI CI