Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Up the submodule pointers for PMIx and PRRTE #12353

Closed
wants to merge 2 commits into from
Closed

Conversation

rhc54
Copy link
Contributor

@rhc54 rhc54 commented Feb 20, 2024

Test against OMPI CI

@wenduwan
Copy link
Contributor

Thanks for the PR. I'm running AWS CI.

@wenduwan
Copy link
Contributor

This PR failed AWS internal CI. Seeing a lot of failures

mpirun --wdir . -n 72 --hostfile hostfile --map-by ppr:36:node --timeout 1800 -x PATH  mpi-benchmarks-IMB-v2021.7/IMB-MPI1 Scatterv -npmin 72 -iter 200 -time 20 -mem 1 2>&1 | tee node2-ppn36.txt
INFO     root:utils.py:507 mpirun output:
--------------------------------------------------------------------------
It looks like MPI runtime init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during RTE init; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  num local peers
  --> Returned "Bad parameter" (-5) instead of "Success" (0)
--------------------------------------------------------------------------
#x00*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and MPI will try to terminate your MPI job as well)
[ip-172-31-22-21:73523] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
.....

@wenduwan
Copy link
Contributor

I'm running it again.

@rhc54
Copy link
Contributor Author

rhc54 commented Feb 22, 2024

No immediate ideas - it is working fine for me, and obviously passed all your standard CI's. I'd have to know more about the particular setup to provide suggestions on what you might try.

@rhc54
Copy link
Contributor Author

rhc54 commented Feb 27, 2024

@wenduwan Any update on this? I'm still unable to reproduce any problems.

@wenduwan
Copy link
Contributor

I ran our tests again with many failures as shown above. I haven't got a chance to look into that yet.

A quick glance shows --enable-debug fixes those failures.

@wenduwan
Copy link
Contributor

Finally I got some time to look into this.

The issue happens on 2 nodes during MPI_Init

...
[ip-172-31-4-77.us-west-2.compute.internal:26954] mca: base: components_open: component tcp open function successful
[ip-172-31-4-77.us-west-2.compute.internal:26954] select: initializing btl component self
[ip-172-31-4-77.us-west-2.compute.internal:26954] select: init of component self returned success
[ip-172-31-4-77.us-west-2.compute.internal:26954] select: initializing btl component smcuda
[ip-172-31-4-77.us-west-2.compute.internal:26954] select: init of component smcuda returned failure
[ip-172-31-4-77.us-west-2.compute.internal:26954] mca: base: close: component smcuda closed
[ip-172-31-4-77.us-west-2.compute.internal:26954] mca: base: close: unloading component smcuda
[ip-172-31-4-77.us-west-2.compute.internal:26954] select: initializing btl component ofi
[ip-172-31-12-182.us-west-2.compute.internal:33452] select: init of component ofi returned success
[ip-172-31-12-182.us-west-2.compute.internal:33452] select: initializing btl component sm
[ip-172-31-12-182.us-west-2.compute.internal:33452] select: init of component sm returned failure
[ip-172-31-12-182.us-west-2.compute.internal:33452] mca: base: close: component sm closed
[ip-172-31-12-182.us-west-2.compute.internal:33452] mca: base: close: unloading component sm
[ip-172-31-12-182.us-west-2.compute.internal:33452] select: initializing btl component tcp
[ip-172-31-12-182.us-west-2.compute.internal:33452] btl:tcp: 0xebbde0: if eth0 kidx 2 cnt 0 addr 172.31.12.182 IPv4 bw 100 lt 100
[ip-172-31-12-182.us-west-2.compute.internal:33452] btl: tcp: exchange: 0 2 IPv4 172.31.12.182
[ip-172-31-12-182.us-west-2.compute.internal:33452] select: init of component tcp returned success
[ip-172-31-4-77.us-west-2.compute.internal:26954] select: init of component ofi returned success
[ip-172-31-4-77.us-west-2.compute.internal:26954] select: initializing btl component sm
[ip-172-31-4-77.us-west-2.compute.internal:26954] select: init of component sm returned success
[ip-172-31-4-77.us-west-2.compute.internal:26954] select: initializing btl component tcp
[ip-172-31-4-77.us-west-2.compute.internal:26954] btl:tcp: 0x26138f0: if eth0 kidx 2 cnt 0 addr 172.31.4.77 IPv4 bw 100 lt 100
[ip-172-31-4-77.us-west-2.compute.internal:26954] btl: tcp: exchange: 0 2 IPv4 172.31.4.77
[ip-172-31-4-77.us-west-2.compute.internal:26954] select: init of component tcp returned success
[ip-172-31-12-182.us-west-2.compute.internal:33452] mca: bml: Using self btl for send to [[54950,1],0] on node ip-172-31-12-182
[ip-172-31-4-77.us-west-2.compute.internal:26954] [[54950,1],1] selected pml ob1, but peer [[54950,1],0] on unknown selected pml �
                                                                                                                                  �                                                                                                 �
                                                                                                                                  �
--------------------------------------------------------------------------
MPI_INIT has failed because at least one MPI process is unreachable
from another.  This *usually* means that an underlying communication
plugin -- such as a BTL or an MTL -- has either not loaded or not
allowed itself to be used.  Your MPI job will now abort.

You may wish to try to narrow down the problem;

 * Check the output of ompi_info to see which BTL/MTL plugins are
   available.
 * Run your application with MPI_THREAD_SINGLE.
 * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,
   if using MTL-based communications) to see exactly which
   communication plugins were considered and/or discarded.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_mpi_instance_init failed
  --> Returned "Unreachable" (-12) instead of "Success" (0)
--------------------------------------------------------------------------
[ip-172-31-4-77:00000] *** An error occurred in MPI_Init
[ip-172-31-4-77:00000] *** reported by process [3601203201,1]
[ip-172-31-4-77:00000] *** on a NULL communicator
[ip-172-31-4-77:00000] *** Unknown error
[ip-172-31-4-77:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ip-172-31-4-77:00000] ***    and MPI will try to terminate your MPI job as well)

@wenduwan
Copy link
Contributor

wenduwan commented Feb 28, 2024

Ignore

@rhc54
Copy link
Contributor Author

rhc54 commented Feb 28, 2024

@wenduwan Pushed the latest state of the master branches - been a number of fixes since this was originally created.

Converted to "draft" to ensure nobody merges this by mistake

@rhc54 rhc54 marked this pull request as draft February 28, 2024 23:13
@wenduwan
Copy link
Contributor

Unfortunately the same tests are still failing...

@rhc54
Copy link
Contributor Author

rhc54 commented Feb 29, 2024

Afraid you aren't giving me much to work off of here 😞 I did see this from your printout above:

[ip-172-31-4-77.us-west-2.compute.internal:26954] [[54950,1],1] selected pml ob1, but peer [[54950,1],0] on unknown selected pml <trash>

You mentioned that --enable-debug made the problems go away? If so, that is suspiciously like what we see when someone mixes debug with non-debug libraries across nodes. The pack/unpack pairing gets off and things go haywire.

@wenduwan
Copy link
Contributor

@rhc54 In our CI we build applications separately for mpi with debug vs non-debug so this shouldn't be an issue.

@hppritcha I wonder if someone on your side could quickly verify this PR with OMB/IMB on 2 nodes?

@rhc54
Copy link
Contributor Author

rhc54 commented Feb 29, 2024

You don't need to run a bloody OMB test when things are failing in MPI_Init - just run MPI "hello". I can reproduce it with --disable-debug, but the error appears to be in the pml "checker" logic. Might be the first thing pulled up from the modex, so I'll take a look there.

@rhc54
Copy link
Contributor Author

rhc54 commented Feb 29, 2024

Okay, I tracked it down and fixed it. Hopefully okay now!

@wenduwan
Copy link
Contributor

Thank you Ralph. The test has passed.

@rhc54
Copy link
Contributor Author

rhc54 commented Feb 29, 2024

Hooray! Anyone have an idea on what mpi4py is complaining about?

@hppritcha
Copy link
Member

probably related to #12384

@rhc54
Copy link
Contributor Author

rhc54 commented Mar 2, 2024

I've got the alternative "spawn" code working, but the MPI message between the parent and a child process seems to be hanging or isn't getting thru. I checked the modex recv calls and the btl/tcp connection info is all getting correctly transferred between all the procs (both parent and child). So I'm a little stumped.

Is there a simple way to trace the MPI send/recv procedure to see where the hang might be? Since the modex gets completed, I'm thinking that maybe the communicator isn't being fully constructed (since I eliminated the "nextcid" code), or perhaps the communicator doesn't have consistent ordering of procs in it.

@hppritcha
Copy link
Member

where is the alternate spawn code? I can take a look at this later this week to see about what's going wrong.

@rhc54
Copy link
Contributor Author

rhc54 commented Mar 4, 2024

It is in the topic/dpm branch of my OMPI fork: https://git@github.com/rhc54/ompi. Here is the commit message:

Add a second method for doing connect/accept

    The "old" method relies on PMIx publish/lookup followed by
    a call to PMIx_Connect. It then does a "next cid" method
    to get the next communicator ID, which has multiple algorithms
    including one that calls PMIx_Group.

    Simplify this by using PMIx_Group_construct in place of
    PMIx_Connect, and have it return the next communicator ID.
    This is more scalable and reliable than the prior method.

    Retain the "old" method for now as this is new code. Create
    a new MCA param "OMPI_MCA_dpm_enable_new_method" to switch
    between the two approaches. Default it to "true" for now
    for ease of debugging.

    NOTE: this includes an update to the submodule pointers
    for PMIx and PRRTE to obtain the required updates to
    those code bases.

Everything works fine, but the child rank=0 hangs in the MPI_Recv call waiting to get the message from the parent. I can't find the cause of the hang. I've printed out the contents of the communicators and they look fine, and I've checked that we aren't waiting for connection endpts (at least, I'm not seeing it).

Help is appreciated! Minus the message, this runs through thousands of comm_spawn loops without a problem.

Test against OMPI CI

Signed-off-by: Ralph Castain <rhc@pmix.org>
@rhc54
Copy link
Contributor Author

rhc54 commented Mar 4, 2024

Updated the submodule pointers to track PMIx/PRRTE master changes

@rhc54
Copy link
Contributor Author

rhc54 commented Mar 4, 2024

@hppritcha The branch has been renamed topic/dpm2 to avoid conflict with another pre-existing branch. Sorry for the confusion.

@rhc54
Copy link
Contributor Author

rhc54 commented Mar 8, 2024

Just FYI: when running Lisandro's test on a single node, I get the following error message on the first iteration:

--------------------------------------------------------------------------
WARNING: Open MPI accepted a TCP connection from what appears to be a
another Open MPI process but cannot find a corresponding process
entry for that peer.

This attempted connection will be ignored; your MPI job may or may not
continue properly.

  Local host: rhc-node01
  PID:        500
--------------------------------------------------------------------------

The rest of the iterations run silently. The communicator local and remote groups both look correct, so I'm not sure where OMPI is looking for the peer.

@rhc54
Copy link
Contributor Author

rhc54 commented Mar 9, 2024

@hppritcha Current status: I have fixed a couple of bugs in the group construct operation regarding modex info storage, and I can now run Lisandro's test to completion. However, if I increase maxnp to greater than 1, then the MPI_Barrier hangs.

So it appears there is still come communication issue once we get a child job larger than 1 process. No error or warning messages are being printed, so I'm not sure where to start looking.

Let me know when you have time to look at this and I'll be happy to assist.

@rhc54
Copy link
Contributor Author

rhc54 commented Mar 9, 2024

BTW: I updated the PMIx/PRRTE submodule pointers so they include the required support for the new dpm connect/accept/spawn method

@rhc54
Copy link
Contributor Author

rhc54 commented Mar 10, 2024

@hppritcha I believe I know the cause of this last problem and am going to work on it. Meantime, I have opened a PR with the current status so we can see how it performs in CI: #12398

@rhc54 rhc54 closed this Mar 13, 2024
@rhc54 rhc54 deleted the topic/up branch March 13, 2024 15:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants