Up the submodule pointers for PMIx and PRRTE #12353

rhc54 · 2024-02-20T16:57:12Z

Test against OMPI CI

wenduwan · 2024-02-20T17:04:09Z

Thanks for the PR. I'm running AWS CI.

wenduwan · 2024-02-22T19:45:37Z

This PR failed AWS internal CI. Seeing a lot of failures

mpirun --wdir . -n 72 --hostfile hostfile --map-by ppr:36:node --timeout 1800 -x PATH  mpi-benchmarks-IMB-v2021.7/IMB-MPI1 Scatterv -npmin 72 -iter 200 -time 20 -mem 1 2>&1 | tee node2-ppn36.txt
INFO     root:utils.py:507 mpirun output:
--------------------------------------------------------------------------
It looks like MPI runtime init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during RTE init; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  num local peers
  --> Returned "Bad parameter" (-5) instead of "Success" (0)
--------------------------------------------------------------------------
#x00*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and MPI will try to terminate your MPI job as well)
[ip-172-31-22-21:73523] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
.....

wenduwan · 2024-02-22T19:46:13Z

I'm running it again.

rhc54 · 2024-02-22T21:22:04Z

No immediate ideas - it is working fine for me, and obviously passed all your standard CI's. I'd have to know more about the particular setup to provide suggestions on what you might try.

rhc54 · 2024-02-27T15:34:41Z

@wenduwan Any update on this? I'm still unable to reproduce any problems.

wenduwan · 2024-02-27T15:46:46Z

I ran our tests again with many failures as shown above. I haven't got a chance to look into that yet.

A quick glance shows --enable-debug fixes those failures.

wenduwan · 2024-02-28T21:20:51Z

Finally I got some time to look into this.

The issue happens on 2 nodes during MPI_Init

...
[ip-172-31-4-77.us-west-2.compute.internal:26954] mca: base: components_open: component tcp open function successful
[ip-172-31-4-77.us-west-2.compute.internal:26954] select: initializing btl component self
[ip-172-31-4-77.us-west-2.compute.internal:26954] select: init of component self returned success
[ip-172-31-4-77.us-west-2.compute.internal:26954] select: initializing btl component smcuda
[ip-172-31-4-77.us-west-2.compute.internal:26954] select: init of component smcuda returned failure
[ip-172-31-4-77.us-west-2.compute.internal:26954] mca: base: close: component smcuda closed
[ip-172-31-4-77.us-west-2.compute.internal:26954] mca: base: close: unloading component smcuda
[ip-172-31-4-77.us-west-2.compute.internal:26954] select: initializing btl component ofi
[ip-172-31-12-182.us-west-2.compute.internal:33452] select: init of component ofi returned success
[ip-172-31-12-182.us-west-2.compute.internal:33452] select: initializing btl component sm
[ip-172-31-12-182.us-west-2.compute.internal:33452] select: init of component sm returned failure
[ip-172-31-12-182.us-west-2.compute.internal:33452] mca: base: close: component sm closed
[ip-172-31-12-182.us-west-2.compute.internal:33452] mca: base: close: unloading component sm
[ip-172-31-12-182.us-west-2.compute.internal:33452] select: initializing btl component tcp
[ip-172-31-12-182.us-west-2.compute.internal:33452] btl:tcp: 0xebbde0: if eth0 kidx 2 cnt 0 addr 172.31.12.182 IPv4 bw 100 lt 100
[ip-172-31-12-182.us-west-2.compute.internal:33452] btl: tcp: exchange: 0 2 IPv4 172.31.12.182
[ip-172-31-12-182.us-west-2.compute.internal:33452] select: init of component tcp returned success
[ip-172-31-4-77.us-west-2.compute.internal:26954] select: init of component ofi returned success
[ip-172-31-4-77.us-west-2.compute.internal:26954] select: initializing btl component sm
[ip-172-31-4-77.us-west-2.compute.internal:26954] select: init of component sm returned success
[ip-172-31-4-77.us-west-2.compute.internal:26954] select: initializing btl component tcp
[ip-172-31-4-77.us-west-2.compute.internal:26954] btl:tcp: 0x26138f0: if eth0 kidx 2 cnt 0 addr 172.31.4.77 IPv4 bw 100 lt 100
[ip-172-31-4-77.us-west-2.compute.internal:26954] btl: tcp: exchange: 0 2 IPv4 172.31.4.77
[ip-172-31-4-77.us-west-2.compute.internal:26954] select: init of component tcp returned success
[ip-172-31-12-182.us-west-2.compute.internal:33452] mca: bml: Using self btl for send to [[54950,1],0] on node ip-172-31-12-182
[ip-172-31-4-77.us-west-2.compute.internal:26954] [[54950,1],1] selected pml ob1, but peer [[54950,1],0] on unknown selected pml �
                                                                                                                                  �                                                                                                 �
                                                                                                                                  �
--------------------------------------------------------------------------
MPI_INIT has failed because at least one MPI process is unreachable
from another.  This *usually* means that an underlying communication
plugin -- such as a BTL or an MTL -- has either not loaded or not
allowed itself to be used.  Your MPI job will now abort.

You may wish to try to narrow down the problem;

 * Check the output of ompi_info to see which BTL/MTL plugins are
   available.
 * Run your application with MPI_THREAD_SINGLE.
 * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,
   if using MTL-based communications) to see exactly which
   communication plugins were considered and/or discarded.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_mpi_instance_init failed
  --> Returned "Unreachable" (-12) instead of "Success" (0)
--------------------------------------------------------------------------
[ip-172-31-4-77:00000] *** An error occurred in MPI_Init
[ip-172-31-4-77:00000] *** reported by process [3601203201,1]
[ip-172-31-4-77:00000] *** on a NULL communicator
[ip-172-31-4-77:00000] *** Unknown error
[ip-172-31-4-77:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ip-172-31-4-77:00000] ***    and MPI will try to terminate your MPI job as well)

wenduwan · 2024-02-28T21:32:07Z

Ignore

rhc54 · 2024-02-28T23:13:34Z

@wenduwan Pushed the latest state of the master branches - been a number of fixes since this was originally created.

Converted to "draft" to ensure nobody merges this by mistake

wenduwan · 2024-02-29T01:41:34Z

Unfortunately the same tests are still failing...

rhc54 · 2024-02-29T02:12:33Z

Afraid you aren't giving me much to work off of here 😞 I did see this from your printout above:

[ip-172-31-4-77.us-west-2.compute.internal:26954] [[54950,1],1] selected pml ob1, but peer [[54950,1],0] on unknown selected pml <trash>

You mentioned that --enable-debug made the problems go away? If so, that is suspiciously like what we see when someone mixes debug with non-debug libraries across nodes. The pack/unpack pairing gets off and things go haywire.

wenduwan · 2024-02-29T02:26:17Z

@rhc54 In our CI we build applications separately for mpi with debug vs non-debug so this shouldn't be an issue.

@hppritcha I wonder if someone on your side could quickly verify this PR with OMB/IMB on 2 nodes?

rhc54 · 2024-02-29T03:15:05Z

You don't need to run a bloody OMB test when things are failing in MPI_Init - just run MPI "hello". I can reproduce it with --disable-debug, but the error appears to be in the pml "checker" logic. Might be the first thing pulled up from the modex, so I'll take a look there.

rhc54 · 2024-02-29T12:42:59Z

Okay, I tracked it down and fixed it. Hopefully okay now!

wenduwan · 2024-02-29T17:43:03Z

Thank you Ralph. The test has passed.

rhc54 · 2024-02-29T18:05:57Z

Hooray! Anyone have an idea on what mpi4py is complaining about?

hppritcha · 2024-02-29T19:47:21Z

probably related to #12384

rhc54 · 2024-03-02T00:44:28Z

I've got the alternative "spawn" code working, but the MPI message between the parent and a child process seems to be hanging or isn't getting thru. I checked the modex recv calls and the btl/tcp connection info is all getting correctly transferred between all the procs (both parent and child). So I'm a little stumped.

Is there a simple way to trace the MPI send/recv procedure to see where the hang might be? Since the modex gets completed, I'm thinking that maybe the communicator isn't being fully constructed (since I eliminated the "nextcid" code), or perhaps the communicator doesn't have consistent ordering of procs in it.

hppritcha · 2024-03-04T14:57:06Z

where is the alternate spawn code? I can take a look at this later this week to see about what's going wrong.

rhc54 · 2024-03-04T15:39:04Z

It is in the topic/dpm branch of my OMPI fork: https://git@github.com/rhc54/ompi. Here is the commit message:

Add a second method for doing connect/accept

    The "old" method relies on PMIx publish/lookup followed by
    a call to PMIx_Connect. It then does a "next cid" method
    to get the next communicator ID, which has multiple algorithms
    including one that calls PMIx_Group.

    Simplify this by using PMIx_Group_construct in place of
    PMIx_Connect, and have it return the next communicator ID.
    This is more scalable and reliable than the prior method.

    Retain the "old" method for now as this is new code. Create
    a new MCA param "OMPI_MCA_dpm_enable_new_method" to switch
    between the two approaches. Default it to "true" for now
    for ease of debugging.

    NOTE: this includes an update to the submodule pointers
    for PMIx and PRRTE to obtain the required updates to
    those code bases.

Everything works fine, but the child rank=0 hangs in the MPI_Recv call waiting to get the message from the parent. I can't find the cause of the hang. I've printed out the contents of the communicators and they look fine, and I've checked that we aren't waiting for connection endpts (at least, I'm not seeing it).

Help is appreciated! Minus the message, this runs through thousands of comm_spawn loops without a problem.

Test against OMPI CI Signed-off-by: Ralph Castain <rhc@pmix.org>

rhc54 · 2024-03-04T15:43:01Z

Updated the submodule pointers to track PMIx/PRRTE master changes

rhc54 · 2024-03-04T18:52:46Z

@hppritcha The branch has been renamed topic/dpm2 to avoid conflict with another pre-existing branch. Sorry for the confusion.

rhc54 · 2024-03-08T20:18:40Z

Just FYI: when running Lisandro's test on a single node, I get the following error message on the first iteration:

--------------------------------------------------------------------------
WARNING: Open MPI accepted a TCP connection from what appears to be a
another Open MPI process but cannot find a corresponding process
entry for that peer.

This attempted connection will be ignored; your MPI job may or may not
continue properly.

  Local host: rhc-node01
  PID:        500
--------------------------------------------------------------------------

The rest of the iterations run silently. The communicator local and remote groups both look correct, so I'm not sure where OMPI is looking for the peer.

rhc54 · 2024-03-09T19:55:49Z

@hppritcha Current status: I have fixed a couple of bugs in the group construct operation regarding modex info storage, and I can now run Lisandro's test to completion. However, if I increase maxnp to greater than 1, then the MPI_Barrier hangs.

So it appears there is still come communication issue once we get a child job larger than 1 process. No error or warning messages are being printed, so I'm not sure where to start looking.

Let me know when you have time to look at this and I'll be happy to assist.

rhc54 · 2024-03-09T19:59:34Z

BTW: I updated the PMIx/PRRTE submodule pointers so they include the required support for the new dpm connect/accept/spawn method

rhc54 · 2024-03-10T21:29:12Z

@hppritcha I believe I know the cause of this last problem and am going to work on it. Meantime, I have opened a PR with the current status so we can see how it performs in CI: #12398

github-actions bot added the Target: main label Feb 20, 2024

rhc54 force-pushed the topic/up branch from 39112b1 to 5fdd5c0 Compare February 28, 2024 23:12

rhc54 marked this pull request as draft February 28, 2024 23:13

rhc54 force-pushed the topic/up branch from 5fdd5c0 to e92819e Compare February 29, 2024 12:42

Up the submodule pointers for PMIx and PRRTE

83c0019

Test against OMPI CI Signed-off-by: Ralph Castain <rhc@pmix.org>

rhc54 force-pushed the topic/up branch from e92819e to 83c0019 Compare March 4, 2024 15:42

Merge branch 'main' into topic/up

1698a71

rhc54 closed this Mar 13, 2024

rhc54 deleted the topic/up branch March 13, 2024 15:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Up the submodule pointers for PMIx and PRRTE #12353

Up the submodule pointers for PMIx and PRRTE #12353

rhc54 commented Feb 20, 2024

wenduwan commented Feb 20, 2024

wenduwan commented Feb 22, 2024

wenduwan commented Feb 22, 2024

rhc54 commented Feb 22, 2024

rhc54 commented Feb 27, 2024

wenduwan commented Feb 27, 2024

wenduwan commented Feb 28, 2024

wenduwan commented Feb 28, 2024 •

edited

Loading

rhc54 commented Feb 28, 2024 •

edited

Loading

wenduwan commented Feb 29, 2024

rhc54 commented Feb 29, 2024

wenduwan commented Feb 29, 2024

rhc54 commented Feb 29, 2024

rhc54 commented Feb 29, 2024

wenduwan commented Feb 29, 2024

rhc54 commented Feb 29, 2024

hppritcha commented Feb 29, 2024

rhc54 commented Mar 2, 2024

hppritcha commented Mar 4, 2024

rhc54 commented Mar 4, 2024

rhc54 commented Mar 4, 2024

rhc54 commented Mar 4, 2024

rhc54 commented Mar 8, 2024

rhc54 commented Mar 9, 2024

rhc54 commented Mar 9, 2024

rhc54 commented Mar 10, 2024

Up the submodule pointers for PMIx and PRRTE #12353

Up the submodule pointers for PMIx and PRRTE #12353

Conversation

rhc54 commented Feb 20, 2024

wenduwan commented Feb 20, 2024

wenduwan commented Feb 22, 2024

wenduwan commented Feb 22, 2024

rhc54 commented Feb 22, 2024

rhc54 commented Feb 27, 2024

wenduwan commented Feb 27, 2024

wenduwan commented Feb 28, 2024

wenduwan commented Feb 28, 2024 • edited Loading

rhc54 commented Feb 28, 2024 • edited Loading

wenduwan commented Feb 29, 2024

rhc54 commented Feb 29, 2024

wenduwan commented Feb 29, 2024

rhc54 commented Feb 29, 2024

rhc54 commented Feb 29, 2024

wenduwan commented Feb 29, 2024

rhc54 commented Feb 29, 2024

hppritcha commented Feb 29, 2024

rhc54 commented Mar 2, 2024

hppritcha commented Mar 4, 2024

rhc54 commented Mar 4, 2024

rhc54 commented Mar 4, 2024

rhc54 commented Mar 4, 2024

rhc54 commented Mar 8, 2024

rhc54 commented Mar 9, 2024

rhc54 commented Mar 9, 2024

rhc54 commented Mar 10, 2024

wenduwan commented Feb 28, 2024 •

edited

Loading

rhc54 commented Feb 28, 2024 •

edited

Loading