Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ompi 1.4.3 hangs in IMB Gather when np >= 64 & msgsize > 4k #125

Closed
ompiteam opened this issue Oct 1, 2014 · 20 comments
Closed

ompi 1.4.3 hangs in IMB Gather when np >= 64 & msgsize > 4k #125

ompiteam opened this issue Oct 1, 2014 · 20 comments
Assignees
Milestone

Comments

@ompiteam
Copy link
Contributor

ompiteam commented Oct 1, 2014

As reported on ompi-devel in the following email thread:

http://www.open-mpi.org/community/lists/devel/2011/01/8852.php

ompi 1.4.3 hangs in IMB/Gather when np >= 64. This is being seen mainly on x86_64 systems with Mellanox ConnectX HCAs. Current workarounds seem to be to either use rdmacm or use mpi_preconnect_mpi to establish all possible connections at job launch, rather than on demand. It also seems to be sensitive to the selection of the collective algorithm.

This hang has not been seen in 1.5, nor with other MPIs (e.g., Intel).

This has been seen on multiple clusters: Doron's cluster and on a couple of IBM iDataplex clusters.

@ompiteam
Copy link
Contributor Author

ompiteam commented Oct 1, 2014

Imported from trac issue 2714. Created by bbenton on 2011-02-08T10:52:29, last modified: 2012-02-21T13:40:41

@ompiteam
Copy link
Contributor Author

ompiteam commented Oct 1, 2014

Trac comment by jsquyres on 2011-02-25 09:08:16:

Is this ticket related to #2722?

@ompiteam
Copy link
Contributor Author

ompiteam commented Oct 1, 2014

Trac comment by bbenton on 2011-05-31 11:59:21:

Given the increased number of incidents with this, moving it back as a blocker for 1.4.4.

@ompiteam
Copy link
Contributor Author

ompiteam commented Oct 1, 2014

Trac comment by samuel on 2011-05-31 13:55:39:

I couldn't reproduce the IMB/Gather hang on the following system setup:

Intel(R) Xeon(R) CPU X5550 @ 2.67GHz

$ uname -iopmvrs[[BR]]
Linux 2.6.18-103chaos !#1 SMP Tue Oct 19 16:43:10 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux

Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE](rev b0)

OFED 1.4.1ish - we don't run a vanilla OFED stack.

Open MPI 1.4.3 with SM BTL memory barrier patch (see #2619).

IMB 3.2

gcc 4.1.2 20080704 (Red Hat 4.1.2-48) - 32 iterations @ 72 rank processes on 9 nodes.

Intel 11.1.072 - 32 iterations @ 72 rank processes on 9 nodes.

Some potentially relevant MCA parameters that we run with:[[BR]]
coll_sync_priority = 100[[BR]]
coll_sync_barrier_before = 1000[[BR]]
coll_hierarch_priority = 90[[BR]]
oob_tcp_if_include = ib0[[BR]]
oob_tcp_peer_retries = 1000[[BR]]
oob_tcp_disable_family = IPv6[[BR]]
oob_tcp_listen_mode = listen_thread[[BR]]
oob_tcp_sndbuf = 32768[[BR]]
oob_tcp_rcvbuf = 32768[[BR]]

Hope that helps,

Samuel Gutierrez[[BR]]
Los Alamos National Laboratory

@ompiteam
Copy link
Contributor Author

ompiteam commented Oct 1, 2014

Trac comment by bbenton on 2011-05-31 22:49:23:

Adding Chris to cc:

@ompiteam
Copy link
Contributor Author

ompiteam commented Oct 1, 2014

Trac comment by bbenton on 2011-06-07 11:32:20:

We have been able to reproduce this at IBM over eHCA's on a couple of our P6/IH servers (32 core systems). Interestingly, we have seen this on 1.5.x as well as 1.4.x, although it does not seem to happen as frequently on the 1.5.x series.

Here is the flow of the gather code:

ompi_coll_tuned_gather_intra_dec_fixed()
  if block_size < 6000 
    ompi_coll_tuned_gather_intra_basic_linear()
  if block_size > 6000 
    ompi_coll_tuned_gather_intra_linear_sync()

The call to ompi_coll_tuned_gather_intra_linear_sync() results in the additional, on the fly, connections. If you force it to stay with ompi_coll_tuned_gather_intra_basic_linear(), things work fine. This can be done via the following mca parameters:

coll_tuned_use_dynamic_rules 1 
coll_tuned_gather_algorithm 1

Chris Yeoh has looked into the state of things during one of the system hangs. Here is his analysis:

Ranks 0-25 think the root process is
rank 28. 26 thinks the root is 26, 27-61 think its 27 and 62-63
think its 26.

In looking at where the root processes in
ompi_coll_tuned_gather_intra_linear_sync() think they are in
talking to non root nodes, for rank 26, it thinks i==62 and for 27
i==26. I.e., rank 26 is waiting for rank 62 and rank 27 is waiting for
rank 26.

I realise now that there is no synchronisation between tests of
MPI_Gather so the above makes sense. the gather where rank 26 was root
appears to have a lost message from rank 62 and then everything piles
up when rank 27 acting as the root can't get a message from 26 because
its still stuck in blocking send to 62. For rank 62 its stuck in:

#6  0x0000040000976b2c in ompi_coll_tuned_gather_intra_linear_sync
  (sbuf=0x102d7430, scount=8192, sdtype=0x40000324048,
  rbuf=0x400088b0010, rcount=8192, rdtype=0x40000324048, root=26,
 comm=0x101654a0, module=0x10182a80, first_segment_size=1024)
 at ../../../../../ompi/mca/coll/tuned/coll_tuned_gather.c:247

which is

  ret = MCA_PML_CALL(recv(sbuf, 0, MPI_BYTE, root,
                                 MCA_COLL_BASE_TAG_GATHER,
                                 comm, MPI_STATUS_IGNORE));

i.e., 62 thinks it never received the send from 26, although rank 26 is
stuck in

#9  0x0000040000976e90 in ompi_coll_tuned_gather_intra_linear_sync
  (sbuf=0x1027c250, scount=8192, sdtype=0x40000324048,
  rbuf=0x40007c10010, rcount=8192, rdtype=0x40000324048, root=26,
  comm=0x101677c0, module=0x101947f0, first_segment_size=1024)
  at ../../../../../ompi/mca/coll/tuned/coll_tuned_gather.c:301
            /* send sync message */
             ret = MCA_PML_CALL(send(rbuf, 0, MPI_BYTE, i,
                                     MCA_COLL_BASE_TAG_GATHER,
                                     MCA_PML_BASE_SEND_STANDARD, > comm));

i.e., the blocking send to 62. So it indicates the connection between 62
and 27 is bad, not just a completion going missing in a race.

I think I don't quite understand the semantics of the blocking send,
because a closer look at the backtraces for rank 26 and 27 seem to
show that rank 26 is stuck at

#9  0x0000040000976e90 in ompi_coll_tuned_gather_intra_linear_sync
  (sbuf=0x1027c250, scount=8192, sdtype=0x40000324048,
  rbuf=0x40007c10010, rcount=8192, rdtype=0x40000324048, root=26,
  comm=0x101677c0, module=0x101947f0, first_segment_size=1024)
  at ../../../../../ompi/mca/coll/tuned/coll_tuned_gather.c:301

and rank 27 is stuck at

#9  0x0000040000976fa8 in ompi_coll_tuned_gather_intra_linear_sync
  (sbuf=0x1027c290, scount=8192, sdtype=0x40000324048,
  rbuf=0x40007d20010, rcount=8192, rdtype=0x40000324048, root=27,
  comm=0x10167800, module=0x10194830, first_segment_size=1024)
  at ../../../../../ompi/mca/coll/tuned/coll_tuned_gather.c:314

which is actually further along and rank 26 couldn't possibly have
posted the receives that 27 is expecting as the root.

Just putting the two above things together, I'm just wondering if its
possible that there is race such that because of the number of
processes in the gather the following occurs:

  1. when the MPI_gather where rank 26 is root occurs, rank 27
    successfully communicates with rank 26 and starts the new iteration
    of the test,
  2. rank 27 acting as root starts talking with ranks 1-25
    and succesfully communicates with them
  3. rank 27 starts up the new
    connection that it needs with rank 26
  4. rank 26 meanwhile has
    succesfully communicated with ranks 28-61
  5. rank 26 starts up the
    new connection it needs with rank 62
  6. "Somehow" there is some cross
    talk such that the send from rank 62 gets to rank 27 instead and the
    completion event goes back to the wrong place.

So rank 27 ends up progressing further but gets stuck on the second
send to rank 26. Rank 62 never sees the send from 26, and 26 is still
stuck in the send it thought went to 62 but went to 27 instead.

Perhaps someone more familiar with this code (George?) can take a look and provide their insights into the state of things as well as the analysis above.

@ompiteam
Copy link
Contributor Author

ompiteam commented Oct 1, 2014

Trac comment by kliteyn on 2011-06-09 10:02:07:

My 0.02$:

I was able to reproduce a hang with Allgather, couldn't reproduce with Gather.
Frankly, I don't know if this is the same problem.
Anyway, I can see it stuck with np=64 ans with the biggest msgsize=4MB.
It happens reliably even with single iteration:

mpirun -np 64 --hostfile <host_file> --mca orte_base_help_aggregate 0 --mca btl_openib_warn_default_gid_prefix 0 --mca btl openib,self IMB-MPI1 -npmin 64 -iter 1 -msglen <msglen_file> Allgather

msglen_file contains just a single size - "4194304"

All 64 ranks are waiting at the same place:

-----------------
[0-63] (64 processes)
-----------------
main() at ?:?
  IMB_init_buffers_iter() at ?:?
    IMB_allgather() at ?:?
      PMPI_Allgather() at pallgather.c:114
        ompi_coll_tuned_allgather_intra_dec_fixed() at coll_tuned_decision_fixed.c:571
          ompi_coll_tuned_allgather_intra_neighborexchange() at coll_tuned_allgather.c:597
            ompi_coll_tuned_sendrecv() at coll_tuned_util.h:60
              ompi_coll_tuned_sendrecv_actual() at coll_tuned_util.c:55
                ompi_request_default_wait_all() at request/req_wait.c:262
                  opal_condition_wait() at ../opal/threads/condition.h:99

I also see this happening with ''--mca coll basic'', not only with ''tuned''.
What's interesting, and what also hints that this might be another issue (more specifically, it can be same as https://svn.open-mpi.org/trac/ompi/ticket/2627), is the fact that using ''--mca mpi_leave_pinned 0'' makes the problem go away.

@ompiteam
Copy link
Contributor Author

ompiteam commented Oct 1, 2014

Trac comment by kliteyn on 2011-06-09 10:40:40:

BTW, I get same problem with branch 1.5.

@ompiteam
Copy link
Contributor Author

ompiteam commented Oct 1, 2014

Trac comment by rolfv on 2011-06-09 11:04:11:

I am probably stating the obvious here, but this must be related to the fact that some of the first messages we are sending are large. This means that we are using the PUT protocol from the PML layer. This means the sending side pins the memory, then sends a PUT message to the receiving side. Presumably, that messages gets queued up, then gets popped off and attempts to pin the memory. I wonder if that fails, and then we get the hang. By setting mpi_leave_pinned=0, we are making use of the RNDV protocol vs the PUT which means we are not actually pinning any memory. (I believe I submitted a bug against that).

One other thing to note is that IMB always does a warmup with a 4 Mbyte message prior to starting a test. So, I would think you may see this hang with IMB regardless of the message size that it is testing.

In terms of debugging, we would need to look at each process via a debugger and see where they are at. We would have to look at the send and receive requests and try to figure out what happend. Another far fetched idea is to take a look at the btl_openib_failover.c file. There is a debugging function in there that will dump out all the internal queues from the openib BTL. This allows us to see if there is a message stuck in the BTL maybe because the connection never completed.

@ompiteam
Copy link
Contributor Author

ompiteam commented Oct 1, 2014

Trac comment by kliteyn on 2011-06-14 08:14:30:

The setup that had this issue reproduced has been upgraded to RHEL6, and the problem disappeared...

I've been able to reproduce probably the same problem on another cluster - this time it's with IMB Alltoall with 4MB messages and np>= 44, but it's on a very remote setup where I don't have too many privileges, so I'm trying to build another setup in-house to debug it.

@ompiteam
Copy link
Contributor Author

ompiteam commented Oct 1, 2014

Trac comment by bbenton on 2011-06-14 12:23:28:

I think that we might be chasing two problems. In particular, I was able to re-create the gather hang with mpi_leave_pinned = 0 (this was with np=64 & it hung with a message size of 8K).

@ompiteam
Copy link
Contributor Author

ompiteam commented Oct 1, 2014

Trac comment by kliteyn on 2011-06-16 08:04:36:

OK, then you're right - what I'm seeing is irrelevant for this particular issue (it's probably more relevant for https://svn.open-mpi.org/trac/ompi/ticket/2627).

@ompiteam
Copy link
Contributor Author

ompiteam commented Oct 1, 2014

Trac comment by bbenton on 2011-07-25 13:37:57:

Update: This defect remains elusive to recreate. We have not been able to recreate it here at IBM for some weeks now. Also, Brock Palen (via off-list email) has indicated that he has been unable to reproduce this after a "power event" forced him to restart his environment from scratch.

My current recommendation for a workaround on IB fabrics is to force the use of ompi_coll_tuned_gather_intra_basic_linear() via:

--mca coll_tuned_use_dynamic_rules 1  --mca coll_tuned_gather_algorithm 1

@ompiteam
Copy link
Contributor Author

ompiteam commented Oct 1, 2014

Trac comment by samuel on 2011-08-16 10:42:59:

Hi,

We updated one of our test clusters and are now experiencing a hang in IMB Alltoall. Can you ask the user to send us the following output?

cat /sys/module/mlx4_core/parameters/log_mtts_per_seg

We think that this hang is related to a memory registration limitation due to this setting (at least on our machine). We'll have more data later today.

We are going to up this value to 5 - its current value on our cluster is 0. This seems to limit the amount of memory that can be registered via calls to ibv_reg_mr, for example, to just under 2GB.

Sam

@ompiteam
Copy link
Contributor Author

ompiteam commented Oct 1, 2014

Trac comment by bbenton on 2011-08-16 10:53:49:

I'm not sure that the Alltoall issue is the same as this gather hang. I think that this seems to be more related to some of the issues with resource exhaustion and the registration cache, such as https://svn.open-mpi.org/trac/ompi/ticket/2155, https://svn.open-mpi.org/trac/ompi/ticket/2157, and https://svn.open-mpi.org/trac/ompi/ticket/2295.

@ompiteam
Copy link
Contributor Author

ompiteam commented Oct 1, 2014

Trac comment by bbenton on 2011-08-16 10:56:34:

Brock set us up (Chris Yeoh & Brad Benton) with accounts at UMich. Hopefully, we'll be able to better pursue this problem in the UMich environment. Meanwhile, as discussed during the Aug 9 telecon, since we have a viable workaround with the selection of a different gather algorithm. I'm resetting this defect to "critical" and moving it to 1.4.5 so that we can get 1.4.4 out.

@ompiteam
Copy link
Contributor Author

ompiteam commented Oct 1, 2014

Trac comment by pasha on 2011-08-16 11:27:21:

Replying to [comment:13 samuel]:

Hi,

We updated one of our test clusters and are now experiencing a hang in IMB Alltoall. Can you ask the user to send us the following output?

cat /sys/module/mlx4_core/parameters/log_mtts_per_seg

We think that this hang is related to a memory registration limitation due to this setting (at least on our machine). We'll have more data later today.

We are going to up this value to 5 - its current value on our cluster is 0. This seems to limit the amount of memory that can be registered via calls to ibv_reg_mr, for example, to just under 2GB.

Sam

BTW , Please see below nice documents that describes the limits tuning:
http://www.ibm.com/developerworks/wikis/display/hpccentral/Using+RDMA+with+pagepool+larger+than+8GB

@ompiteam
Copy link
Contributor Author

ompiteam commented Oct 1, 2014

Trac comment by samuel on 2011-08-16 12:32:58:

Sorry for hijacking the thread.

Thanks Pasha. We updated the setting and everything seems to be working as expected.

Sam

@ompiteam
Copy link
Contributor Author

ompiteam commented Oct 1, 2014

Trac comment by bbenton on 2012-02-21 13:40:41:

Milestone Open MPI 1.4.5 deleted

@jsquyres
Copy link
Member

This seems to have been fixed forever ago.

yosefe pushed a commit to yosefe/ompi that referenced this issue Mar 5, 2015
Update hwloc to 1.9.1, bringing the newer version over from the master
jsquyres pushed a commit to jsquyres/ompi that referenced this issue Nov 28, 2018
Implement process set name support
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants