Skip to content

master: Getting TCP BTL complaints about peer unexpectedly closing connection #4263

@jsquyres

Description

@jsquyres

In MTT (and manual) runs of Open MPI on master, I'm getting complaints from the TCP BTL about the peer unexpectedly closing the connection while it was trying to read a blocking message.

Note that I have a lot of IP interfaces on my linux machines. Here's an example:

$ ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: lom0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether 60:73:5c:68:c4:8a brd ff:ff:ff:ff:ff:ff
    inet 10.0.8.19/16 brd 10.0.255.255 scope global lom0
    inet6 fe80::6273:5cff:fe68:c48a/64 scope link 
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 60:73:5c:68:c4:8b brd ff:ff:ff:ff:ff:ff
4: eth2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 60:73:5c:68:c4:8c brd ff:ff:ff:ff:ff:ff
5: eth3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 60:73:5c:68:c4:8d brd ff:ff:ff:ff:ff:ff
6: vic20: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP qlen 1000
    link/ether 24:57:20:13:20:00 brd ff:ff:ff:ff:ff:ff
    inet 10.10.0.19/16 brd 10.10.255.255 scope global vic20
    inet6 fe80::2657:20ff:fe13:2000/64 scope link 
       valid_lft forever preferred_lft forever
7: vic21: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP qlen 1000
    link/ether 24:57:20:13:21:00 brd ff:ff:ff:ff:ff:ff
    inet 10.2.0.19/16 brd 10.2.255.255 scope global vic21
    inet6 fe80::2657:20ff:fe13:2100/64 scope link 
       valid_lft forever preferred_lft forever
8: eth6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP qlen 1000
    link/ether 24:57:20:13:50:00 brd ff:ff:ff:ff:ff:ff
    inet 10.3.0.19/16 brd 10.3.255.255 scope global eth6
    inet6 fe80::2657:20ff:fe13:5000/64 scope link 
       valid_lft forever preferred_lft forever
9: eth7: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 24:57:20:13:51:00 brd ff:ff:ff:ff:ff:ff

Running 32 copies of the Intel MPI_Isend_rtoa_c test across 2 16-core machines (mpi019,
and mpi020), each with the same Linux IP interfaces, I sometimes see output like this:

$ mpirun --mca btl tcp,vader,self --np 4 --npernode 2 --mca oob tcp  ./MPI_Isend_rtoa_c
MPITEST info  (0): Starting MPI_Isend_rtoa: Root Isends TO All test
[mpi020][[29976,1],3][../../../../../opal/mca/btl/tcp/btl_tcp.c:549:mca_btl_tcp_recv_blocking] remote peer unexpectedly closed connection while I was waiting for blocking message
--------------------------------------------------------------------------
WARNING: Open MPI failed to handshake with a connecting peer MPI
process over TCP.  This should not happen.

Your Open MPI job may now fail.

  Local host: mpi020
  PID:        28579
  Message:    did not receive entire connect ACK from peer
--------------------------------------------------------------------------
MPITEST_results: MPI_Isend_rtoa: Root Isends TO All all tests PASSED (19656)

If I increase NP / PPN, the frequency of seeing those warning messages increases.

I've uploaded the output from a --mca btl_base_verbose 100 run with NP=4, PPN=2 to a gist.

@bwbarrett Is this due to recent changes in the connectivity stuff in the TCP BTL that you guys did, perchance?

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions