master: Getting TCP BTL complaints about peer unexpectedly closing connection

In MTT (and manual) runs of Open MPI on master, I'm getting complaints from the TCP BTL about the peer unexpectedly closing the connection while it was trying to read a blocking message.

Note that I have a lot of IP interfaces on my linux machines.  Here's an example:

```
$ ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: lom0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether 60:73:5c:68:c4:8a brd ff:ff:ff:ff:ff:ff
    inet 10.0.8.19/16 brd 10.0.255.255 scope global lom0
    inet6 fe80::6273:5cff:fe68:c48a/64 scope link 
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 60:73:5c:68:c4:8b brd ff:ff:ff:ff:ff:ff
4: eth2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 60:73:5c:68:c4:8c brd ff:ff:ff:ff:ff:ff
5: eth3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 60:73:5c:68:c4:8d brd ff:ff:ff:ff:ff:ff
6: vic20: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP qlen 1000
    link/ether 24:57:20:13:20:00 brd ff:ff:ff:ff:ff:ff
    inet 10.10.0.19/16 brd 10.10.255.255 scope global vic20
    inet6 fe80::2657:20ff:fe13:2000/64 scope link 
       valid_lft forever preferred_lft forever
7: vic21: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP qlen 1000
    link/ether 24:57:20:13:21:00 brd ff:ff:ff:ff:ff:ff
    inet 10.2.0.19/16 brd 10.2.255.255 scope global vic21
    inet6 fe80::2657:20ff:fe13:2100/64 scope link 
       valid_lft forever preferred_lft forever
8: eth6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP qlen 1000
    link/ether 24:57:20:13:50:00 brd ff:ff:ff:ff:ff:ff
    inet 10.3.0.19/16 brd 10.3.255.255 scope global eth6
    inet6 fe80::2657:20ff:fe13:5000/64 scope link 
       valid_lft forever preferred_lft forever
9: eth7: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 24:57:20:13:51:00 brd ff:ff:ff:ff:ff:ff
```

Running 32 copies of the Intel `MPI_Isend_rtoa_c` test across 2 16-core machines (`mpi019`,
 and `mpi020`), each with the same Linux IP interfaces, I *sometimes* see output like this:

```
$ mpirun --mca btl tcp,vader,self --np 4 --npernode 2 --mca oob tcp  ./MPI_Isend_rtoa_c
MPITEST info  (0): Starting MPI_Isend_rtoa: Root Isends TO All test
[mpi020][[29976,1],3][../../../../../opal/mca/btl/tcp/btl_tcp.c:549:mca_btl_tcp_recv_blocking] remote peer unexpectedly closed connection while I was waiting for blocking message
--------------------------------------------------------------------------
WARNING: Open MPI failed to handshake with a connecting peer MPI
process over TCP.  This should not happen.

Your Open MPI job may now fail.

  Local host: mpi020
  PID:        28579
  Message:    did not receive entire connect ACK from peer
--------------------------------------------------------------------------
MPITEST_results: MPI_Isend_rtoa: Root Isends TO All all tests PASSED (19656)
```

If I increase NP / PPN, the frequency of seeing those warning messages increases.

I've uploaded the output from a `--mca btl_base_verbose 100` run with NP=4, PPN=2 [to a gist](https://gist.github.com/jsquyres/5351c7cfc29c6fd78392cc467f2ef158).

@bwbarrett Is this due to recent changes in the connectivity stuff in the TCP BTL that you guys did, perchance?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

master: Getting TCP BTL complaints about peer unexpectedly closing connection #4263

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

master: Getting TCP BTL complaints about peer unexpectedly closing connection #4263

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions