-
Notifications
You must be signed in to change notification settings - Fork 932
Closed
Description
In MTT (and manual) runs of Open MPI on master, I'm getting complaints from the TCP BTL about the peer unexpectedly closing the connection while it was trying to read a blocking message.
Note that I have a lot of IP interfaces on my linux machines. Here's an example:
$ ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: lom0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
link/ether 60:73:5c:68:c4:8a brd ff:ff:ff:ff:ff:ff
inet 10.0.8.19/16 brd 10.0.255.255 scope global lom0
inet6 fe80::6273:5cff:fe68:c48a/64 scope link
valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
link/ether 60:73:5c:68:c4:8b brd ff:ff:ff:ff:ff:ff
4: eth2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
link/ether 60:73:5c:68:c4:8c brd ff:ff:ff:ff:ff:ff
5: eth3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
link/ether 60:73:5c:68:c4:8d brd ff:ff:ff:ff:ff:ff
6: vic20: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP qlen 1000
link/ether 24:57:20:13:20:00 brd ff:ff:ff:ff:ff:ff
inet 10.10.0.19/16 brd 10.10.255.255 scope global vic20
inet6 fe80::2657:20ff:fe13:2000/64 scope link
valid_lft forever preferred_lft forever
7: vic21: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP qlen 1000
link/ether 24:57:20:13:21:00 brd ff:ff:ff:ff:ff:ff
inet 10.2.0.19/16 brd 10.2.255.255 scope global vic21
inet6 fe80::2657:20ff:fe13:2100/64 scope link
valid_lft forever preferred_lft forever
8: eth6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP qlen 1000
link/ether 24:57:20:13:50:00 brd ff:ff:ff:ff:ff:ff
inet 10.3.0.19/16 brd 10.3.255.255 scope global eth6
inet6 fe80::2657:20ff:fe13:5000/64 scope link
valid_lft forever preferred_lft forever
9: eth7: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
link/ether 24:57:20:13:51:00 brd ff:ff:ff:ff:ff:ff
Running 32 copies of the Intel MPI_Isend_rtoa_c test across 2 16-core machines (mpi019,
and mpi020), each with the same Linux IP interfaces, I sometimes see output like this:
$ mpirun --mca btl tcp,vader,self --np 4 --npernode 2 --mca oob tcp ./MPI_Isend_rtoa_c
MPITEST info (0): Starting MPI_Isend_rtoa: Root Isends TO All test
[mpi020][[29976,1],3][../../../../../opal/mca/btl/tcp/btl_tcp.c:549:mca_btl_tcp_recv_blocking] remote peer unexpectedly closed connection while I was waiting for blocking message
--------------------------------------------------------------------------
WARNING: Open MPI failed to handshake with a connecting peer MPI
process over TCP. This should not happen.
Your Open MPI job may now fail.
Local host: mpi020
PID: 28579
Message: did not receive entire connect ACK from peer
--------------------------------------------------------------------------
MPITEST_results: MPI_Isend_rtoa: Root Isends TO All all tests PASSED (19656)
If I increase NP / PPN, the frequency of seeing those warning messages increases.
I've uploaded the output from a --mca btl_base_verbose 100 run with NP=4, PPN=2 to a gist.
@bwbarrett Is this due to recent changes in the connectivity stuff in the TCP BTL that you guys did, perchance?