Skip to content

Conversation

@artpol84
Copy link
Contributor

No description provided.

@artpol84 artpol84 added this to the v2.0.2 milestone Nov 30, 2016
@artpol84 artpol84 changed the title Oob/v2.0.x/msg drop oob/v2.0.x/msg drop Nov 30, 2016
artpol84 and others added 3 commits December 1, 2016 06:46
The problem was observed for direct modex used with recursive doubling
algorithm (used for collective ID calculation prior to d52a2d0)
that has pairwise nature and counter-connections are highly likely.

The following scenario was uncovering the issue:
* ranks `x` and `y` want to communicate with each other, `x` < `y`;
* rank `x` initiates the connection and sends the ack;
* rank `y` starts to `connect()` and gets the ack from `x`;
* `y` identifies that it already started connecting and `y` > `x` so it rejects incoming connection.
* `x` sees that his connection was rejected in `mca_oob_tcp_peer_recv_connect_ack()` when trying to
read the message header using `tcp_peer_recv_blocking()` which calls `mca_oob_tcp_peer_close()`
that effectively flushes all the messages in the peer->send_queue.
* `y` send the ack to `x` and the connection is established, however all the messages for the peer
at `x` are vanished (except the front one in peer->send_msg).

This commit introduces a "nack" function that will be used at `y` side to tell `x` that `y` has the
priority and `x`'s connection should be closed. This allows to avoid "guessing" on the unexpectedly
closed connection.

(cherry-picked from ada93e0)

Signed-off-by: Artem Polyakov <artpol84@gmail.com>
(cherry-picked from 30ff8be)

Signed-off-by: Artem Polyakov <artpol84@gmail.com>
Plug coverity defect CID 1396541.
(cherry-picked from bf79e83)

Signed-off-by: Artem Polyakov <artpol84@gmail.com>
@artpol84 artpol84 force-pushed the oob/v2.0.x/msg_drop branch from 07a8068 to 91bb88e Compare November 30, 2016 23:47
@artpol84
Copy link
Contributor Author

@jjhursey if I go by the link I see that test has passed. Can you check please?

@artpol84
Copy link
Contributor Author

artpol84 commented Dec 1, 2016

bot:ibm:retest

@jsquyres
Copy link
Member

jsquyres commented Dec 2, 2016

@rhc54 Can you review? Thanks.

@jsquyres
Copy link
Member

jsquyres commented Dec 2, 2016

@hppritcha Good to go.

@jsquyres jsquyres merged commit b124a71 into open-mpi:v2.0.x Dec 3, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants