Skip to content

Conversation

@artpol84
Copy link
Contributor

The problem was observed for direct modex used with recursive doubling
algorithm (used for collective ID calculation prior to d52a2d0)
that has pairwise nature and counter-connections are highly likely.

The following scenario was uncovering the issue:

  • ranks x and y want to communicate with each other, x < y;
  • rank x initiates the connection and sends the ack;
  • rank y starts to connect() and gets the ack from x;
  • y identifies that it already started connecting and y > x so it rejects incoming connection.
  • x sees that his connection was rejected in mca_oob_tcp_peer_recv_connect_ack() when trying to
    read the message header using tcp_peer_recv_blocking() which calls mca_oob_tcp_peer_close()
    that effectively flushes all the messages in the peer->send_queue.
  • y send the ack to x and the connection is established, however all the messages for the peer
    at x are vanished (except the front one in peer->send_msg).

This commit introduces a "nack" function that will be used at y side to tell x that y has the
priority and x's connection should be closed. This allows to avoid "guessing" on the unexpectedly
closed connection.

(cherry-picked from ada93e0)

Signed-off-by: Artem Polyakov artpol84@gmail.com

@rhc54
Copy link
Contributor

rhc54 commented Nov 30, 2016

Please include 30ff8be

@artpol84
Copy link
Contributor Author

@rhc54 - I'm not sure who's "sign-off" should be there. I'll change it if not correct.

@rhc54
Copy link
Contributor

rhc54 commented Nov 30, 2016

It doesn't matter, so long as someone's is there

@artpol84
Copy link
Contributor Author

@jjhursey looks like an fs failure.

@jjhursey
Copy link
Member

bot:ibm:retest

artpol84 and others added 2 commits December 1, 2016 06:45
The problem was observed for direct modex used with recursive doubling
algorithm (used for collective ID calculation prior to d52a2d0)
that has pairwise nature and counter-connections are highly likely.

The following scenario was uncovering the issue:
* ranks `x` and `y` want to communicate with each other, `x` < `y`;
* rank `x` initiates the connection and sends the ack;
* rank `y` starts to `connect()` and gets the ack from `x`;
* `y` identifies that it already started connecting and `y` > `x` so it rejects incoming connection.
* `x` sees that his connection was rejected in `mca_oob_tcp_peer_recv_connect_ack()` when trying to
read the message header using `tcp_peer_recv_blocking()` which calls `mca_oob_tcp_peer_close()`
that effectively flushes all the messages in the peer->send_queue.
* `y` send the ack to `x` and the connection is established, however all the messages for the peer
at `x` are vanished (except the front one in peer->send_msg).

This commit introduces a "nack" function that will be used at `y` side to tell `x` that `y` has the
priority and `x`'s connection should be closed. This allows to avoid "guessing" on the unexpectedly
closed connection.

(cherry-picked from ada93e0)

Signed-off-by: Artem Polyakov <artpol84@gmail.com>
(cherry-picked from 30ff8be)

Signed-off-by: Artem Polyakov <artpol84@gmail.com>
Plug coverity defect CID 1396541.
(cherry-picked from bf79e83)

Signed-off-by: Artem Polyakov <artpol84@gmail.com>
@jsquyres
Copy link
Member

jsquyres commented Dec 5, 2016

@rhc54 Can you review? Thanks.

@jsquyres jsquyres modified the milestones: v2.x, v2.1.0 Dec 5, 2016
@jsquyres
Copy link
Member

jsquyres commented Dec 7, 2016

@hppritcha Good to go.

@artpol84
Copy link
Contributor Author

artpol84 commented Dec 8, 2016

@hppritcha any updates on this? We would like to get it in so we won't see MTT failures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants