-
Notifications
You must be signed in to change notification settings - Fork 931
oob/v2.x/msg drop #2477
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
oob/v2.x/msg drop #2477
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
|
Please include 30ff8be |
|
@rhc54 - I'm not sure who's "sign-off" should be there. I'll change it if not correct. |
|
It doesn't matter, so long as someone's is there |
|
@jjhursey looks like an fs failure. |
|
bot:ibm:retest |
The problem was observed for direct modex used with recursive doubling algorithm (used for collective ID calculation prior to d52a2d0) that has pairwise nature and counter-connections are highly likely. The following scenario was uncovering the issue: * ranks `x` and `y` want to communicate with each other, `x` < `y`; * rank `x` initiates the connection and sends the ack; * rank `y` starts to `connect()` and gets the ack from `x`; * `y` identifies that it already started connecting and `y` > `x` so it rejects incoming connection. * `x` sees that his connection was rejected in `mca_oob_tcp_peer_recv_connect_ack()` when trying to read the message header using `tcp_peer_recv_blocking()` which calls `mca_oob_tcp_peer_close()` that effectively flushes all the messages in the peer->send_queue. * `y` send the ack to `x` and the connection is established, however all the messages for the peer at `x` are vanished (except the front one in peer->send_msg). This commit introduces a "nack" function that will be used at `y` side to tell `x` that `y` has the priority and `x`'s connection should be closed. This allows to avoid "guessing" on the unexpectedly closed connection. (cherry-picked from ada93e0) Signed-off-by: Artem Polyakov <artpol84@gmail.com>
(cherry-picked from 30ff8be) Signed-off-by: Artem Polyakov <artpol84@gmail.com>
9e36780 to
94a1d42
Compare
Plug coverity defect CID 1396541. (cherry-picked from bf79e83) Signed-off-by: Artem Polyakov <artpol84@gmail.com>
94a1d42 to
f508887
Compare
|
@rhc54 Can you review? Thanks. |
rhc54
approved these changes
Dec 7, 2016
|
@hppritcha Good to go. |
|
@hppritcha any updates on this? We would like to get it in so we won't see MTT failures. |
This was referenced Dec 10, 2016
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The problem was observed for direct modex used with recursive doubling
algorithm (used for collective ID calculation prior to d52a2d0)
that has pairwise nature and counter-connections are highly likely.
The following scenario was uncovering the issue:
xandywant to communicate with each other,x<y;xinitiates the connection and sends the ack;ystarts toconnect()and gets the ack fromx;yidentifies that it already started connecting andy>xso it rejects incoming connection.xsees that his connection was rejected inmca_oob_tcp_peer_recv_connect_ack()when trying toread the message header using
tcp_peer_recv_blocking()which callsmca_oob_tcp_peer_close()that effectively flushes all the messages in the peer->send_queue.
ysend the ack toxand the connection is established, however all the messages for the peerat
xare vanished (except the front one in peer->send_msg).This commit introduces a "nack" function that will be used at
yside to tellxthatyhas thepriority and
x's connection should be closed. This allows to avoid "guessing" on the unexpectedlyclosed connection.
(cherry-picked from ada93e0)
Signed-off-by: Artem Polyakov artpol84@gmail.com