New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/sockets: Fix #3217, bug in buffered receive path #3223

Merged
merged 1 commit into from Aug 16, 2017

Conversation

Projects
None yet
2 participants
@epaulson10

epaulson10 commented Aug 16, 2017

This bug was found by testing GASNet, which has some tests that very
rigourously stress messaging endpoints. The behaviour was that tests would
sporadically hang when put under heavy loads.

The root cause is found in the provider's handling of previously buffered
messages (ones that came in without a matching receive buffer). When the
handler grabbed a posted buffer to place a buffered message into, it was not
marking that receive buffer as free once it was done with it, which causes
buffers posted with FI_MULTI_RECV to become unusable in the future, which
leads to the provider running out of receive buffers.

After fixing the root cause, there was also an issue in calculating the
position in which to place a message in such a multi-recv buffer, which this
patch also fixes.

Signed-off-by: Erik Paulson erik.r.paulson@intel.com

@epaulson10

This comment has been minimized.

Show comment
Hide comment
@epaulson10

epaulson10 Aug 16, 2017

Travis failed fi_multi_ep on the linux GCC build, but passed it on the Clang Linux build. Here was the output:

- name:   fi_multi_ep -e msg -p "sockets"

  result: Fail

  time:   90

  server_cmd: fi_multi_ep -e msg -p "sockets" -s 127.0.0.1

  server_stdout: |

    transmit(): common/shared.c:1950, ret=-11 (Resource temporarily unavailable)

  client_cmd: fi_multi_ep -e msg -p "sockets" -s 127.0.0.1 127.0.0.1

  client_stdout: |

    PASSED multi ep

This test passes when I run it on my Linux machine using GCC.

epaulson10 commented Aug 16, 2017

Travis failed fi_multi_ep on the linux GCC build, but passed it on the Clang Linux build. Here was the output:

- name:   fi_multi_ep -e msg -p "sockets"

  result: Fail

  time:   90

  server_cmd: fi_multi_ep -e msg -p "sockets" -s 127.0.0.1

  server_stdout: |

    transmit(): common/shared.c:1950, ret=-11 (Resource temporarily unavailable)

  client_cmd: fi_multi_ep -e msg -p "sockets" -s 127.0.0.1 127.0.0.1

  client_stdout: |

    PASSED multi ep

This test passes when I run it on my Linux machine using GCC.

@shefty

This comment has been minimized.

Show comment
Hide comment
@shefty

shefty Aug 16, 2017

Contributor

Ignore multi_ep tests. It fails frequently for any PR. I haven't been able to isolate the cause of the failure. The test itself looks like it should be okay, but it's a new test.

Contributor

shefty commented Aug 16, 2017

Ignore multi_ep tests. It fails frequently for any PR. I haven't been able to isolate the cause of the failure. The test itself looks like it should be okay, but it's a new test.

Erik Paulson
prov/sockets: Fix #3217, bug in buffered receive path
This bug was found by testing GASNet, which has some tests that very
rigourously stress messaging endpoints. The behaviour was that tests would
sporadically hang when put under heavy loads.

The root cause is found in the provider's handling of previously buffered
messages (ones that came in without a matching receive buffer). When the
handler grabbed a posted buffer to place a buffered message into, it was not
marking that receive buffer as free once it was done with it, which causes
buffers posted with FI_MULTI_RECV to become unusable in the future, which
leads to the provider running out of receive buffers.

After fixing the root cause, there was also an issue in calculating the
position in which to place a message in such a multi-recv buffer, which this
patch also fixes.

Signed-off-by: Erik Paulson <erik.r.paulson@intel.com>

@shefty shefty merged commit 30084d2 into ofiwg:master Aug 16, 2017

2 of 3 checks passed

continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
DCO All commits have a DCO sign-off from the author
Signed-off-by checker This commit is signed off
@shefty

This comment has been minimized.

Show comment
Hide comment
@shefty

shefty Aug 16, 2017

Contributor

thanks - also adding to v1.4.x branch

Contributor

shefty commented Aug 16, 2017

thanks - also adding to v1.4.x branch

@epaulson10

This comment has been minimized.

Show comment
Hide comment
@epaulson10

epaulson10 Dec 12, 2017

@shefty I think this needs to get picked to v1.5.x.

epaulson10 commented Dec 12, 2017

@shefty I think this needs to get picked to v1.5.x.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment