New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sockets: The provider is losing messages when FI_MULTI_RECV space runs low. #3217

Closed
epaulson10 opened this Issue Aug 15, 2017 · 7 comments

Comments

Projects
None yet
3 participants
@epaulson10

epaulson10 commented Aug 15, 2017

First off, this issue was reported by LBNL when testing GASNet and is described in detail on their Bugzilla here: https://upc-bugs.lbl.gov/bugzilla/show_bug.cgi?id=3597

Second, here are the hints being used in GASNet:

hints->caps                   = FI_RMA | FI_MSG | FI_MULTI_RECV;
hints->mode                   = FI_CONTEXT; 
hints->addr_format  = FI_FORMAT_UNSPEC;
hints->tx_attr->op_flags = FI_DELIVERY_COMPLETE;
hints->ep_attr->type = FI_EP_RDM;
hints->domain_attr->threading = FI_THREAD_DOMAIN;
hints->domain_attr->control_progress  = FI_PROGRESS_MANUAL;
hints->domain_attr->resource_mgmt  = FI_RM_ENABLED;
hints->domain_attr->av_type = FI_AV_TABLE;

Also note that this test is running in a single thread.

The test that is hanging due to lost messages is testam, which fires a large number of messages at once. This failure is sporadic and occurs at different times. Sometimes it takes place in the test warmup, sometimes it takes place in the test body, sometimes it happens after the test is run. To run the test and reproduce the failure, I loop the test as so:
$ while FI_PROVIDER=sockets ./contrib/gasnetrun_ofi -np 2 ./testam 10000 8192 F; do :; done

The specific test above (designated by specifying 'F') is a two-way flood test, meaning both nodes are sending messages to each other. On examining the processes, I see that one process has finished sending and receiving all messages (10000, as specified in the command). The other process, has successfully sent all of its messages, but has only received a portion of them. In the run I have on my screen right now, out of 10,000 messages, the node is stuck only having received 5790 of them.

I believe this has to do with the handling of buffered messages in sockets when the FI_MULTI_RECV buffers in GASNet run out of space. The first indication of this is the log messages I get. Sometimes (but not all the time), I get this log dump:

libfabric:sockets:cq:_sock_cq_write():181<warn> Not enough space in CQ                        
...            
libfabric:sockets:cq:_sock_cq_write():181<warn> Not enough space in CQ                        
libfabric:sockets:ep_data:sock_rx_new_buffered_entry():109<warn> Exceeded buffered recv limit 
...
libfabric:sockets:ep_data:sock_rx_new_buffered_entry():109<warn> Exceeded buffered recv limit 

By default, GASNet allocates 8 multi-recv buffers, each 1MB in size. These are split between two receiving endpoints. If i greatly increase the size or number of the receive buffers (make each 8MB in size or post 8 times as many 1MB buffers) I can't reproduce this error. If I reduce the number of buffers to two (which is the minimum, as there are two MSG endpoints in GASNet/ofi) then the test won't function at all (and the test does function under these circumstances with other providers, like psm2).

@shefty

This comment has been minimized.

Show comment
Hide comment
@shefty

shefty Aug 15, 2017

Contributor

From the log it appears that the provider is unable to write the completion, which results in lost events. This may be an issue with the resource management support.

Contributor

shefty commented Aug 15, 2017

From the log it appears that the provider is unable to write the completion, which results in lost events. This may be an issue with the resource management support.

@epaulson10

This comment has been minimized.

Show comment
Hide comment
@epaulson10

epaulson10 Aug 15, 2017

From the log it appears that the provider is unable to write the completion, which results in lost events.
This may be an issue with the resource management support.

If my reading of the code is correct, this message means that sockets will malloc additional memory to hold the completion entry. I get this message (by itself) on successful runs sometimes. I also can completely squelch this message and still get a hang by setting the CQ size to a number larger than the number of messages I need to send.

epaulson10 commented Aug 15, 2017

From the log it appears that the provider is unable to write the completion, which results in lost events.
This may be an issue with the resource management support.

If my reading of the code is correct, this message means that sockets will malloc additional memory to hold the completion entry. I get this message (by itself) on successful runs sometimes. I also can completely squelch this message and still get a hang by setting the CQ size to a number larger than the number of messages I need to send.

@epaulson10

This comment has been minimized.

Show comment
Hide comment
@epaulson10

epaulson10 Aug 15, 2017

Another test I just did was adding simple counters to track the number of posted multi-recv buffers at any given time. At the time of buffer consumption (when FI_MULTI_RECV is set on a completion entry) I put a check to see if the number of posted buffers ever hits zero. And unless I set the minimum number of buffers, this never happens. So there is always a buffer available whenever this hang happens, however for some reason increasing the number of buffers available stops this from happening. I'm wondering if the space inside the buffer is getting consumed to the point where there is not enough space available, yet the buffer is not being marked as depleted.

epaulson10 commented Aug 15, 2017

Another test I just did was adding simple counters to track the number of posted multi-recv buffers at any given time. At the time of buffer consumption (when FI_MULTI_RECV is set on a completion entry) I put a check to see if the number of posted buffers ever hits zero. And unless I set the minimum number of buffers, this never happens. So there is always a buffer available whenever this hang happens, however for some reason increasing the number of buffers available stops this from happening. I'm wondering if the space inside the buffer is getting consumed to the point where there is not enough space available, yet the buffer is not being marked as depleted.

@shefty

This comment has been minimized.

Show comment
Hide comment
@shefty

shefty Aug 15, 2017

Contributor

Do you always or only sometimes see this message:

libfabric:sockets:ep_data:sock_rx_new_buffered_entry():109<warn> Exceeded buffered recv limit 
Contributor

shefty commented Aug 15, 2017

Do you always or only sometimes see this message:

libfabric:sockets:ep_data:sock_rx_new_buffered_entry():109<warn> Exceeded buffered recv limit 
@epaulson10

This comment has been minimized.

Show comment
Hide comment
@epaulson10

epaulson10 Aug 15, 2017

Only sometimes, which also includes the message about the CQ not having enough space.

epaulson10 commented Aug 15, 2017

Only sometimes, which also includes the message about the CQ not having enough space.

@epaulson10

This comment has been minimized.

Show comment
Hide comment
@epaulson10

epaulson10 Aug 16, 2017

@shefty and I believe we found the solution to this bug, but I don't have the mental fortitude left after finding it to patch it and test it. I will do that tomorrow.

Note to self, the relevant diff is:

diff --git a/prov/sockets/src/sock_progress.c b/prov/sockets/src/sock_progress.c
index feeb94b..ecdac80 100644
--- a/prov/sockets/src/sock_progress.c
+++ b/prov/sockets/src/sock_progress.c
@@ -1160,14 +1160,15 @@ static int sock_pe_progress_buffered_rx(struct sock_rx_ctx *rx_ctx)
                datatype_sz = (rx_buffered->flags & FI_ATOMIC) ?
                        ofi_datatype_size(rx_buffered->rx_op.atomic.datatype) : 0;
                offset = 0;
+                // rem: size of the message
                rem = rx_buffered->iov[0].iov.len;
                rx_ctx->buffered_len -= rem;
                used_len = rx_posted->used;
                pe_entry.data_len = 0;
                pe_entry.buf = 0L;
                for (i = 0; i < rx_posted->rx_op.dest_iov_len && rem > 0; i++) {
-                       if (used_len >= rx_posted->rx_op.dest_iov_len) {
-                               used_len -= rx_posted->rx_op.dest_iov_len;
+                       if (used_len >= rx_posted->iov[i].iov.len) {
+                               used_len -= rx_posted->iov[i].iov.len;
                                continue;
                        }
@@ -1221,11 +1225,15 @@ static int sock_pe_progress_buffered_rx(struct sock_rx_ctx *rx_ctx)

                if (rem) {
                        SOCK_LOG_DBG("Not enough space in posted recv buffer\n");
                        sock_pe_report_rx_error(&pe_entry, rem, FI_ETRUNC);
                } else {
                        sock_pe_report_recv_completion(&pe_entry);
                }

+                rx_posted->is_busy = 0;
+
                dlist_remove(&rx_buffered->entry);
                sock_rx_release_entry(rx_buffered);

epaulson10 commented Aug 16, 2017

@shefty and I believe we found the solution to this bug, but I don't have the mental fortitude left after finding it to patch it and test it. I will do that tomorrow.

Note to self, the relevant diff is:

diff --git a/prov/sockets/src/sock_progress.c b/prov/sockets/src/sock_progress.c
index feeb94b..ecdac80 100644
--- a/prov/sockets/src/sock_progress.c
+++ b/prov/sockets/src/sock_progress.c
@@ -1160,14 +1160,15 @@ static int sock_pe_progress_buffered_rx(struct sock_rx_ctx *rx_ctx)
                datatype_sz = (rx_buffered->flags & FI_ATOMIC) ?
                        ofi_datatype_size(rx_buffered->rx_op.atomic.datatype) : 0;
                offset = 0;
+                // rem: size of the message
                rem = rx_buffered->iov[0].iov.len;
                rx_ctx->buffered_len -= rem;
                used_len = rx_posted->used;
                pe_entry.data_len = 0;
                pe_entry.buf = 0L;
                for (i = 0; i < rx_posted->rx_op.dest_iov_len && rem > 0; i++) {
-                       if (used_len >= rx_posted->rx_op.dest_iov_len) {
-                               used_len -= rx_posted->rx_op.dest_iov_len;
+                       if (used_len >= rx_posted->iov[i].iov.len) {
+                               used_len -= rx_posted->iov[i].iov.len;
                                continue;
                        }
@@ -1221,11 +1225,15 @@ static int sock_pe_progress_buffered_rx(struct sock_rx_ctx *rx_ctx)

                if (rem) {
                        SOCK_LOG_DBG("Not enough space in posted recv buffer\n");
                        sock_pe_report_rx_error(&pe_entry, rem, FI_ETRUNC);
                } else {
                        sock_pe_report_recv_completion(&pe_entry);
                }

+                rx_posted->is_busy = 0;
+
                dlist_remove(&rx_buffered->entry);
                sock_rx_release_entry(rx_buffered);
@bonachea

This comment has been minimized.

Show comment
Hide comment
@bonachea

bonachea Aug 16, 2017

Thank you both for your quick response and great work!

bonachea commented Aug 16, 2017

Thank you both for your quick response and great work!

epaulson10 pushed a commit to epaulson10/libfabric that referenced this issue Aug 16, 2017

Erik Paulson
prov/sockets: Fix ofiwg#3217, bug in buffered receive path
This bug was found by testing GASNet, which has some tests that very
rigourously stress messaging endpoints. The behaviour was that tests would
sporadically hang when put under heavy loads.

The root cause is found in the provider's handling of previously buffered
messages (ones that came in without a matching receive buffer). When the
handler grabbed a posted buffer to place a buffered message into, it was not
marking that receive buffer as free once it was done with it, which causes
buffers posted with FI_MULTI_RECV to become unusable in the future, which
leads to the provider running out of receive buffers.

After fixing the root cause, there was also an issue in calculating the
position in which to place a message in such a multi-recv buffer, which this
patch also fixes.

Signed-off-by: Erik Paulson <erik.r.paulson@intel.com>

@shefty shefty closed this in 6c5d131 Aug 16, 2017

shefty added a commit that referenced this issue Aug 16, 2017

Merge pull request #3223 from epaulson10/master
prov/sockets: Fix #3217, bug in buffered receive path

shefty added a commit to shefty/libfabric that referenced this issue Aug 16, 2017

prov/sockets: Fix ofiwg#3217, bug in buffered receive path
This bug was found by testing GASNet, which has some tests that very
rigourously stress messaging endpoints. The behaviour was that tests would
sporadically hang when put under heavy loads.

The root cause is found in the provider's handling of previously buffered
messages (ones that came in without a matching receive buffer). When the
handler grabbed a posted buffer to place a buffered message into, it was not
marking that receive buffer as free once it was done with it, which causes
buffers posted with FI_MULTI_RECV to become unusable in the future, which
leads to the provider running out of receive buffers.

After fixing the root cause, there was also an issue in calculating the
position in which to place a message in such a multi-recv buffer, which this
patch also fixes.

Signed-off-by: Erik Paulson <erik.r.paulson@intel.com>

shefty added a commit that referenced this issue Aug 17, 2017

Merge pull request #3225 from shefty/v1.4.x
prov/sockets: Fix #3217, bug in buffered receive path

epaulson10 pushed a commit to epaulson10/libfabric that referenced this issue Jan 30, 2018

Erik Paulson
prov/sockets: Fix ofiwg#3217, bug in buffered receive path
This bug was found by testing GASNet, which has some tests that very
rigourously stress messaging endpoints. The behaviour was that tests would
sporadically hang when put under heavy loads.

The root cause is found in the provider's handling of previously buffered
messages (ones that came in without a matching receive buffer). When the
handler grabbed a posted buffer to place a buffered message into, it was not
marking that receive buffer as free once it was done with it, which causes
buffers posted with FI_MULTI_RECV to become unusable in the future, which
leads to the provider running out of receive buffers.

After fixing the root cause, there was also an issue in calculating the
position in which to place a message in such a multi-recv buffer, which this
patch also fixes.

Signed-off-by: Erik Paulson <erik.r.paulson@intel.com>
(cherry picked from commit 6c5d131)

shefty added a commit that referenced this issue Jan 30, 2018

Merge pull request #3746 from epaulson10/v1.5.x
Cherry-pick to v1.5.x: prov/sockets: Fix #3217, bug in buffered receive path
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment