Possible bug in RDP TX timeout #109

marikka · 2017-09-15T15:09:13Z

The conditions on these two lines seem contradictory:

Line 939 in 17bbef3

    
           while (csp_rdp_seq_after(conn->rdp.snd_nxt, conn->rdp.snd_una + (uint16_t)conn->rdp.window_size)) {

libcsp/src/transport/csp_rdp.c

Line 563 in 17bbef3

    
           if (csp_rdp_seq_before(conn->rdp.snd_nxt - conn->rdp.snd_una, conn->rdp.window_size * 2))

In the condition within the csp_rdp_send function, the window size is not multiplied by two.
In the condition within the csp_rdp_check_timeouts function, the window size is multiplied by two.

The effect is that the program can end up in a loop where the timeout here never happens:

libcsp/src/transport/csp_rdp.c

Line 944 in 17bbef3

return CSP_ERR_TIMEDOUT;

Therefore the while loop never exits.

The text was updated successfully, but these errors were encountered:

johandc · 2017-09-23T15:12:57Z

Thank you for the report. I will look into this before the 1.5 release.

johandc · 2018-08-21T07:30:47Z

Hi again, sorry for the wait.

I believe this is not a bug, but might have an explanation for your issue:

The TX task is stopped at the point when the unacknowledged packet number equals the window size. And the TX task is woken again when the number is half. What needs to happen in order to break out of the wait, is simply to transmit the packets and receive an ACK from the counterpart.

However, that's where i have found a related issue. Maybe your issue could be related to window size and queue lengths. But first a bit of reasoning:
In TX mode, we know exactly what's in the TX queue, and can control it exactly. Therefore the entire buffer could be utilized for outgoing traffic. However for incoming data, we don't know how many frames are already in "the air" towards us. So we should not utilize the entire buffer in RX mode, because otherwise we would have to drop the next incoming frame.

So we have a similar check in RX mode, where an ACK would only be sent if there is more than half the buffer available.

Setting the queue length to 10 and the window size to 4 works fine. But increasing the window size to 5 now gives this problem where the receiver will not automatically ack because there is never room for half the window size or more.

Consequently we can do two things.
a) Increase the queue length of the receiver, to be able to hold 2*window_size
b) decrease the window_size overhead from a factor of 2 to perhaps one or two.

Either way, we should do a check on the RDP options in order to ensure the parameters match the queue length limitations.

I am working on a patch that will be released in v1.5 to solve this.

perwulff · 2018-08-21T07:50:28Z

I think it makes sense, to have the same check "for waking Tx", as used in csp_rdp_send() - seems like a simple fix and why wake up the Tx task if its not allowed to send more messages.
csp_rdp_send() already exceeds the window size by 1, which is another bug.
We are currently testing this fix in GomSpace.

johandc · 2018-08-21T08:30:10Z

It's a balance whether to wake the TX task for each available slot, or when the TX queue is at half full.

In the event that delayed acks is not used. The first case would lead to more context switches and overhead. Whereas the second (and current) choice could lead to a TX buffer underflow in the case where the wakeup period is longer than it takes to ack half a window_size.

If delayed acks are used, where we only ack for > window_size/2, once the TX queue is flushed, there will most likely be more than enough room for window_size/2. So it will not make any difference.

What would really make a difference was to put the check into csp_rdp_new_packet. We might as well wake the TX task when an ACK is received right away, rather than just setting the ack number, and waiting for the timeout_check function.

When deciding wether to ack a packet, we previously would consider the RX queue as available if there was enough room for window_size * 2 packets. However with reduced conn_queue_length (the default setting is now 10 in our setup), this would reduce the maximum allowed window size to 4. Considering the fact that it is wastefull to reserve this much buffer space, we have reduced the number to only a single time the window_size. This would allow the window size to approach the conn_queue_length. The user should be mindfull that using a window_size of 10, with a queue_length of 10 could result in dropped packets, since some packets could be already in the air. So setting window size a bit below is always recommended. This is partially in response to issue libcsp#109

johandc · 2018-08-21T08:46:31Z

Just read this through again and would like to add a few remarks:

To Per: The code currently does not wake the TX task if it's not allowed to send messages. In fact, it waits with the wakeup, untill there is room for at least a half_window size again. So it's the opposite.
To marikka: You say that the while loop never exits. I assume this was your bug. But in fact it should automatically exit when the connection is closed by a timeout. This will flush the entire TX queue and not accept any more packets, thus closing the userspace task as well.

When deciding wether to ack a packet, we previously would consider the RX queue as available if there was enough room for window_size * 2 packets. However with reduced conn_queue_length (the default setting is now 10 in our setup), this would reduce the maximum allowed window size to 4. Considering the fact that it is wastefull to reserve this much buffer space, we have reduced the number to only a single time the window_size. This would allow the window size to approach the conn_queue_length. The user should be mindfull that using a window_size of 10, with a queue_length of 10 could result in dropped packets, since some packets could be already in the air. So setting window size a bit below is always recommended. This is partially in response to issue libcsp#109

perwulff · 2018-08-21T10:05:50Z

We have seen the problem Marikka refers to and its easily reproduced by a simple test, where task A sends messages to task B, and task B doesn't do any receiving.
The problem is caused by different checks, as Marikka describes. The test will also shows, that task A actually sends window size + 1 messages, before deadlocking. It deadlocks because Tx is wrongly woken and therefor never times out.
The simple fix, is to use the same check, so the Tx task isn't woken unless it can send data or the connection is closing (the while look in csp_rdp_send() are also missing a check on connection state).

johandc · 2018-08-21T19:58:08Z

Aha, thanks for clarifying, now I see the issue.

I was blinded by the similar issue with the RX queue length I had just worked around. But I see now that the check on the TX side should have been window size divided by two, not multiplied, to work as I had expected. The current timeout check is completely invalid. I agree on that.

This is very very old issue. At least 7 or 8 years. The deadlock seems to have been added 2 years ago when it was changed from an if to a while statement in RDP send function.

I agree with the fix.

johandc · 2018-08-22T07:34:36Z

Just to add to the confusion. I just did a test like the one you mentioned. It seems i cannot reproduce the problem because of the check in the line above the TX check:

libcsp/src/transport/csp_rdp.c

Lines 563 to 566 in 6e51733

    
           if (conn->rdp.state == RDP_OPEN) 
        
           	if (csp_queue_size(conn->rdp.tx_queue) < (int)conn->rdp.window_size) 
        
           		if (csp_rdp_seq_before(conn->rdp.snd_nxt - conn->rdp.snd_una, conn->rdp.window_size * 2)) 
        
           			csp_bin_sem_post(&conn->rdp.tx_wait);

The check for queue_size < window_size, overrules the bad check in line 565. And therefore it stil works in my example. No deadlock here. I'm trying to work out how you are coming to the deadlock situation, for that to occur the queue must not be full, and snd_nxt must be greater than snd_una + window. The latter might be the case at the very beginning, where the snd_una might be initialized to zero and snd_nxt might be initialized to a random number. Let me check that out.

libcsp/src/transport/csp_rdp.c

Lines 627 to 630 in 6e51733

    
           /* Setup TX seq. */ 
        
           conn->rdp.snd_iss = (uint16_t)rand(); 
        
           conn->rdp.snd_nxt = conn->rdp.snd_iss + 1; 
        
           conn->rdp.snd_una = conn->rdp.snd_iss;

Edit: It seems that snd_nxt and snd_una are initialized correctly. So i cannot explain why your TX task would sleep, and why the timeout wakes it? I still agree that this should be cleaned up, and corrected. But i still cannot reproduce the deadlock.

perwulff · 2018-08-22T08:57:26Z

Sorry for not pushing the fix - will try and do so very soon.
During additional test of the fix, the logs showed some strange behaviour when the receivers queue becomes full and the receiver "tries not to" acknowledge received packets. This doesn't seem to work exactly as expected/intended (I think).

…ome cases, the connection would switch to CLOSED immediately. RDP: Fixed connection leak, if a RST segment was received on a closed connecton. RDP: Ensure connection is closed from both userspace and protocol, before closing completely (preventing undetermined behaviour). RDP: Added support for fast close of a connection (skipping the CLOSE_WAIT period), but only if both ends agree on close. RDP: Fixed issue "Possible bug in RDP TX timeout" (#109), see issue on github for further details.

- Ensure connection is kept in CLOSE_WAIT for a period of time. In some cases, the connection would switch to CLOSED immediately. (#138) - Fixed connection leak, if a RST segment was received on a closed connection. - Ensure connection is closed from both userspace and protocol, before closing completely (preventing undetermined behaviour). - Added support for fast close of a connection (skipping the CLOSE_WAIT period), but only if both ends agree on close. - Fixed issue "Possible bug in RDP TX timeout" (#109), see issue on github for further details.

perwulff · 2020-01-17T08:42:28Z

This issue has been fixed in #138 .

johandc closed this as completed Aug 21, 2018

johandc reopened this Aug 21, 2018

perwulff self-assigned this Dec 11, 2019

perwulff mentioned this issue Jan 15, 2020

RDP fixes and features #138

Merged

perwulff closed this as completed Jan 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible bug in RDP TX timeout #109

Possible bug in RDP TX timeout #109

marikka commented Sep 15, 2017

johandc commented Sep 23, 2017

johandc commented Aug 21, 2018

perwulff commented Aug 21, 2018

johandc commented Aug 21, 2018

johandc commented Aug 21, 2018

perwulff commented Aug 21, 2018

johandc commented Aug 21, 2018

johandc commented Aug 22, 2018 •

edited

perwulff commented Aug 22, 2018 •

edited

perwulff commented Jan 17, 2020

Possible bug in RDP TX timeout #109

Possible bug in RDP TX timeout #109

Comments

marikka commented Sep 15, 2017

johandc commented Sep 23, 2017

johandc commented Aug 21, 2018

perwulff commented Aug 21, 2018

johandc commented Aug 21, 2018

johandc commented Aug 21, 2018

perwulff commented Aug 21, 2018

johandc commented Aug 21, 2018

johandc commented Aug 22, 2018 • edited

perwulff commented Aug 22, 2018 • edited

perwulff commented Jan 17, 2020

johandc commented Aug 22, 2018 •

edited

perwulff commented Aug 22, 2018 •

edited