Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible bug in RDP TX timeout #109

Closed
marikka opened this issue Sep 15, 2017 · 10 comments
Closed

Possible bug in RDP TX timeout #109

marikka opened this issue Sep 15, 2017 · 10 comments
Assignees

Comments

@marikka
Copy link

marikka commented Sep 15, 2017

The conditions on these two lines seem contradictory:

while (csp_rdp_seq_after(conn->rdp.snd_nxt, conn->rdp.snd_una + (uint16_t)conn->rdp.window_size)) {

if (csp_rdp_seq_before(conn->rdp.snd_nxt - conn->rdp.snd_una, conn->rdp.window_size * 2))

In the condition within the csp_rdp_send function, the window size is not multiplied by two.
In the condition within the csp_rdp_check_timeouts function, the window size is multiplied by two.

The effect is that the program can end up in a loop where the timeout here never happens:

return CSP_ERR_TIMEDOUT;
Therefore the while loop never exits.

@johandc
Copy link
Contributor

johandc commented Sep 23, 2017

Thank you for the report. I will look into this before the 1.5 release.

@johandc
Copy link
Contributor

johandc commented Aug 21, 2018

Hi again, sorry for the wait.

I believe this is not a bug, but might have an explanation for your issue:

The TX task is stopped at the point when the unacknowledged packet number equals the window size. And the TX task is woken again when the number is half. What needs to happen in order to break out of the wait, is simply to transmit the packets and receive an ACK from the counterpart.

However, that's where i have found a related issue. Maybe your issue could be related to window size and queue lengths. But first a bit of reasoning:
In TX mode, we know exactly what's in the TX queue, and can control it exactly. Therefore the entire buffer could be utilized for outgoing traffic. However for incoming data, we don't know how many frames are already in "the air" towards us. So we should not utilize the entire buffer in RX mode, because otherwise we would have to drop the next incoming frame.

So we have a similar check in RX mode, where an ACK would only be sent if there is more than half the buffer available.

Setting the queue length to 10 and the window size to 4 works fine. But increasing the window size to 5 now gives this problem where the receiver will not automatically ack because there is never room for half the window size or more.

Consequently we can do two things.
a) Increase the queue length of the receiver, to be able to hold 2*window_size
b) decrease the window_size overhead from a factor of 2 to perhaps one or two.

Either way, we should do a check on the RDP options in order to ensure the parameters match the queue length limitations.

I am working on a patch that will be released in v1.5 to solve this.

@johandc johandc closed this as completed Aug 21, 2018
@perwulff
Copy link

I think it makes sense, to have the same check "for waking Tx", as used in csp_rdp_send() - seems like a simple fix and why wake up the Tx task if its not allowed to send more messages.
csp_rdp_send() already exceeds the window size by 1, which is another bug.
We are currently testing this fix in GomSpace.

@johandc
Copy link
Contributor

johandc commented Aug 21, 2018

It's a balance whether to wake the TX task for each available slot, or when the TX queue is at half full.

In the event that delayed acks is not used. The first case would lead to more context switches and overhead. Whereas the second (and current) choice could lead to a TX buffer underflow in the case where the wakeup period is longer than it takes to ack half a window_size.

If delayed acks are used, where we only ack for > window_size/2, once the TX queue is flushed, there will most likely be more than enough room for window_size/2. So it will not make any difference.

What would really make a difference was to put the check into csp_rdp_new_packet. We might as well wake the TX task when an ACK is received right away, rather than just setting the ack number, and waiting for the timeout_check function.

johandc pushed a commit to spaceinventor/libcsp that referenced this issue Aug 21, 2018
When deciding wether to ack a packet, we previously would consider the
RX queue as available if there was enough room for window_size * 2
packets.

However with reduced conn_queue_length (the default setting is now 10 in
our setup), this would reduce the maximum allowed window size to 4.

Considering the fact that it is wastefull to reserve this much buffer
space, we have reduced the number to only a single time the window_size.
This would allow the window size to approach the conn_queue_length.

The user should be mindfull that using a window_size of 10, with a
queue_length of 10 could result in dropped packets, since some packets
could be already in the air. So setting window size a bit below is
always recommended.

This is partially in response to issue libcsp#109
@johandc
Copy link
Contributor

johandc commented Aug 21, 2018

Just read this through again and would like to add a few remarks:

  1. To Per: The code currently does not wake the TX task if it's not allowed to send messages. In fact, it waits with the wakeup, untill there is room for at least a half_window size again. So it's the opposite.
  2. To marikka: You say that the while loop never exits. I assume this was your bug. But in fact it should automatically exit when the connection is closed by a timeout. This will flush the entire TX queue and not accept any more packets, thus closing the userspace task as well.

johandc pushed a commit to spaceinventor/libcsp that referenced this issue Aug 21, 2018
When deciding wether to ack a packet, we previously would consider the
RX queue as available if there was enough room for window_size * 2
packets.

However with reduced conn_queue_length (the default setting is now 10 in
our setup), this would reduce the maximum allowed window size to 4.

Considering the fact that it is wastefull to reserve this much buffer
space, we have reduced the number to only a single time the window_size.
This would allow the window size to approach the conn_queue_length.

The user should be mindfull that using a window_size of 10, with a
queue_length of 10 could result in dropped packets, since some packets
could be already in the air. So setting window size a bit below is
always recommended.

This is partially in response to issue libcsp#109
@perwulff
Copy link

We have seen the problem Marikka refers to and its easily reproduced by a simple test, where task A sends messages to task B, and task B doesn't do any receiving.
The problem is caused by different checks, as Marikka describes. The test will also shows, that task A actually sends window size + 1 messages, before deadlocking. It deadlocks because Tx is wrongly woken and therefor never times out.
The simple fix, is to use the same check, so the Tx task isn't woken unless it can send data or the connection is closing (the while look in csp_rdp_send() are also missing a check on connection state).

@johandc
Copy link
Contributor

johandc commented Aug 21, 2018

Aha, thanks for clarifying, now I see the issue.

I was blinded by the similar issue with the RX queue length I had just worked around. But I see now that the check on the TX side should have been window size divided by two, not multiplied, to work as I had expected. The current timeout check is completely invalid. I agree on that.

This is very very old issue. At least 7 or 8 years. The deadlock seems to have been added 2 years ago when it was changed from an if to a while statement in RDP send function.

I agree with the fix.

@johandc johandc reopened this Aug 21, 2018
@johandc
Copy link
Contributor

johandc commented Aug 22, 2018

Just to add to the confusion. I just did a test like the one you mentioned. It seems i cannot reproduce the problem because of the check in the line above the TX check:

if (conn->rdp.state == RDP_OPEN)
if (csp_queue_size(conn->rdp.tx_queue) < (int)conn->rdp.window_size)
if (csp_rdp_seq_before(conn->rdp.snd_nxt - conn->rdp.snd_una, conn->rdp.window_size * 2))
csp_bin_sem_post(&conn->rdp.tx_wait);

The check for queue_size < window_size, overrules the bad check in line 565. And therefore it stil works in my example. No deadlock here. I'm trying to work out how you are coming to the deadlock situation, for that to occur the queue must not be full, and snd_nxt must be greater than snd_una + window. The latter might be the case at the very beginning, where the snd_una might be initialized to zero and snd_nxt might be initialized to a random number. Let me check that out.

/* Setup TX seq. */
conn->rdp.snd_iss = (uint16_t)rand();
conn->rdp.snd_nxt = conn->rdp.snd_iss + 1;
conn->rdp.snd_una = conn->rdp.snd_iss;

Edit: It seems that snd_nxt and snd_una are initialized correctly. So i cannot explain why your TX task would sleep, and why the timeout wakes it? I still agree that this should be cleaned up, and corrected. But i still cannot reproduce the deadlock.

@perwulff
Copy link

perwulff commented Aug 22, 2018

Sorry for not pushing the fix - will try and do so very soon.
During additional test of the fix, the logs showed some strange behaviour when the receivers queue becomes full and the receiver "tries not to" acknowledge received packets. This doesn't seem to work exactly as expected/intended (I think).

@perwulff perwulff self-assigned this Dec 11, 2019
perwulff pushed a commit that referenced this issue Jan 15, 2020
…ome cases, the connection would switch to CLOSED immediately.

RDP: Fixed connection leak, if a RST segment was received on a closed connecton.
RDP: Ensure connection is closed from both userspace and protocol, before closing completely (preventing undetermined behaviour).
RDP: Added support for fast close of a connection (skipping the CLOSE_WAIT period), but only if both ends agree on close.
RDP: Fixed issue "Possible bug in RDP TX timeout" (#109), see issue on github for further details.
perwulff pushed a commit that referenced this issue Jan 16, 2020
- Ensure connection is kept in CLOSE_WAIT for a period of time. In some cases, the connection would switch to CLOSED immediately. (#138)
- Fixed connection leak, if a RST segment was received on a closed connection.
- Ensure connection is closed from both userspace and protocol, before closing completely (preventing undetermined behaviour).
- Added support for fast close of a connection (skipping the CLOSE_WAIT period), but only if both ends agree on close.
- Fixed issue "Possible bug in RDP TX timeout" (#109), see issue on github for further details.
@perwulff
Copy link

This issue has been fixed in #138 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants