Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stuck at retransmitting #224

Closed
pothos opened this issue Jun 1, 2018 · 19 comments
Closed

Stuck at retransmitting #224

pothos opened this issue Jun 1, 2018 · 19 comments
Labels

Comments

@pothos
Copy link
Contributor

pothos commented Jun 1, 2018

This code here https://github.com/m-labs/smoltcp/blob/master/src/socket/tcp.rs#L913 should either reset local_rx_dup_acks or let it count from 255 to 0 with a overflowing add or stay at 255 with a saturation. What is the desired behavior?

                        self.local_rx_dup_acks += 1;
                        if self.local_rx_dup_acks == 3 {
                        }

thread 'main' panicked at 'attempt to add with overflow', smoltcp/src/socket/tcp.rs:913:25

@pothos
Copy link
Contributor Author

pothos commented Jun 1, 2018

Actually master stopped working for me, I try to find out if related to the fast retrainsmit code.

@pothos
Copy link
Contributor Author

pothos commented Jun 1, 2018

Yes, both endpoints are stuck in retransmitting…

@whitequark
Copy link
Contributor

cc @barskern

@barskern
Copy link
Contributor

barskern commented Jun 1, 2018

@pothos Do you think you would be able to write a test which mimics this behavior? Or explain me how it happened so that I could write a test?

I would think that the desired behavior is to do a saturating addition, and change the condition from if self.local_rx_dup_acks == 3 to if self.local_rx_dup_acks >= 3. This would trigger a fast retransmit for every segment which is a duplicate ACK when we have recived more than 3 duplicate ACKs. However I did not find documentation that indicates that this is appropriate behavior for a TCP-implementation.

@whitequark
Copy link
Contributor

This would trigger a fast retransmit for every segment which is a duplicate ACK when we have recived more than 3 duplicate ACKs.

I'm not sure that this is right. Per my understanding, with the modified code, if e.g. first packet in a burst of 8 gets lost, the other 7 will arrive, and 4 of them will trigger retransmit of the entire burst. Is that correct?

@barskern
Copy link
Contributor

barskern commented Jun 2, 2018

Per my understanding, with the modified code, if e.g. first packet in a burst of 8 gets lost, the other 7 will arrive, and 4 of them will trigger retransmit of the entire burst. Is that correct?

This is the reason I previously used == 3. However if we recive a burst of pacakges and the socket is not polled until all 8 packages have recived, I think there will only be 1 actual retransmit. This is because the way the socket retransmits is by setting an immidiate timeout, not directly sending a packet. Or thats what I think.. Perhaps we could find a way to test this theory.

This is such a edge case though. What is the probablility that we lose a packet and then lose the exact same package again in the retransmit, when all the other packages arrived. I think this is why I can't find a document about it online.

@barskern
Copy link
Contributor

barskern commented Jun 2, 2018

After some more research, I found that what should happen after three duplicate ACKs is recived is that the socket should enter "Fast recovery" and every following duplicate ACK only increases the congestion window.

After the fast retransmit algorithm sends what appears to be the
missing segment, the "fast recovery" algorithm governs the
transmission of new data until a non-duplicate ACK arrives. RFC 5681, page 9.

Hence since there hasn't been implemented Congestion Control in this library, I think we just have to "ignore" further duplicate ACKs and wait for timeout to retransmit again.

@pothos
Copy link
Contributor Author

pothos commented Jun 4, 2018

Hello,
sorry for the late reply, the overflow was the first thing I could find as bug in my problem. Now I shortened the logs to a sane size so I can upload them. Basically, the code overreacts and somehow both sides are stuck retransmitting and never leave the poll call again (might be related to #226). I think there is more than one case where the retransmission should not happen.

In these logs, the client sends data.
logclient.txt
logserver.txt
tcpdumpclient.txt
tcpdumpserver.txt

@pothos
Copy link
Contributor Author

pothos commented Jun 4, 2018

Ah, I applied the proposed patch with a saturating add, btw. So the numbers jumping back again should not be caused by overflows.

@barskern
Copy link
Contributor

barskern commented Jun 4, 2018

@pothos I found a bug in the fast retransmit code based on your logs!

When there is content in a package, one should not count the ACK as a duplicate ACK! New PR incoming 👍

@pothos
Copy link
Contributor Author

pothos commented Jun 5, 2018

@barskern Yes, that's one point, but also the retransmit should only be scheduled if there is data to be sent out. In the tcpdump in the end the packets of both sides just have length 0.

@barskern
Copy link
Contributor

barskern commented Jun 5, 2018

@pothos yes that is correct. I will look into that. I think I have to move the detection of duplicate ACKs to a different branch of the bigger match within the process function. Currently it will only look at the ACK field and respond based on that.

@whitequark
Copy link
Contributor

Thanks for debugging this!

@pothos
Copy link
Contributor Author

pothos commented Jun 7, 2018

@barskern Maybe a look on this code here can help? https://github.com/google/netstack/blob/master/tcpip/transport/tcp/snd.go e.g. see checkDuplicateAck and the variables saved for fast recovery (except sending rate)

@barskern
Copy link
Contributor

barskern commented Jun 9, 2018

@pothos Could you run the same test that generated this issue in the first place? I'm excited to see if it works now ⚡

m-labs-homu pushed a commit that referenced this issue Jun 9, 2018
Improve detection of duplicate ACKs by checking that the segment does
not contain data.

Move implementation from the 'reject unacceptable acknowledgements'
to later in the process function, because a duplicate ACK is not an
'unacceptable' acknowledgement but rather a condition to update state.

Implement more tests to verify the operation of the fast retransmit
implementation.

Related issue: #224

Closes: #225
Approved by: whitequark
@pothos pothos changed the title Debug build panic local_rx_dup_acks overflow Stuck at retransmitting Jun 12, 2018
@pothos
Copy link
Contributor Author

pothos commented Jun 12, 2018

Hi,
sorry, but no, two endpoints can still be stuck sending the same packets.

@barskern
Copy link
Contributor

Could you upload debug logs again?

@pothos
Copy link
Contributor Author

pothos commented Jun 12, 2018

I've found the issues and fixed it but did not write a test yet: #233

pothos added a commit to pothos/smoltcp that referenced this issue Jun 12, 2018
This fixes entering a loop when both endpoints
are stuck in retransmission.

smoltcp-rs#224
@pothos
Copy link
Contributor Author

pothos commented Jun 18, 2018

Hi,
can somebody review the PR to get master working again?
Thanks

m-labs-homu pushed a commit that referenced this issue Jun 18, 2018
This fixes entering a loop when both endpoints
are stuck in retransmission.

#224

Closes: #233
Approved by: whitequark
m-labs-homu pushed a commit that referenced this issue Jun 18, 2018
This fixes entering a loop when both endpoints
are stuck in retransmission.

#224

Closes: #233
Approved by: whitequark
@pothos pothos closed this as completed Jun 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

3 participants