-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pion <-> usrsctp under 10% packet loss gets an abort #104
Comments
This seems to be triggered by the size of the message we are sending. If I drop this to a smaller size the crash goes away. The crash seems to be around DATA chunk fragmenting. When we do messages @ ~3500 I don't see the abort anymore. |
This is the full logic of the assert we are hitting
It seems to be saying
I will work on getting a capture of stuff we are sending and see what I can find |
Here is a |
@enobufs this looks promising! This happens right after a storm of |
I think the new forward TSN flushes any caching in Then when the next data comes in it is confused because it goes to do a lookup and we have a chunk that the |
I am figuring out the best way to make sure when
|
@Sean-Der I think you are on the right track! Indeed, forwarding TSN to a fragment with (BE)=(00) doesn't make sense, it should forward to the last fragment, TSN E-bit being 1. |
@enobufs That's good to hear :) Any suggestions on best way to refactor/change the API? This is what I have so far
|
Ah... one thing I realize that may make things complicated is, the rest of the fragments that also need to be abandoned may still be in the pendingQueue. When sendPayloadData() is called, the chunks passed would first be written into pendingQueue. The chunks in the pending queue do not have TSN assigned yet (so that urgent messages can be reordered - and that is why the pendingQueue has ordered and reordered queues. Pulling a chunk from the pendingQueue is done by peek(), then pop(). When the peek() returns a chunk that is a fragment of a message (E-bit being 0), then then the next peek() would persist to the ordered or unorder queue whichever the earlier chunk was, until it pop() the last fragment. (using pendingQueue.selected flag in the code). So, this concern is taken care of. This is done because all TSN numbers of the same group of fragments must be consecutive. When there is an available congestion window, the chunks are moved from pendingQueue to inflightQueue, and the chunks will have TSN assigned at that point. (see: movePendingDataChunkToInflightQueue()) So, the challenging part is, we currently do not have a easy way to set What I would do is (not entirely sure if works), when move chunks from pendingQueue to inflightQueue, check a.cumulativeTSNAckPoint against the newly assigned TSN number. If the number is equal to or less than a.cumulativeTSNAckPoint, I believe we should mark them as |
There is one thing in my mind that concerns me a bit about the protocol violation error. Say, we sent chunks with TSN 1(B=1), 2, 3, 4, 5(E=1). The sender forwards TSN to 3, stops retransmitting 4, 5 (which we are trying to do), but those chunks with 4 and 5 might still be in-flight (reordered delay), and later received by the remote. So, I am not 100% sure if the workaround would work perfectly. usrsctp maybe should have just ignored the DATA chunk silently instead of sending the error. Or there may be other causes of the violation error.. |
oh that is a really good point. I think that is worth asking @tuexen I can file an issue on the usrsctp bug tracker and see if has any ideas! |
I am just walking out the door, but I can file a ticket right when I get back! Or feel free to do it if you are bored/want to get into it :) |
I won't be available for this issue (as the main assignee) due to other commitments right now but will try my best to support your work like this. It is great to have another eye on sctp also! |
Is it possible to get
Then I can try to reproduce the issue. |
Fantastic! Thank you @tuexen
The sender is here
The abort comes from usrsctp here You can reproduce by doing
I get an ABORT after ~5 seconds.
I am happy to answer any questions about our SCTP implementation! You can easily run it by doing this I also documented everything as I went in this ticket. This wasn't filed by a user, so everything should be relevant! Thank you so much for the help :) |
Could you provide a tracefile from the usrsctp side which contains the packets received and sent? Right now it only contains one direction. One note about your configuration: a single user message needs 42 fragments, which make up one packet each. Using a packet loss rate of 10 %, this gives a chance of (9/10)^42 that a user message is received completely. This is approximately 1%. Not sure if this is what you intended to test, just making sure what we should observe. So for 100 messages you are sending, 1 should be received successfully. |
@tuexen can you elaborate a bit on that derivation for a noob? 50000 bytes message = 42 fragments in SCTP, which are one packet each? |
@tuexen here is another one! Pion got fromChrome.tar.gz The 10% packet loss is just for an easy repro. @mturnshek is still hitting this in production though. My goal is just to have everything be as resilient as possible/not making users debug things. I don't have any expectations over how much data is actually transferred and at what speed. @mturnshek yep exactly! Each SCTP packet contains at most a data fragment of 1200 bytes, so you need to split across all those packets. Then once you split each of those fragments inside 42 separate packets have a TSN (SCTP protocols unique id). If any of those individual packets are lost, then the whole larger message needs to be discarded as well. |
@mturnshek @Sean-Der gives the correct explanation. If you are sending a user message of 50000 bytes and in a packet fit only 1200 bytes and considering 50000/1200 = 41 + 2/3, you need 42 packets. Since the number of retransmissions is 0, a single drop of a fragment required the receiver to drop all fragments of the user message. |
@Sean-Der Can't you provide a single tracefile showing the packets processed by the usrsctp stack within Chrome? I need the correct timing between received packets and sent packets. I need to understand which packet sequence is sent by the peer triggering the ABORT. Either the packet sequence sent by the peer is violating the specification (which means the ABORT is fine and the peer stack needs to be fixed) or the code detecting the violation is wrong (which means the ABORT is wrong and we need to fix usrsctp). Therefore I need the packet sequence sent/received by usersctp with the correct timing. If you can't get the logging from Chrome, maybe you can try with Firefox. It also uses the usrsctp stack... |
Unfortunately enabling the logging in Chromium/FireFox stops me from getting the abort sent back :( Let me keep trying different combinations! |
OK, that is strange since it only changes the timing and I don't expect a race condition to be a problem here. How are you taking the traces if not via logging from usrsctp? |
I have been capturing them going in/out of the Go SCTP implementation. Just adding it to the Read/Write functions. If I put them in the same pcap would it be helpful? I will keep trying with the browsers though, or maybe just write a little C program and get browsers out of the equation. Thanks again for all the help |
I see. It would be best to see the perspective of the usrsctp stack, since Go's perspective is different. Packets are dropped, which are sent the the Go implementation. I can try to look at these files. At least I can see if the Go implementation behaves as expected. |
@tuexen Do you think it is possible that the cleaning up of reassembly queues from the forward TSN can cause this? When getting a FORWARD-TSN the code cleans up some of the control stuff here It it possible that when we get another fragment we enter the block for a new control here Then we abort because we have a Also ping @lgrahl if you have any bandwidth/debug ideas I am out of them :( maybe I can spin up a |
Even though it's usrsctp-based, RAWRTC often behaves very different than browsers. So, I can't provide a definitive answer but it's probably worth a try? Keep in mind that Chrome uses an outdated usrsctp version and IIRC they don't even handle abandoned messages, so you'll likely end up with garbled messages sometimes. Anyway, I can give you pointers if needed but I atm don't know what you need. 🙂 |
Oops, yes I meant E=1. I want to check with packets on the wire too. I will try to get pcap and share that with you. Thanks @tuexen for your support. Really appreciated. |
So I tested out this branch: This disables forward TSN, which supposedly is the root cause of the issue. and noticed that we still receive an ABORT when I introduce packet loss. The channel errors with the following message (in Pion): This makes me believe there's another issue which doesn't have to do with Forward TSN, and may be causing more of the ABORTs. Notably, the ABORT doesn't say "Protocol Violation Reass 30000007,CI:ffffffff,TSN=1e5e52d3,SID=0002,FSN=1e5e52d3,SSN:0000" or anything. Any ideas? |
@mturnshek I have read @tuexen's mind and he is about to ask you for a PCAP file in order to investigate. 🙂 |
@lgrahl did it correctly... |
I'm still figuring out how to get a .pcapng with the unencrypted SCTP packets. In the meantime, maybe it will be helpful if I provide Pion logs? Here are two log traces of the failure, with This does not occur when I am logging with Chrome, or have chrome://webrtc-internals open. I can reproduce this consistently by introducing 5-10% packet loss (and 50ms latency) with the initial example Sean provided. This was run on 6b0c8bb which has Forward TSN disabled. Both of these traces begin in a healthy states, and I end after the connection has failed. Chrome sends an ABORT accompanied by an empty string to Pion. There are some interesting lines |
You could add to PION a routine like https://github.com/sctplab/usrsctp/blob/master/usrsctplib/user_socket.c#L3385
Will look at them.
|
@tuexen I have captured SCTP packets in two pcapng files. What I am doing:
unordered-with-sidssn-pair-bad.pcapng
unordered-without-sidssn-pair.pcapng
Because I send Fwd TSN with new cumulative TSN pointing to the end of the fragment, I don't see Protocol Violation, but I have been seeing the unexplainable stall. I could not find anything wrong in pcap files. With or without SID/SSN pair does not make any difference but, without SID/SSN pairs in FwdTSN seem to experience a_rwnd=0 more often and quicker. In both cases, when the problem occurs, pion is retransmitting a DATA chunk. Chrome returns SACK each time it receives the DATA chunk, but the SACK has the previous TSN. (This is to me, a typical case that application stops reading data from the SCTP reassembly queue...) While this is happening, JavaScript side seems to be healthy because other events are still coming in. How this happens varies. Sometimes it happens right away ~ 5 sec, sometimes, the traffic flows for a long time. unordered-with-sidssn-pair-bad.zip |
Tests need to be fixed Relates to #104
I'm confused. My understanding was that we are looking at a case where Pion is sending data towards usrsctp. So the SACKs would be received by Pion. You are referring to lines stating that SACKs are being sent. Is that a different issue? |
I looked at the tracefile without the sidssn pair, since it is the correct way for unordered data. Observations:
The following packetdrill script should reproduce the issue with the FreeBSD kernel stack:
But it runs fine. Options:
I guess it is 3. Are you able with the more up-to-date usrsctp stack? If not, I can try to build a script for a different test tool and test with the userland stack. |
@tuexen Thanks for your time and comments
Yeah, good point. When a package is large and cwnd is small, later fragments of a message are still in the pending queue (chunks that are not sent yet and do not have TSN assigned yet). When time comes to abandon an earlier fragment, there are two options in that situation: (1) Discard all fragments in the inflight queue as well as the pending queue. The current approach (in my branch) is (2) because it was a lot easier. But I think (1) is the way to go (align with what RFC says) and I'd need more time. I wanted make sure (2) would solve the protocol violation error first as a quick fix.
I did pion to Firefox (72.0.2), and it never happened. So my guess is also 3. I am going to leave this stalling issue at that for now, and fix tests that are broken by the necessary changes to move on. One thing to mention, I feel the use of large messages over a channel with partial reliability may not make much sense. Aside from this particular issue, I think I should come up with an interop test tool between pion/sctp and usrsctp (with various versions). |
Tests need to be fixed Relates to #104
Tests need to be fixed Relates to #104
I will look into the report made by @mturnshek later. Reopening this. |
Thanks for the information. |
I noticed, using pion/sctp v1.7.5, unordered, sending 32KB messages to Chrome over lossy connection at 4% with RTT 200ms delay (using comcast ... i mean.. this), Chrome sends an Abort chunk after a while, but this time (with v1.7.5), the error cause was "No User Data" (code: 9). I found a bug in association.go that causes a DATA chunk with 0-length user data spilled out on wire when SACK and T3-rtx timeout occurs on the same chunk at the same time - a race condition on the 'chunk.retransmit' flag. This bug was not a regression by the previous fix. It was there before. After I fixed it locally, I have been running for a while but I no longer see Abort at all. I will spend more time with different loss rate, delay, data sizes before I submit a PR... |
@mturnshek I have merged #116 and tagged v1.7.6. Are you able to try to repro the Abort with pion/sctp@v1.7.6? I saw Abort from Chrome (as mentioned above, with v1.7.5) but the error cause was "No User Data", wasn't "Protocol Violation". It may just be two different manifestations of the same cause (hopefully), but there could be another issue that doesn't seem to happen in my environment. |
I will test 1.76 today and report here. |
@enobufs @yutakasg I'm very late here, but I have finally tested this and experienced no connection breaks even with 75% packet loss. There are some interesting results I have where But even when that happened, reliable messages eventually found their way! Truly awesome stuff here. Thanks so much for the attention to detail and quick action when handling this. |
@mturnshek That's great news! Thanks for running the test and letting us know! |
comcast --device=lo --latency=150 --packet-loss=10% --target-proto=udp
I get an ABORT after ~5 seconds.
This abort is generated here in usrsctp. I am not sure why yet though, I need to read more about SCTP to fully understand.
The text was updated successfully, but these errors were encountered: