Pion <-> usrsctp under 10% packet loss gets an abort #104

Sean-Der · 2020-02-01T07:30:43Z

Using main.go
demo.js
simulating packet loss using comcast --device=lo --latency=150 --packet-loss=10% --target-proto=udp

I get an ABORT after ~5 seconds.

Reass 30000007,CI:ffffffff,TSN=1e5e52d3,SID=0002,FSN=1e5e52d3,SSN:0000

This abort is generated here in usrsctp. I am not sure why yet though, I need to read more about SCTP to fully understand.

The text was updated successfully, but these errors were encountered:

Sean-Der · 2020-02-01T07:40:36Z

This seems to be triggered by the size of the message we are sending. If I drop this to a smaller size the crash goes away.

The crash seems to be around DATA chunk fragmenting. When we do messages @ ~3500 I don't see the abort anymore.

Sean-Der · 2020-02-01T08:46:49Z

This is the full logic of the assert we are hitting

                  if ((tsn == (asoc->cumulative_tsn + 1) && (asoc->idata_supported == 0))) {
                          /* Ok we created this control and now
                           * lets validate that its legal i.e. there
                           * is a B bit set, if not and we have
                           * up to the cum-ack then its invalid.
                           */
                          if ((chk->rec.data.rcv_flags & SCTP_DATA_FIRST_FRAG) == 0) {
                                  sctp_abort_in_reasm(stcb, control, chk,
                                                      abort_flag,
                                                      SCTP_FROM_SCTP_INDATA + SCTP_LOC_7);
                                  return;
                          }
                  }

It seems to be saying

TSN of this new chunk is exactly 1 greater then what we have gotten
It doesn't have the B field set (beginning fragment)
This is the first packet ever created_control means it is the first thing we have ever processed.

I will work on getting a capture of stuff we are sending and see what I can find

Sean-Der · 2020-02-01T09:28:07Z

Here is a pcap with this message Protocol Violation: Reass 30000007,CI:ffffffff,TSN=0bf4ed02,SID=0002,FSN=bf4ed02,SSN:0000

pion.tar.gz

Sean-Der · 2020-02-01T09:44:49Z

@enobufs this looks promising! This happens right after a storm of Forward TSN packets. Then a DATA packet right after causes the issue. Should we not be sending a Forward TSN in the middle of a fragment maybe?

Sean-Der · 2020-02-01T09:45:04Z

Sean-Der · 2020-02-01T09:46:34Z

I think the new forward TSN flushes any caching in usrsctp.

Then when the next data comes in it is confused because it goes to do a lookup and we have a chunk that the B bit is unset.

Sean-Der · 2020-02-01T20:27:03Z

I am figuring out the best way to make sure when abandoned is set it takes into account the whole fragment.

checkPartialReliabilityStatus is called on a single chunk. Maybe we can update this to take a payloadQueue + an index. Then it can handle iterating and making sure the whole chunked fragment is marked abandoned.

enobufs · 2020-02-01T20:50:45Z

@Sean-Der I think you are on the right track! Indeed, forwarding TSN to a fragment with (BE)=(00) doesn't make sense, it should forward to the last fragment, TSN E-bit being 1.

Sean-Der · 2020-02-01T21:11:57Z

@enobufs That's good to hear :) Any suggestions on best way to refactor/change the API?

This is what I have so far

diff --git a/association.go b/association.go
index e47195d..a870687 100644
--- a/association.go
+++ b/association.go
@@ -623,7 +623,7 @@ func (a *Association) gatherOutboundFrastRetransmissionPackets(rawPackets [][]by
                        fastRetransSize += dataChunkSize
                        a.stats.incFastRetrans()
                        c.nSent++
-                       a.checkPartialReliabilityStatus(c)
+                       a.checkPartialReliabilityStatus(c, a.inflightQueue)
                        toFastRetrans = append(toFastRetrans, c)
                        a.log.Tracef("[%s] fast-retransmit: tsn=%d sent=%d htna=%d",
                                a.name, c.tsn, c.nSent, a.fastRecoverExitPoint)
@@ -1715,7 +1715,7 @@ func (a *Association) movePendingDataChunkToInflightQueue(c *chunkPayloadData) {
        c.since = time.Now() // use to calculate RTT and also for maxPacketLifeTime
        c.nSent = 1          // being sent for the first time

-       a.checkPartialReliabilityStatus(c)
+       a.checkPartialReliabilityStatus(c, a.inflightQueue)

        a.log.Tracef("[%s] sending tsn=%d ssn=%d sent=%d len=%d",
                a.name, c.tsn, c.streamSequenceNumber, c.nSent, len(c.userData))
@@ -1837,7 +1837,7 @@ func (a *Association) sendPayloadData(chunks []*chunkPayloadData) error {
 }

 // The caller should hold the lock.
-func (a *Association) checkPartialReliabilityStatus(c *chunkPayloadData) {
+func (a *Association) checkPartialReliabilityStatus(c *chunkPayloadData, q chunkQueue) {
        if !a.useForwardTSN {
                return
        }
@@ -1858,6 +1858,28 @@ func (a *Association) checkPartialReliabilityStatus(c *chunkPayloadData) {
                        }
                }
                s.lock.RUnlock()
+
+               // From RFC 3758 Sec 3.6 A3:
+               // When a TSN is "abandoned", if it is part of a fragmented message,
+               // all other TSN's within that fragmented message MUST be abandoned
+               // at the same time.
+               if !c.abandoned || c.beginningFragment {
+                       return
+               }
+
+               currentTSN := c.tsn + 1
+               for {
+                       fragment, ok := q.get(currentTSN)
+                       if !ok {
+                               a.log.Warnf("[%s] attempted to abandon chunk, but missing fragment tsn=%d ", a.name, currentTSN)
+                               return
+                       } else if fragment.beginningFragment {
+                               return
+                       }
+
+                       fragment.abandoned = true
+                       currentTSN++
+               }
        }
 }

@@ -1894,7 +1916,7 @@ func (a *Association) getDataPacketsToRetransmit() []*packet {

                c.nSent++

-               a.checkPartialReliabilityStatus(c)
+               a.checkPartialReliabilityStatus(c, a.inflightQueue)

                a.log.Tracef("[%s] retransmitting tsn=%d ssn=%d sent=%d", a.name, c.tsn, c.streamSequenceNumber, c.nSent)

enobufs · 2020-02-01T21:49:41Z

Ah... one thing I realize that may make things complicated is, the rest of the fragments that also need to be abandoned may still be in the pendingQueue.

When sendPayloadData() is called, the chunks passed would first be written into pendingQueue. The chunks in the pending queue do not have TSN assigned yet (so that urgent messages can be reordered - and that is why the pendingQueue has ordered and reordered queues.

Pulling a chunk from the pendingQueue is done by peek(), then pop(). When the peek() returns a chunk that is a fragment of a message (E-bit being 0), then then the next peek() would persist to the ordered or unorder queue whichever the earlier chunk was, until it pop() the last fragment. (using pendingQueue.selected flag in the code). So, this concern is taken care of. This is done because all TSN numbers of the same group of fragments must be consecutive.

When there is an available congestion window, the chunks are moved from pendingQueue to inflightQueue, and the chunks will have TSN assigned at that point. (see: movePendingDataChunkToInflightQueue())

So, the challenging part is, we currently do not have a easy way to set abandoned flag set to true for the chunks that are still in the pendingQueue.

What I would do is (not entirely sure if works), when move chunks from pendingQueue to inflightQueue, check a.cumulativeTSNAckPoint against the newly assigned TSN number. If the number is equal to or less than a.cumulativeTSNAckPoint, I believe we should mark them as abandoned at that time, make sure the chunk won't be sent out.

enobufs · 2020-02-01T22:03:21Z

There is one thing in my mind that concerns me a bit about the protocol violation error.

Say, we sent chunks with TSN 1(B=1), 2, 3, 4, 5(E=1). The sender forwards TSN to 3, stops retransmitting 4, 5 (which we are trying to do), but those chunks with 4 and 5 might still be in-flight (reordered delay), and later received by the remote.

So, I am not 100% sure if the workaround would work perfectly. usrsctp maybe should have just ignored the DATA chunk silently instead of sending the error. Or there may be other causes of the violation error..

Sean-Der · 2020-02-01T22:07:32Z

oh that is a really good point.

I think that is worth asking @tuexen I can file an issue on the usrsctp bug tracker and see if has any ideas!

Sean-Der · 2020-02-01T22:08:24Z

I am just walking out the door, but I can file a ticket right when I get back! Or feel free to do it if you are bored/want to get into it :)

enobufs · 2020-02-01T22:13:00Z

I won't be available for this issue (as the main assignee) due to other commitments right now but will try my best to support your work like this. It is great to have another eye on sctp also!

tuexen · 2020-02-01T22:18:47Z

Is it possible to get

a .pcapng file taken at the side sending the ABORT chunk.
a description what is done at the data sender side (message size, ordered / unordered, PR-SCTP policy)

Then I can try to reproduce the issue.

Sean-Der · 2020-02-02T02:11:28Z

Fantastic! Thank you @tuexen

This is the chrome.pcap.tar.gz that is coming from usrsctp in Chromium.

The sender is here

Unordered
0 Retransmissions
Each message is 50000 bytes, broken into 1200 byte fragments.

The abort comes from usrsctp here

You can reproduce by doing

Using main.go
demo.js
simulating packet loss using comcast --device=lo --latency=150 --packet-loss=10% --target-proto=udp

I get an ABORT after ~5 seconds.

Reass 30000007,CI:ffffffff,TSN=1e5e52d3,SID=0002,FSN=1e5e52d3,SSN:0000

I am happy to answer any questions about our SCTP implementation! You can easily run it by doing this

I also documented everything as I went in this ticket. This wasn't filed by a user, so everything should be relevant! Thank you so much for the help :)

tuexen · 2020-02-02T14:52:45Z

Could you provide a tracefile from the usrsctp side which contains the packets received and sent? Right now it only contains one direction.

One note about your configuration: a single user message needs 42 fragments, which make up one packet each. Using a packet loss rate of 10 %, this gives a chance of (9/10)^42 that a user message is received completely. This is approximately 1%. Not sure if this is what you intended to test, just making sure what we should observe. So for 100 messages you are sending, 1 should be received successfully.

mturnshek · 2020-02-02T16:33:01Z

@tuexen can you elaborate a bit on that derivation for a noob? 50000 bytes message = 42 fragments in SCTP, which are one packet each?

Sean-Der · 2020-02-02T19:27:00Z

@tuexen here is another one! Pion got Abort chunk, with following errors: (Protocol Violation: Reass 30000007,CI:ffffffff,TSN=cf800b55,SID=0002,FSN=cf800b55,SSN:0000)

fromChrome.tar.gz
intoChrome.tar.gz

The 10% packet loss is just for an easy repro. @mturnshek is still hitting this in production though. My goal is just to have everything be as resilient as possible/not making users debug things. I don't have any expectations over how much data is actually transferred and at what speed.

@mturnshek yep exactly! Each SCTP packet contains at most a data fragment of 1200 bytes, so you need to split across all those packets.

Then once you split each of those fragments inside 42 separate packets have a TSN (SCTP protocols unique id). If any of those individual packets are lost, then the whole larger message needs to be discarded as well.

tuexen · 2020-02-02T19:47:41Z

@mturnshek @Sean-Der gives the correct explanation. If you are sending a user message of 50000 bytes and in a packet fit only 1200 bytes and considering 50000/1200 = 41 + 2/3, you need 42 packets. Since the number of retransmissions is 0, a single drop of a fragment required the receiver to drop all fragments of the user message.

tuexen · 2020-02-02T19:57:36Z

@Sean-Der Can't you provide a single tracefile showing the packets processed by the usrsctp stack within Chrome? I need the correct timing between received packets and sent packets.
Firefox has this capability and I think Chrome has it so. Something like
https://github.com/nplab/WebRTC-Data-Channel-Playground#trace-with-wireshark

I need to understand which packet sequence is sent by the peer triggering the ABORT. Either the packet sequence sent by the peer is violating the specification (which means the ABORT is fine and the peer stack needs to be fixed) or the code detecting the violation is wrong (which means the ABORT is wrong and we need to fix usrsctp). Therefore I need the packet sequence sent/received by usersctp with the correct timing.

If you can't get the logging from Chrome, maybe you can try with Firefox. It also uses the usrsctp stack...

Sean-Der · 2020-02-02T20:13:14Z

Unfortunately enabling the logging in Chromium/FireFox stops me from getting the abort sent back :(

Let me keep trying different combinations!

tuexen · 2020-02-02T20:18:43Z

OK, that is strange since it only changes the timing and I don't expect a race condition to be a problem here. How are you taking the traces if not via logging from usrsctp?

Sean-Der · 2020-02-02T21:30:53Z

I have been capturing them going in/out of the Go SCTP implementation.

Just adding it to the Read/Write functions.

If I put them in the same pcap would it be helpful? I will keep trying with the browsers though, or maybe just write a little C program and get browsers out of the equation.

Thanks again for all the help

tuexen · 2020-02-02T21:52:28Z

I see.

It would be best to see the perspective of the usrsctp stack, since Go's perspective is different. Packets are dropped, which are sent the the Go implementation. I can try to look at these files. At least I can see if the Go implementation behaves as expected.

Sean-Der · 2020-02-02T23:17:09Z

@tuexen Do you think it is possible that the cleaning up of reassembly queues from the forward TSN can cause this?

When getting a FORWARD-TSN the code cleans up some of the control stuff here

It it possible that when we get another fragment we enter the block for a new control here

Then we abort because we have a !B for the first fragment we have ever seen here

Also ping @lgrahl if you have any bandwidth/debug ideas I am out of them :( maybe I can spin up a rawrtc instance and repro/log easier.

lgrahl · 2020-02-03T08:44:30Z

Even though it's usrsctp-based, RAWRTC often behaves very different than browsers. So, I can't provide a definitive answer but it's probably worth a try? Keep in mind that Chrome uses an outdated usrsctp version and IIRC they don't even handle abandoned messages, so you'll likely end up with garbled messages sometimes. Anyway, I can give you pointers if needed but I atm don't know what you need. 🙂

yutakasg · 2020-02-12T02:16:04Z

Oops, yes I meant E=1. I want to check with packets on the wire too. I will try to get pcap and share that with you. Thanks @tuexen for your support. Really appreciated.

mturnshek · 2020-02-12T17:43:06Z

So I tested out this branch:
https://github.com/pion/sctp/tree/disable-forward-tsn

This disables forward TSN, which supposedly is the root cause of the issue.

and noticed that we still receive an ABORT when I introduce packet loss. The channel errors with the following message (in Pion):
"channel error [0xc0003729c0] Abort chunk, with following errors: "

This makes me believe there's another issue which doesn't have to do with Forward TSN, and may be causing more of the ABORTs.

Notably, the ABORT doesn't say "Protocol Violation Reass 30000007,CI:ffffffff,TSN=1e5e52d3,SID=0002,FSN=1e5e52d3,SSN:0000" or anything.

Any ideas?

lgrahl · 2020-02-13T10:22:58Z

@mturnshek I have read @tuexen's mind and he is about to ask you for a PCAP file in order to investigate. 🙂

tuexen · 2020-02-13T10:35:07Z

@lgrahl did it correctly...

mturnshek · 2020-02-14T03:06:46Z

I'm still figuring out how to get a .pcapng with the unencrypted SCTP packets.

In the meantime, maybe it will be helpful if I provide Pion logs? Here are two log traces of the failure, with PION_LOG_TRACE=all enabled.

This does not occur when I am logging with Chrome, or have chrome://webrtc-internals open.

I can reproduce this consistently by introducing 5-10% packet loss (and 50ms latency) with the initial example Sean provided.

This was run on 6b0c8bb which has Forward TSN disabled.

Both of these traces begin in a healthy states, and I end after the connection has failed. Chrome sends an ABORT accompanied by an empty string to Pion.
1: https://gist.github.com/mturnshek/2e2bf3b9f5ab651d6052bf2aa41048ce
2: https://gist.github.com/mturnshek/9a123759480d4bddec44b68b76084812

There are some interesting lines gap ack: 2 - 3, gap ack: 2 - 4, gap ack: 2 - 5. Maybe those provide a clue?

tuexen · 2020-02-14T10:48:00Z

I'm still figuring out how to get a .pcapng with the unencrypted SCTP packets.

In the meantime, maybe it will be helpful if I provide Pion logs? Here are two log traces of the failure, with PION_LOG_TRACE=all enabled.

You could add to PION a routine like https://github.com/sctplab/usrsctp/blob/master/usrsctplib/user_socket.c#L3385
From that text output one can generate a .pcapng file using https://github.com/nplab/WebRTC-Data-Channel-Playground#3-extract-sctp-information-from-logfile and https://github.com/nplab/WebRTC-Data-Channel-Playground#4-text2pcap.

This does not occur when I am logging with Chrome, or have chrome://webrtc-internals open.

I can reproduce this consistently by introducing 5-10% packet loss (and 50ms latency) with the initial example Sean provided.

This was run on 6b0c8bb which has Forward TSN disabled.

Both of these traces begin in a healthy states, and I end after the connection has failed. Chrome sends an ABORT accompanied by an empty string to Pion.
1: https://gist.github.com/mturnshek/2e2bf3b9f5ab651d6052bf2aa41048ce
2: https://gist.github.com/mturnshek/9a123759480d4bddec44b68b76084812

Will look at them.

There are some interesting lines gap ack: 2 - 3, gap ack: 2 - 4, gap ack: 2 - 5. Maybe those provide a clue?

enobufs · 2020-02-15T09:57:52Z

@tuexen I have captured SCTP packets in two pcapng files.

What I am doing:

Using my revision (issue-104-yt) of pion/sctp that sends Fwd TSN with new cumulative TSN always points to the end of chunk (E-bit = 1).
Sending many 32GB messages from pion to chrome
Randomly dropping packet (by skipping writes to DTLS layer in the code) at 8% ratio. (the dropped packets won't appear in the pcap)
Max packet retransmit is set to 0.
Dumping packets at pion/sctp layer's read/write data to udp ports 5000/5001 respectively to capture them with Wireshark.

unordered-with-sidssn-pair-bad.pcapng

Foward TSN chunk always includes a pair of stream identifier (SID) and SSN.
Chrome stops making dc.onmessage callback.
Chrome keeps returning a_rwnd=0

unordered-without-sidssn-pair.pcapng

Foward TSN chunk never include SID/SSN pair.
Chrome stops making dc.onmessage callback.
Chrome keeps returning a_rwnd=0

Because I send Fwd TSN with new cumulative TSN pointing to the end of the fragment, I don't see Protocol Violation, but I have been seeing the unexplainable stall. I could not find anything wrong in pcap files. With or without SID/SSN pair does not make any difference but, without SID/SSN pairs in FwdTSN seem to experience a_rwnd=0 more often and quicker.

In both cases, when the problem occurs, pion is retransmitting a DATA chunk. Chrome returns SACK each time it receives the DATA chunk, but the SACK has the previous TSN. (This is to me, a typical case that application stops reading data from the SCTP reassembly queue...)

While this is happening, JavaScript side seems to be healthy because other events are still coming in.

How this happens varies. Sometimes it happens right away ~ 5 sec, sometimes, the traffic flows for a long time.

unordered-with-sidssn-pair-bad.zip
unordered-without-sidssn-pair.zip

Tests need to be fixed Relates to #104

tuexen · 2020-02-15T19:29:00Z

I'm still figuring out how to get a .pcapng with the unencrypted SCTP packets.

In the meantime, maybe it will be helpful if I provide Pion logs? Here are two log traces of the failure, with PION_LOG_TRACE=all enabled.

This does not occur when I am logging with Chrome, or have chrome://webrtc-internals open.

I can reproduce this consistently by introducing 5-10% packet loss (and 50ms latency) with the initial example Sean provided.

This was run on 6b0c8bb which has Forward TSN disabled.

Both of these traces begin in a healthy states, and I end after the connection has failed. Chrome sends an ABORT accompanied by an empty string to Pion.
1: https://gist.github.com/mturnshek/2e2bf3b9f5ab651d6052bf2aa41048ce
2: https://gist.github.com/mturnshek/9a123759480d4bddec44b68b76084812

There are some interesting lines gap ack: 2 - 3, gap ack: 2 - 4, gap ack: 2 - 5. Maybe those provide a clue?

I'm confused. My understanding was that we are looking at a case where Pion is sending data towards usrsctp. So the SACKs would be received by Pion. You are referring to lines stating that SACKs are being sent. Is that a different issue?

tuexen · 2020-02-15T20:28:00Z

@tuexen I have captured SCTP packets in two pcapng files.

What I am doing:

Using my revision (issue-104-yt) of pion/sctp that sends Fwd TSN with new cumulative TSN always points to the end of chunk (E-bit = 1).

Sending many 32GB messages from pion to chrome

Randomly dropping packet (by skipping writes to DTLS layer in the code) at 8% ratio. (the dropped packets won't appear in the pcap)

Max packet retransmit is set to 0.

Dumping packets at pion/sctp layer's read/write data to udp ports 5000/5001 respectively to capture them with Wireshark.

unordered-with-sidssn-pair-bad.pcapng

Foward TSN chunk always includes a pair of stream identifier (SID) and SSN.

Chrome stops making dc.onmessage callback.

Chrome keeps returning a_rwnd=0

unordered-without-sidssn-pair.pcapng

Foward TSN chunk never include SID/SSN pair.

Chrome stops making dc.onmessage callback.

Chrome keeps returning a_rwnd=0

Because I send Fwd TSN with new cumulative TSN pointing to the end of the fragment, I don't see Protocol Violation, but I have been seeing the unexplainable stall. I could not find anything wrong in pcap files. With or without SID/SSN pair does not make any difference but, without SID/SSN pairs in FwdTSN seem to experience a_rwnd=0 more often and quicker.

In both cases, when the problem occurs, pion is retransmitting a DATA chunk. Chrome returns SACK each time it receives the DATA chunk, but the SACK has the previous TSN. (This is to me, a typical case that application stops reading data from the SCTP reassembly queue...)

While this is happening, JavaScript side seems to be healthy because other events are still coming in.

How this happens varies. Sometimes it happens right away ~ 5 sec, sometimes, the traffic flows for a long time.

unordered-with-sidssn-pair-bad.zip
unordered-without-sidssn-pair.zip

I looked at the tracefile without the sidssn pair, since it is the correct way for unordered data. Observations:

I don't know what triggers the sending of the FORWARD-TSN chunks, but that is irrelevant for the receiver.
I don't understand why you are doing window probing by retransmitting DATA chunk when the maximum number of retransmissions is zero. But the is also irrelevant for the receiver.
The receiver window is NOT recovered when the first and the 22nd DATA chunk of a user message of 28 DATA chunks are dropped. See frame 30761 where the DATA chunks are still buffered and frame 30763, where it should have been recovered but is not. This is an incorrect behaviour of usrsctp.

The following packetdrill script should reproduce the issue with the FreeBSD kernel stack:

+0.0 socket(..., SOCK_STREAM, IPPROTO_SCTP) = 3
// Check the handshake with en empty(!) cookie
+0.0 bind(3, ..., ...) = 0
+0.0 listen(3, 1) = 0
+0.0 < sctp: INIT[flgs=0, tag=1, a_rwnd=1500, os=2, is=2, tsn=1, FORWARD_TSN_SUPPORTED[]]
+0.0 > sctp: INIT_ACK[flgs=0, tag=2, a_rwnd=1864135, os=..., is=..., tsn=1, ...]
+0.1 < sctp: COOKIE_ECHO[flgs=0, len=..., val=...]
+0.0 > sctp: COOKIE_ACK[flgs=0]
+0.0 accept(3, ..., ...) = 4
+0.0 close(3) = 0
// First packet is lost
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=2, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1862679, gaps=[2:2], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=3, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1861223, gaps=[2:3], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=4, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1859767, gaps=[2:4], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=5, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1858311, gaps=[2:5], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=6, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1856855, gaps=[2:6], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=7, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1855399, gaps=[2:7], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=8, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1853943, gaps=[2:8], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=9, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1852487, gaps=[2:9], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=10, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1851031, gaps=[2:10], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=11, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1849575, gaps=[2:11], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=12, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1848119, gaps=[2:12], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=13, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1846663, gaps=[2:13], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=14, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1845207, gaps=[2:14], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=15, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1843751, gaps=[2:15], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=16, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1842295, gaps=[2:16], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=17, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1840839, gaps=[2:17], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=18, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1839383, gaps=[2:18], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=19, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1837927, gaps=[2:19], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=20, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1836471, gaps=[2:20], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=21, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1835015, gaps=[2:21], dups=[]]
// Next packet is lost
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=23, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1833559, gaps=[2:21,23:23], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=24, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1832103, gaps=[2:21,23:24], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=25, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1830647, gaps=[2:21,23:25], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=26, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1829191, gaps=[2:21,23:26], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=27, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1827735, gaps=[2:21,23:27], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=384, tsn=28, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1827111, gaps=[2:21,23:28], dups=[]]
// Now the FORWARD-TSN chunk comes
+0.0 < sctp: FORWARD_TSN[cum_tsn=28, ids=[]]
+0.0 > sctp: SACK[flgs=0, cum_tsn=28, a_rwnd=1864135, gaps=[], dups=[]]
// Tear down the association
+0.0 < sctp: SHUTDOWN[flgs=0, cum_tsn=0]
+0.0 > sctp: SHUTDOWN_ACK[flgs=0]
+0.0 < sctp: SHUTDOWN_COMPLETE[flgs=0]
+0.0 recv(4, ..., 1024, 0) = 0
+0.0 close(4) = 0

But it runs fine. Options:

The above script does not capture the root of the problem.
The bug is user stack specific.
The bug has been fixed after Google imported the code.

I guess it is 3. Are you able with the more up-to-date usrsctp stack? If not, I can try to build a script for a different test tool and test with the userland stack.

enobufs · 2020-02-15T22:56:30Z

@tuexen Thanks for your time and comments

I don't understand why you are doing window probing by retransmitting DATA chunk when the maximum number of retransmissions is zero. But the is also irrelevant for the receiver.

Yeah, good point. When a package is large and cwnd is small, later fragments of a message are still in the pending queue (chunks that are not sent yet and do not have TSN assigned yet). When time comes to abandon an earlier fragment, there are two options in that situation:

(1) Discard all fragments in the inflight queue as well as the pending queue.
(2) Do not abandon any chunk until the last fragment is sent on wire and moved into the inflight queue.

The current approach (in my branch) is (2) because it was a lot easier. But I think (1) is the way to go (align with what RFC says) and I'd need more time. I wanted make sure (2) would solve the protocol violation error first as a quick fix.

1. The above script does not capture the root of the problem.
2. The bug is user stack specific.
3. The bug has been fixed after Google imported the code.

I did pion to Firefox (72.0.2), and it never happened. So my guess is also 3.
(I need to find a way to tell the revision of usrsctp different versions of chrome and firefix use...)

I am going to leave this stalling issue at that for now, and fix tests that are broken by the necessary changes to move on.

One thing to mention, I feel the use of large messages over a channel with partial reliability may not make much sense.

Aside from this particular issue, I think I should come up with an interop test tool between pion/sctp and usrsctp (with various versions).

Tests need to be fixed Relates to #104

Resolves #104 and #106

Tests need to be fixed Relates to #104

enobufs · 2020-02-18T20:30:05Z

I will look into the report made by @mturnshek later. Reopening this.

enobufs · 2020-02-23T09:06:41Z

@tuexen check this out:
https://gist.github.com/enobufs/e1f1b236e8fb188d3bbe71a266fa566e

tuexen · 2020-02-23T12:48:41Z

Thanks for the information.

enobufs · 2020-02-27T10:31:00Z

I noticed, using pion/sctp v1.7.5, unordered, sending 32KB messages to Chrome over lossy connection at 4% with RTT 200ms delay (using comcast ... i mean.. this), Chrome sends an Abort chunk after a while, but this time (with v1.7.5), the error cause was "No User Data" (code: 9).

I found a bug in association.go that causes a DATA chunk with 0-length user data spilled out on wire when SACK and T3-rtx timeout occurs on the same chunk at the same time - a race condition on the 'chunk.retransmit' flag.

This bug was not a regression by the previous fix. It was there before.

After I fixed it locally, I have been running for a while but I no longer see Abort at all. I will spend more time with different loss rate, delay, data sizes before I submit a PR...

Resolves #104

enobufs · 2020-02-28T08:06:43Z

@mturnshek I have merged #116 and tagged v1.7.6. Are you able to try to repro the Abort with pion/sctp@v1.7.6? I saw Abort from Chrome (as mentioned above, with v1.7.5) but the error cause was "No User Data", wasn't "Protocol Violation". It may just be two different manifestations of the same cause (hopefully), but there could be another issue that doesn't seem to happen in my environment.

mturnshek · 2020-02-28T14:08:12Z

I will test 1.76 today and report here.

mturnshek · 2020-03-16T01:21:28Z

@enobufs @yutakasg I'm very late here, but I have finally tested this and experienced no connection breaks even with 75% packet loss. There are some interesting results I have where pc.oniceconnectionstatechange = e => log(pc.iceConnectionState) quickly fires "disconnected" and then "connected" at above 75% packet loss.

But even when that happened, reliable messages eventually found their way!

Truly awesome stuff here. Thanks so much for the attention to detail and quick action when handling this.

enobufs · 2020-03-17T23:21:11Z

@mturnshek That's great news! Thanks for running the test and letting us know!

Sean-Der assigned enobufs and Sean-Der Feb 1, 2020

Sean-Der added the bug label Feb 1, 2020

enobufs mentioned this issue Feb 17, 2020

Forward TSN for unordered DATA chunks from usrsctp wouldn't discard chunks in the reassembly queue #106

Closed

enobufs added a commit that referenced this issue Feb 17, 2020

Forward TSN only to DATA chunk with E=1

6562821

Tests need to be fixed Relates to #104

enobufs mentioned this issue Feb 17, 2020

Fix ProtocolViolation error from usrsctp and rwnd exhaustion #107

Merged

enobufs added a commit that referenced this issue Feb 17, 2020

Fix lint error

53da7f5

Resolves #104 and #106

enobufs added a commit that referenced this issue Feb 18, 2020

Forward TSN only to DATA chunk with E=1

7f05711

Tests need to be fixed Relates to #104

enobufs closed this as completed in f7e6670 Feb 18, 2020

enobufs reopened this Feb 18, 2020

enobufs added a commit that referenced this issue Feb 28, 2020

Reset retransmit flag when acked

8577445

Resolves #104

enobufs mentioned this issue Feb 28, 2020

Reset retransmit flag when acked #116

Merged

enobufs closed this as completed in #116 Feb 28, 2020

enobufs added a commit that referenced this issue Feb 28, 2020

Reset retransmit flag when acked

9e890a2

Resolves #104

Pion <-> usrsctp under 10% packet loss gets an abort #104

Pion <-> usrsctp under 10% packet loss gets an abort #104

Comments

Sean-Der commented Feb 1, 2020 • edited Loading

Sean-Der commented Feb 1, 2020 • edited Loading

Sean-Der commented Feb 1, 2020 • edited Loading

Sean-Der commented Feb 1, 2020

Sean-Der commented Feb 1, 2020

Sean-Der commented Feb 1, 2020

Sean-Der commented Feb 1, 2020

Sean-Der commented Feb 1, 2020

enobufs commented Feb 1, 2020

Sean-Der commented Feb 1, 2020

enobufs commented Feb 1, 2020 • edited Loading

enobufs commented Feb 1, 2020 • edited Loading

Sean-Der commented Feb 1, 2020

Sean-Der commented Feb 1, 2020

enobufs commented Feb 1, 2020 • edited Loading

tuexen commented Feb 1, 2020 • edited Loading

Sean-Der commented Feb 2, 2020 • edited Loading

tuexen commented Feb 2, 2020

mturnshek commented Feb 2, 2020

Sean-Der commented Feb 2, 2020

tuexen commented Feb 2, 2020

tuexen commented Feb 2, 2020

Sean-Der commented Feb 2, 2020

tuexen commented Feb 2, 2020

Sean-Der commented Feb 2, 2020

tuexen commented Feb 2, 2020

Sean-Der commented Feb 2, 2020 • edited Loading

lgrahl commented Feb 3, 2020 • edited Loading

yutakasg commented Feb 12, 2020

mturnshek commented Feb 12, 2020 • edited Loading

lgrahl commented Feb 13, 2020

tuexen commented Feb 13, 2020

mturnshek commented Feb 14, 2020 • edited Loading

tuexen commented Feb 14, 2020

enobufs commented Feb 15, 2020 • edited Loading

What I am doing:

unordered-with-sidssn-pair-bad.pcapng

unordered-without-sidssn-pair.pcapng

tuexen commented Feb 15, 2020

tuexen commented Feb 15, 2020

What I am doing:

unordered-with-sidssn-pair-bad.pcapng

unordered-without-sidssn-pair.pcapng

enobufs commented Feb 15, 2020 • edited Loading

enobufs commented Feb 18, 2020

enobufs commented Feb 23, 2020

tuexen commented Feb 23, 2020

enobufs commented Feb 27, 2020 • edited Loading

enobufs commented Feb 28, 2020 • edited Loading

mturnshek commented Feb 28, 2020

mturnshek commented Mar 16, 2020 • edited Loading

enobufs commented Mar 17, 2020

Sean-Der commented Feb 1, 2020 •

edited

Loading

Sean-Der commented Feb 1, 2020 •

edited

Loading

Sean-Der commented Feb 1, 2020 •

edited

Loading

enobufs commented Feb 1, 2020 •

edited

Loading

enobufs commented Feb 1, 2020 •

edited

Loading

enobufs commented Feb 1, 2020 •

edited

Loading

tuexen commented Feb 1, 2020 •

edited

Loading

Sean-Der commented Feb 2, 2020 •

edited

Loading

Sean-Der commented Feb 2, 2020 •

edited

Loading

lgrahl commented Feb 3, 2020 •

edited

Loading

mturnshek commented Feb 12, 2020 •

edited

Loading

mturnshek commented Feb 14, 2020 •

edited

Loading

enobufs commented Feb 15, 2020 •

edited

Loading

enobufs commented Feb 15, 2020 •

edited

Loading

enobufs commented Feb 27, 2020 •

edited

Loading

enobufs commented Feb 28, 2020 •

edited

Loading

mturnshek commented Mar 16, 2020 •

edited

Loading