Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pion <-> usrsctp under 10% packet loss gets an abort #104

Closed
Sean-Der opened this issue Feb 1, 2020 · 67 comments · Fixed by #116
Closed

Pion <-> usrsctp under 10% packet loss gets an abort #104

Sean-Der opened this issue Feb 1, 2020 · 67 comments · Fixed by #116
Assignees
Labels
bug Something isn't working

Comments

@Sean-Der
Copy link
Member

Sean-Der commented Feb 1, 2020

  • Using main.go
  • demo.js
  • simulating packet loss using comcast --device=lo --latency=150 --packet-loss=10% --target-proto=udp

I get an ABORT after ~5 seconds.

Reass 30000007,CI:ffffffff,TSN=1e5e52d3,SID=0002,FSN=1e5e52d3,SSN:0000

This abort is generated here in usrsctp. I am not sure why yet though, I need to read more about SCTP to fully understand.

@Sean-Der Sean-Der added the bug Something isn't working label Feb 1, 2020
@Sean-Der
Copy link
Member Author

Sean-Der commented Feb 1, 2020

This seems to be triggered by the size of the message we are sending. If I drop this to a smaller size the crash goes away.

The crash seems to be around DATA chunk fragmenting. When we do messages @ ~3500 I don't see the abort anymore.

@Sean-Der
Copy link
Member Author

Sean-Der commented Feb 1, 2020

This is the full logic of the assert we are hitting

                  if ((tsn == (asoc->cumulative_tsn + 1) && (asoc->idata_supported == 0))) {
                          /* Ok we created this control and now
                           * lets validate that its legal i.e. there
                           * is a B bit set, if not and we have
                           * up to the cum-ack then its invalid.
                           */
                          if ((chk->rec.data.rcv_flags & SCTP_DATA_FIRST_FRAG) == 0) {
                                  sctp_abort_in_reasm(stcb, control, chk,
                                                      abort_flag,
                                                      SCTP_FROM_SCTP_INDATA + SCTP_LOC_7);
                                  return;
                          }
                  }

It seems to be saying

  • TSN of this new chunk is exactly 1 greater then what we have gotten
  • It doesn't have the B field set (beginning fragment)
  • This is the first packet ever created_control means it is the first thing we have ever processed.

I will work on getting a capture of stuff we are sending and see what I can find

@Sean-Der
Copy link
Member Author

Sean-Der commented Feb 1, 2020

Here is a pcap with this message Protocol Violation: Reass 30000007,CI:ffffffff,TSN=0bf4ed02,SID=0002,FSN=bf4ed02,SSN:0000

pion.tar.gz

@Sean-Der
Copy link
Member Author

Sean-Der commented Feb 1, 2020

@enobufs this looks promising! This happens right after a storm of Forward TSN packets. Then a DATA packet right after causes the issue. Should we not be sending a Forward TSN in the middle of a fragment maybe?

@Sean-Der
Copy link
Member Author

Sean-Der commented Feb 1, 2020

2020-02-01-014412_761x319_scrot

2020-02-01-014420_844x326_scrot

@Sean-Der
Copy link
Member Author

Sean-Der commented Feb 1, 2020

I think the new forward TSN flushes any caching in usrsctp.

Then when the next data comes in it is confused because it goes to do a lookup and we have a chunk that the B bit is unset.

@Sean-Der
Copy link
Member Author

Sean-Der commented Feb 1, 2020

I am figuring out the best way to make sure when abandoned is set it takes into account the whole fragment.

checkPartialReliabilityStatus is called on a single chunk. Maybe we can update this to take a payloadQueue + an index. Then it can handle iterating and making sure the whole chunked fragment is marked abandoned.

@enobufs
Copy link
Member

enobufs commented Feb 1, 2020

@Sean-Der I think you are on the right track! Indeed, forwarding TSN to a fragment with (BE)=(00) doesn't make sense, it should forward to the last fragment, TSN E-bit being 1.

@Sean-Der
Copy link
Member Author

Sean-Der commented Feb 1, 2020

@enobufs That's good to hear :) Any suggestions on best way to refactor/change the API?

This is what I have so far

diff --git a/association.go b/association.go
index e47195d..a870687 100644
--- a/association.go
+++ b/association.go
@@ -623,7 +623,7 @@ func (a *Association) gatherOutboundFrastRetransmissionPackets(rawPackets [][]by
                        fastRetransSize += dataChunkSize
                        a.stats.incFastRetrans()
                        c.nSent++
-                       a.checkPartialReliabilityStatus(c)
+                       a.checkPartialReliabilityStatus(c, a.inflightQueue)
                        toFastRetrans = append(toFastRetrans, c)
                        a.log.Tracef("[%s] fast-retransmit: tsn=%d sent=%d htna=%d",
                                a.name, c.tsn, c.nSent, a.fastRecoverExitPoint)
@@ -1715,7 +1715,7 @@ func (a *Association) movePendingDataChunkToInflightQueue(c *chunkPayloadData) {
        c.since = time.Now() // use to calculate RTT and also for maxPacketLifeTime
        c.nSent = 1          // being sent for the first time

-       a.checkPartialReliabilityStatus(c)
+       a.checkPartialReliabilityStatus(c, a.inflightQueue)

        a.log.Tracef("[%s] sending tsn=%d ssn=%d sent=%d len=%d",
                a.name, c.tsn, c.streamSequenceNumber, c.nSent, len(c.userData))
@@ -1837,7 +1837,7 @@ func (a *Association) sendPayloadData(chunks []*chunkPayloadData) error {
 }

 // The caller should hold the lock.
-func (a *Association) checkPartialReliabilityStatus(c *chunkPayloadData) {
+func (a *Association) checkPartialReliabilityStatus(c *chunkPayloadData, q chunkQueue) {
        if !a.useForwardTSN {
                return
        }
@@ -1858,6 +1858,28 @@ func (a *Association) checkPartialReliabilityStatus(c *chunkPayloadData) {
                        }
                }
                s.lock.RUnlock()
+
+               // From RFC 3758 Sec 3.6 A3:
+               // When a TSN is "abandoned", if it is part of a fragmented message,
+               // all other TSN's within that fragmented message MUST be abandoned
+               // at the same time.
+               if !c.abandoned || c.beginningFragment {
+                       return
+               }
+
+               currentTSN := c.tsn + 1
+               for {
+                       fragment, ok := q.get(currentTSN)
+                       if !ok {
+                               a.log.Warnf("[%s] attempted to abandon chunk, but missing fragment tsn=%d ", a.name, currentTSN)
+                               return
+                       } else if fragment.beginningFragment {
+                               return
+                       }
+
+                       fragment.abandoned = true
+                       currentTSN++
+               }
        }
 }

@@ -1894,7 +1916,7 @@ func (a *Association) getDataPacketsToRetransmit() []*packet {

                c.nSent++

-               a.checkPartialReliabilityStatus(c)
+               a.checkPartialReliabilityStatus(c, a.inflightQueue)

                a.log.Tracef("[%s] retransmitting tsn=%d ssn=%d sent=%d", a.name, c.tsn, c.streamSequenceNumber, c.nSent)

@enobufs
Copy link
Member

enobufs commented Feb 1, 2020

Ah... one thing I realize that may make things complicated is, the rest of the fragments that also need to be abandoned may still be in the pendingQueue.

When sendPayloadData() is called, the chunks passed would first be written into pendingQueue. The chunks in the pending queue do not have TSN assigned yet (so that urgent messages can be reordered - and that is why the pendingQueue has ordered and reordered queues.

Pulling a chunk from the pendingQueue is done by peek(), then pop(). When the peek() returns a chunk that is a fragment of a message (E-bit being 0), then then the next peek() would persist to the ordered or unorder queue whichever the earlier chunk was, until it pop() the last fragment. (using pendingQueue.selected flag in the code). So, this concern is taken care of. This is done because all TSN numbers of the same group of fragments must be consecutive.

When there is an available congestion window, the chunks are moved from pendingQueue to inflightQueue, and the chunks will have TSN assigned at that point. (see: movePendingDataChunkToInflightQueue())

So, the challenging part is, we currently do not have a easy way to set abandoned flag set to true for the chunks that are still in the pendingQueue.

What I would do is (not entirely sure if works), when move chunks from pendingQueue to inflightQueue, check a.cumulativeTSNAckPoint against the newly assigned TSN number. If the number is equal to or less than a.cumulativeTSNAckPoint, I believe we should mark them as abandoned at that time, make sure the chunk won't be sent out.

@enobufs
Copy link
Member

enobufs commented Feb 1, 2020

There is one thing in my mind that concerns me a bit about the protocol violation error.

Say, we sent chunks with TSN 1(B=1), 2, 3, 4, 5(E=1). The sender forwards TSN to 3, stops retransmitting 4, 5 (which we are trying to do), but those chunks with 4 and 5 might still be in-flight (reordered delay), and later received by the remote.

So, I am not 100% sure if the workaround would work perfectly. usrsctp maybe should have just ignored the DATA chunk silently instead of sending the error. Or there may be other causes of the violation error..

@Sean-Der
Copy link
Member Author

Sean-Der commented Feb 1, 2020

oh that is a really good point.

I think that is worth asking @tuexen I can file an issue on the usrsctp bug tracker and see if has any ideas!

@Sean-Der
Copy link
Member Author

Sean-Der commented Feb 1, 2020

I am just walking out the door, but I can file a ticket right when I get back! Or feel free to do it if you are bored/want to get into it :)

@enobufs
Copy link
Member

enobufs commented Feb 1, 2020

I won't be available for this issue (as the main assignee) due to other commitments right now but will try my best to support your work like this. It is great to have another eye on sctp also!

@tuexen
Copy link
Contributor

tuexen commented Feb 1, 2020

Is it possible to get

  • a .pcapng file taken at the side sending the ABORT chunk.
  • a description what is done at the data sender side (message size, ordered / unordered, PR-SCTP policy)

Then I can try to reproduce the issue.

@Sean-Der
Copy link
Member Author

Sean-Der commented Feb 2, 2020

Fantastic! Thank you @tuexen

The sender is here

  • Unordered
  • 0 Retransmissions
  • Each message is 50000 bytes, broken into 1200 byte fragments.

The abort comes from usrsctp here

You can reproduce by doing

  • Using main.go
  • demo.js
  • simulating packet loss using comcast --device=lo --latency=150 --packet-loss=10% --target-proto=udp

I get an ABORT after ~5 seconds.

Reass 30000007,CI:ffffffff,TSN=1e5e52d3,SID=0002,FSN=1e5e52d3,SSN:0000

I am happy to answer any questions about our SCTP implementation! You can easily run it by doing this

I also documented everything as I went in this ticket. This wasn't filed by a user, so everything should be relevant! Thank you so much for the help :)

@tuexen
Copy link
Contributor

tuexen commented Feb 2, 2020

Could you provide a tracefile from the usrsctp side which contains the packets received and sent? Right now it only contains one direction.

One note about your configuration: a single user message needs 42 fragments, which make up one packet each. Using a packet loss rate of 10 %, this gives a chance of (9/10)^42 that a user message is received completely. This is approximately 1%. Not sure if this is what you intended to test, just making sure what we should observe. So for 100 messages you are sending, 1 should be received successfully.

@mturnshek
Copy link

@tuexen can you elaborate a bit on that derivation for a noob? 50000 bytes message = 42 fragments in SCTP, which are one packet each?

@Sean-Der
Copy link
Member Author

Sean-Der commented Feb 2, 2020

@tuexen here is another one! Pion got Abort chunk, with following errors: (Protocol Violation: Reass 30000007,CI:ffffffff,TSN=cf800b55,SID=0002,FSN=cf800b55,SSN:0000)

fromChrome.tar.gz
intoChrome.tar.gz


The 10% packet loss is just for an easy repro. @mturnshek is still hitting this in production though. My goal is just to have everything be as resilient as possible/not making users debug things. I don't have any expectations over how much data is actually transferred and at what speed.


@mturnshek yep exactly! Each SCTP packet contains at most a data fragment of 1200 bytes, so you need to split across all those packets.

Then once you split each of those fragments inside 42 separate packets have a TSN (SCTP protocols unique id). If any of those individual packets are lost, then the whole larger message needs to be discarded as well.

@tuexen
Copy link
Contributor

tuexen commented Feb 2, 2020

@mturnshek @Sean-Der gives the correct explanation. If you are sending a user message of 50000 bytes and in a packet fit only 1200 bytes and considering 50000/1200 = 41 + 2/3, you need 42 packets. Since the number of retransmissions is 0, a single drop of a fragment required the receiver to drop all fragments of the user message.

@tuexen
Copy link
Contributor

tuexen commented Feb 2, 2020

@Sean-Der Can't you provide a single tracefile showing the packets processed by the usrsctp stack within Chrome? I need the correct timing between received packets and sent packets.
Firefox has this capability and I think Chrome has it so. Something like
https://github.com/nplab/WebRTC-Data-Channel-Playground#trace-with-wireshark

I need to understand which packet sequence is sent by the peer triggering the ABORT. Either the packet sequence sent by the peer is violating the specification (which means the ABORT is fine and the peer stack needs to be fixed) or the code detecting the violation is wrong (which means the ABORT is wrong and we need to fix usrsctp). Therefore I need the packet sequence sent/received by usersctp with the correct timing.

If you can't get the logging from Chrome, maybe you can try with Firefox. It also uses the usrsctp stack...

@Sean-Der
Copy link
Member Author

Sean-Der commented Feb 2, 2020

Unfortunately enabling the logging in Chromium/FireFox stops me from getting the abort sent back :(

Let me keep trying different combinations!

@tuexen
Copy link
Contributor

tuexen commented Feb 2, 2020

OK, that is strange since it only changes the timing and I don't expect a race condition to be a problem here. How are you taking the traces if not via logging from usrsctp?

@Sean-Der
Copy link
Member Author

Sean-Der commented Feb 2, 2020

I have been capturing them going in/out of the Go SCTP implementation.

Just adding it to the Read/Write functions.

If I put them in the same pcap would it be helpful? I will keep trying with the browsers though, or maybe just write a little C program and get browsers out of the equation.

Thanks again for all the help

@tuexen
Copy link
Contributor

tuexen commented Feb 2, 2020

I see.

It would be best to see the perspective of the usrsctp stack, since Go's perspective is different. Packets are dropped, which are sent the the Go implementation. I can try to look at these files. At least I can see if the Go implementation behaves as expected.

@Sean-Der
Copy link
Member Author

Sean-Der commented Feb 2, 2020

@tuexen Do you think it is possible that the cleaning up of reassembly queues from the forward TSN can cause this?

When getting a FORWARD-TSN the code cleans up some of the control stuff here

It it possible that when we get another fragment we enter the block for a new control here

Then we abort because we have a !B for the first fragment we have ever seen here

Also ping @lgrahl if you have any bandwidth/debug ideas I am out of them :( maybe I can spin up a rawrtc instance and repro/log easier.

@lgrahl
Copy link

lgrahl commented Feb 3, 2020

Even though it's usrsctp-based, RAWRTC often behaves very different than browsers. So, I can't provide a definitive answer but it's probably worth a try? Keep in mind that Chrome uses an outdated usrsctp version and IIRC they don't even handle abandoned messages, so you'll likely end up with garbled messages sometimes. Anyway, I can give you pointers if needed but I atm don't know what you need. 🙂

@yutakasg
Copy link

Oops, yes I meant E=1. I want to check with packets on the wire too. I will try to get pcap and share that with you. Thanks @tuexen for your support. Really appreciated.

@mturnshek
Copy link

mturnshek commented Feb 12, 2020

So I tested out this branch:
https://github.com/pion/sctp/tree/disable-forward-tsn

This disables forward TSN, which supposedly is the root cause of the issue.

and noticed that we still receive an ABORT when I introduce packet loss. The channel errors with the following message (in Pion):
"channel error [0xc0003729c0] Abort chunk, with following errors: "

This makes me believe there's another issue which doesn't have to do with Forward TSN, and may be causing more of the ABORTs.

Notably, the ABORT doesn't say "Protocol Violation Reass 30000007,CI:ffffffff,TSN=1e5e52d3,SID=0002,FSN=1e5e52d3,SSN:0000" or anything.

Any ideas?

@lgrahl
Copy link

lgrahl commented Feb 13, 2020

@mturnshek I have read @tuexen's mind and he is about to ask you for a PCAP file in order to investigate. 🙂

@tuexen
Copy link
Contributor

tuexen commented Feb 13, 2020

@lgrahl did it correctly...

@mturnshek
Copy link

mturnshek commented Feb 14, 2020

I'm still figuring out how to get a .pcapng with the unencrypted SCTP packets.

In the meantime, maybe it will be helpful if I provide Pion logs? Here are two log traces of the failure, with PION_LOG_TRACE=all enabled.

This does not occur when I am logging with Chrome, or have chrome://webrtc-internals open.

I can reproduce this consistently by introducing 5-10% packet loss (and 50ms latency) with the initial example Sean provided.

This was run on 6b0c8bb which has Forward TSN disabled.

Both of these traces begin in a healthy states, and I end after the connection has failed. Chrome sends an ABORT accompanied by an empty string to Pion.
1: https://gist.github.com/mturnshek/2e2bf3b9f5ab651d6052bf2aa41048ce
2: https://gist.github.com/mturnshek/9a123759480d4bddec44b68b76084812

There are some interesting lines gap ack: 2 - 3, gap ack: 2 - 4, gap ack: 2 - 5. Maybe those provide a clue?

@tuexen
Copy link
Contributor

tuexen commented Feb 14, 2020

I'm still figuring out how to get a .pcapng with the unencrypted SCTP packets.

In the meantime, maybe it will be helpful if I provide Pion logs? Here are two log traces of the failure, with PION_LOG_TRACE=all enabled.

You could add to PION a routine like https://github.com/sctplab/usrsctp/blob/master/usrsctplib/user_socket.c#L3385
From that text output one can generate a .pcapng file using https://github.com/nplab/WebRTC-Data-Channel-Playground#3-extract-sctp-information-from-logfile and https://github.com/nplab/WebRTC-Data-Channel-Playground#4-text2pcap.

This does not occur when I am logging with Chrome, or have chrome://webrtc-internals open.

I can reproduce this consistently by introducing 5-10% packet loss (and 50ms latency) with the initial example Sean provided.

This was run on 6b0c8bb which has Forward TSN disabled.

Both of these traces begin in a healthy states, and I end after the connection has failed. Chrome sends an ABORT accompanied by an empty string to Pion.
1: https://gist.github.com/mturnshek/2e2bf3b9f5ab651d6052bf2aa41048ce
2: https://gist.github.com/mturnshek/9a123759480d4bddec44b68b76084812

Will look at them.

There are some interesting lines gap ack: 2 - 3, gap ack: 2 - 4, gap ack: 2 - 5. Maybe those provide a clue?

@enobufs
Copy link
Member

enobufs commented Feb 15, 2020

@tuexen I have captured SCTP packets in two pcapng files.

What I am doing:

  • Using my revision (issue-104-yt) of pion/sctp that sends Fwd TSN with new cumulative TSN always points to the end of chunk (E-bit = 1).
  • Sending many 32GB messages from pion to chrome
  • Randomly dropping packet (by skipping writes to DTLS layer in the code) at 8% ratio. (the dropped packets won't appear in the pcap)
  • Max packet retransmit is set to 0.
  • Dumping packets at pion/sctp layer's read/write data to udp ports 5000/5001 respectively to capture them with Wireshark.

unordered-with-sidssn-pair-bad.pcapng

  • Foward TSN chunk always includes a pair of stream identifier (SID) and SSN.
  • Chrome stops making dc.onmessage callback.
  • Chrome keeps returning a_rwnd=0

unordered-without-sidssn-pair.pcapng

  • Foward TSN chunk never include SID/SSN pair.
  • Chrome stops making dc.onmessage callback.
  • Chrome keeps returning a_rwnd=0

Because I send Fwd TSN with new cumulative TSN pointing to the end of the fragment, I don't see Protocol Violation, but I have been seeing the unexplainable stall. I could not find anything wrong in pcap files. With or without SID/SSN pair does not make any difference but, without SID/SSN pairs in FwdTSN seem to experience a_rwnd=0 more often and quicker.

In both cases, when the problem occurs, pion is retransmitting a DATA chunk. Chrome returns SACK each time it receives the DATA chunk, but the SACK has the previous TSN. (This is to me, a typical case that application stops reading data from the SCTP reassembly queue...)

While this is happening, JavaScript side seems to be healthy because other events are still coming in.

How this happens varies. Sometimes it happens right away ~ 5 sec, sometimes, the traffic flows for a long time.

unordered-with-sidssn-pair-bad.zip
unordered-without-sidssn-pair.zip

enobufs added a commit that referenced this issue Feb 15, 2020
Tests need to be fixed
Relates to #104
@tuexen
Copy link
Contributor

tuexen commented Feb 15, 2020

I'm still figuring out how to get a .pcapng with the unencrypted SCTP packets.

In the meantime, maybe it will be helpful if I provide Pion logs? Here are two log traces of the failure, with PION_LOG_TRACE=all enabled.

This does not occur when I am logging with Chrome, or have chrome://webrtc-internals open.

I can reproduce this consistently by introducing 5-10% packet loss (and 50ms latency) with the initial example Sean provided.

This was run on 6b0c8bb which has Forward TSN disabled.

Both of these traces begin in a healthy states, and I end after the connection has failed. Chrome sends an ABORT accompanied by an empty string to Pion.
1: https://gist.github.com/mturnshek/2e2bf3b9f5ab651d6052bf2aa41048ce
2: https://gist.github.com/mturnshek/9a123759480d4bddec44b68b76084812

There are some interesting lines gap ack: 2 - 3, gap ack: 2 - 4, gap ack: 2 - 5. Maybe those provide a clue?

I'm confused. My understanding was that we are looking at a case where Pion is sending data towards usrsctp. So the SACKs would be received by Pion. You are referring to lines stating that SACKs are being sent. Is that a different issue?

@tuexen
Copy link
Contributor

tuexen commented Feb 15, 2020

@tuexen I have captured SCTP packets in two pcapng files.

What I am doing:

  • Using my revision (issue-104-yt) of pion/sctp that sends Fwd TSN with new cumulative TSN always points to the end of chunk (E-bit = 1).
  • Sending many 32GB messages from pion to chrome
  • Randomly dropping packet (by skipping writes to DTLS layer in the code) at 8% ratio. (the dropped packets won't appear in the pcap)
  • Max packet retransmit is set to 0.
  • Dumping packets at pion/sctp layer's read/write data to udp ports 5000/5001 respectively to capture them with Wireshark.

unordered-with-sidssn-pair-bad.pcapng

  • Foward TSN chunk always includes a pair of stream identifier (SID) and SSN.
  • Chrome stops making dc.onmessage callback.
  • Chrome keeps returning a_rwnd=0

unordered-without-sidssn-pair.pcapng

  • Foward TSN chunk never include SID/SSN pair.
  • Chrome stops making dc.onmessage callback.
  • Chrome keeps returning a_rwnd=0

Because I send Fwd TSN with new cumulative TSN pointing to the end of the fragment, I don't see Protocol Violation, but I have been seeing the unexplainable stall. I could not find anything wrong in pcap files. With or without SID/SSN pair does not make any difference but, without SID/SSN pairs in FwdTSN seem to experience a_rwnd=0 more often and quicker.

In both cases, when the problem occurs, pion is retransmitting a DATA chunk. Chrome returns SACK each time it receives the DATA chunk, but the SACK has the previous TSN. (This is to me, a typical case that application stops reading data from the SCTP reassembly queue...)

While this is happening, JavaScript side seems to be healthy because other events are still coming in.

How this happens varies. Sometimes it happens right away ~ 5 sec, sometimes, the traffic flows for a long time.

unordered-with-sidssn-pair-bad.zip
unordered-without-sidssn-pair.zip

I looked at the tracefile without the sidssn pair, since it is the correct way for unordered data. Observations:

  • I don't know what triggers the sending of the FORWARD-TSN chunks, but that is irrelevant for the receiver.
  • I don't understand why you are doing window probing by retransmitting DATA chunk when the maximum number of retransmissions is zero. But the is also irrelevant for the receiver.
  • The receiver window is NOT recovered when the first and the 22nd DATA chunk of a user message of 28 DATA chunks are dropped. See frame 30761 where the DATA chunks are still buffered and frame 30763, where it should have been recovered but is not. This is an incorrect behaviour of usrsctp.

The following packetdrill script should reproduce the issue with the FreeBSD kernel stack:

+0.0 socket(..., SOCK_STREAM, IPPROTO_SCTP) = 3
// Check the handshake with en empty(!) cookie
+0.0 bind(3, ..., ...) = 0
+0.0 listen(3, 1) = 0
+0.0 < sctp: INIT[flgs=0, tag=1, a_rwnd=1500, os=2, is=2, tsn=1, FORWARD_TSN_SUPPORTED[]]
+0.0 > sctp: INIT_ACK[flgs=0, tag=2, a_rwnd=1864135, os=..., is=..., tsn=1, ...]
+0.1 < sctp: COOKIE_ECHO[flgs=0, len=..., val=...]
+0.0 > sctp: COOKIE_ACK[flgs=0]
+0.0 accept(3, ..., ...) = 4
+0.0 close(3) = 0
// First packet is lost
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=2, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1862679, gaps=[2:2], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=3, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1861223, gaps=[2:3], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=4, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1859767, gaps=[2:4], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=5, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1858311, gaps=[2:5], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=6, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1856855, gaps=[2:6], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=7, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1855399, gaps=[2:7], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=8, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1853943, gaps=[2:8], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=9, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1852487, gaps=[2:9], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=10, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1851031, gaps=[2:10], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=11, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1849575, gaps=[2:11], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=12, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1848119, gaps=[2:12], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=13, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1846663, gaps=[2:13], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=14, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1845207, gaps=[2:14], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=15, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1843751, gaps=[2:15], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=16, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1842295, gaps=[2:16], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=17, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1840839, gaps=[2:17], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=18, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1839383, gaps=[2:18], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=19, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1837927, gaps=[2:19], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=20, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1836471, gaps=[2:20], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=21, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1835015, gaps=[2:21], dups=[]]
// Next packet is lost
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=23, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1833559, gaps=[2:21,23:23], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=24, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1832103, gaps=[2:21,23:24], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=25, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1830647, gaps=[2:21,23:25], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=26, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1829191, gaps=[2:21,23:26], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=27, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1827735, gaps=[2:21,23:27], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=384, tsn=28, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1827111, gaps=[2:21,23:28], dups=[]]
// Now the FORWARD-TSN chunk comes
+0.0 < sctp: FORWARD_TSN[cum_tsn=28, ids=[]]
+0.0 > sctp: SACK[flgs=0, cum_tsn=28, a_rwnd=1864135, gaps=[], dups=[]]
// Tear down the association
+0.0 < sctp: SHUTDOWN[flgs=0, cum_tsn=0]
+0.0 > sctp: SHUTDOWN_ACK[flgs=0]
+0.0 < sctp: SHUTDOWN_COMPLETE[flgs=0]
+0.0 recv(4, ..., 1024, 0) = 0
+0.0 close(4) = 0

But it runs fine. Options:

  1. The above script does not capture the root of the problem.
  2. The bug is user stack specific.
  3. The bug has been fixed after Google imported the code.

I guess it is 3. Are you able with the more up-to-date usrsctp stack? If not, I can try to build a script for a different test tool and test with the userland stack.

@enobufs
Copy link
Member

enobufs commented Feb 15, 2020

@tuexen Thanks for your time and comments

I don't understand why you are doing window probing by retransmitting DATA chunk when the maximum number of retransmissions is zero. But the is also irrelevant for the receiver.

Yeah, good point. When a package is large and cwnd is small, later fragments of a message are still in the pending queue (chunks that are not sent yet and do not have TSN assigned yet). When time comes to abandon an earlier fragment, there are two options in that situation:

(1) Discard all fragments in the inflight queue as well as the pending queue.
(2) Do not abandon any chunk until the last fragment is sent on wire and moved into the inflight queue.

The current approach (in my branch) is (2) because it was a lot easier. But I think (1) is the way to go (align with what RFC says) and I'd need more time. I wanted make sure (2) would solve the protocol violation error first as a quick fix.

1. The above script does not capture the root of the problem.
2. The bug is user stack specific.
3. The bug has been fixed after Google imported the code.

I did pion to Firefox (72.0.2), and it never happened. So my guess is also 3.
(I need to find a way to tell the revision of usrsctp different versions of chrome and firefix use...)

I am going to leave this stalling issue at that for now, and fix tests that are broken by the necessary changes to move on.

One thing to mention, I feel the use of large messages over a channel with partial reliability may not make much sense.

Aside from this particular issue, I think I should come up with an interop test tool between pion/sctp and usrsctp (with various versions).

enobufs added a commit that referenced this issue Feb 17, 2020
Tests need to be fixed
Relates to #104
enobufs added a commit that referenced this issue Feb 17, 2020
Resolves #104 and #106
enobufs added a commit that referenced this issue Feb 18, 2020
Tests need to be fixed
Relates to #104
@enobufs
Copy link
Member

enobufs commented Feb 18, 2020

I will look into the report made by @mturnshek later. Reopening this.

@enobufs enobufs reopened this Feb 18, 2020
@enobufs
Copy link
Member

enobufs commented Feb 23, 2020

@tuexen
Copy link
Contributor

tuexen commented Feb 23, 2020

Thanks for the information.

@enobufs
Copy link
Member

enobufs commented Feb 27, 2020

I noticed, using pion/sctp v1.7.5, unordered, sending 32KB messages to Chrome over lossy connection at 4% with RTT 200ms delay (using comcast ... i mean.. this), Chrome sends an Abort chunk after a while, but this time (with v1.7.5), the error cause was "No User Data" (code: 9).

I found a bug in association.go that causes a DATA chunk with 0-length user data spilled out on wire when SACK and T3-rtx timeout occurs on the same chunk at the same time - a race condition on the 'chunk.retransmit' flag.

This bug was not a regression by the previous fix. It was there before.

After I fixed it locally, I have been running for a while but I no longer see Abort at all. I will spend more time with different loss rate, delay, data sizes before I submit a PR...

enobufs added a commit that referenced this issue Feb 28, 2020
enobufs added a commit that referenced this issue Feb 28, 2020
@enobufs
Copy link
Member

enobufs commented Feb 28, 2020

@mturnshek I have merged #116 and tagged v1.7.6. Are you able to try to repro the Abort with pion/sctp@v1.7.6? I saw Abort from Chrome (as mentioned above, with v1.7.5) but the error cause was "No User Data", wasn't "Protocol Violation". It may just be two different manifestations of the same cause (hopefully), but there could be another issue that doesn't seem to happen in my environment.

@mturnshek
Copy link

I will test 1.76 today and report here.

@mturnshek
Copy link

mturnshek commented Mar 16, 2020

@enobufs @yutakasg I'm very late here, but I have finally tested this and experienced no connection breaks even with 75% packet loss. There are some interesting results I have where pc.oniceconnectionstatechange = e => log(pc.iceConnectionState) quickly fires "disconnected" and then "connected" at above 75% packet loss.

But even when that happened, reliable messages eventually found their way!

Truly awesome stuff here. Thanks so much for the attention to detail and quick action when handling this.

@enobufs
Copy link
Member

enobufs commented Mar 17, 2020

@mturnshek That's great news! Thanks for running the test and letting us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants