Proper handling of outgoing data channel messages (buffering, large messages) #133

daniel-abramov · 2023-02-14T18:19:27Z

It looks like sending a message on a data channel is a blocking operation (due to the I/O involved in it), in my non-scientific tests it turned out that sending a message on a data channel may sometimes take 30ms. That's not a problem in the majority of cases, but when the conference gets large and participants exchange messages over the data channel, it can delay the processing of the incoming To-Device messages significantly.

UPD:

Blocking is not an issue in the majority of the cases and only starts slowly hitting us when we go over 50 participants in the session.
The main culprit is the data channel here, or to be more precisely, large messages on a data channel: Proper handling of outgoing data channel messages (buffering, large messages) #133 (comment)

daniel-abramov · 2023-02-23T12:57:09Z

I tried to measure the amount of time it takes to process each conference message in the main loop to see which of them takes most of the most time. In a conference with 4 participants, the majority of the messages take less than 1ms to be handled.

However, the same message may under certain circumstances take more time to get processed.

I tried to log all occurrences when a message took more than 1ms to handle and sorted them in a descending order:

onNewParticipant:                            17.938334ms
onNewParticipant:                            14.249709ms
conference.processNewTrackPublishedMessage:  11.006ms
resendMetadataToAllExcept:                   10.72575ms
onNewParticipant:                            9.574333ms
conference.processNewTrackPublishedMessage:  8.04675ms
resendMetadataToAllExcept:                   7.9625ms
conference.processDataChannelMessage:        7.410416ms
onNewParticipant:                            7.355125ms
conference.processNewTrackPublishedMessage:  6.806458ms
conference.processDataChannelMessage:        5.301042ms
resendMetadataToAllExcept:                   4.100334ms
conference.processLeftTheCallMessage:        4.078875ms
removeParticipant:                           4.051708ms
conference.processNewTrackPublishedMessage:  3.850084ms
resendMetadataToAllExcept:                   3.667958ms
onNewParticipant:                            3.462417ms
onHangup:                                    3.393917ms
removeParticipant:                           3.294875ms
onNewParticipant:                            2.74225ms
onCandidates:                                2.324042ms
conference.processDataChannelMessage:        2.282125ms
conference.processNewTrackPublishedMessage:  2.213209ms
getAvailableStreamsFor:                      2.178167ms
onCandidates:                                2.138333ms
conference.processDataChannelMessage:        2.096166ms
resendMetadataToAllExcept:                   2.037042ms
onNewParticipant:                            1.987ms
onNewParticipant:                            1.858708ms
conference.processDataChannelMessage:        1.856125ms
conference.processTrackSubscriptionMessage:  1.651292ms
conference.processDataChannelMessage:        1.609167ms
conference.processNewTrackPublishedMessage:  1.520916ms
conference.processDataChannelMessage:        1.412625ms
onHangup:                                    1.361625ms
conference.processDataChannelMessage:        1.3315ms
conference.processDataChannelMessage:        1.288708ms
removeParticipant:                           1.22225ms
conference.processDataChannelMessage:        1.204084ms
conference.processNewTrackPublishedMessage:  1.196334ms
conference.processDataChannelMessage:        1.151292ms
resendMetadataToAllExcept:                   1.151125ms
onNewParticipant:                            1.135375ms
resendMetadataToAllExcept:                   1.100833ms
conference.processDataChannelMessage:        1.093875ms
conference.processNewTrackPublishedMessage:  1.01275ms

It's interesting to see that some occurrences of the same event take a significantly different amount of time that they need to be processed which is a bit weird.

Take for instance onNewParticipant that during my test with 4 participants joining, leaving, joining took different amount of time:

onNewParticipant                                :  1.858708ms
onNewParticipant                                :  967.791µs
onNewParticipant                                :  2.74225ms
onNewParticipant                                :  1.987ms
onNewParticipant                                :  1.135375ms
onNewParticipant                                :  9.574333ms
onNewParticipant                                :  17.938334ms
onNewParticipant                                :  3.462417ms
onNewParticipant                                :  14.249709ms
onNewParticipant                                :  7.355125ms

The discrepancy is rather large.

Same could be said for metadata update which consists of us generating a metadata for each participant and then sending it to them:

resendMetadataToAllExcept                       :  35.167µs
resendMetadataToAllExcept                       :  126.25µs
resendMetadataToAllExcept                       :  84.333µs
resendMetadataToAllExcept                       :  270.75µs
resendMetadataToAllExcept                       :  251µs
resendMetadataToAllExcept                       :  4.100334ms
resendMetadataToAllExcept                       :  310.834µs
resendMetadataToAllExcept                       :  694.667µs
resendMetadataToAllExcept                       :  187µs
resendMetadataToAllExcept                       :  288µs
resendMetadataToAllExcept                       :  1.151125ms
resendMetadataToAllExcept                       :  2.037042ms
resendMetadataToAllExcept                       :  7.9625ms
resendMetadataToAllExcept                       :  520.5µs
resendMetadataToAllExcept                       :  618.375µs
resendMetadataToAllExcept                       :  503.083µs
resendMetadataToAllExcept                       :  197.25µs
resendMetadataToAllExcept                       :  3.667958ms
resendMetadataToAllExcept                       :  439.542µs
resendMetadataToAllExcept                       :  314.209µs
resendMetadataToAllExcept                       :  287.209µs
resendMetadataToAllExcept                       :  191µs
resendMetadataToAllExcept                       :  127.166µs
resendMetadataToAllExcept                       :  1.100833ms
resendMetadataToAllExcept                       :  10.72575ms
resendMetadataToAllExcept                       :  226.625µs
resendMetadataToAllExcept                       :  316.291µs
resendMetadataToAllExcept                       :  936.042µs
resendMetadataToAllExcept                       :  242.167µs
resendMetadataToAllExcept                       :  125.625µs
resendMetadataToAllExcept                       :  537.167µs
resendMetadataToAllExcept                       :  68.958µs

Mostly in microseconds, but takes milliseconds at times.

It's interesting to see a large discrepancy in handling new published track:

conference.processNewTrackPublishedMessage      :  1.01275ms
conference.processNewTrackPublishedMessage      :  131.583µs
conference.processNewTrackPublishedMessage      :  368.709µs
conference.processNewTrackPublishedMessage      :  364.458µs
conference.processNewTrackPublishedMessage      :  6.806458ms
conference.processNewTrackPublishedMessage      :  557.5µs
conference.processNewTrackPublishedMessage      :  918.666µs
conference.processNewTrackPublishedMessage      :  278.167µs
conference.processNewTrackPublishedMessage      :  455.458µs
conference.processNewTrackPublishedMessage      :  1.520916ms
conference.processNewTrackPublishedMessage      :  2.213209ms
conference.processNewTrackPublishedMessage      :  8.04675ms
conference.processNewTrackPublishedMessage      :  296.209µs
conference.processNewTrackPublishedMessage      :  3.850084ms
conference.processNewTrackPublishedMessage      :  513.042µs
conference.processNewTrackPublishedMessage      :  418.125µs
conference.processNewTrackPublishedMessage      :  1.196334ms
conference.processNewTrackPublishedMessage      :  11.006ms
conference.processNewTrackPublishedMessage      :  333.209µs
conference.processNewTrackPublishedMessage      :  452.042µs

Mostly likely caused by a fact that processNewTrackPublishedMessage is an accumulated metric that also includes resendMetadataToAllExcept.

Interestingly enough, the sending of the data channel message alone did not have that big of a footprint alone (which makes #136 of a questionable utility).

One obvious improvement would be eliminating the need to send the whole state of the stream/track metadata to the client, but only share the changes. Also, currently we construct a map of the metadata each time when we want to send it. It would be more efficient to maintain the map for each participant separately and update it on change instead of creating and releasing the memory for the map each time we want to send it.

dbkr · 2023-02-23T13:30:53Z

This is really interesting — thanks for doing this. This could be another argument for media info going in state events, then only the short-term metadata needs to go via the SFU. It would be slightly ironic if an SFU designed to forward large numbers of high res video streams got bottlenecked on JSON marshalling, so moving all of the signaling code out to separate goroutines from the media path definitely sounds like a plan (fwiw in a past media stack, we had a model of having two threads, one dedicated one for moving media around and another for everything else).

daniel-abramov · 2023-02-23T15:43:17Z

It would be slightly ironic if an SFU designed to forward large numbers of high res video streams got bottlenecked on JSON marshalling

Yeah, for sure 🙂 Though it seems like it does not really bottleneck there, i.e. I could imagine that even if takes 100ms to marshall large requests, it should not really behave as odd as we observed it last time, so I'm still trying to understand what happens when we go past 20 participants.

Also, we could drastically reduce the time to generate and send metadata if we could send only the diffs/changes to the other side and could maintain the map of the metadata for each participant separately to prevent generating it every time we need to send an update.

Storing it in the room for state events is another possibility, that's right. In this case, the SFU would only share the media-to-mid relation and won't deal with the metadata propagation at all.

daniel-abramov · 2023-03-06T13:41:40Z

Context

I used https://github.com/vector-im/ec-cli-rust to load the SFU and after several tests were able to reproduce the issue where after reaching the limit of 20 participants, things start to go awry (i.e. the new participants can't see the majority of the other participants and the whole conference gets extremely unstable).

The majority of the requests and metadata updates are handled very promptly. Even if they're big - yes, it may take dozen milliseconds, but they are still not noticeable enough to cause the freeze of the session (they would likely become a problem with 100+ participants though), so it's not the blocking operations in the main loop of the conference that hindered us.

Checking and adding logs

I've observed the following odd things in the logs though:

Sometimes certain participants get into a vicious cycle of sending an offer, getting the answer, then sending it again, reconnecting, receiving the offer, sending the answer, then getting several OnNegotiationNeeded callbacks on the SFU and again the ping-pong of SDPs.
Once we reach approximately 20 participants, we get to the point where after adding new tracks to the new participant, we send a new SDP offer to the EC but don't get a response back.

I've noticed that we lacked some logging in the data channels (specifically, the errors that were returned, but never logged), so I added them and repeated the tests. That helped me to pinpoint the issue.

Final results

Notably, there are 2 things that regularly go wrong once the conference gets larger and that get very wrong when we reach 20 participants or so:

At certain times the data channel message cannot be sent by the SFU because the data channel state is not open. This results in messages being dropped. Sometimes such a message is an SDP offer from the SFU to the EC (when a renegotiation is needed). We need to buffer important messages if they are dropped and re-send them once the data channel is ready (unless the message has been replaced by a new message of the same type generated meanwhile). One odd thing though is that we do get a data channel open message from Pion and the channel is supposed to be open from this point on, but it turns out that it's not in certain cases. I've found a couple of races related to that in Pion, but both are solved (first, second). I believe that this issue is a bug in the SFU, we have a logical race there that if the channel gets closed right after someone attempts to send it, then the sender gets the reference to the channel that is open, then we get a closed callback from pion and mark it as closed which results into the sender to attempt to send on a channel that is closed. Long story short: we need to introduce smarter buffering as mentioned above.
After reaching the limit of 20 participants or so, we regularly fail to write to the data channel due to the error that we get from Pion that says that outbound packet larger than maximum message size: 65535. It turned out that there are 2 open issues in Pion since 2020 that have not been implemented since (Error: Outbound packet larger than maximum message size 65535 pion/webrtc#1396, Outbound packet larger than maximum message size 65535 pion/webrtc#2712). Essentially, Pion does not support sending data channel messages that are greater or equal than 64 KiB. Here the only thing we could do is to workaround it on our side by splitting messages into chunks, meaning that we'll have to add some indices to them and the indicator that the message is complete - too much hassle for a small thing, not sure that implementing such a packetized just to workaround an issue with the library is a particularly sensible thing. We could instead share this information over Matrix. Though, we must always take into account that we still must be very careful about that limit as it means that the SDPs that we share over the data channel must be small enough and that any other messages (i.e. mappings between tracks UUIDs and the MIDs that we want to implement) are also small.

Alternatively, we could contribute to Pion and implement the SCTP extension that supports an explicit EOR (End-Of-Record) flag that would lift the limitation of 64 KiB.

FWIW: While researching (2), I checked some relevant sections in the RFC, and according to it, they recommend not going above 16 KiB per message in order to stay compatible with all browsers (historically, browsers supported data channels differently and handled larger messages in a different way, so the behavior may be a bit inconsistent). I wonder if this might be related to #106 as handling data channel messages is one of the things where Firefox and Chrome handled things a bit differently in the past).

fkwp · 2023-03-08T08:05:37Z

https://developer.mozilla.org/en-US/docs/Web/API/WebRTC_API/Using_data_channels#concerns_with_large_messages

daniel-abramov added T-Enhancement New feature or request T-Task labels Feb 14, 2023

daniel-abramov self-assigned this Feb 14, 2023

daniel-abramov removed the T-Task label Feb 15, 2023

daniel-abramov mentioned this issue Feb 23, 2023

peer: add data channel worker #136

Closed

daniel-abramov mentioned this issue Mar 6, 2023

conference: logging improvements #139

Merged

daniel-abramov changed the title ~~To improve session performance: move blocking operations out of the main conference loop~~ Properly handle data channel messages (buffering, large messages) Mar 6, 2023

daniel-abramov added T-Defect Something isn't working T-Task X-Needs-Investigation and removed T-Enhancement New feature or request labels Mar 6, 2023

daniel-abramov changed the title ~~Properly handle data channel messages (buffering, large messages)~~ Properly handle data channel messages (buffering and large messages) Mar 6, 2023

daniel-abramov changed the title ~~Properly handle data channel messages (buffering and large messages)~~ Proper handling for outgoing data channel messages (buffering, large messages) Mar 6, 2023

daniel-abramov changed the title ~~Proper handling for outgoing data channel messages (buffering, large messages)~~ Proper handling of outgoing data channel messages (buffering, large messages) Mar 6, 2023

daniel-abramov removed their assignment Mar 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proper handling of outgoing data channel messages (buffering, large messages) #133

Proper handling of outgoing data channel messages (buffering, large messages) #133

daniel-abramov commented Feb 14, 2023 •

edited

Loading

daniel-abramov commented Feb 23, 2023 •

edited

Loading

dbkr commented Feb 23, 2023

daniel-abramov commented Feb 23, 2023

daniel-abramov commented Mar 6, 2023

fkwp commented Mar 8, 2023

Proper handling of outgoing data channel messages (buffering, large messages) #133

Proper handling of outgoing data channel messages (buffering, large messages) #133

Comments

daniel-abramov commented Feb 14, 2023 • edited Loading

daniel-abramov commented Feb 23, 2023 • edited Loading

dbkr commented Feb 23, 2023

daniel-abramov commented Feb 23, 2023

daniel-abramov commented Mar 6, 2023

Context

Checking and adding logs

Final results

fkwp commented Mar 8, 2023

daniel-abramov commented Feb 14, 2023 •

edited

Loading

daniel-abramov commented Feb 23, 2023 •

edited

Loading