Race Condition on opening a Stream over a DataChannel #2443

justin0mcateer · 2024-03-15T15:29:36Z

Version:
1.2.4
Platform:
Linux master 6.7.6-arch1-1 #1 SMP PREEMPT_DYNAMIC Fri, 23 Feb 2024 16:31:48 +0000 x86_64 GNU/Linux
Windows 11
MacOS
Google Chrome 122.0.6261.94
Brave Browser 122.1.63.165
Subsystem:
WebRTC Transport

Severity:

High

Description:

DataChannels open but Streams timeout and fail.
The main symptom is that we end up with 'one way' Streams, where the incoming Stream was successfully opened, but the outgoing Stream aborts.

Steps to reproduce the error:

The timing window for the race condition is extremely small ~4ms, so the error only occurs when the time required to open a DataChannel is within a very small window. We typically see it when trying to establish connections in a LAN environment, but then the timing varies slightly based on the operating system and browser. Changing browsers may fix the problem, enabling a VPN will almost always fix the problem (due to added latency). The problem seems to occur with much greater frequency on Chrome on MacOS.

We only began seeing this problem when upgrading from 0.46 to 1.1 (now 1.2). The problem seems to have been introduced in the refactoring done here:
@achingbrain
#2237

What we are seeing is that there is a very small delay between checking for the current state of the DataChannel and pEvent actually registering the event handler to capture the 'open' event. Such that if the timing of the DataChannel open event is just right, pEvent never receives the event and throws on timing out.
https://github.com/libp2p/js-libp2p/blob/main/packages/transport-webrtc/src/stream.ts#L228-L230

Catching the Timeout error and checking again, proceeding if the DataChannel is in fact open works around the issue.

Debugging log points have been added in this log output which illustrate the issue:

If pEvent caught the 'open' event this log message should be printed, but it is not:
https://github.com/libp2p/js-libp2p/blob/main/packages/transport-webrtc/src/stream.ts#L231

It seems that due to the way pEvent is implemented, it leaves a tiny window of time between when the pEvent function is called and when 'addEventHandler' is actually called. A custom function that periodically polls, in addition to capturing the event may be required, since the timing is so tight in this case.

The text was updated successfully, but these errors were encountered:

justin0mcateer · 2024-03-15T15:53:35Z

Upon reviewing the pEvent code it more detail and thinking about it a little more. I believe this issue can only occur with events emitted from native code, such as the DataChannel event the WebRTC Transport library is trying to await. Otherwise, the pEvent code is effectively synchronous, so this would not happen with an event emitted from JavaScript code.

justin0mcateer · 2024-05-24T21:36:43Z

Just to follow up on this, we added a small patch to address this, it was happening fairly regularly and we haven't seen this problem any more after applying the patch.

Basically, we shortened the timeout and we just check it again if pEvent times out. This second check covers the case that pEvent misses the open event from the DataChannel.

I think a better solution long-term here might be to 'poll' the state periodically in addition to listening for the event.

+++ b/dist/src/stream.js
@@ -38,7 +38,7 @@ export const FIN_ACK_TIMEOUT = 5000;
  * When sending data messages, if the channel is not in the "open" state, wait
  * this long for the "open" event to fire.
  */
-export const OPEN_TIMEOUT = 5000;
+export const OPEN_TIMEOUT = 500;
 export class WebRTCStream extends AbstractStream {
     /**
      * The data channel used to send and receive data
@@ -177,7 +177,16 @@ export class WebRTCStream extends AbstractStream {
         }
         if (this.channel.readyState !== 'open') {
             this.log('channel state is "%s" and not "open", waiting for "open" event before sending data', this.channel.readyState);
-            await pEvent(this.channel, 'open', { timeout: this.openTimeout });
+            try {
+               await pEvent(this.channel, 'open', { timeout: this.openTimeout });
+               this.log('channel state is now "%s", sending data', this.channel.readyState);
+            } catch (err) {
+               // check if the channel is open after the timeout
+               if (this.channel.readyState !== 'open') {
+                   this.log.error('channel state is still "%s" after waiting for "open" event', this.channel.readyState);
+                   throw err;
+               }
+            }
             this.log('channel state is now "%s", sending data', this.channel.readyState);
         }
         // send message without copying data

justin0mcateer · 2024-05-24T21:58:45Z

Would an using 'Promise.all' across pWaitFor and pEvent be an acceptable solution? If so, we would be happy to submit a PR.

achingbrain · 2024-06-14T14:31:22Z

Sorry for the delay in getting to this, it does look like very strange behaviour.

Looking through the p-event code, it adds the event listeners synchronously as you've found. Any callbacks invoked from native code would be added to the callback queue - they can't interrupt the current event loop tick since js is run-to-completion.

But you're seeing some behaviour that's hard to explain, so maybe something is causing the current tick to end.

I'd like to dig into this further - do you have a way to replicate the problem?

justin0mcateer · 2024-06-14T18:44:35Z

It is a race condition that happens only if the open is not too fast or too slow. Probably in a window of about 4 to 8 milliseconds based on my testing.

We aren't doing anything special, we are using Circuit-Relay and the WebRTCTransport is managing opening the connections.

The only variable that impacts the ocurrence is the latency of the 'open' event. However, because the handling of creating the DataChannel is all internal to libp2p, I don't see an easy way to inject a delay to trigger the race without either modifying the WebRTCTransport implementation or doing it at the network level.

We have some tests that use 'tc' on Linux to inject packet latency and packet loss between two nodes, but they require a fair amount of setup code building virtual interfaces and bridges to allow controlling the network behavior.

achingbrain · 2024-06-15T14:46:14Z

4-8ms is an eternity for synchronous code so there's definitely something weird happening.

I've tried to create a repro from your description here https://github.com/achingbrain/libp2p-webrtc-ping

I'm having problems replicating with Chrome*, can you please take a look at tell me what I need to change or add to be able to replicate the issue?

It shouldn't be necessary to modify libp2p internals to replicate the issue - we should be able to do it with external changes only. Since we might want to modify libp2p to fix the issue, we can't introduce new code there that creates the bug, because to "fix" it we'd just delete the code we just added 😅.

* = For Firefox I came across this which is not great, but we'll see what the devs say.

justin0mcateer · 2024-06-15T18:32:09Z

I agree, code should not need to be added to reproduce the issue. However, since the condition only manifests within a small range of open event latency, ensuring the condition will manifest is not possible without being able to control the latency of the open event.

When you run on two nodes on localhost the latency will be so low that the datachannel is in the open state by the first time the first check happens, so pEvent will never be called.

You'll have to run the listener and the sender on two different machines in the same LAN. What we found was that certain network and browser combinantions would never see this issue, for example, connecting over the Internet or a VPN. Other configurations would see this issue a non-trivial percentage of the time.

We found that the problem seemed to occur more frequently when one of the peers was running Chrome on MacOS. Seems like the open latency was slightly higher with MacOS.

A more deterministic option would be to create two 'veth' interfaces, one for the listener and one for the sender, and use 'tc' to add latency to the network path.

I appreciate your interest in this. It is not an easy one I think. I would be happy to discuss this real-time if it is helpful. You can reach me directly at domain ryzyliant dot com, user justin dot mcateer.

achingbrain · 2024-06-18T10:39:17Z

I've tried running https://github.com/achingbrain/libp2p-webrtc-ping on two MacOS machines on a LAN with both running Chrome but I still can't replicate the issue.

If the problem is that the open event is firing before the listener is added, introducing network latency should decrease the chance of this happening since it would take longer for the data channel to become open and the event to fire so this may be a red herring.

If anything we'd want to reduce latency to expose the issue since the time between the check and the adding of the event listener will be significantly less than 1ms as the code is all synchronous.

That the code is synchronous means that it's not possible to interrupt it, so assuming it all happens in the current tick, there's no way the event can fire before the listener is added, since the callback from the native code that will cause the event to fire will be invoked when the callback queue is processed, which happens after the current tick has ended (e.g. the call stack is empty) and adding the listener will increase the calls in the call stack, since it's done synchronously.

If the native callback is executed in a way that interrupts the JS call stack, the bug is in the browser since that's not how JavaScript works as pointed out.

I need more assistance in replicating this issue as it doesn't seem possible as described.

justin0mcateer · 2024-06-18T11:34:13Z

Thanks again for taking the time to dig into this. We will work on an automated method of reproducing this issue, possibly based on the test repo you put together.

We will try to get to this as quickly as possible, but it may take a couple of weeks.

achingbrain · 2024-06-18T12:40:46Z

Awesome, thanks - I'll check back here in a couple of weeks.

justin0mcateer added the need/triage Needs initial labeling and prioritization label Mar 15, 2024

justin0mcateer mentioned this issue Mar 15, 2024

Possible Race Condition when catching native events sindresorhus/p-event#47

Open

achingbrain removed the need/triage Needs initial labeling and prioritization label Jun 18, 2024

achingbrain added the need/author-input Needs input from the original author label Jun 18, 2024

achingbrain removed the need/author-input Needs input from the original author label Jun 18, 2024

achingbrain added need/analysis Needs further analysis before proceeding need/author-input Needs input from the original author labels Jun 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race Condition on opening a Stream over a DataChannel #2443

Race Condition on opening a Stream over a DataChannel #2443

justin0mcateer commented Mar 15, 2024 •

edited

Loading

justin0mcateer commented Mar 15, 2024

justin0mcateer commented May 24, 2024

justin0mcateer commented May 24, 2024

achingbrain commented Jun 14, 2024

justin0mcateer commented Jun 14, 2024

achingbrain commented Jun 15, 2024

justin0mcateer commented Jun 15, 2024

achingbrain commented Jun 18, 2024

justin0mcateer commented Jun 18, 2024

achingbrain commented Jun 18, 2024

Race Condition on opening a Stream over a DataChannel #2443

Race Condition on opening a Stream over a DataChannel #2443

Comments

justin0mcateer commented Mar 15, 2024 • edited Loading

Severity:

Description:

Steps to reproduce the error:

justin0mcateer commented Mar 15, 2024

justin0mcateer commented May 24, 2024

justin0mcateer commented May 24, 2024

achingbrain commented Jun 14, 2024

justin0mcateer commented Jun 14, 2024

achingbrain commented Jun 15, 2024

justin0mcateer commented Jun 15, 2024

achingbrain commented Jun 18, 2024

justin0mcateer commented Jun 18, 2024

achingbrain commented Jun 18, 2024

justin0mcateer commented Mar 15, 2024 •

edited

Loading