Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race Condition on opening a Stream over a DataChannel #2443

Open
justin0mcateer opened this issue Mar 15, 2024 · 10 comments
Open

Race Condition on opening a Stream over a DataChannel #2443

justin0mcateer opened this issue Mar 15, 2024 · 10 comments
Labels
need/analysis Needs further analysis before proceeding need/author-input Needs input from the original author

Comments

@justin0mcateer
Copy link

justin0mcateer commented Mar 15, 2024

  • Version:
    1.2.4

  • Platform:
    Linux master 6.7.6-arch1-1 #1 SMP PREEMPT_DYNAMIC Fri, 23 Feb 2024 16:31:48 +0000 x86_64 GNU/Linux
    Windows 11
    MacOS
    Google Chrome 122.0.6261.94
    Brave Browser 122.1.63.165

  • Subsystem:
    WebRTC Transport

Severity:

High

Description:

DataChannels open but Streams timeout and fail.
The main symptom is that we end up with 'one way' Streams, where the incoming Stream was successfully opened, but the outgoing Stream aborts.

Steps to reproduce the error:

The timing window for the race condition is extremely small ~4ms, so the error only occurs when the time required to open a DataChannel is within a very small window. We typically see it when trying to establish connections in a LAN environment, but then the timing varies slightly based on the operating system and browser. Changing browsers may fix the problem, enabling a VPN will almost always fix the problem (due to added latency). The problem seems to occur with much greater frequency on Chrome on MacOS.

We only began seeing this problem when upgrading from 0.46 to 1.1 (now 1.2). The problem seems to have been introduced in the refactoring done here:
@achingbrain
#2237

What we are seeing is that there is a very small delay between checking for the current state of the DataChannel and pEvent actually registering the event handler to capture the 'open' event. Such that if the timing of the DataChannel open event is just right, pEvent never receives the event and throws on timing out.
https://github.com/libp2p/js-libp2p/blob/main/packages/transport-webrtc/src/stream.ts#L228-L230

Catching the Timeout error and checking again, proceeding if the DataChannel is in fact open works around the issue.

Debugging log points have been added in this log output which illustrate the issue:
image

If pEvent caught the 'open' event this log message should be printed, but it is not:
https://github.com/libp2p/js-libp2p/blob/main/packages/transport-webrtc/src/stream.ts#L231

It seems that due to the way pEvent is implemented, it leaves a tiny window of time between when the pEvent function is called and when 'addEventHandler' is actually called. A custom function that periodically polls, in addition to capturing the event may be required, since the timing is so tight in this case.

@justin0mcateer
Copy link
Author

Upon reviewing the pEvent code it more detail and thinking about it a little more. I believe this issue can only occur with events emitted from native code, such as the DataChannel event the WebRTC Transport library is trying to await. Otherwise, the pEvent code is effectively synchronous, so this would not happen with an event emitted from JavaScript code.

@justin0mcateer
Copy link
Author

Just to follow up on this, we added a small patch to address this, it was happening fairly regularly and we haven't seen this problem any more after applying the patch.

Basically, we shortened the timeout and we just check it again if pEvent times out. This second check covers the case that pEvent misses the open event from the DataChannel.

I think a better solution long-term here might be to 'poll' the state periodically in addition to listening for the event.

+++ b/dist/src/stream.js
@@ -38,7 +38,7 @@ export const FIN_ACK_TIMEOUT = 5000;
  * When sending data messages, if the channel is not in the "open" state, wait
  * this long for the "open" event to fire.
  */
-export const OPEN_TIMEOUT = 5000;
+export const OPEN_TIMEOUT = 500;
 export class WebRTCStream extends AbstractStream {
     /**
      * The data channel used to send and receive data
@@ -177,7 +177,16 @@ export class WebRTCStream extends AbstractStream {
         }
         if (this.channel.readyState !== 'open') {
             this.log('channel state is "%s" and not "open", waiting for "open" event before sending data', this.channel.readyState);
-            await pEvent(this.channel, 'open', { timeout: this.openTimeout });
+            try {
+               await pEvent(this.channel, 'open', { timeout: this.openTimeout });
+               this.log('channel state is now "%s", sending data', this.channel.readyState);
+            } catch (err) {
+               // check if the channel is open after the timeout
+               if (this.channel.readyState !== 'open') {
+                   this.log.error('channel state is still "%s" after waiting for "open" event', this.channel.readyState);
+                   throw err;
+               }
+            }
             this.log('channel state is now "%s", sending data', this.channel.readyState);
         }
         // send message without copying data

@justin0mcateer
Copy link
Author

Would an using 'Promise.all' across pWaitFor and pEvent be an acceptable solution? If so, we would be happy to submit a PR.

@achingbrain
Copy link
Member

Sorry for the delay in getting to this, it does look like very strange behaviour.

Looking through the p-event code, it adds the event listeners synchronously as you've found. Any callbacks invoked from native code would be added to the callback queue - they can't interrupt the current event loop tick since js is run-to-completion.

But you're seeing some behaviour that's hard to explain, so maybe something is causing the current tick to end.

I'd like to dig into this further - do you have a way to replicate the problem?

@justin0mcateer
Copy link
Author

It is a race condition that happens only if the open is not too fast or too slow. Probably in a window of about 4 to 8 milliseconds based on my testing.

We aren't doing anything special, we are using Circuit-Relay and the WebRTCTransport is managing opening the connections.

The only variable that impacts the ocurrence is the latency of the 'open' event. However, because the handling of creating the DataChannel is all internal to libp2p, I don't see an easy way to inject a delay to trigger the race without either modifying the WebRTCTransport implementation or doing it at the network level.

We have some tests that use 'tc' on Linux to inject packet latency and packet loss between two nodes, but they require a fair amount of setup code building virtual interfaces and bridges to allow controlling the network behavior.

@achingbrain
Copy link
Member

4-8ms is an eternity for synchronous code so there's definitely something weird happening.

I've tried to create a repro from your description here https://github.com/achingbrain/libp2p-webrtc-ping

I'm having problems replicating with Chrome*, can you please take a look at tell me what I need to change or add to be able to replicate the issue?

It shouldn't be necessary to modify libp2p internals to replicate the issue - we should be able to do it with external changes only. Since we might want to modify libp2p to fix the issue, we can't introduce new code there that creates the bug, because to "fix" it we'd just delete the code we just added 😅.


* = For Firefox I came across this which is not great, but we'll see what the devs say.

@justin0mcateer
Copy link
Author

I agree, code should not need to be added to reproduce the issue. However, since the condition only manifests within a small range of open event latency, ensuring the condition will manifest is not possible without being able to control the latency of the open event.

When you run on two nodes on localhost the latency will be so low that the datachannel is in the open state by the first time the first check happens, so pEvent will never be called.

You'll have to run the listener and the sender on two different machines in the same LAN. What we found was that certain network and browser combinantions would never see this issue, for example, connecting over the Internet or a VPN. Other configurations would see this issue a non-trivial percentage of the time.

We found that the problem seemed to occur more frequently when one of the peers was running Chrome on MacOS. Seems like the open latency was slightly higher with MacOS.

A more deterministic option would be to create two 'veth' interfaces, one for the listener and one for the sender, and use 'tc' to add latency to the network path.

I appreciate your interest in this. It is not an easy one I think. I would be happy to discuss this real-time if it is helpful. You can reach me directly at domain ryzyliant dot com, user justin dot mcateer.

@achingbrain achingbrain removed the need/triage Needs initial labeling and prioritization label Jun 18, 2024
@achingbrain
Copy link
Member

I've tried running https://github.com/achingbrain/libp2p-webrtc-ping on two MacOS machines on a LAN with both running Chrome but I still can't replicate the issue.

If the problem is that the open event is firing before the listener is added, introducing network latency should decrease the chance of this happening since it would take longer for the data channel to become open and the event to fire so this may be a red herring.

If anything we'd want to reduce latency to expose the issue since the time between the check and the adding of the event listener will be significantly less than 1ms as the code is all synchronous.

That the code is synchronous means that it's not possible to interrupt it, so assuming it all happens in the current tick, there's no way the event can fire before the listener is added, since the callback from the native code that will cause the event to fire will be invoked when the callback queue is processed, which happens after the current tick has ended (e.g. the call stack is empty) and adding the listener will increase the calls in the call stack, since it's done synchronously.

If the native callback is executed in a way that interrupts the JS call stack, the bug is in the browser since that's not how JavaScript works as pointed out.

I need more assistance in replicating this issue as it doesn't seem possible as described.

@achingbrain achingbrain added the need/author-input Needs input from the original author label Jun 18, 2024
@justin0mcateer
Copy link
Author

Thanks again for taking the time to dig into this. We will work on an automated method of reproducing this issue, possibly based on the test repo you put together.

We will try to get to this as quickly as possible, but it may take a couple of weeks.

@achingbrain achingbrain removed the need/author-input Needs input from the original author label Jun 18, 2024
@achingbrain
Copy link
Member

Awesome, thanks - I'll check back here in a couple of weeks.

@achingbrain achingbrain added need/analysis Needs further analysis before proceeding need/author-input Needs input from the original author labels Jun 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need/analysis Needs further analysis before proceeding need/author-input Needs input from the original author
Projects
None yet
Development

No branches or pull requests

2 participants