Skip to content

fix(webrtx): ICE issue + substream shutdown procedure#586

Open
gab8i wants to merge 21 commits into
masterfrom
gab_webrtc_multiple_fixes_v2
Open

fix(webrtx): ICE issue + substream shutdown procedure#586
gab8i wants to merge 21 commits into
masterfrom
gab_webrtc_multiple_fixes_v2

Conversation

@gab8i
Copy link
Copy Markdown
Contributor

@gab8i gab8i commented May 18, 2026

This PR contains a big rework which address comments left by @lexnv on #574.

Essentially:

  • Address ICE issue during negotiation
  • Add tracing/debug
  • Move the FIN/FIN_ACK half-close handshake from Substream::poll_shutdown into SubstreamHandle::poll_next (half_close). Now there are two clear halves that can close separately.
  • Introduce a dedicated State::Reset variant (previously folded into SendClosed), also send RESET_STREAM to the peer when the FIN_ACK timeout expires, instead of silently transitioning to FinAcked.
  • Clean up stale SubstreamHandle entries from SubstreamHandleSetwhen a channel is closed or an outbound write fails, so exhausted handles don't sit forever in the round-robin set.

@gab8i gab8i changed the title [WIP] fix(webrtx): ICE issue + substream shutdown procedure fix(webrtx): ICE issue + substream shutdown procedure May 18, 2026
.protocol_set
.report_substream_open_failure(
context.protocol,
context.substream_id,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a different substream ID from the one at pending_outbound?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is stored within pending_outbound differs from what is stored within the channels field. A channel is initially opened and ends up in pending_outbound with a substream_id, once the WebRTC data channel is created, it is moved into the channels field, where the state OutboundOpening implies that the multistream-select protocol still needs to happen. These are the same channel in two different phases.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't we here call report_substream_open_failure twice from the higher-level protocols perspective?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why so?

on_open_substream -> inserts pending_outbound item
on_channel_opened -> remove item from pending_outbound and inserts it as ChannelState::OutboundOpening within the channels field

While report_substream_open_failure is getting called only while negotiating the multistream-select protocol and while a channel is getting closed, within neither of the two it seems to be the case of it being called twice over the same substream id

/// Matches go-libp2p's 5 second stream close timeout.
const FIN_ACK_TIMEOUT: Duration = Duration::from_secs(5);
/// Matches go-libp2p and js-libp2p's 10-second stream close timeout.
const FIN_ACK_TIMEOUT: Duration = Duration::from_secs(10);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thanks for aligning this 🙏

Comment thread src/transport/webrtc/substream.rs Outdated
@@ -206,7 +240,7 @@ impl SubstreamHandle {
flag: Some(Flag::FinAck),
}) {
tracing::warn!(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, we cannot send the FinAck to the peer because of backpressure.
Then we return Ok(()) here and ignore this error, and this forces the remote peer to wait a full 10s.

Could we instead reserve a permit on the channel to always guarantee we can send out a FinAck?
Similarly, could this fail because the outbound tx has been dropped? Inthat case should we takle the gracefull shutdown to higher levels?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason why this could fail is backpressure because the Substream AsyncWriter could have already written many message and the FIN_ACK could have no space to be sent.

Could we instead reserve a permit on the channel to always guarantee we can send out a FinAck?

Yes, I think this should be done because reading more carefully the code and the comment I understood that it doesn't align to the spec, waiting 10s does not imply a graceful shutdown but instead force the peer to send a ResetStream flag. So the way to go is to remove this outbound_tx and simply add a flag which is used to preempt over other things that needs to be sent over the channel.


self.rtc.direct_api().close_data_channel(channel_id);
self.channels.insert(channel_id, ChannelState::Closing);
self.handles.remove(&channel_id);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to advance state: Arc<Mutex<State>> into Reset here?

Otherwise the poll_shutdown would always return Poll::Pending:

        if matches!(*self.state.lock(), State::FinAcked | State::Reset) {
            Poll::Ready(Ok(()))
        } else {
            Poll::Pending
        }

And this could stall further substream.close().await calls indefinetely

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the easiest fix would be to set the state inside SubstreamHandle::drop

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, tbf I honestly thought that poll_shutdown would have been stop to be called automatically if Substream was drop, but now that I think about it, it is more plausible it is the other way around: the Substream doesn't get drop until the poll_shutdown doesn't complete.

As you said, there could be an additional flag, something like ForceStop which just implies that something higher in the stack is forcing this substream to close.

}
// Let str0m handle input validation internally, similar to how the initial STUN packet is
// handled
self.rtc.handle_input(message).map_err(|error| {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, the rtc accepts method was too restrictive including ICE. Now, we rely on str0m to handle input validation internally.
This also means we are redirecting every STUN packet into the connection. Does str0m handle under the hood ICE credential validation?

Copy link
Copy Markdown
Contributor Author

@gab8i gab8i May 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accepts should be used to demultipex multiple Rtc instances while we are already doing so by tracking the source. Initially rtc gets created and then yes, handle_input manages ICE + DTLS + Noise + Scpt handshakes

Comment thread src/transport/webrtc/substream.rs Outdated
@@ -171,6 +199,11 @@ impl SubstreamHandle {
// This ensures that if a FIN message contains data, we deliver it before closing.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The remote peer might send us a FIN frame containing data.
However, in the meanwhile we have droped the substream.
In this case, the self.inbound_tx.send would return an error and we'd never handle the Fin code path below, leaving again the remote to wait 10s until they reset the substream on timeout

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why have we dropped the substream in the meantime?

In a request-response protocol, there can be a case where a peer issues a request and attaches the FIN flag to it. In that case, the message is delivered to inbound_tx first, followed by Event::RecvClosed triggered by the FIN flag, the order is preserved. As noted in the previous comment (#586 (comment)), a follow-up refactor will send the FIN_ACK immediately, the substream remains alive until the shutdown procedure terminates, either by receiving a FIN_ACK or by exceeding the timeout. What I describe here is the most eager case from a peer: attaching a FIN together with the request. If this is well covered also everything else should be (?)

Comment thread src/transport/webrtc/substream.rs Outdated
Flag::StopSending => {
*self.state.lock() = State::SendClosed;
let mut current_state = self.state.lock();
if !matches!(*current_state, State::FinSent | State::FinAcked) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes state::Reset into State::SendClosed:

  • if we enter this path after we've encountered an error (ie half close sets reset) we are nuking the state back to SendClosed
  • Then any poll_shutdown is going to wait indefinetely again

T0: we've encountered an error, our state is Reset
T1: on_message moves back our state to SendClosed
T2: poll_shutdown never returns Ok() and instead blocks the whole execution:

if matches!(*self.state.lock(), State::FinAcked | State::Reset) {
    Poll::Ready(Ok(()))
} else {
    Poll::Pending
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm not mistaken, reading your comment, I would say that you pointed out a nice bug, which should be solved by adding State::Reset to the matches, right?

Comment thread src/transport/webrtc/substream.rs Outdated
@@ -240,23 +278,97 @@ impl SubstreamHandle {
// (matching go-libp2p behavior)
// Close the read side
let _ = self.inbound_tx.try_send(Event::RecvClosed);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this code path couldn't we actually send the RecvClosed twice?

For example, in the path above, we ensure that we send RecvClosed only once if we received Fin:

if self.read_closed.swap(true, Ordering::SeqCst) {

However, here we send unconditionally the RecvClosed regardless if we sent it above or vice-versa

T0: We receive Flag::ResetStream and send Event::RecvClosed
T1: A delayed Fin arrives if self.read_closed.swap(true, Ordering::SeqCst) was previously false. We queue another RecvClosed and also send a FinAck to a stream that was reset at T0.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logically yes, practically no, but it remains a problem.

I didn't notice this before: the reset_stream_sent flag stops the SubstreamHandle Stream, preventing FIN_ACK from being sent. It's still a problem though, because even if inbound_tx should be dropped at some point, the on_message function could send another RecvClosed on the channel.

To solve this, either the state or the reset_stream_sent flag could be checked at the beginning of on_message.

Comment thread src/transport/webrtc/mod.rs Outdated
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the ufrag / pass are provided by the remote peer, it could be possible to craft a payload such that we panic here

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how specifically maliciously formed ufrag / pass can cause panics within make_rtc_client but definitely it generally contains unwrap and panics that should not be there, it needs to be changed and handle them 'gracefully'.

Comment thread src/transport/webrtc/mod.rs Outdated
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the STUN packet here is malformed, ICE / fingerprint missmatch we are also panicking

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This expect doesn't make sense here. I'm sorry I didn't notice it before, it's entirely possible to receive a malformed message, which needs to be handled gracefully.

Base automatically changed from gab_webrtc_multistream_select to master May 20, 2026 09:09
timwu20 and others added 5 commits May 20, 2026 12:28
Move the FIN/FIN_ACK half-close handshake from `Substream::poll_shutdown`
into `SubstreamHandle::poll_next` (`half_close`).

Now there are two clear halves that can close separately.

Both `poll_shutdown` and `Drop` go through the same path, so dropping
a substream still produces a graceful FIN.

Introduce a dedicated `State::Reset` variant (previously folded
into `SendClosed`), also send `RESET_STREAM` to the peer when the
FIN_ACK timeout expires, instead of silently transitioning to `FinAcked`.

Clean up stale `SubstreamHandle` entries from `SubstreamHandleSet`
when a channel is closed or an outbound write fails, so exhausted
handles don't sit forever in the round-robin set.
@gab8i gab8i force-pushed the gab_webrtc_multiple_fixes_v2 branch from 4b5487c to 2be6862 Compare May 20, 2026 10:32
gab8i added 2 commits May 20, 2026 12:34
Refactor the code to remove a channel writing messages to the same
struct, and respect the spec by immediately replying with FIN_ACK
after receiving a FIN instead of occasionally waiting 10 seconds and
letting the remote reset the connection.
@lexnv
Copy link
Copy Markdown
Collaborator

lexnv commented May 21, 2026

The state management between Substream (object received by higher-level protocols) and SubstreamHandle polled on webrtc task is getting a bit too convoluted:

  1. We share an state: Arc<Mutex<State>> between both objects
  • This is altered by SubstreamHandle::on_message to transition the state on FinAck and wake up the shutdown wakers. Similarly, transition into SendClosed on receiving Flag::StopSending and wake up the poll_write waker. And similar upon receiving ResetStream
  • Then the Substream::poll_write rejects any state thats not in State::Open with a brokek pipe error.
  • Substream::poll_shutdown returns pending until the state transitions into FinAck / Reset
  1. We have a write_waker atomic waker registered by the poll_write
  • In combination with state mutex, it is effectively signaling a write error upon Flag::StopSending or Flag::ResetStream (broken pipe)
  1. We have a shutdown_waker atomic waker registered inside Substream::poll_shutdown
  • This is woken up on FinAck / ResetStream
  • Also from fn half_close onm timeout if we didn't receive the FinAck in due time
  1. We also got a substream_shutdown atomic bool with self.substream_shutdown_waker atomic waker
  • substream_shutdown_waker registered by SubstreamHandle::poll_next
  • Substream::poll_shutdown / on drop sets the bool to true and wakes the waker

Instead, could we simplify this by:

  • removing the mutex entirely and keeping the state private to the SubstreamHandle
  • Waker + atomicbool patterns could be replaced by channels

Ideally, the Substream could hold:

  • read_buffer and rx: Receiver<Event> similar to before
  • tx: Option<PollSender<Event>> signal when the poll_write should fail
  • done_signal: Option<oneshot sender<()>> signals the substreamHandle that shutdown was called / or substream dropped
  • shutdown_complete: Option<oneshot receiver<()>> awaited by shutdown to resolve when the fin ack/reset/timeout reached its state

@gab8i
Copy link
Copy Markdown
Contributor Author

gab8i commented May 21, 2026

The latest commits address all the review comments except the last one, which would require refactoring the Substream/SubstreamHandle relationship (#593). One TODO remains in substream.rs (around the ReadingState::Fin transition), addressing it would mean adding or modifying wakers, but since the goal is to partially or fully remove them, it will be handled in a follow-up PR

@gab8i
Copy link
Copy Markdown
Contributor Author

gab8i commented May 21, 2026

The last commit adds a simple if to apparently make things work, but it sweeps a bigger problem under the carpet, one related to the Substream <-> SubstreamHandle structure, which will be addressed in the refactor. The core issue:
once the Substream receives RecvClosed, poll_read returns Err(BrokenPipe), which causes a drop of the Substream, including the write side, which a half-close from the peer should not close.

Copy link
Copy Markdown
Collaborator

@lexnv lexnv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an improvement over what we had before 👍

I didn't look closely at the edge cases between the mutex states/atomic bools and wakers, so we still have some rough edges. However, most of the code would get simplified either way by: #586 (comment) 🙏

gab8i and others added 3 commits May 28, 2026 15:47
…600)

Rework the communication mechanism between `Substream` and its
`SubstreamHandle`. State is shared through `Mutex` and `AtomicWaker`
abstracted behind a small helper, which makes them easy to use and
ensures the relevant tasks are woken whenever the shared state changes.

This also decouples the reading half from the writing half: a graceful
close of either half no longer implies closing the other. An abrupt
RESET_STREAM still tears down both, as required by the spec.
target: LOG_TARGET,
peer = ?self.peer,
destination = ?v.destination,
"UDP send buffer full, dropping datagram (str0m will retransmit)",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I see it, str0m will retransmit after the hole in the sequence numbers is detected, leading to unordered delivery of packets and broken application data. Instead it's better to pause (backpressure) a producer and send all UDP packets in order they are produced by str0m.

Can be a follow-up PR.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that what you described is really needed.
The flow of messages producer -> str0m is different from str0m -> udp socket.
The latter benefits from all the guarantees that str0m gives, especially given how str0m is created:

  • ordered: false
  • reliability: Default::default() → Reliability::Reliable

So the channel is reliable: dropping an outgoing UDP datagram cannot lose application data, since SCTP guarantees retransmission.

The producer backpressure implemented by #575 was related to the first flow, messages that have not yet entered str0m, where it is up to us to control the order of things.

If we shift the focus from correctness to efficiency, then yes, it would be better to have a queue of pending UDP packets, because relying on str0m to recover here could be pretty slow.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading now #586 (comment) I think I got confused about how the 'ordered' flag works within str0m

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dropping an outgoing UDP datagram cannot lose application data

It doesn't "lose" it in the precise sense, but because the packets arrive not in order now, and we do not reorder them on the application protocol level, the actual stream data will be garbage.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I don't know why libp2p spec is using unordered delivery. This is not what we get with TCP/WSS connections.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, if I'm not mistaken, to follow golibp2p behavior, ordered should be set to true (which is also the default one) so that the peer is expected to hold upon retransmission of lost packet!

This would imply that the current implementation is expected to work fine! Beside a possible optimization by manually re-transmitting the packet instead of waiting for str0m detecting it!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, the last message was sent without reading your replies, gh didn't show them up to now.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, let's set ordered: true. The spec says the implementations MAY expose unordered, so we are free to use ordered as well:
https://github.com/libp2p/specs/blob/master/webrtc/README.md#ordering

Comment thread src/transport/webrtc/connection.rs Outdated
Comment thread src/transport/webrtc/connection.rs Outdated
"str0m rejected timeout input, closing connection",
);
return self.on_connection_closed().await;
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading the str0m docs, I don't think we need to pass Input::Timeout before trying to receive the incoming UDP packet. I.e., the entire special case can be removed.

Given the select! below is biased, I would put the handle_input(Input::Timeout(...)) the second after the recv() arm, and this should be enough.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are totally right, doc:

... poll_output, the function will only produce more output again
when one of two things happen:
- The polled timeout is reached.
- New network input. 

What does this mean is that we can safely remove this special case.

I just don't follow why we should move handle_input(Input::Timeout(...)) in second position just after the recv()?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just don't follow why we should move handle_input(Input::Timeout(...)) in second position just after the recv()?

I would prioritize pumping the network traffic over processing the commands. This can be triggered under load when multiple futures resolve at the same time, and in this case we want to free/process the network buffers first before processing the user commands.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current approach won't break anything, but we can try to process user commands only to discover we are backpressured by networking, then process networking, and only then process commands. With prioritizing networking we spare one poll cycle in such situation.

Comment thread src/transport/webrtc/substream.rs Outdated
source: SocketAddr,
destination: SocketAddr,
) -> (Rtc, ChannelId) {
) -> crate::Result<(Rtc, ChannelId)> {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! This fixes the git issue for the make_rtc_client panics!

Comment thread src/transport/webrtc/mod.rs
Copy link
Copy Markdown
Collaborator

@dmitry-markin dmitry-markin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good! One question, though: why can't we do without locks and implement communication betwee Substrem <-> SubstreamHandle purely over channels?

Comment on lines +206 to +207
// In practice this should never happen because SCTP guarantees the order
// of messages, thus no other message is expected after a Reset.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SCTP is configured to unordered as per libp2p WebRTC spec:


so there is no such guarantee.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think this is a mistake and should be moved to ordered: true

Copy link
Copy Markdown
Collaborator

@lexnv lexnv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had another look! Thanks for switching to AtomicU8s 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants