Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extra question about horizontally scaled Stunner #132

Closed
dbason opened this issue Mar 22, 2024 · 3 comments
Closed

Extra question about horizontally scaled Stunner #132

dbason opened this issue Mar 22, 2024 · 3 comments
Labels
type: question Further information is requested

Comments

@dbason
Copy link

dbason commented Mar 22, 2024

I've been trying to understand a bit more how ICE works, particularly in the case of multiple Stunner pods in the symmetric ICE model.

In #31 (comment) the following was mentioned

I'm not sure we're using the word "looping back" in the same sense. We use this term to mean that in the headless model, when there is no media server at all, both clients open a relay candidate on STUNner and then these magically get looped back to one another during the ICE conversation, connecting the two clients. This works because both relay candidate's contain pod IPs, so this is a fairly standard TURN server use case it is just that the relay candidates are opened "inside" the cluster.

The communication that seems to be being established is client -> stunner_a <-> stunner_b <- media server. Is there any documentation or details about how the stunner_a <-> stunner_b communication is being established by ICE? Or is this something that Stunner is doing that is extra to ICE?

@rg0now rg0now added the type: question Further information is requested label Mar 22, 2024
@rg0now
Copy link
Member

rg0now commented Mar 22, 2024

:-D This is one of those tricky parts in ICE and TURN that just magically works out of the box, without us having to configure or add anything. I was also puzzled when I first discovered this...:-)

Here is the process flow:

  • the client obtains a TURN allocation from stunner_a, suppose that the transport relay address (the UDP connection opened by STUNner inside the Kubernetes private pod-pod network that will it will use to forward the TURN payload received from the client) is IP_a:port_a, where IP_a is the IP of the stunner_a pod;
  • the client generates an ICE candidate of type relay with the relay address IP_a:port_a and sends it along to the media server via the signaling connection;
  • the media server opens a TURN allocation on stunner_b, suppose that the transport relay address is IP_b:port_b, where IP_b is now the IP of the stunner_b pod;
  • the media server sends the ICE relay candidate, including the relay address IP_b:port_b in it, to the client via the signaling connection;
  • the client starts ICE negotiation and, at a certain (very late) point during the ICE conversation, it selects the ICE candidate pair corresponding to the two ICE relay candidates (its own as the local candidate, plus the one received from the media server as the remote candidate);
  • the client creates a permission on the TURN allocation at stunner_a to send to IP_b (the IP corresponding to the remote ICE candidate in the selected ICE candidate pair);
  • the client tries to connect via its TURN allocation through the relay address IP_a:port_a to IP_b:port_b (the relay address in the remote ICE candidate), which just happens to be the transport relay address created by the media server on stunner_b (this is the only "magical" part here, but there is no magic here at all, it's how ICE negotiation goes over relay candidates);
  • the connection attempt will most probably fail, since the media server still hasn't created a TURN permission for receiving traffic from IP_a over its TURN allocation;
  • the media server also enters the ICE negotiation phase (in fact, with trickle ICE the negotiation starts immediately and the connection attempt is made right after the relay candidates become available);
  • the media finally selects the relay-relay ICE candidate pair (note that this is always the least preferred ICE candidate pair since TURN is considered "expensive" in the ICE RFC, but this is only because TURN servers often add significant delay due to being hosted far from the media server, but this is not the case for STUNner whose extra latency/jitter is typically in the 10-100 microsecond range by being in the same cluster as the server);
  • the media server creates a TURN permission (this is why you must add the service covering all stunnerd pods to the backend of the UDPRoute for symmetric ICE, otherwise the permission will not be granted by STUNner) to IP_a (since for the server the remote ICE candidate is now the relay candidate received from the client);
  • the media server tries to connect via the relay address IP_b:port_b to IP_a:port_a (the relay address of the client);
  • the connection request is now successful, since both parties have permission to send to the other, and both the sender and the receiver IP:port in the connection attempts (this is usually a STUN Binding request) are exactly the ones as specified in the relay candidates;
  • ICE enters into the connected state and the media can start to flow between the client and the server over the negotiated connection.

The traffic flow is something like the below:

client <--TURN--> stunner_a(IP_a:port_a) <--UDP--> (IP_b:port_b)stunner_b <--TURN--> media server

Note that in symmetric ICE mode both parties have to go through STUNner, which adds some extra processing compared to asymmetric ICE where only the client uses STUNner, the media server connects directly to the relay address via UDP as it is in the same IP subnet as the stunnerd pods:

client <--TURN--> stunner_a(IP_a:port_a) <--UDP--> media server

Note that even with symmetric ICE the ICE negotiation usually succeeds over asymmetric ICE and never reaches the relay-relay case. This is because, as mentioned above, ICE tries the candidate pairs in order of priority and the relay-relay pair is always the last one (minus TCP, I never know how TCP host/srflx candidates are prioritized versus UDP relay candidates). Before the relay-relay candidate, however, ICE will try to match the client's relay candidate with the media server's host candidate (this will contain the media server's own pod IP) and this will be successful due to the above so the relay-relay pair is never used.

As you see, the symmetric ICE mode is quite complex and the fact that it usually falls back to asymmetric ICE confuses the hell out of people. That's why we advise against using it unless it's not absolutely necessary.

Hope this clarifies the symmetric ICE case.

@dbason
Copy link
Author

dbason commented Mar 24, 2024

That's a great explanation; thank you so much!

@rg0now
Copy link
Member

rg0now commented Mar 25, 2024

Closing this for now, feel free to reopen if something comes up.

@rg0now rg0now closed this as completed Mar 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants