-
-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Operational query on running stunner in headless mode #31
Comments
Oh sorry, they can use arbitrary stunner and talk to each other |
Well ... I guess the confusion is still here ... Let's say I have stunner-A and stunner-B. A browser as a video player and an application in k8s which streams video frames. So now when the browser connects to stunner-A and the application connects to stunner-B, will this work out ? So browser would have a relay addr which is the pod IP + random port of stunner-A and the application most likely can uses its pod IP + random port and it is host candidate. So how does stunner-A forwards to application on its host candidate? |
IIUC, if the application talks to stunner-B than by default, application would not be able to whitelist the peer addr in stunner-A, so the browser should not be able to reach the application via stunner-A, is it correct ? |
I'll try to answer your questions below, let me know if I am being too fast (or too slow).
The idea here is that the user, located outside the cluster behind potentially several layers of NAT, will connect via STUNner-A (or STUNner-B), by creating a transport relay connection on STUNner. The user-facing side of the transport relay connection runs TURN, so that it is immune to NATs, while the cluster-facing side uses plain UDP. Now the client will ask STUNner to return the IP address and port corresponding to the cluster-facing side of the transport relay connection, this is the so called "relay candidate". Note that the IP address here is STUNner-A's pod IP, while the port is some semi-random (STUNner allows you to customize the upstream port range). The application on the other hand, being located inside the cluster, will not need STUNner at all: it will create a humble host candidate over its pod IP. (Note that this is contingent on that you configure your application with no external STUN or TURN servers at all; otherwise the client and the application may establish a media connection circumventing STUNner, which is most probably not what you want. Curiously, TURN magic also allows us to successfully connect the client and the application in the case when you mistakenly set STUNner as a TURN server for the application, but this is much rather a funny side-effect than a useful feature here.)
Exactly! Now the client passes over the relay candidate to the application and the application returns its own host candidate to the client over the signaling channel, and then they try to connect via this candidate pair. The client sends a packet to the application's host candidate via STUNner A, which it will conveniently receive over the external TURN connection and forward over its pod IP:port (the relay candidate) to the application's pod IP:port (the host candidate). Now, since pod-pod communication in Kubernetes must never ever involve a NAT, this connection setup attempt will be successful: the application will find that it receives the packet from the IP:port it was told in the relay candidate, and the same applies in the other way around. ICE goes to connected state and voila, your WebRTC media is ingested into the cluster despite that it commonly traverses 2-3 NATs while traveling from the client to the application via the Kubernetes container network. Does this answer your question? |
@rg0now Excellent explanation! Huge Thanks!
I noticed this too, but I was wondering in the one-to-one call example, when we loop back the connection to |
However, there is a further question regarding the ICE procedure. So by right the application inside cluster should NOT configure Browser --> stunner-A (3478) [src: relay-A] --> [dst: relay-B] stunner-B --> the application Is this correct ? So it really does not matter which one connects to which stunner as the ICE server Thanks in advance! |
I'm not sure we're using the word "looping back" in the same sense. We use this term to mean that in the headless model, when there is no media server at all, both clients open a relay candidate on STUNner and then these magically get looped back to one another during the ICE conversation, connecting the two clients. This works because both relay candidate's contain pod IPs, so this is a fairly standard TURN server use case it is just that the relay candidates are opened "inside" the cluster. The one2one call demo uses the "media-plane setup" on the other hand, in which case the flow of packets is client_a -> stuner_a -> kurento -> stunner_b -> client_b and vice versa. In some sense this is also "looping back" media to STUNner; I guess this is what you meant.
Almost correct: I would make this more precise in the following way: "a client external to the cluster MUST use STUNner to open a transport relay connection, and the application located inside the cluster MAY use STUNner to open a transport relay candidate" (where MUST and MAY mean what they usually do in IETF parlance). Observe the difference: the case when the client generates only a host candidate (or a server reflexive candidate using a public STUN server) but it does not use STUNner, while the application opens a relay candidate via STUNner (and potentially further host candidates) will not work. This is because in this case the clients' ICE candidates all contain public IPs while the application's candidate (obtained from STUNner) contains a private IP (pod IP), and the former has no way to route packets to the latter (the other way around should work though but I don't know ICE enough to know whether or not it handles half-duplex connections, plus there is a NAT in the middle so I guess this should not work either). The other case (both the client and the application use STUNner) should work though (not that I ever tested this). Feel free to try, let me know what you obtained. |
@rg0now Awesome! I believed I've learnt all I need. I am deploying STUNner into our production environment for our new product. Where we also use Cheers |
To that end, I also believe once we have a large number of In my imagination, for instance, I can have a media server group of 5 servers. Each of them acting as SFU? So the current streamers could just send the frames to the media servers and our clients should fetch corresponding media tracks from those media servers. This is how I imagined to scale our service. Any suggestions or corrections would be greatly appreciated! |
Can you give more detail? We are extremely interested in production deployments of STUNner: we want to learn what works and what not in your setup. Let us know if we could maybe organize a call or something. That being said, STUNner does not quite pre-condition the size of the backend pool, as long as all backends belong to the same Kubernetes service. (Well, this is not entirely true: the operator actually injects all the pod IPs as endpoint IP addresses into the STUNner config so when you have like thousands of pods in a single backend service then this may take a while, but if you have thousands of media servers in your cluster then I guess this is your least worry). So if it is an option to you to put all your pion streamers into a single Kubernetes service then all should be fine. Quick note: STUNner in fact does not load-balance across the backend pool (yet): by the time STUNner receives a TURN allocation request from the client the client actually knows which media-server/application/backend it wants to talk to (the one corresponding to the ICE candidate received from the application), so STUNner just merely checks whether that pod-ip belongs to the Kubernetes service you specified as the UDPRoute backend, and it if does then it lets the request through, otherwise it blocks it. Does this make any sense? |
Totally! Sure I will try to put more details of our setup. Besides, I believe we could probably find a time by mid of Sep. to have a conversation via concall. Stay in touch! The deployment is as follows. Our application in the cloud (k8s) is more or less like a online gaming, where users will be assigned to a dedicated server. The server renders user input into h264 frames and stream those frames to our |
Thanks @imcom, this was super insightful. Two minor notes:
That being said, I don't see a reason why the current STUNner version would not work with 10k endpoints, provided that the dataplane is scaled-out to 10-20 pods and you give sufficient CPU to the operator. And if not, then your idea to inject media servers between can be a good solution. Backend scalability has not been in the focus yet, but I guess your input should be a good motivation for us to work on this: we know how to handle this, just haven't had the time/resources to work on it yet. |
Lovely, we have not got chance to do the benchmark on our side, mind to elaborate more on the tooling or methodology you used for it ? Anyhow your benchmark result looks just sufficient and reasonable for our use case. Glad to hear that ! |
Just wanted to drop a line here that we have documented the ideas that came up during our discussion:
I'm closing this issue for now, feel free to reopen if there's any related issue you'd like to discuss. |
Currently I am a bit confused with the scaling operation of
stunner
in one-to-one call scenario. This the setup, initially I would have one stunner with LoadBalancer service (Public facing IP) and cluster service IP for WebRTC client within k8s. This works fine as long as there is only one stunner pod. But once I scale the stunner pods to 3 instances I would assume the WebRTC would not establish because there is no control over the LB and cluster service to actually land the BIND requests to whichstunner
correct?So in this case what should be done to scale ? A naive way I could think of is that I need to assign a new LB public address for each stunner and use headless service within the k8s. But this adds extra complexity on how should I ensure the both clients can use the same stunner ?
Thanks in advance
The text was updated successfully, but these errors were encountered: