Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Media plane: Asymmetric ICE connection issues: no allocation found #143

Closed
dsgli opened this issue May 14, 2024 · 6 comments
Closed

Media plane: Asymmetric ICE connection issues: no allocation found #143

dsgli opened this issue May 14, 2024 · 6 comments
Labels

Comments

@dsgli
Copy link

dsgli commented May 14, 2024

Hi,

my setup is as follows: I have on-premise k8s cluster, and I'm in the middle of integration of webrtc streaming. The idea is that client from his PC will establish webrtc connection to my streaming server within my kubernetes cluster. Streaming server handles sdp offer/answer exchange and once connection is made it starts streaming data.
Streaming server in standalone mode works as intended, both based on docker (certs handled by nginx reverse proxy in front of it), and bare metal (directly contains certs).

Due to the fact that my k8s is on premises (real kubeadm, not some minikube nor kind), my desired config looks as follows:

Client PC with browser -> nginx reverse proxy -> k8s cluster with stunner gateway(s) -> my simple server handling sdp exchange and streaming.

Based on stunner documentation (https://docs.l7mp.io/en/latest/DEPLOYMENT/#asymmetric-ice-mode) I concluded that i need to make setup like described Media Plane, with asymmetric ICE.

Although nothing works (im not sure if in asymmetric ICE client browser should go with iceTransport relay or all, tried both since docs text mention it, but picture of config skips it, anyway none of them work).

What I already checked:

  • my stunner gateways have external IP based on metalLB, and my nginx reverse proxy redirects to those gateways
  • test with iperf works as intended
  • no stun config is delivered to browser client
  • only turn config is delivered to browser client
  • no RTCConnection config at all is delivered to server
  • my udp gateway shows following logs:

10:59:41.847770 handlers.go:25: stunner-auth INFO: static auth request: username="user-1" realm="stunner.l7mp.io" srcAddr=10.64.4.2:24478
10:59:41.876352 handlers.go:25: stunner-auth INFO: static auth request: username="user-1" realm="stunner.l7mp.io" srcAddr=10.64.4.2:10726
10:59:41.928294 handlers.go:25: stunner-auth INFO: static auth request: username="user-1" realm="stunner.l7mp.io" srcAddr=10.64.4.2:26005
10:59:42.200412 handlers.go:25: stunner-auth INFO: static auth request: username="user-1" realm="stunner.l7mp.io" srcAddr=10.64.4.2:26005
10:59:42.200461 handlers.go:25: stunner-auth INFO: static auth request: username="user-1" realm="stunner.l7mp.io" srcAddr=10.64.4.2:24478
10:59:42.200528 turn.go:239: turn INFO: permission denied for client 10.64.4.2:24478 to peer 10.10.246.228 (suppressed 6 log events)
10:59:42.200586 server.go:202: turn ERROR: Failed to handle datagram: failed to handle Send-indication from 10.64.4.2:30068: no allocation found 10.64.4.2:30068:[::]:3478
10:59:42.220967 server.go:202: turn ERROR: Failed to handle datagram: failed to handle Send-indication from 10.64.4.2:24478: unable to handle send-indication, no permission added: 89.25.216.14:61795

  • I also checked https://docs.l7mp.io/en/latest/examples/direct-one2one-call/ and its troubleshooting section mentions No ICE candidate appears: Most probably this occurs because the browser's ICE configuration does not match the running STUNner config. Check that the ICE configuration returned by the application server in the registerResponse message matches the output of stunnerctl config. Examine the stunner pods' logs (kubectl logs...): permission-denied messages typically indicate that STUN/TURN authentication was unsuccessful. But even when I went with default user-1/pass-1 still no luck.
  • Trickle ICE (https://webrtc.github.io/samples/src/content/peerconnection/trickle-ice/) shows:

The server turn:mycluster.mydnsentry.rnd:3478?transport=udp returned an error with code=701:
TURN host lookup received error.

I tried also symmetric ICE, but no luck.

Not sure how to debug further, will be glad if you could point me in some direction

@rg0now rg0now added priority: high status: confirmed type: question Further information is requested labels May 14, 2024
@dsgli
Copy link
Author

dsgli commented May 14, 2024

I noticed that accessing from chrome i get different/addtional logs (logs in previous message were from FF):

12:25:28.289821 handlers.go:25: stunner-auth INFO: static auth request: username="user-1" realm="stunner.l7mp.io" srcAddr=10.64.4.2:19946
12:25:32.735157 handlers.go:25: stunner-auth INFO: static auth request: username="user-1" realm="stunner.l7mp.io" srcAddr=10.64.4.2:20471
12:25:32.735378 handlers.go:25: stunner-auth INFO: static auth request: username="user-1" realm="stunner.l7mp.io" srcAddr=10.64.4.2:19757
12:25:32.928150 handlers.go:25: stunner-auth INFO: static auth request: username="user-1" realm="stunner.l7mp.io" srcAddr=10.64.4.2:20471
12:25:32.931463 handlers.go:25: stunner-auth INFO: static auth request: username="user-1" realm="stunner.l7mp.io" srcAddr=10.64.4.2:19757
12:25:32.931498 turn.go:239: turn INFO: permission denied for client 10.64.4.2:19757 to peer 89.25.216.14 (suppressed 1 log events)
12:25:32.931560 server.go:202: turn ERROR: Failed to handle datagram: failed to handle CreatePermission-request from 10.64.4.2:15570: no allocation found 10.64.4.2:15570:[::]:3478
12:25:32.931737 server.go:202: turn ERROR: Failed to handle datagram: failed to handle Send-indication from 10.64.4.2:19757: unable to handle send-indication, no permission added: 10.10.246.228:42681
12:25:33.697695 handlers.go:25: stunner-auth INFO: static auth request: username="user-1" realm="stunner.l7mp.io" srcAddr=10.64.4.2:19757

Not sure if important, but wanted to add 'just in case its meaningful'

@rg0now
Copy link
Member

rg0now commented May 14, 2024

Thanks for the clear problem description.

We've already seen this. The problem seems to be caused by that STUNner is running behind an nginx UDP proxy and, for some reason, every UDP packet we get seems to be coming from a different UDP source port. Since TURN identifies allocations by the IP 5-tuple (IP src/dst address, UDP src/dst port, and IP proto), this breaks the TURN state machine. In other words, for each TURN message (like CreatePermission or a SendIndication) that assumes a prior TURN message (like CreateAllocation) received from the same 5-tuple, we get a no allocation found error, like you see in your logs:

10:59:42.200586 server.go:202: turn ERROR: Failed to handle datagram: failed to handle Send-indication from 10.64.4.2:30068: no allocation found 10.64.4.2:30068:[::]:3478

Now, since this is the first time STUNner gets a TURN message from the UDP source port 30068 it tries to look up the corresponding session state but, it fails to find anything so it signals an error.

We never really debugged what causes this, nginx or the kube-proxy or some weird interaction between the two, but it seems that STUNner behind nginx does not work.

There are various possible workarounds:

  • Find the magic config that will enable UDP connection tracking in nginx. I'm no nginx expert, so I dunno whether such a thing exists.
  • Remove nginx from the loop and expose STUNner directly: this will also reduce your latency/jitter, but if you include nginx in the loop for integrating with cert-manager then TURN/UDP/DTLS will not work (unless you use a signed certificate with STUNner). We have cert-manager integration planned on the drawing board, but this will definitely come after v1. Plain TURN/UDP should work fine though. Note that the TURN payload will be encrypted anyway.
  • Use TURN/TCP/TLS: nginx always creates conntrack state for TCP, so this option is known to work. Of curse, TCP is suboptimal for real-time traffic so this may have massive negative impact on latency/jitter.

@dsgli
Copy link
Author

dsgli commented May 16, 2024

I think doing some nginx magic should work, will try to do that and post my results, regardless of success/failure. For now just to test it I went with dropping nginx from the loop (just for webrtc traffic) so my setup looks as follows:
webrtc_issues
Note: ports 3478 for udp and 3479 for tcp as 'public' access is not a typo, but in iptables i redirect it to proper extIP:3478 ports from metalLB.

However i ended up worse than before (with nginx proxy for tcp/udp stunner purposes, ICE gathering was fine), now my ICE gathering is stuck, so im debugging it at the moment. I've also checked logs of tcp/udp gateways from stunner, but absolutely no logs are produced (before, when i used nginx proxy, i had some logs), so i guess my error must be somewhere before.

Stunner UDP/TCP gateways are "first" point of contact from stunner perspective?

@rg0now
Copy link
Member

rg0now commented May 16, 2024

You should lecture GitHub issue writing skills at a university...

Anyway, if there are no logs in stunnerd that means that the traffic does not even hit the TURN server. I guess the problem is somewhere at the MetalLB part.

As per whether STUNner is a "first point of contact" I'd say it depends. There can be any number of L3/L4 middleboxes (firewalls, NATs, tunnel endpoints) between the client and the stunnerd pods, TURN is designed to survive that. However, speaking of UDP, we often see MTU issues causing weird errors, so you'd better avoid extensive tunneling. Try testing with TCP: if TURN/TCP works then your setup is fine and the issue is a missing conntrack or an MTU issue somewhere on the client-to-stunnerd path.

@rg0now
Copy link
Member

rg0now commented Jun 10, 2024

Closing this in preparation for v1, feel free to reopen if you have any new input.

@rg0now rg0now closed this as completed Jun 10, 2024
@dsgli
Copy link
Author

dsgli commented Jun 10, 2024

Sure, i'll add my insights when i'll get back to it. Had other priorities, but for now my traces lead to necessary changes within kube-proxy, as all iptables stuff seems to be in order.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants