Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[1.x] videoroom: remote publisher doesn't release RTP/RTCP ports #3345

Closed
boddob opened this issue Mar 19, 2024 · 10 comments
Closed

[1.x] videoroom: remote publisher doesn't release RTP/RTCP ports #3345

boddob opened this issue Mar 19, 2024 · 10 comments
Labels

Comments

@boddob
Copy link

boddob commented Mar 19, 2024

What version of Janus is this happening on?
1202 (commit: 3773de6)

Have you tested a more recent version of Janus too?
This is reproducible on the latest commit, we use 1200 (commit: e3a2aea) for production on which we first saw it.

Was this working before?
This issue has been there since we've started using it.

Is there a gdb or libasan trace of the issue?
Don't have a trace, but have logs with refcount debug enabled, which seems to point to a refcount issue. Shared the gist below.
https://gist.github.com/boddob/e5f82c19e664840083fc8f8f0f0979b7

The remote publisher refcount object to look at is "0xffff50001680". A cat of the gist content shows how the refcount seems to have some issue.

cat janus_refcount_gist.txt | grep 0xffff50001680
[plugins/janus_videoroom.c:janus_videoroom_process_synchronous_request:7202:init] 0xffff50001680 (1)
[plugins/janus_videoroom.c:janus_videoroom_process_synchronous_request:7221:increase] 0xffff50001680 (2)
[plugins/janus_videoroom.c:janus_videoroom_process_synchronous_request:7221:increase] 0xffff50001680 (3)
[plugins/janus_videoroom.c:janus_videoroom_process_synchronous_request:7313:increase] 0xffff50001680 (4)
[plugins/janus_videoroom.c:janus_videoroom_remote_publisher_thread:12835:increase] 0xffff50001680 (5)
[plugins/janus_videoroom.c:janus_videoroom_publisher_stream_destroy:2344:decrease] 0xffff50001680 (2117)
[plugins/janus_videoroom.c:janus_videoroom_publisher_stream_destroy:2344:decrease] 0xffff50001680 (2116)
[plugins/janus_videoroom.c:janus_videoroom_publisher_dereference:2387:decrease] 0xffff50001680 (2115)
[plugins/janus_videoroom.c:janus_videoroom_publisher_destroy:2429:decrease] 0xffff50001680 (2114)
[plugins/janus_videoroom.c:janus_videoroom_remote_publisher_thread:13131:decrease] 0xffff50001680 (2113)

Additional context

Background:
We use the remote publishing/SFU cascading feature to support video rooms on multiple Janus servers. We have a separate orchestrator service on each server, that does the job of adding and removing remote publishers. The design was inspired from the SFU cascading blog post. These are EC2 servers running on aarch64 CPU(s).

Problem:
Over time, the servers run out of free RTP/RTCP ports, and the request "add_remote_publisher" fails with the error code: 499, "Could not open UDP socket for RTP stream". This happens even though we ensure "remove_remote_publisher" is called for the same publisher once we're done.

Observations:

When testing this in an isolated environment, it was observed that the refcount for the remote publisher (initalized in line no. 7202 of the current src/plugin/janus_videoroom.c), doesn't behave as expected. At the point we try to remove the remote publisher, the refcount is a very high number, and the function "janus_videoroom_publisher_free" is never called. Interestingly, the refcount value seems to grow with time, i.e for how long the "remote publisher" was active and receiving data from another instance. I have shared the logs of this experiment, with refcounting enabled.

Our orchestrator that talks to Janus has multiple threads, and each one creates a different session, and each thread took up on requests/events. So, the session that adds the remote publisher could be different from the session that removes the remote publisher. Our initial suspicion was that, but the issue persists even if the orchestrator uses a single session/thread.

@boddob boddob added the multistream Related to Janus 1.x label Mar 19, 2024
@atoppi
Copy link
Member

atoppi commented Mar 19, 2024

Hi @boddob thanks for reporting.
Could you try the attached patch? It's on top of current master but it's just the removal of one line, it should not be difficult to adapt it to your current commit.

remove_extra_ref_vr_remote_pub.txt

@atoppi atoppi added the WIP label Mar 19, 2024
@boddob
Copy link
Author

boddob commented Mar 19, 2024

@atoppi thanks for sharing this. I'll give it a try and get back!

@boddob
Copy link
Author

boddob commented Mar 19, 2024

@atoppi I gave this a try and it works as expected now. I'm seeing that the sockets are released now. Thanks! Is there something more to test here to validate the change?

Mar 19 10:46:48.477174 : [plugins/janus_videoroom.c:janus_videoroom_process_synchronous_request:7202:init] 0xffff6000f840 (1)
Mar 19 10:46:48.477174 : [plugins/janus_videoroom.c:janus_videoroom_process_synchronous_request:7221:increase] 0xffff6000f840 (2)
Mar 19 10:46:48.477174 : [plugins/janus_videoroom.c:janus_videoroom_process_synchronous_request:7221:increase] 0xffff6000f840 (3)
Mar 19 10:46:48.480002 : [plugins/janus_videoroom.c:janus_videoroom_process_synchronous_request:7313:increase] 0xffff6000f840 (4)
Mar 19 10:46:48.480002 : [plugins/janus_videoroom.c:janus_videoroom_remote_publisher_thread:12835:increase] 0xffff6000f840 (5)
Mar 19 10:46:48.982088 : [plugins/janus_videoroom.c:janus_videoroom_handler:9473:increase] 0xffff6000f840 (6)
Mar 19 10:46:49.004447 : [plugins/janus_videoroom.c:janus_videoroom_handler:9809:decrease] 0xffff6000f840 (5)
Mar 19 10:48:52.771590 : [plugins/janus_videoroom.c:janus_videoroom_publisher_stream_destroy:2344:decrease] 0xffff6000f840 (4)
Mar 19 10:48:52.771590 : [plugins/janus_videoroom.c:janus_videoroom_publisher_stream_destroy:2344:decrease] 0xffff6000f840 (3)
Mar 19 10:48:52.771590 : [plugins/janus_videoroom.c:janus_videoroom_publisher_dereference:2387:decrease] 0xffff6000f840 (2)
Mar 19 10:48:52.771590 : [plugins/janus_videoroom.c:janus_videoroom_publisher_destroy:2429:decrease] 0xffff6000f840 (1)
Mar 19 10:48:52.771590 : [plugins/janus_videoroom.c:janus_videoroom_remote_publisher_thread:13130:decrease] 0xffff6000f840 (0)

@atoppi
Copy link
Member

atoppi commented Mar 19, 2024

I guess due to the nature of the fix a quick test is enough but I'll let @lminiero reply in case he feels more testing is needed.

@atoppi atoppi removed the WIP label Mar 19, 2024
@lminiero
Copy link
Member

If you can test it a few more days it will help, just to be sure. But I agree this is 99.99999% likely the cause of the issue.

@boddob
Copy link
Author

boddob commented Mar 20, 2024

@atoppi @lminiero thanks, I'll test for some more time and get back.

@atoppi
Copy link
Member

atoppi commented Mar 25, 2024

@boddob the patch has been merged, thanks for testing.

@boddob
Copy link
Author

boddob commented Mar 25, 2024

@atoppi sorry for not getting back in time. I've been running this under heavy loads, and I've not faced any issues.

@zevarito
Copy link
Contributor

zevarito commented Apr 4, 2024

I've deployed 2 servers with this patch, one about a week ago, and seems running fine, metrics doesn't show up improvements or regressions compared with the rest of the servers so far. Yesterday I've deployed a second one, that unfortunately crashed about 10hs later. Core dump core.hloop 2.1 was somehow truncated and I am not able to debug and post results. Just wanted to share this in case someone else is also facing a similar issue, for now I would prefer to no update the reset of the servers until have a better insight about the issue.

@boddob
Copy link
Author

boddob commented May 3, 2024

@zevarito I tested this patch for a week with the latest janus-gateway code at the time and hadn't seen any crash. For production, we have this running for over a month, but patched on an older janus-gateway version (1200, commit : e3a2aea). Will update here if I see any issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants