Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[1.x] Videoroom use-after-free when unsubscribing and closing a room #3242

Closed
thatsmydoing opened this issue Jul 13, 2023 · 29 comments
Closed
Labels
bug multistream Related to Janus 1.x

Comments

@thatsmydoing
Copy link
Contributor

thatsmydoing commented Jul 13, 2023

What version of Janus is this happening on?
6d2dd1b44d16d51c086f58f5ffc59c87a5e9b0c7

Have you tested a more recent version of Janus too?
Yes. fe993140577ff91ec626eb7b5f03a2898809674f

Was this working before?
I'm not sure, it also crashes on 1.0.4. In practice, it doesn't crash that often. Maybe once a day in our usage.

Is there a gdb or libasan trace of the issue?
Reproduced locally against fe993140577ff91ec626eb7b5f03a2898809674f, https://gist.github.com/thatsmydoing/c8e82a531f3431cb8f066e4b54cb6dec

Additional context
I can reproduce it pretty reliably with https://github.com/thatsmydoing/janus-gateway-2775/

@thatsmydoing thatsmydoing added the multistream Related to Janus 1.x label Jul 13, 2023
@lminiero
Copy link
Member

Ah, it looks like a cleanup code stepping on its own feet: we're looping on stream->publisher_streams, which is a GList, but while iterating there (list = list->next), we call janus_videoroom_subscriber_stream_remove which updates that same list. As such, the list pointer we're using may not be a valid pointer anymore, because destroyed in the process, hence the crash. I'll try to remember why exactly the code was written like that, and to come up with a fix.

@lminiero lminiero added the bug label Jul 13, 2023
@lminiero
Copy link
Member

If only I could replicate the problem, though... I'm at attempt 900 with your tool and still nothing 🤣

@thatsmydoing
Copy link
Contributor Author

Oh no. On my machine I get the error on the 1st attempt. Let me check again.

@lminiero
Copy link
Member

I stopped the tool at about 2000 attempts, as I don't think it's going to die on me at this point 🙂

@thatsmydoing
Copy link
Contributor Author

I'm not sure what to say, I tried in a different environment and it's still consistently crashing on the first attempt. If you've cloned the repository before, I've pushed some changes which you may not have pulled? It also only reliably crashes with asan, otherwise it just keeps going.

Otherwise, it might be the library versions being used? I'll see if I can make docker reproduction case.

@lminiero
Copy link
Member

lminiero commented Jul 13, 2023

No, there's no further commits if I do a git pull, and my local Janus instance is indeed always built with libasan, since I use that to track issues I stumble upon while developing.

@thatsmydoing
Copy link
Contributor Author

I think this depends on the environment. I can reliably reproduce on nix and Archlinux but not on Debian 12. I'll see if using the same lib versions as Debian avoid the issue.

@thatsmydoing
Copy link
Contributor Author

I've yet to narrow it down to the exact library, but downgrading libraries does make it non-reproducible.

Bad (Reproduces Crash)
gcc-12.3.0
glib-2.76.3
glibc-2.37-8
jansson-2.14
libconfig-1.7.3
libnice-0.1.21
libsrtp-2.5.0
libwebsockets-4.1.6
openssl-3.0.9
zlib-1.2.13
Good (No Crash)
gcc-11.3.0
glib-2.74.3
glibc-2.35-224
jansson-2.14
libconfig-1.7.3
libnice-0.1.18
libsrtp-2.4.2
libwebsockets-4.1.6
openssl-3.0.9
zlib-1.2.13

My guess is it might be glib as most of the lib versions under "Bad" match that of Debian 12. I'll dig a bit more.

@thatsmydoing
Copy link
Contributor Author

I can confirm that downgrading glib to 2.74 seems to fix the issue for us (at least with the provided test case). So I guess there's something in glib 2.76 that breaks things?

@lminiero
Copy link
Member

I'm on 2.74 too (2.74.7 on Fedora). That said, my guess is that part I mentioned is where things break, so I'll come back to you with a quick patch you can try to see if it improves things.

@lminiero
Copy link
Member

@thatsmydoing can you check if this tiny patch helps?

diff --git a/src/plugins/janus_videoroom.c b/src/plugins/janus_videoroom.c
index a7a7c2b5..9bf3a053 100644
--- a/src/plugins/janus_videoroom.c
+++ b/src/plugins/janus_videoroom.c
@@ -10432,10 +10432,10 @@ static void *janus_videoroom_handler(void *data) {
 										list = list->next;
 										continue;
 									}
-									janus_videoroom_subscriber_stream_remove(stream, ps, TRUE);
 									if(stream->type != JANUS_VIDEOROOM_MEDIA_DATA)
 										changes++;
 									list = list->next;
+									janus_videoroom_subscriber_stream_remove(stream, ps, TRUE);
 								}
 								temp = temp->next;
 							}

All it does is moving the janus_videoroom_subscriber_stream_remove to the end of the inner loop, so that updating the pointer while traversing is not impacted in case that pointer is the one that's destroyed. I have not tested this myself.

@thatsmydoing
Copy link
Contributor Author

That might be it. With the patch, even on glib 2.76, I can't reproduce it with my test case anymore.

@lminiero
Copy link
Member

That's good news! You may want to ensure you don't see any other regression though. If not, I can push the fix upstream.

@thatsmydoing
Copy link
Contributor Author

I'll be trying this out on production today. I'll let you know around next week if we see any more crashes.

@thatsmydoing
Copy link
Contributor Author

With the patch and glib 2.76, we unfortunately had a crash elsewhere this time

janus_videoroom_publisher_stream *ps = list->data;
from
janus_videoroom_hangup_media_internal(session);

@lminiero
Copy link
Member

Please share dumps as you did last time.

@thatsmydoing
Copy link
Contributor Author

I couldn't reproduce it locally, so I only have a gdb backtrace from coredumps.

The first logs are running on thatsmydoing@155c456, this is branched off 6d2dd1b and includes your patch as well as an unrelated patch that shouldn't affect this.

Logs
[7825799502076340] Updating existing session
[7540571102874557] Updating existing session
[3620882643052495] WebRTC resources freed; 0x7f3ab82b2820 0x7f3ac805a410
[3620882643052495] Handle and related resources freed; 0x7f3ab82b2820 0x7f3ac805a410
[7825799502076340] Negotiation update, checking what changed...
[7540571102874557] Negotiation update, checking what changed...
[4877804757315329] Updating existing session
[WARN] [8270445183568114] Didn't receive video (substream #2) for more than 1 second(s)...
[4877804757315329] Negotiation update, checking what changed...
[WARN] [4287692750950437] Didn't receive video (substream #1) for more than 1 second(s)...
[WARN] [3252118253804503] Didn't receive video (substream #1) for more than 1 second(s)...
[libwebsockets][NOTICE] [2023/07/20 08:41:09:2406] N: __lws_lc_tag:  ++ [wsisrv|3613|adopted] (39)
[libwebsockets][NOTICE] [2023/07/20 08:41:09:2407] N: __lws_lc_untag:  -- [wsisrv|3613|adopted] (38) 105μs
[WARN] No packet received on substream 1 for a while, falling back to 0
[WARN] [8270445183568114] Didn't receive video (substream #2) for more than 1 second(s)...
[WARN] [8270445183568114] Didn't receive audio for more than 1 second(s)...
[WARN] No packet received on substream 1 for a while, falling back to 0
Detaching handle from JANUS VideoRoom plugin; 0x7f3ab8098760 0x7f3ab81293e0 0x7f3ab8098760 0x7f3ab831cdf0
[janus.plugin.videoroom-0x7f3ab81293e0] No WebRTC media anymore; 0x7f3ab8098760 0x7f3ab831cdf0
[6967737102373463] Updating existing session
[8270445183568114] WebRTC resources freed; 0x7f3ab8098760 0x7f3ab80a99e0
Detaching handle from JANUS VideoRoom plugin; 0x7f3ab81ca960 0x7f3ab8369a40 0x7f3ab81ca960 0x7f3ab81ef150
[janus.plugin.videoroom-0x7f3ab8003630] No WebRTC media anymore; 0x7f3ab82131e0 0x7f3ab8305800
[janus.plugin.videoroom-0x7f3ab8369a40] No WebRTC media anymore; 0x7f3ab81ca960 0x7f3ab81ef150
Detaching handle from JANUS VideoRoom plugin; 0x7f3ab82131e0 0x7f3ab8003630 0x7f3ab82131e0 0x7f3ab8305800
[8270445183568114] Handle and related resources freed; 0x7f3ab8098760 0x7f3ab80a99e0
GDB Backtrace
#0  janus_videoroom_hangup_media_internal (session_data=session_data@entry=0x7f3ab8305800) at plugins/janus_videoroom.c:8620
#1  0x00007f3acd0cf9a0 in janus_videoroom_hangup_media (handle=) at plugins/janus_videoroom.c:8483
#2  0x0000000000448790 in janus_ice_outgoing_traffic_handle ()
#3  0x000000000044b701 in janus_ice_outgoing_traffic_dispatch ()
#4  0x00007f3ad03d2d7d in g_main_context_dispatch () from /nix/store/fr53rm643lg0rp4qmgzzpwsz5xxdqwk0-glib-2.76.3/lib/libglib-2.0.so.0
#5  0x00007f3ad03d3028 in g_main_context_iterate.constprop () from /nix/store/fr53rm643lg0rp4qmgzzpwsz5xxdqwk0-glib-2.76.3/lib/libglib-2.0.so.0
#6  0x00007f3ad03d32df in g_main_loop_run () from /nix/store/fr53rm643lg0rp4qmgzzpwsz5xxdqwk0-glib-2.76.3/lib/libglib-2.0.so.0
#7  0x000000000043d81c in janus_ice_handle_thread ()
#8  0x00007f3ad03ff64d in g_thread_proxy () from /nix/store/fr53rm643lg0rp4qmgzzpwsz5xxdqwk0-glib-2.76.3/lib/libglib-2.0.so.0
#9  0x00007f3acfbb6dd4 in start_thread () from /nix/store/3n58xw4373jp0ljirf06d8077j15pc4j-glibc-2.37-8/lib/libc.so.6
#10 0x00007f3acfc389b0 in clone3 () from /nix/store/3n58xw4373jp0ljirf06d8077j15pc4j-glibc-2.37-8/lib/libc.so.6

While debugging, I saw a similar bug to the one fixed by your patch and fixed it. Unfortunately, it was unrelated to the crash and didn't fix it. Here are the logs running against thatsmydoing@2d1b88a.

Logs
[3530032410032467] Updating existing session
[WARN] [121364506572113] Didn't receive video (substream #1) for more than 1 second(s)...
[WARN] [121364506572113] Didn't receive video (substream #2) for more than 1 second(s)...
[3530032410032467] Negotiation update, checking what changed...
[2740559208127164] Negotiation update, checking what changed...
[WARN] [4890055471783842] Didn't receive video (substream #1) for more than 1 second(s)...
[WARN] [121364506572113] Didn't receive video (substream #1) for more than 1 second(s)...
[WARN] [121364506572113] Didn't receive video for more than 1 second(s)...
[WARN] [5426595611094146] Didn't receive video (substream #1) for more than 1 second(s)...
[libwebsockets][NOTICE] [2023/07/23 11:01:48:8014] N: __lws_lc_tag:  ++ [wsisrv|29cc|adopted] (38)
[libwebsockets][NOTICE] [2023/07/23 11:01:48:8015] N: __lws_lc_untag:  -- [wsisrv|29cc|adopted] (37) 91μs
[WARN] [121364506572113] Didn't receive video (substream #1) for more than 1 second(s)...
[libwebsockets][NOTICE] [2023/07/23 11:01:58:8017] N: __lws_lc_tag:  ++ [wsisrv|29cd|adopted] (38)
[libwebsockets][NOTICE] [2023/07/23 11:01:58:8018] N: __lws_lc_untag:  -- [wsisrv|29cd|adopted] (37) 68μs
[WARN] [6318269714873560] Didn't receive video for more than 1 second(s)...
[WARN] [6318269714873560] Didn't receive video (substream #1) for more than 1 second(s)...
[WARN] [6318269714873560] Unknown SSRC, dropping packet (SSRC 966292906)...
[WARN] [121364506572113] Didn't receive video (substream #1) for more than 1 second(s)...
[libwebsockets][NOTICE] [2023/07/23 11:02:08:8017] N: __lws_lc_tag:  ++ [wsisrv|29ce|adopted] (38)
[libwebsockets][NOTICE] [2023/07/23 11:02:08:8017] N: __lws_lc_untag:  -- [wsisrv|29ce|adopted] (37) 92μs
[WARN] [121364506572113] Unknown SSRC, dropping packet (SSRC 3329985358)...
[WARN] [121364506572113] Unknown SSRC, dropping packet (SSRC 3329985358)...
[WARN] [121364506572113] Unknown SSRC, dropping packet (SSRC 3329985358)...
[libwebsockets][NOTICE] [2023/07/23 11:02:18:8015] N: __lws_lc_tag:  ++ [wsisrv|29cf|adopted] (38)
[libwebsockets][NOTICE] [2023/07/23 11:02:18:8016] N: __lws_lc_untag:  -- [wsisrv|29cf|adopted] (37) 83μs
[WARN] [121364506572113] Didn't receive video (substream #1) for more than 1 second(s)...
[WARN] [6318269714873560] Didn't receive video for more than 1 second(s)...
[WARN] [6318269714873560] Didn't receive video (substream #1) for more than 1 second(s)...
[libwebsockets][NOTICE] [2023/07/23 11:02:28:8014] N: __lws_lc_tag:  ++ [wsisrv|29d0|adopted] (38)
[libwebsockets][NOTICE] [2023/07/23 11:02:28:8015] N: __lws_lc_untag:  -- [wsisrv|29d0|adopted] (37) 76μs
[WARN] [121364506572113] Unknown SSRC, dropping packet (SSRC 3329985358)...
[libwebsockets][NOTICE] [2023/07/23 11:02:38:8013] N: __lws_lc_tag:  ++ [wsisrv|29d1|adopted] (38)
[libwebsockets][NOTICE] [2023/07/23 11:02:38:8014] N: __lws_lc_untag:  -- [wsisrv|29d1|adopted] (37) 80μs
[WARN] [121364506572113] Unknown SSRC, dropping packet (SSRC 3329985358)...
[janus.plugin.videoroom-0x7f7590335020] No WebRTC media anymore; 0x7f7590052320 0x7f75901da380
[3658334612854888] Updating existing session
[4278699205407511] WebRTC resources freed; 0x7f7590052320 0x7f759040dd90
Detaching handle from JANUS VideoRoom plugin; 0x7f7590052320 0x7f7590335020 0x7f7590052320 0x7f75901da380
[4278699205407511] Handle and related resources freed; 0x7f7590052320 0x7f759040dd90
Detaching handle from JANUS VideoRoom plugin; 0x7f7590372010 0x7f7590002330 0x7f7590372010 0x7f759006a530
Detaching handle from JANUS VideoRoom plugin; 0x7f759002f890 0x7f759002dcb0 0x7f759002f890 0x7f759003fff0
[janus.plugin.videoroom-0x7f7590002330] No WebRTC media anymore; 0x7f7590372010 0x7f759006a530
[janus.plugin.videoroom-0x7f759002dcb0] No WebRTC media anymore; 0x7f759002f890 0x7f759003fff0
GDB Backtrace
#0  0x00007f75a4e9c538 in janus_videoroom_hangup_media_internal (session_data=session_data@entry=0x7f759003fff0) at plugins/janus_videoroom.c:8621
#1  0x00007f75a4e9d9a0 in janus_videoroom_hangup_media (handle=) at plugins/janus_videoroom.c:8483
#2  0x0000000000448790 in janus_ice_outgoing_traffic_handle ()
#3  0x000000000044b701 in janus_ice_outgoing_traffic_dispatch ()
#4  0x00007f75a81a0d7d in g_main_context_dispatch () from /nix/store/fr53rm643lg0rp4qmgzzpwsz5xxdqwk0-glib-2.76.3/lib/libglib-2.0.so.0
#5  0x00007f75a81a1028 in g_main_context_iterate.constprop () from /nix/store/fr53rm643lg0rp4qmgzzpwsz5xxdqwk0-glib-2.76.3/lib/libglib-2.0.so.0
#6  0x00007f75a81a12df in g_main_loop_run () from /nix/store/fr53rm643lg0rp4qmgzzpwsz5xxdqwk0-glib-2.76.3/lib/libglib-2.0.so.0
#7  0x000000000043d81c in janus_ice_handle_thread ()
#8  0x00007f75a81cd64d in g_thread_proxy () from /nix/store/fr53rm643lg0rp4qmgzzpwsz5xxdqwk0-glib-2.76.3/lib/libglib-2.0.so.0
#9  0x00007f75a7984dd4 in start_thread () from /nix/store/3n58xw4373jp0ljirf06d8077j15pc4j-glibc-2.37-8/lib/libc.so.6
#10 0x00007f75a7a069b0 in clone3 () from /nix/store/3n58xw4373jp0ljirf06d8077j15pc4j-glibc-2.37-8/lib/libc.so.6

@lminiero
Copy link
Member

A libasan trace would be better.

@thatsmydoing
Copy link
Contributor Author

Here's one https://gist.github.com/thatsmydoing/05750ff32b8e5b0f3d3fdbd6f813d1ad running against thatsmydoing@2d1b88a

I do have another one that has a slightly different error and line https://gist.github.com/thatsmydoing/8a577c6c26005cde8ef70b6e0e181c09

@lminiero
Copy link
Member

Looks like a race condition, that I think we correctly identified now. Unfortunately it's a also a bit tricky to address and solve. I'll let you know when there's a patch you can test.

@lminiero
Copy link
Member

lminiero commented Aug 1, 2023

@thatsmydoing please test the PR above. It already includes the other small change I pasted a few weeks ago, so you' won't need to reapply that yourself again.

@lminiero
Copy link
Member

lminiero commented Aug 4, 2023

@thatsmydoing ping 🙂

@thatsmydoing
Copy link
Contributor Author

Ah sorry, I have it running now and it seems good so far. I'll update you next week after having let it run for a bit more.

I think the PR also needs

Patch
diff --git a/src/plugins/janus_videoroom.c b/src/plugins/janus_videoroom.c
index cdf42837..fb0ba279 100644
--- a/src/plugins/janus_videoroom.c
+++ b/src/plugins/janus_videoroom.c
@@ -8633,8 +8633,8 @@ static void janus_videoroom_hangup_media_internal(gpointer session_data) {
                                        }
                                        list = list->next;
                                }
+                               temp = temp->next;
                                janus_videoroom_subscriber_stream_remove(s, NULL, TRUE);
-                               temp = temp->next;
                        }
                        /* Free streams */
                        g_list_free(subscriber->streams);

@lminiero
Copy link
Member

lminiero commented Aug 4, 2023

No it's there already.

@lminiero
Copy link
Member

lminiero commented Aug 4, 2023

Ah you mean a different patch than mine I guess?

@thatsmydoing
Copy link
Contributor Author

Yeah, it's a similar fix to the first patch, just in a different area.

@lminiero
Copy link
Member

lminiero commented Aug 4, 2023

Found another one too, pushed.

@thatsmydoing
Copy link
Contributor Author

I've not seen any crashes since I used the PR patchset so I think it's good.

@lminiero
Copy link
Member

lminiero commented Aug 9, 2023

Ack, thanks for the feedback! I'll merge then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug multistream Related to Janus 1.x
Projects
None yet
Development

No branches or pull requests

2 participants