Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eventfd spinning #154

Closed
ploxiln opened this issue Feb 13, 2015 · 62 comments
Closed

eventfd spinning #154

ploxiln opened this issue Feb 13, 2015 · 62 comments

Comments

@ploxiln
Copy link
Contributor

ploxiln commented Feb 13, 2015

I figured out what was sometimes making threads "spin", consuming as much cpu as they can: write()s to eventfd.

First I identified a thread that was "going crazy", and then counted syscalls with strace over a few seconds:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 59.84    0.712945           6    110297           write
 19.47    0.231949           6     36765           clock_gettime
 10.60    0.126276           7     18383           poll
  9.95    0.118586           6     18382           read
  0.14    0.001683          12       135        32 futex
------ ----------- ----------- --------- --------- ----------------
100.00    1.191439                183962        32 total

that seems like more write() than makes sense for sending rtp/srtp packets. Let's see what it's writing...

clock_gettime(CLOCK_MONOTONIC, {823252, 946459022}) = 0
poll([{fd=15, events=POLLIN}, {fd=26, events=POLLIN}, {fd=28, events=POLLIN}, {fd=32, events=POLLIN}, {fd=34, events=POLLIN}, {fd=25, events=POLLIN}, {fd=27, events=POLLIN}, {fd=31, events=POLLIN}, {fd=33, events=POLLIN}], 9, 2553) = 4 ([{fd=15, revents=POLLIN}, {fd=27, revents=POLLIN}, {fd=31, revents=POLLIN}, {fd=33, revents=POLLIN}])
read(15, "\6\0\0\0\0\0\0\0", 16)        = 8
clock_gettime(CLOCK_MONOTONIC, {823252, 946602023}) = 0
write(15, "\1\0\0\0\0\0\0\0", 8)        = 8
write(15, "\1\0\0\0\0\0\0\0", 8)        = 8
write(15, "\1\0\0\0\0\0\0\0", 8)        = 8
write(15, "\1\0\0\0\0\0\0\0", 8)        = 8
write(15, "\1\0\0\0\0\0\0\0", 8)        = 8
write(15, "\1\0\0\0\0\0\0\0", 8)        = 8
clock_gettime(CLOCK_MONOTONIC, {823252, 946880949}) = 0
poll([{fd=15, events=POLLIN}, {fd=26, events=POLLIN}, {fd=28, events=POLLIN}, {fd=32, events=POLLIN}, {fd=34, events=POLLIN}, {fd=25, events=POLLIN}, {fd=27, events=POLLIN}, {fd=31, events=POLLIN}, {fd=33, events=POLLIN}], 9, 2553) = 4 ([{fd=15, revents=POLLIN}, {fd=27, revents=POLLIN}, {fd=31, revents=POLLIN}, {fd=33, revents=POLLIN}])
read(15, "\6\0\0\0\0\0\0\0", 16)        = 8
clock_gettime(CLOCK_MONOTONIC, {823252, 947022532}) = 0
write(15, "\1\0\0\0\0\0\0\0", 8)        = 8
write(15, "\1\0\0\0\0\0\0\0", 8)        = 8
write(15, "\1\0\0\0\0\0\0\0", 8)        = 8
write(15, "\1\0\0\0\0\0\0\0", 8)        = 8
write(15, "\1\0\0\0\0\0\0\0", 8)        = 8
write(15, "\1\0\0\0\0\0\0\0", 8)        = 8
clock_gettime(CLOCK_MONOTONIC, {823252, 947316721}) = 0

So it reads 8 bytes from fd 15, and then writes 8 bytes to fd 15, 6 times. and the data is 0x01 0x00 0x00 0x00 0x00 0x00 0x00 0x00 each time. (and it does this rapidly and continuously)

root@janus02:~# ls -l /proc/615/fd
total 0
lrwx------ 1 root root 64 Feb 13 17:38 0 -> /dev/null
l-wx------ 1 root root 64 Feb 13 17:38 1 -> /opt/log/janus/janus.log
lrwx------ 1 root root 64 Feb 13 17:38 10 -> socket:[5925975]
lrwx------ 1 root root 64 Feb 13 17:38 11 -> anon_inode:[eventfd]
lrwx------ 1 root root 64 Feb 13 17:38 12 -> anon_inode:[eventfd]
lrwx------ 1 root root 64 Feb 13 17:38 13 -> socket:[5924500]
lrwx------ 1 root root 64 Feb 13 17:38 14 -> socket:[5924501]
lrwx------ 1 root root 64 Feb 13 17:38 15 -> anon_inode:[eventfd]
lrwx------ 1 root root 64 Feb 13 17:38 16 -> socket:[5925728]
...

fd 15 is from eventfd()

Maybe it's some libnice or glib thing... I'll try to get a backtrace of the relevant thread.

@lminiero
Copy link
Member

Thanks for that very detailed analysis! Looking forward to the backtrace.

@ploxiln
Copy link
Contributor Author

ploxiln commented Feb 13, 2015

Here's the relevant backtrace, roughly:

  • write()
  • g_wakeup_signal()
  • g_main_context_add_poll_unlocked()
  • unblock_source()
  • g_main_dispatch()
  • g_main_context_dispatch()
  • g_main_context_iterate()
  • g_main_loop_run()
  • janus_ice_thread()
warning: process 4043 is a cloned process
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007fe35ceac35d in write () from /lib/x86_64-linux-gnu/libpthread.so.0

Thread 1 (process 4043):
#0  0x00007fe35ceac35d in write () from /lib/x86_64-linux-gnu/libpthread.so.0
No symbol table info available.
#1  0x00007fe35e769c9f in g_wakeup_signal (wakeup=0x7fe34c003ea0) at /build/buildd/glib2.0-2.40.2/./glib/gwakeup.c:236
        one = 1
        res = <optimized out>
#2  0x00007fe35e72594e in g_main_context_add_poll_unlocked (context=<optimized out>, priority=<optimized out>, fd=<optimized out>) at /build/buildd/glib2.0-2.40.2/./glib/gmain.c:4155
        prevrec = <optimized out>
        nextrec = <optimized out>
        newrec = <optimized out>
#3  0x00007fe35e725a00 in unblock_source (source=source@entry=0x7fe34c004720) at /build/buildd/glib2.0-2.40.2/./glib/gmain.c:2995
        tmp_list = 0x7fe324001270
        __FUNCTION__ = "unblock_source"
#4  0x00007fe35e728e70 in g_main_dispatch (context=0x7fe34c009790) at /build/buildd/glib2.0-2.40.2/./glib/gmain.c:3079
        dispatch = <optimized out>
        prev_source = <optimized out>
        was_in_call = 0
        user_data = 0x7fe30c07fe40
        callback = 0x7fe35ec4bfc0 <component_io_cb>
        cb_funcs = <optimized out>
        cb_data = 0x7fe34c012cf0
        need_destroy = <optimized out>
        source = 0x7fe34c004720
        current = 0x7fe34c013d60
        i = 2
#5  g_main_context_dispatch (context=context@entry=0x7fe34c009790) at /build/buildd/glib2.0-2.40.2/./glib/gmain.c:3663
No locals.
#6  0x00007fe35e729048 in g_main_context_iterate (context=0x7fe34c009790, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at /build/buildd/glib2.0-2.40.2/./glib/gmain.c:3734
        max_priority = 2147483647
        timeout = 3846
        some_ready = 1
        nfds = <optimized out>
        allocated_nfds = 9
        fds = 0x1fb2240
#7  0x00007fe35e72930a in g_main_loop_run (loop=0x7fe34c003d30) at /build/buildd/glib2.0-2.40.2/./glib/gmain.c:3928
        __FUNCTION__ = "g_main_loop_run"
#8  0x00000000004148c9 in janus_ice_thread (data=0x1f9fc20) at ice.c:1237
        handle = 0x1f9fc20
        __FUNCTION__ = "janus_ice_thread"
        loop = <optimized out>
#9  0x00007fe35e74df05 in g_thread_proxy (data=0x7fe328009540) at /build/buildd/glib2.0-2.40.2/./glib/gthread.c:764
        thread = 0x7fe328009540
#10 0x00007fe35cea5182 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
No symbol table info available.
#11 0x00007fe35cbd200d in clone () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.

@ploxiln
Copy link
Contributor Author

ploxiln commented Feb 13, 2015

seems to start when I start a capture, and not stop when I stop a capture... I suspect my plugin / api access is a significant contributing factor

@ploxiln
Copy link
Contributor Author

ploxiln commented Feb 13, 2015

seems to actually be the viewer that triggers it for us, and does sometimes go away immediately, and sometimes after a few minutes (which might be related to our session handling)

@ploxiln
Copy link
Contributor Author

ploxiln commented Feb 13, 2015

Well, I guess it's good news that this is easily reproducible in my VM, and with the demo pages. viewers spin threads. It might be our custom plugin though, echotest doesn't cause it.

@lminiero
Copy link
Member

Do the streaming and videoroom plugins cause this too? Those are the plugins that have a viewer-specific scenario, so this might help narrowing the scope.

I'm wondering if the cause is the sendonly nature of those instances: in that case, libnice is only used to send media, and incoming data is only triggered for RTCP from time to time. This might explain some of the poll related bits I saw in your initial check, and the one in the backtrace, but wouldn't explain the spikes. A poll can indeed cause the CPU to spin but only when fds are invalid, usually. Not sure what we can do to fix this: an alternative might do the poll stuff manually and not rely on the recv callback, but this might make it less efficient than what's in libnice.

@ploxiln
Copy link
Contributor Author

ploxiln commented Feb 14, 2015

Yup, I can reproduce the same symptom in videoroom aka Video MCU. Same strace results etc.

@ploxiln
Copy link
Contributor Author

ploxiln commented Feb 14, 2015

btw, useful commands:

# show top threads:
top H

# count system calls for a bit (stop with ctrl-c after a bit)
strace -c -p  $JANUS_THREAD_PID

# log system calls with arguments
strace -p  $JANUS_THREAD_PID

# show current callstack (thanks @jayridge)
gdb --batch --quiet -ex "thread apply all bt full" -ex "quit" /opt/janus/bin/janus  $JANUS_THREAD_PID

@lminiero
Copy link
Member

Thanks, this is very helpful, I'll look into this as well, although I can't recall being able to reproduce this issue locally.

@ploxiln
Copy link
Contributor Author

ploxiln commented Feb 16, 2015

I've run more experiments:

I've reproduced this with both (just firefox) and (just chrome), with the latest janus-gateway, built and linked against libnice-0.1.10, and against libnice-0.1.7.

However, I was not able to easily reproduce it with chrome or firefox, with the latest janus-gateway built and linked against libnice-0.1.4 from the ubuntu-14.04 main repo. (I'll be running with libnice-0.1.4 for a while, seeing if that works well enough for us. Also, I'll be on vacation in two days, so I won't be responsive.)

Also, there appears to be a memory leak, while those threads spin, though I haven't looked into it:
cpu-week
memory-week

@lminiero
Copy link
Member

So you mean libnice 0.1.4 should actually be much more robust with respect to that than its more recent versions? I'll also have to investigate where the leak comes from, we're doing some stress tests with Valgrind on a different aspect so I'll make sure I check this too.

@ploxiln
Copy link
Contributor Author

ploxiln commented Mar 2, 2015

libnice-0.1.4 does not have this eventfd spinning, so that's what we've switched back to. However, it does seem that janus is a bit less stable with libnice-0.1.4, we've seen segfaults a couple of times a day, and there does seem to be a (slower) memory leak.

I'd like to figure out this eventfd spinning, so we can switch back to libnice-0.1.10, and then track down whatever leaks or crashes remain.

@lminiero
Copy link
Member

lminiero commented Mar 3, 2015

Are you sure segfaults are caused by the libnice version and not something else? Anyway, I agree that trying to figure out what's wrong in the up-to-date library might help. Maybe it would be a good idea to notice it on the libnice mailing list, especially if it turns out it really is a bug there and not in Janus.

@ploxiln
Copy link
Contributor Author

ploxiln commented Mar 3, 2015

I really don't know what the segfaults were caused by, they were probably luck or correlation, and in fact they haven't been happening for about a week now, it was two weeks ago when I just started using libnice-0.1.4 (I wonder if I had failed to re-compile janus against the older headers).

@arjenveenhuizen
Copy link

Just ran into this issue again. Any version of libnice > 0.1.4 still causes the spinning problem in current GIT HEAD of janus (ubuntu 14.04 x64).

[EDIT 1]
By the way, when compiling janus against libnice 0.1.13 (latest release tarball), the make output shows a couple of warnings. See: http://pastebin.com/BZSDP3FK

[EDIT 2]
I really think it would be a good idea to solve this problem since 0.1.4 is already 3 years old and in the mean time hundreds of commits have been pushed to github.

@roysjosh
Copy link

roysjosh commented Sep 1, 2015

I just did a git bisect on libnice and narrowed this issue down to the following 4 sequential commits:

243c47ecc9d694ecfe230880081634936770a959
agent: Add nice_agent_recv() allowing blocking receives on sockets

c56727025dd1ffa2e0513bf6bfc5218b58e2b483
agent: Move GSource handling into Component

12ee430ef3dbe61e52fe1ace5196f1931cf1e2c4
agent: Fix format specifiers in debug messages

2b6370a87c34236e3996fb182751267e05ee11ac
agent: Support invoking I/O callbacks in non-default contexts

For my sake and anyone else who bisects libnice, this is the little command set I used:
touch gtk-doc.make && sed -i -e 's/^gtkdocize/#gtkdocize/' autogen.sh && sh autogen.sh --enable-compile-warnings=no && make -j2 && make install DESTDIR=$HOME/test && read -p 'Press ENTER to continue' && make uninstall DESTDIR=$HOME/test && git clean -xdf && git reset --hard HEAD
git bisect {good,bad}

Also, this only seems to bite me when I have a [nat] stun_server set? I didn't test that during the bisect but noticed it when attempting to reproduce the issue on faster hardware.

@lminiero
Copy link
Member

lminiero commented Sep 1, 2015

No idea what a git bisect is so you have me at a disadvantage here 😄
How did you figure out those were the changes that caused this and not something else? If so, these are details the libnice developers could definitely benefit from, and so it would be worthwhile telling them about this on their ML.

About stun_server, this might explain why I'm not getting this in my local tests: in fact, while I have the 0.1.13 libnice installed on my Fedora, I never encountered the issue, but I don't make use of STUN servers within Janus, only on the client side. Unfortunately, STUN is pretty much required when you deploy Janus on Amazon or other servers that are natted, so disabling STUN is not an option in those cases. Anyway, if what you found is true, that might be also useful to the libnice developers to narrow down the issue and the causes, and take care of them.

That said, considering waiting for a fix in libnice might take a while, and having this fix spread fast enough in the distros repos even longer, I might start looking into some alternatives (e.g., libre) but this might take just as much, as it wouldn't be easy to study another library and integrate that one in how Janus is currently conceived.

@roysjosh
Copy link

roysjosh commented Sep 1, 2015

libnice issue filed: https://phabricator.freedesktop.org/T3328

git bisect is pretty cool. If a regression makes it into a codebase and you know a good commit and a bad commit, git can automate the process of doing a binary search to locate the commit that introduced the undesired behavior.

@lminiero
Copy link
Member

lminiero commented Sep 1, 2015

Thanks for filing that issue, I'll keep that monitored.

@roysjosh
Copy link

roysjosh commented Oct 5, 2015

I've found that in janus.c during the bundle and rtcp-mux consolidation and close of components, the nice_agent_attach_recv sets a NULL IO callback. This is directly responsible for three sockets failing all tests in libnice/agent/agent.c component_io_cb, specifically the has_io_callback tests, and thus being repeatedly called as readable. This was shown by printing the FD that component_io_cb was operating on and observing that the three (for me) FDs also returned as readable alongside the eventfd are, indeed, seen by this function and ignored.

I've reduced the number to 1 by replacing the bundle nice_agent_attach_recv calls with nice_agent_remove_stream but I'm not sure that is entirely correct. I'm also not sure what to do about the rtcp-mux calls. What do you all think?

@roysjosh
Copy link

roysjosh commented Oct 5, 2015

Also, starting janus with the new --bundle and --rtcp-mux options "fix" the issue.

@lminiero
Copy link
Member

lminiero commented Oct 5, 2015

That's an interesting discovery, thanks for sharing!

We pass a NULL because as per the documentation that's what we should do to detach any existing callback:

Calling nice_agent_attach_recv() with a NULL func will detach any existing callback and cause reception to be paused for the given stream/component pair. You must iterate the previously specified GMainContext sufficiently to ensure all pending I/O callbacks have been received before calling this function to unset func , otherwise data loss of received packets may occur.

The second part shouldn't matter as we don't care about data loss in that case.

The rtcp-mux stuff means you can get rid of the 2nd component of each stream: I guess that's because you still have a pending FD, you got rid of the unused streams (e.g., video and data, bundled on the "audio" stream), but the 2nd component of the remaining stream is still there. Can you check if removing that too fixes it for you?

@lminiero
Copy link
Member

lminiero commented Oct 5, 2015

Now that I think about it, we're indeed detaching callbacks, but not removing the streams as you pointed out, which means libnice is still istening for them. That said, no data should be sent there anymore, so I don't know why that should cause the threads to spin: maybe a poll on an invalid file descriptor?

@lminiero
Copy link
Member

lminiero commented Oct 5, 2015

Anyway, that's something we should put in janus_ice_stream_free and janus_ice_component_free which are called after detaching the callbacks, rather than using those methods instead of the nice_agent_attach_recv: that would guarantee that the streams and components are properly removed in any case. I'll look into it tomorrow.

@lminiero
Copy link
Member

lminiero commented Oct 6, 2015

Just checked and, while there is nice_agent_remove_stream for removing streams, there doesn't seem to be any method for removing components from an existing stream... or am I missing something obvious?

lminiero added a commit that referenced this issue Oct 6, 2015
…nice_agent_remove_stream when enforcing bundle/rtcp-mux (see #154)
@roysjosh
Copy link

That appeared to work with the side effect of sending a UDP packet to 1.2.3.4:1234 every 25 seconds.

@lminiero
Copy link
Member

The 25 seconds thing probably is the keepalive we mentioned. I guess we can change the 1.2.3.4 to something like localhost and to some port that nobody uses: any suggestion?

@lminiero
Copy link
Member

Changed the IP to 127.0.0.1 for now. The port is still 1234 but that shouldn't matter, unless one has UDP-based services that listen on that port. Not sure what other port we might want to use here: maybe one of the ports typically used for one of the Janus webservers, as that might indicate a port that is not used by anybody else. Whatever we choose, it's enough it's a port nobody uses as we just need to "dump" the keepalives somewhere.

Another alternative might be creating an UDP server ourselves within Janus that just acts as a "blackhole" to receive those keepalives and ignore them.

@arjenveenhuizen
Copy link

Great that we are getting close to a solution. I think it would be a good idea to make this port configurable so that people at least have the option to pick another port in case they need to use that specific port. Furthermore, I think it would be a good idea to pick a very high port number (e.g. 65520) since that is probably less likely to be used than 1234 oor 12345.

@lminiero
Copy link
Member

I'm working on a solution that binds to a random port. I'd rather not make this configurable as it's such an obscure workaround to such an obscure issue that people would not get the point of it at all.

@ploxiln
Copy link
Contributor Author

ploxiln commented Oct 26, 2015

I vote for not making it configurable. I'd actually pick a never-used port in the lower 1024 though. If you look in /etc/services you can see the many reservations in the "privileged" range below 1024. Something like 605 is not listed, 79 is "finger" which is tcp-only, 57 is "mtp" which is tcp-only and "deprecated"

@lminiero
Copy link
Member

Ports below 1024 would require Janus to be launched as root, which is not always the case.

@ploxiln
Copy link
Contributor Author

ploxiln commented Oct 26, 2015

I mean just send to udp port below 1024 and not receive. But now I see what you mean, open a random high port to receive and send to it ... yeah that's not a bad idea.

@lminiero
Copy link
Member

Yes, as otherwise we risk sending stuff to a service that may indeed be using that port with who knows which consequences.

@arjenveenhuizen
Copy link

I like that approach. Good work around!

@lminiero
Copy link
Member

Just made a commit in #362 that implements the UDP server blackhole I talked about. Feedback welcome! In case it fixes it for you guys I'll merge the PR.

@ploxiln
Copy link
Contributor Author

ploxiln commented Nov 2, 2015

I'm testing #362 starting tomorrow, results soon :)

@ploxiln
Copy link
Contributor Author

ploxiln commented Nov 2, 2015

first not-really-related observation is that the nat_1_1_mapping setting is not having effect, the stun-server derived public ip is being used in ice candidates produced by janus

@ploxiln
Copy link
Contributor Author

ploxiln commented Nov 2, 2015

#362 does fix the "eventfd spinning" threads for viewers thing for us, without forcing --rtcp-mux or --bundle

@lminiero
Copy link
Member

lminiero commented Nov 3, 2015

@ploxiln thanks for the feedback! Before I merge this, have you checked why nat_1_1_mapping stopped working, though? Was this a regression my patch introduced? I can't look into this right now as we're in Japan streaming the IETF meeting, I'll try to do that this evening (morning here now).

@ploxiln
Copy link
Contributor Author

ploxiln commented Nov 3, 2015

Yeah I figured out the nat_1_1_mapping thing, and have a small patch I'll submit, and we can figure out in that PR whether the current functionality is intended.

@lminiero
Copy link
Member

lminiero commented Nov 3, 2015

Great, thanks! Can you submit it on #362 itself, so that we can merge them both together?

@ploxiln
Copy link
Contributor Author

ploxiln commented Nov 3, 2015

it's really not related. put it as #368

@lminiero
Copy link
Member

lminiero commented Nov 3, 2015

Ah ok, got it. So I guess it's ok to merge both #362 and #368, or am I missing something?

@lminiero
Copy link
Member

lminiero commented Nov 3, 2015

Ah ok, got it. So I guess it's ok to merge both #362 and #368, or am I missing something?

After I fix your note on the blackhole fd, that is... will do that later.

@ploxiln
Copy link
Contributor Author

ploxiln commented Nov 3, 2015

Yup, all OK from me, thanks.

@BellesoftConsulting
Copy link

BellesoftConsulting commented Nov 16, 2016

Hi @lminiero , I am seeing this issue, or similar (Janus ice threads consuming 100% of the CPU on Debian Jessie). I am using libnice-0.1.13 and latest Janus master (commit df5b546) . I wonder if you have any suggestions.

screen shot 2016-11-16 at 3 09 24 pm

screen shot 2016-11-16 at 3 28 20 pm

@BellesoftConsulting
Copy link

BellesoftConsulting commented Nov 17, 2016

i was able to resolve this by not using TURN on the janus.cfg , I used STUN instead since the server is public and has no NAT.

@lminiero
Copy link
Member

Probably the issue was the same: libnice not closing or getting rid of an
unused/broken file descriptor that would poll forever.

Il 18/nov/2016 06:45, "BellesoftConsulting" notifications@github.com ha
scritto:

i was able to resolve this by not using TURN on the janus.cfg , I used
TURN instead since the server is public and has no NAT.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#154 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ADg5vAyMpzpyv9UO9o80pUGCErmX7gumks5q_MrngaJpZM4DgPhd
.

@BellesoftConsulting
Copy link

Does Janus needs to use those iceloops for anything if both TURN and STUN are not configured from Janus.cfg?

@lminiero
Copy link
Member

iceloops are the thread assigned to libnice. Even without STUN/TURN, libnice still has to handle connectivity checks and poll sockets to receive packets...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants