New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
eventfd spinning #154
Comments
Thanks for that very detailed analysis! Looking forward to the backtrace. |
Here's the relevant backtrace, roughly:
|
seems to start when I start a capture, and not stop when I stop a capture... I suspect my plugin / api access is a significant contributing factor |
seems to actually be the viewer that triggers it for us, and does sometimes go away immediately, and sometimes after a few minutes (which might be related to our session handling) |
Well, I guess it's good news that this is easily reproducible in my VM, and with the demo pages. viewers spin threads. It might be our custom plugin though, echotest doesn't cause it. |
Do the streaming and videoroom plugins cause this too? Those are the plugins that have a viewer-specific scenario, so this might help narrowing the scope. I'm wondering if the cause is the sendonly nature of those instances: in that case, libnice is only used to send media, and incoming data is only triggered for RTCP from time to time. This might explain some of the poll related bits I saw in your initial check, and the one in the backtrace, but wouldn't explain the spikes. A poll can indeed cause the CPU to spin but only when fds are invalid, usually. Not sure what we can do to fix this: an alternative might do the poll stuff manually and not rely on the recv callback, but this might make it less efficient than what's in libnice. |
Yup, I can reproduce the same symptom in videoroom aka Video MCU. Same strace results etc. |
btw, useful commands: # show top threads:
top H
# count system calls for a bit (stop with ctrl-c after a bit)
strace -c -p $JANUS_THREAD_PID
# log system calls with arguments
strace -p $JANUS_THREAD_PID
# show current callstack (thanks @jayridge)
gdb --batch --quiet -ex "thread apply all bt full" -ex "quit" /opt/janus/bin/janus $JANUS_THREAD_PID |
Thanks, this is very helpful, I'll look into this as well, although I can't recall being able to reproduce this issue locally. |
So you mean libnice 0.1.4 should actually be much more robust with respect to that than its more recent versions? I'll also have to investigate where the leak comes from, we're doing some stress tests with Valgrind on a different aspect so I'll make sure I check this too. |
libnice-0.1.4 does not have this eventfd spinning, so that's what we've switched back to. However, it does seem that janus is a bit less stable with libnice-0.1.4, we've seen segfaults a couple of times a day, and there does seem to be a (slower) memory leak. I'd like to figure out this eventfd spinning, so we can switch back to libnice-0.1.10, and then track down whatever leaks or crashes remain. |
Are you sure segfaults are caused by the libnice version and not something else? Anyway, I agree that trying to figure out what's wrong in the up-to-date library might help. Maybe it would be a good idea to notice it on the libnice mailing list, especially if it turns out it really is a bug there and not in Janus. |
I really don't know what the segfaults were caused by, they were probably luck or correlation, and in fact they haven't been happening for about a week now, it was two weeks ago when I just started using libnice-0.1.4 (I wonder if I had failed to re-compile janus against the older headers). |
Just ran into this issue again. Any version of libnice > 0.1.4 still causes the spinning problem in current GIT HEAD of janus (ubuntu 14.04 x64). [EDIT 1] [EDIT 2] |
I just did a 243c47ecc9d694ecfe230880081634936770a959 c56727025dd1ffa2e0513bf6bfc5218b58e2b483 12ee430ef3dbe61e52fe1ace5196f1931cf1e2c4 2b6370a87c34236e3996fb182751267e05ee11ac For my sake and anyone else who bisects libnice, this is the little command set I used: Also, this only seems to bite me when I have a [nat] stun_server set? I didn't test that during the bisect but noticed it when attempting to reproduce the issue on faster hardware. |
No idea what a About That said, considering waiting for a fix in libnice might take a while, and having this fix spread fast enough in the distros repos even longer, I might start looking into some alternatives (e.g., libre) but this might take just as much, as it wouldn't be easy to study another library and integrate that one in how Janus is currently conceived. |
libnice issue filed: https://phabricator.freedesktop.org/T3328
|
Thanks for filing that issue, I'll keep that monitored. |
I've found that in janus.c during the bundle and rtcp-mux consolidation and close of components, the nice_agent_attach_recv sets a NULL IO callback. This is directly responsible for three sockets failing all tests in libnice/agent/agent.c component_io_cb, specifically the has_io_callback tests, and thus being repeatedly called as readable. This was shown by printing the FD that component_io_cb was operating on and observing that the three (for me) FDs also returned as readable alongside the eventfd are, indeed, seen by this function and ignored. I've reduced the number to 1 by replacing the bundle nice_agent_attach_recv calls with nice_agent_remove_stream but I'm not sure that is entirely correct. I'm also not sure what to do about the rtcp-mux calls. What do you all think? |
Also, starting janus with the new --bundle and --rtcp-mux options "fix" the issue. |
That's an interesting discovery, thanks for sharing! We pass a NULL because as per the documentation that's what we should do to detach any existing callback:
The second part shouldn't matter as we don't care about data loss in that case. The rtcp-mux stuff means you can get rid of the 2nd component of each stream: I guess that's because you still have a pending FD, you got rid of the unused streams (e.g., video and data, bundled on the "audio" stream), but the 2nd component of the remaining stream is still there. Can you check if removing that too fixes it for you? |
Now that I think about it, we're indeed detaching callbacks, but not removing the streams as you pointed out, which means libnice is still istening for them. That said, no data should be sent there anymore, so I don't know why that should cause the threads to spin: maybe a poll on an invalid file descriptor? |
Anyway, that's something we should put in |
Just checked and, while there is |
…nice_agent_remove_stream when enforcing bundle/rtcp-mux (see #154)
That appeared to work with the side effect of sending a UDP packet to 1.2.3.4:1234 every 25 seconds. |
The 25 seconds thing probably is the keepalive we mentioned. I guess we can change the 1.2.3.4 to something like localhost and to some port that nobody uses: any suggestion? |
Changed the IP to 127.0.0.1 for now. The port is still 1234 but that shouldn't matter, unless one has UDP-based services that listen on that port. Not sure what other port we might want to use here: maybe one of the ports typically used for one of the Janus webservers, as that might indicate a port that is not used by anybody else. Whatever we choose, it's enough it's a port nobody uses as we just need to "dump" the keepalives somewhere. Another alternative might be creating an UDP server ourselves within Janus that just acts as a "blackhole" to receive those keepalives and ignore them. |
Great that we are getting close to a solution. I think it would be a good idea to make this port configurable so that people at least have the option to pick another port in case they need to use that specific port. Furthermore, I think it would be a good idea to pick a very high port number (e.g. 65520) since that is probably less likely to be used than 1234 oor 12345. |
I'm working on a solution that binds to a random port. I'd rather not make this configurable as it's such an obscure workaround to such an obscure issue that people would not get the point of it at all. |
I vote for not making it configurable. I'd actually pick a never-used port in the lower 1024 though. If you look in |
Ports below 1024 would require Janus to be launched as root, which is not always the case. |
I mean just send to udp port below 1024 and not receive. But now I see what you mean, open a random high port to receive and send to it ... yeah that's not a bad idea. |
Yes, as otherwise we risk sending stuff to a service that may indeed be using that port with who knows which consequences. |
I like that approach. Good work around! |
Just made a commit in #362 that implements the UDP server blackhole I talked about. Feedback welcome! In case it fixes it for you guys I'll merge the PR. |
I'm testing #362 starting tomorrow, results soon :) |
first not-really-related observation is that the nat_1_1_mapping setting is not having effect, the stun-server derived public ip is being used in ice candidates produced by janus |
#362 does fix the "eventfd spinning" threads for viewers thing for us, without forcing --rtcp-mux or --bundle |
@ploxiln thanks for the feedback! Before I merge this, have you checked why nat_1_1_mapping stopped working, though? Was this a regression my patch introduced? I can't look into this right now as we're in Japan streaming the IETF meeting, I'll try to do that this evening (morning here now). |
Yeah I figured out the nat_1_1_mapping thing, and have a small patch I'll submit, and we can figure out in that PR whether the current functionality is intended. |
Great, thanks! Can you submit it on #362 itself, so that we can merge them both together? |
it's really not related. put it as #368 |
Yup, all OK from me, thanks. |
i was able to resolve this by not using TURN on the janus.cfg , I used STUN instead since the server is public and has no NAT. |
Probably the issue was the same: libnice not closing or getting rid of an Il 18/nov/2016 06:45, "BellesoftConsulting" notifications@github.com ha
|
Does Janus needs to use those iceloops for anything if both TURN and STUN are not configured from Janus.cfg? |
iceloops are the thread assigned to libnice. Even without STUN/TURN, libnice still has to handle connectivity checks and poll sockets to receive packets... |
I figured out what was sometimes making threads "spin", consuming as much cpu as they can: write()s to eventfd.
First I identified a thread that was "going crazy", and then counted syscalls with strace over a few seconds:
that seems like more write() than makes sense for sending rtp/srtp packets. Let's see what it's writing...
So it reads 8 bytes from fd 15, and then writes 8 bytes to fd 15, 6 times. and the data is
0x01 0x00 0x00 0x00 0x00 0x00 0x00 0x00
each time. (and it does this rapidly and continuously)fd 15 is from eventfd()
Maybe it's some libnice or glib thing... I'll try to get a backtrace of the relevant thread.
The text was updated successfully, but these errors were encountered: