Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

100% CPU usage after some time, cannot query bandwidth usage #30

Closed
hsgg opened this issue Apr 14, 2019 · 14 comments
Closed

100% CPU usage after some time, cannot query bandwidth usage #30

hsgg opened this issue Apr 14, 2019 · 14 comments

Comments

@hsgg
Copy link

hsgg commented Apr 14, 2019

nlbwmon seems to go into an infinite loop, using ~50% cpu, refusing to service the frontend. This happens after a few days, or sometimes a few weeks. I'm on MIPS, R6220. I see the same sort of strace as mentioned in a forum thread: lots of

recvmsg(7, {msg_namelen=12}, 0)         = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(7, {msg_namelen=12}, 0)         = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(7, {msg_namelen=12}, 0)         = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(7, {msg_namelen=12}, 0)         = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(7, {msg_namelen=12}, 0)         = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(7, {msg_namelen=12}, 0)         = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(7, {msg_namelen=12}, 0)         = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(7, {msg_namelen=12}, 0)         = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(7, {msg_namelen=12}, 0)         = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(7, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=0x000001}, msg_namelen=12, msg_iov=[{iov_base={{len=164, type=0x100 /* NLMSG_??? */, flags=0x600 /* NLM_F_??? */, seq=0, pid=0}, "\x02\x00\x00\x00\x3c\x00\x01\x80\x14\x00\x01\x80\x08\x00\x01\x00\xc0\xa8\x01\xf6\x08\x00\x02\x00\xc0\xa8\x01\x01\x24\x00\x02\x80"...}, iov_len=16384}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 164
recvmsg(7, {msg_namelen=12}, 0)         = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(7, {msg_namelen=12}, 0)         = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(7, {msg_namelen=12}, 0)         = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(7, {msg_namelen=12}, 0)         = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(7, {msg_namelen=12}, 0)         = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(7, {msg_namelen=12}, 0)         = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(7, {msg_namelen=12}, 0)         = -1 EAGAIN (Resource temporarily unavailable)

There are many more of the EAGAIN messages than the ones with more information. The command nlbw -c list just hangs:

root@OpenWrt:~# nlbw -c list
@jow-
Copy link
Owner

jow- commented Jun 6, 2019

This should be fixed with 4574e6e now - please reopen if it still occurs.

@jow- jow- closed this as completed Jun 6, 2019
@jow- jow- mentioned this issue Jun 6, 2019
@hsgg
Copy link
Author

hsgg commented Jun 6, 2019

Thank you! I updated my router, and will report back in case it occurs again.

@PieterBreugelmans
Copy link

PieterBreugelmans commented Jan 14, 2020

This week I proceed to upgrade my R7800 with the latest master build from hnyman and I'm facing the same problem reported in this issue.

OpenWrt SNAPSHOT r11954-7936cb94a9 / LuCI Master git-20.010.59544-ff12aa3

root@jupiter:~# top -b
Mem: 146220K used, 330680K free, 2128K shrd, 6352K buff, 22876K cached
CPU: 0% usr 28% sys 28% nic 42% idle 0% io 0% irq 0% sirq
Load average: 1.01 1.02 1.00 2/85 6912
PID PPID USER STAT VSZ %VSZ %CPU COMMAND
2561 1 root RN 1124 0% 52% /usr/sbin/nlbwmon -o /var/lib/nlbwmon -i 24h -r 30s -p /usr/share/nlbwmon/protocols -G 10 -I 1 -L 10000 -Z -s 192.168.0.0/16 -s 172.16.0.0/12 -s 10.0.0.0/8 -s 192.168.1.1/24 -s fd56:53f0:3524:10::1/60
6912 5816 root R 1092 0% 5% top -b
2927 1 root SN 4372 1% 0% /usr/sbin/collectd -C /tmp/collectd.conf -f

root@jupiter:/sys/devices/system/cpu/cpufreq/policy0# opkg list-installed |grep nlbwmon
luci-app-nlbwmon - git-20.010.59544-ff12aa3-1
nlbwmon - 2019-06-06-4574e6e8-2

...
11:34:37.904833 recvmsg(7, {msg_namelen=12}, 0) = -1 EAGAIN (Resource temporarily unavailable)
11:34:37.905715 recvmsg(7, {msg_namelen=12}, 0) = -1 EAGAIN (Resource temporarily unavailable)
11:34:37.906228 recvmsg(7, {msg_namelen=12}, 0) = -1 EAGAIN (Resource temporarily unavailable)
...

Attaching the strace log for a few seconds as a reference.
strace_nlbwmon.log

I saw that you @jow- put in 4574e6e on the same date (June 6) as the version of nlbwmon I'm supposedly am running on my R7800. Does that suggest the issue is not fully resolved with this commit?

Edit: I noticed this after 2 days (uptime: 2d 15h 33m 0s)

@jow-
Copy link
Owner

jow- commented Jan 14, 2020

Thanks for the log, I'll investigate later. Unfortunately there is no reliable reproducer for this issue so it might take a while.

@digitalcircuit
Copy link
Contributor

I've run into this issue as well on a ZyXEL NBG6817 with OpenWrt 19.07.0 r10860-a3ffeb413b and nlbwmon 2019-06-06-4574e6e8-1. Unfortunately, I have yet to find a reliable reproducer, either. Most recently it happened after 8d 2h of uptime on a network with WAN, LAN, and two additional VLANs.

If there's any investigation that'd help, I'd be happy to try.

Meanwhile, for those who run into this issue searching for a solution, the forum thread linked above also has a workaround:

Since nlbwmon is frequently going wild and consuming all available processing power from at least one core I've put together some workarounds to be placed in /etc/rc.local file before exit 0 line.

[…]

  1. Automatic restart of nlbwmon process:
(while true; do (NLB_T=45; sleep 15m; NLB_A=$(top -b -n 1 | grep nlbwmon | grep -v grep | awk '{print (int($7))}' | head -1); [ $NLB_A -ge $NLB_T ] && ( sleep 5m; NLB_B=$(top -b -n 1 | grep nlbwmon | grep -v grep | awk '{print (int($7))}' | head -1); [ $NLB_B -ge $NLB_T ] && ( /etc/init.d/nlbwmon restart; logger "WARNING: nlbwmon restarted due to process runaway, proc% was $NLB_A then $NLB_B" ))); done) &

@greekstreet
Copy link

I encountered the same issue. Found this line in syslog. Hope this will help.
Sat Feb 15 06:54:23 2020 daemon.err nlbwmon[9132]: Unable to dump conntrack: No buffer space available

I restarted nlbwmon service and then CPU% became normal.

@davidl500
Copy link

I'm encountering the same, including the conntrack error; e.g.
Wed Apr 8 22:47:41 2020 daemon.err nlbwmon[1646]: Unable to dump conntrack: No buffer space available

Also a MIPS, but the Lantiq VRX268 (BT Homehub 5A), so xRX200 rev 1.2, running OpenWRT 19.07.2

As above, restarting nlbwmon service restored CPU and nlbw functionality.

@jow-
Copy link
Owner

jow- commented Apr 10, 2020

Thanks, that helped a lot to track down a potential issue. I hope that 33c77cb finally addressed the 100% CPU load issue.

@pdfruth
Copy link

pdfruth commented May 3, 2020

My experience so far...
I was seeing 100% CPU by nlbwmon pretty regularly.
Now running the latest commit, I have 15 days of uptime without seeing the problem.

@facboy
Copy link

facboy commented Jul 23, 2020

i'm getting these randomly:

Thu Jul 23 11:47:06 2020 daemon.err nlbwmon[3017]: Netlink receive failure: Out of memory
Thu Jul 23 11:47:06 2020 daemon.err nlbwmon[3017]: Unable to dump conntrack: No buffer space available

@jow-
Copy link
Owner

jow- commented Jul 23, 2020

@facboy - if you see this error, you need to increase option netlink_buffer_size in /etc/config/nlbwmon. The default is 524288 which appears to be too low for your use case.

May I ask how many conntrack entries you have? wc -l /proc/net/nf_conntrack

@facboy
Copy link

facboy commented Jul 23, 2020

$ wc -l /proc/net/nf_conntrack
305 /proc/net/nf_conntrack

@jow-
Copy link
Owner

jow- commented Jul 23, 2020

That is not much, but maybe you have a very high fluctuation of connections, that is conntrack entries appearing and vanishing frequently causing a lot of events to be reported.

@martin8801
Copy link

I've started getting the messages above as well.

wc -l /proc/net/nf_conntrack
395 /proc/net/nf_conntrack

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants