Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU utilization is higher and shows weird shapes after 19.7 upgrade #3587

Closed
immanuelfodor opened this issue Jul 19, 2019 · 45 comments
Closed
Assignees
Labels
cleanup Low impact changes
Milestone

Comments

@immanuelfodor
Copy link

Important notices
Before you add a new report, we ask you kindly to acknowledge the following:

[x] I have read the contributing guide lines at https://github.com/opnsense/core/blob/master/CONTRIBUTING.md

[x] I have searched the existing issues and I'm convinced that mine is new.

Describe the bug
CPU utilization is noticeably higher after 19.1->19.7 upgrade than before, and the CPU graph shows weird triangles since then.

To Reproduce
Upgrade from 19.1 to 19.7, look through the admin page to explore new features, leave the VM running for a day.

Expected behavior
CPU utilization is constant and around the levels of 19.1.

Screenshots
Grafana:
image

Relevant log files
Running top shows extremely high python3.7 activity sometimes, for example once I saw this below but it's usually floating around 33-44%.

last pid: 70082;  load averages:  0.59,  0.55,  0.51                                                                                                                                                                 up 1+03:17:13  15:46:12
50 processes:  2 running, 48 sleeping
CPU: 49.6% user,  0.0% nice,  0.6% system,  0.0% interrupt, 49.8% idle
Mem: 33M Active, 628M Inact, 438M Wired, 277M Buf, 1859M Free
Swap: 

  PID USERNAME       THR PRI NICE   SIZE    RES   SWAP STATE   C   TIME    WCPU COMMAND
71938 root             1  84    0 26548K 22272K     0K CPU0    0  22:20  99.08% python3.7
15593 zabbix           1  20    0 18564K 10752K     0K select  1   0:25   0.11% zabbix_agentd
30415 root             1  21    0  1034M  3844K     0K CPU1    1   0:00   0.04% top
46820 root             1  20    0  1038M  7668K     0K select  0   0:00   0.02% sshd
37559 root             5  30    0  6912K  2356K     0K uwait   1   0:14   0.01% dpinger
94458 zabbix           1  20    0 18564K 10284K     0K nanslp  1   0:12   0.01% zabbix_agentd
84794 zabbix           1  20    0 18564K 10716K     0K nanslp  1   0:10   0.01% zabbix_agentd
33533 root             1  20    0  1039M  6344K     0K select  1   0:08   0.01% ntpd
65488 dhcpd            1  20    0 16588K  9260K     0K select  1   0:05   0.01% dhcpd
60544 root             1  20    0  1033M  2976K     0K bpf     0   0:07   0.00% filterlog
29313 root             1  20    0 17148K 11932K     0K kqread  1   0:08   0.00% lighttpd
92095 zabbix           1   4    0 18564K 10756K     0K select  1   0:25   0.00% zabbix_agentd
28141 zabbix           1  20    0 18564K 10760K     0K accept  1   0:25   0.00% zabbix_agentd
35656 root             1  52    0  1034M  3160K     0K wait    1   0:23   0.00% sh
80551 root             1  20    0 46964K 34700K     0K accept  1   0:15   0.00% python3.7
 9676 root             2  20    0 19876K 10916K     0K kqread  0   0:11   0.00% syslog-ng
...

Environment
OPNsense 19.7 (amd64/OpenSSL)
VM with 2 vcores and 3GB RAM running on Proxmox: pve-manager/5.4-11/6df3d8d0 (kernel: 4.15.18-18-pve)
Network: Intel® I219-LM and I211

@immanuelfodor
Copy link
Author

I must say that OPNsense is working fine, there are no noticeable side effects of the CPU utilization changes, it just looks weird on the graph. If there is a good explanation why this is going to be the normal operation from now on, I'm all ears :)

@AdSchellevis
Copy link
Member

we refactored flowd_aggregate.py to parse the flowd binary files (/var/log/flowd.log) natively now and use python 3 in stead of python 2. I expect your top process is this one.

Generally it can process more flowd records in the same time than the old version, but will probably keep one core busy while doing so.

Is the machine also busy when there isn't a lot of traffic?

@mstrdraco
Copy link

I'm having high cpu usage as well.. I started a post on the forum
https://forum.opnsense.org/index.php?topic=13507.0 I've attached a few screenshots and such seems like a few people are having similar issues, and it does seem that flowd_aggregate.py is the culprit.

@mr44er
Copy link

mr44er commented Jul 19, 2019

Same effect here, but I noticed it through my power meter, which normally shows 120W throughput idle (20mins after boot and everything settled) and now I have 170-180W in idle.

OPNsense runs under bhyve, on the host is powerdxx working.

From host:

load 200%, current freq 3500 MHz ( 0), wanted freq 7000 MHz
load 200%, current freq 3500 MHz ( 0), wanted freq 7000 MHz
load 200%, current freq 3500 MHz ( 0), wanted freq 7000 MHz
load 200%, current freq 3500 MHz ( 0), wanted freq 7000 MHz
load 200%, current freq 3500 MHz ( 0), wanted freq 7000 MHz
load 203%, current freq 3500 MHz ( 0), wanted freq 7000 MHz
load 204%, current freq 3500 MHz ( 0), wanted freq 7000 MHz
load 200%, current freq 3500 MHz ( 0), wanted freq 7000 MHz
load 200%, current freq 3500 MHz ( 0), wanted freq 7000 MHz
load 200%, current freq 3500 MHz ( 0), wanted freq 7000 MHz
load 203%, current freq 3500 MHz ( 0), wanted freq 7000 MHz
load 207%, current freq 3500 MHz ( 0), wanted freq 7000 MHz
load 200%, current freq 3500 MHz ( 0), wanted freq 7000 MHz

That's full load on 2 cores, which should'nt be under idle system.

@AdSchellevis
Copy link
Member

flowd_aggregate.py can only use one core....

@mr44er
Copy link

mr44er commented Jul 19, 2019

Yes, I only wanted to say this load is much higher than before and normally the cpu would go down (and so the power) with the freq -> dev.hwpstate.0.freq_settings: 3500/-1 3300/-1 3100/-1 2800/-1 2200/-1 1600/-1

@AdSchellevis
Copy link
Member

The question is how much flow records the machine is producing (traffic), if the machine is busy without network load, we might have an issue finding the tail of the flowd.log file. In which case it might help to rotate more often or wait more time in between cycles.

It could be that the script itself can utilise a core better, which would lead to higher load (on one core).

@pyrodex
Copy link

pyrodex commented Jul 19, 2019

The question is how much flow records the machine is producing (traffic), if the machine is busy without network load, we might have an issue finding the tail of the flowd.log file. In which case it might help to rotate more often or wait more time in between cycles.

It could be that the script itself can utilise a core better, which would lead to higher load (on one core).

The flowd.log is not showing any data, I see it growing but when trying to read it via clog or other methods it produces no output. Is this expected behavior?

@mr44er
Copy link

mr44er commented Jul 19, 2019

if the machine is busy without network load, we might have an issue finding the tail of the flowd.log file.

No traffic today, machine is out of production.

I stopped it for now.

Host (120W now again) :

load 15%, current freq 3500 MHz ( 0), wanted freq 6569 MHz
load 30%, current freq 3500 MHz ( 0), wanted freq 6569 MHz
load 40%, current freq 3500 MHz ( 0), wanted freq 7000 MHz
load 24%, current freq 3500 MHz ( 0), wanted freq 6781 MHz
load 21%, current freq 3500 MHz ( 0), wanted freq 6569 MHz
load 16%, current freq 3500 MHz ( 0), wanted freq 6363 MHz
load 31%, current freq 3500 MHz ( 0), wanted freq 6363 MHz
load 28%, current freq 3500 MHz ( 0), wanted freq 6363 MHz
load 12%, current freq 3500 MHz ( 0), wanted freq 6164 MHz
load 19%, current freq 3500 MHz ( 0), wanted freq 5971 MHz
load 40%, current freq 3500 MHz ( 0), wanted freq 6369 MHz
load 40%, current freq 3500 MHz ( 0), wanted freq 6793 MHz
load 13%, current freq 3500 MHz ( 0), wanted freq 6580 MHz
load 14%, current freq 3500 MHz ( 0), wanted freq 6374 MHz
load 20%, current freq 1600 MHz ( 5), wanted freq 6174 MHz
changing clock speed from 1600 MHz to 3500 MHz
load 26%, current freq 1600 MHz ( 5), wanted freq 6174 MHz
changing clock speed from 1600 MHz to 3500 MHz
load 12%, current freq 3500 MHz ( 0), wanted freq 5981 MHz
load 22%, current freq 3500 MHz ( 0), wanted freq 5794 MHz
load 41%, current freq 3500 MHz ( 0), wanted freq 6334 MHz
load 36%, current freq 3500 MHz ( 0), wanted freq 6334 MHz
load 22%, current freq 3500 MHz ( 0), wanted freq 6136 MHz
load 26%, current freq 3500 MHz ( 0), wanted freq 6136 MHz
load 46%, current freq 3500 MHz ( 0), wanted freq 7000 MHz
load 30%, current freq 3500 MHz ( 0), wanted freq 7000 MHz
load 25%, current freq 3500 MHz ( 0), wanted freq 7000 MHz

@AdSchellevis
Copy link
Member

flowd-reader /var/log/flowd.log

to read the binary file.

AdSchellevis added a commit that referenced this issue Jul 19, 2019
…ived timestamp to flowparser, so we can skip a bit of processing when the data isn't relevant. for #3587
@mr44er
Copy link

mr44er commented Jul 19, 2019

flowd-reader /var/log/flowd.log
open(/var/log/flowd.log): No such file or directory

Whoops?

@AdSchellevis
Copy link
Member

fabaef0 reduces the load a bit while seeking the end of the file.

opnsense-patch fabaef0a
service flowd_aggregate restart

@mr44er ls -aslh /var/log/flowd.log*

@mr44er
Copy link

mr44er commented Jul 19, 2019

ls -aslh /var/log/flowd.log*
ls: No match.
opnsense-patch fabaef0
...
All patches have been applied successfully. Have a nice day.

powerd -v
powerd: unable to determine AC line status
load 41%, current freq 1600 MHz ( 5), wanted freq 1749 MHz
changing clock speed from 1600 MHz to 2200 MHz
load 22%, current freq 1600 MHz ( 5), wanted freq 1694 MHz
changing clock speed from 1600 MHz to 2200 MHz
load 22%, current freq 2200 MHz ( 4), wanted freq 1641 MHz
load 19%, current freq 2200 MHz ( 4), wanted freq 1600 MHz
changing clock speed from 2200 MHz to 1600 MHz
load 37%, current freq 1600 MHz ( 5), wanted freq 1600 MHz
load 29%, current freq 1600 MHz ( 5), wanted freq 1600 MHz
load 44%, current freq 1600 MHz ( 5), wanted freq 1877 MHz
changing clock speed from 1600 MHz to 2200 MHz
load 27%, current freq 1600 MHz ( 5), wanted freq 1877 MHz
changing clock speed from 1600 MHz to 2200 MHz
load 22%, current freq 2200 MHz ( 4), wanted freq 1818 MHz
load 25%, current freq 2200 MHz ( 4), wanted freq 1818 MHz
load 29%, current freq 2200 MHz ( 4), wanted freq 1818 MHz
load 27%, current freq 2200 MHz ( 4), wanted freq 1818 MHz
load 36%, current freq 2200 MHz ( 4), wanted freq 1818 MHz
load 14%, current freq 2200 MHz ( 4), wanted freq 1761 MHz
load 106%, current freq 2200 MHz ( 4), wanted freq 7000 MHz
changing clock speed from 2200 MHz to 3500 MHz
load 4%, current freq 3500 MHz ( 0), wanted freq 6781 MHz
load 8%, current freq 3500 MHz ( 0), wanted freq 6569 MHz
load 7%, current freq 3500 MHz ( 0), wanted freq 6363 MHz
load 48%, current freq 3500 MHz ( 0), wanted freq 7000 MHz
load 11%, current freq 3500 MHz ( 0), wanted freq 6781 MHz
load 34%, current freq 3500 MHz ( 0), wanted freq 6781 MHz
load 16%, current freq 3500 MHz ( 0), wanted freq 6569 MHz
load 32%, current freq 3500 MHz ( 0), wanted freq 6569 MHz
load 24%, current freq 3500 MHz ( 0), wanted freq 6363 MHz
load 40%, current freq 3500 MHz ( 0), wanted freq 6787 MHz
load 32%, current freq 3500 MHz ( 0), wanted freq 6787 MHz

Looks good so far.

flowd-reader /var/log/flowd.log

Shows data now...do you need something from it? Won't just c&p because of ips. ;)

Edit: cpugraph on dashboard is now flat again and everything is snappier so far. :)

graph

@mstrdraco
Copy link

Seems the patch made mine way worse..
IMG_20190719_121630

AdSchellevis added a commit that referenced this issue Jul 19, 2019
previously we always waited 15 seconds between cycles, now we calculcate the time to wait with the time spend for the previous cycle.
@AdSchellevis
Copy link
Member

the dashboard consumes cycles as well, worse is practically not possible (it's doing less or the same amount of work).

If the script reads the file it can use one core for 100%. The old version sometimes had difficulties keeping up, likely because there was latency between the flowd library and the python process.

Previously we always waited 15 seconds between polls, with 6b1f3e6 we wait max 60 seconds (minus the time spend last time).

On top of the other one:

opnsense-patch 6b1f3e60
service flowd_aggregate restart

We might consider lowering the max files size (10MB now), since every poll it has to read the file to know where it ended last time (which is the same as previously by the way).

@mstrdraco
Copy link

I've added the second patch as well.. my CPU is currently getting an extreme workout it's flapping from 100% to 26% and running about 15 degrees warmer then it did on 19.1. My max cpu load before would flap between 10% and at max 50% before on 19.1 as well. My Cpu is - Intel(R) Xeon(R) CPU X3430 @ 2.40GHz (4 cores) with 12gb of Ram.

@immanuelfodor
Copy link
Author

Wow, this thread has skyrocketed since I was away 😀
To your first question @AdSchellevis : I only have minimal traffic from a few clients, nothing serious.
I'm eager to help by applying the two patches, but if I do so, will the next 19.7.1 update go smoothly or I'm about create a headache for myself. I've never used the opnsense-patch utility before.

@immanuelfodor
Copy link
Author

immanuelfodor commented Jul 19, 2019

It doesn't sound good, @mstrdraco, I'll do a vzdump backup of the VM before applying the patches then, just to make sure.

@mstrdraco
Copy link

from System/Diag

image

@mstrdraco
Copy link

mstrdraco commented Jul 19, 2019

This is the from top command, so maybe the GUI isn't displaying properly? Not sure
image

@immanuelfodor
Copy link
Author

I've just seen the manpage, it can revert the patches if applied again in reverse order.
Both patches are up and running, waiting for graphs to be made.

opnsense-patch fabaef0a
opnsense-patch 6b1f3e60
service flowd_aggregate restart

@mstrdraco
Copy link

OK seems to be settling down now, maybe it just took it a while to clean itself up after the patches. I'm topping at around 50% and hopping between 20-50%.

@immanuelfodor
Copy link
Author

immanuelfodor commented Jul 19, 2019

I also rebooted after applying the patches, it seems to be okay just for now, but my fear is that these changes are just about to prolong the high CPU usage for later. I feel the new emerging triangle will just have a flatter back but it will be there eventually in the morning. Please prove me wrong about this feeling 😀

@immanuelfodor
Copy link
Author

This is what I mean, sorry for the mobile edit:
Screenshot_20190719-193318__01
The last small section is what appears to be a bit flatter since the patches. We'll see it tomorrow.

@mstrdraco
Copy link

@immanuelfodor what app are you using to get that graph? I'm curious to see it on something other than the splash-screen widget.

@immanuelfodor
Copy link
Author

OPNsense is a VM on a Proxmox node, and all the hypervisor metrics flow into InfluxDB and visualized in Grafana. Here are guides how to set it up:

Once you have InfluxDB and Grafana running, you can monitor almost everything on your infrastructure with opensource exporter plugins and custom scripts, internet speed, hardware status, etc. here is an example Proxmox metrics dashboard for inspiration: https://imgur.com/gallery/V3aKU

@AndyX90
Copy link
Contributor

AndyX90 commented Jul 19, 2019

Here are my observations since the 19.7 Upgrade. This is version 19.7 without any patches.
Graphs are from one of my two firewalls running in HA-mode (last 24 hours) .
Upgrade performed today about 08.00 am.
result1
EDIT: Hardware - DEC-4630
Intel® Xeon™ E3-1225V5 3.3Ghz Quad Core
8GB RAM

@immanuelfodor
Copy link
Author

Well, the patches seem to help making the CPU utilization stay lower for a longer period of time, but the effect is generally the same. Patches have been applied around 7/19 19:00

image

@AdSchellevis
Copy link
Member

The behaviour will likely remain a bit like that, looking at the stats, I expect you collect about 10MB of log traffic (/var/log/flowd.log) every 8 to 9 hours.

Since we always need to read at least one file (usually flowd.log itself), processing will become more time consuming on every iteration (as it was before as well by the way, but I expect there was a latency in between the old reader lib and the actual process).
Since flowd doesn't use fixed size records, we can't read the file in reverse by the way.

We could opt to add some options which waste less cpu cycles:

  • Optionally set the max flowd log file size (now fixed 10MB), which forces more rotates but limits the maximum amount of data to read (to find our last known timestamp)
  • Configurable time interval, in the latests patch, it is set to max 1 minute. Increasing will add lag to the statistics, but will be more economical. Usually flow records are being send out with a little lag as well by the kernel.

@payback007
Copy link

I'm also observing higher CPU usage, after turn-off netflow, it looks a little bit better, but still higher than before. By the way, I have no /var/log/flowd.log available. For me it seems not periodically, it seems to be random.

I summarized the CPU usage at different OPNsense versions / actions:

opnsense_CPU_usage

@AdSchellevis
Copy link
Member

AdSchellevis commented Jul 22, 2019

oops, it seems that I missed 9287b55 in an earlier commit. The full set of changes should look like this :

pkg install -f opnsense
opnsense-patch fabaef0a 6b1f3e60 9287b55
service flowd_aggregate restart

pkg -f install opnsense reinstalls the base package

EDIT: fixed pkg install typo

@mstrdraco
Copy link

I've added 9287b55 to mine so I now have all 3 patches I'll let you know how it goes.

@AndyX90
Copy link
Contributor

AndyX90 commented Jul 22, 2019

New observations from my side:
pic
Same time range on the 3 graphs.
As you can see, after applying the 3 patches those spikes are gone.
But it is noticeable that the kernel and system times on 19.7 are much higher than they were on 19.1.

@AdSchellevis
Copy link
Member

@AndyX90 the current state will likely go into 19.7.1, you could try to fiddle a bit with MAX_FILE_SIZE_MB and see if the load changes a lot (e.g. change it to 5 and restart the process).

I'm not sure yet if we should change the current defaults. As stated earlier, load is dependant on two variables, the time between polls and the size of the flowd log file. Since the old flowd library isn't compatible with python 3, there's no going back. Python 3 itself isn't always faster/more economical than version 2 either, probably the price of progress.

@immanuelfodor
Copy link
Author

I've just added the 3rd patch, waiting for the morning's stats.

@gk998
Copy link

gk998 commented Jul 22, 2019

Same issue here (after applying these 3 patches). Netdata clearly visualizes, that the spikes come from python (flow aggregator). Load avg here is 0.89 (was way lower with 19.1) on apu2. The second graph shows smoothed pictures, when running top/htop, it is visible that flow aggregator is cosuming one core with 100% for some seconds in an interval.
image
image

@immanuelfodor
Copy link
Author

For me, the effect of the 3rd patch is the same as before, the CPU usage got even smaller and flatter again. The patch was applied at the blue arrow without a full system restart; just flowd service.

Screenshot_20190723-050552

@fichtner fichtner added the support Community support label Jul 23, 2019
fichtner pushed a commit that referenced this issue Jul 23, 2019
…ived timestamp to flowparser, so we can skip a bit of processing when the data isn't relevant. for #3587

(cherry picked from commit fabaef0)
(cherry picked from commit 6b1f3e6)
(cherry picked from commit 9287b55)
@AndyX90
Copy link
Contributor

AndyX90 commented Jul 23, 2019

Thanks Ad! Those commits address the cpu spikes caused by the netflow-parser and solve them.
But after that i disabled netflow on a testing machine and the cpu usage won't change significantly. (specially the red and light blue parts in my diagrams)
I think there is still another problem with kernel/cpu times.
result1
Hardware - DEC-4630
Intel® Xeon™ E3-1225V5 3.3Ghz Quad Core
8GB RAM

@AdSchellevis
Copy link
Member

Load between versions is always difficult to compare, there's a lot of new software that could be more or less efficient.

At the moment I don't think there's a lot we can do here.

@mr44er
Copy link

mr44er commented Jul 23, 2019

I recently added the 3 patches again and tested on three machines, which have all the same spikes everyone has.
top shows that python3 creates the load.

I then decided to do without netgraph for now, because 50W more on 24/7 are ~130€/year in germany. ;)

Additionally I noticed that suricata crashes more often and floods log with this:
suricata: [100245] <Error> -- [ERRCODE: SC_ERR_NETMAP_READ(264)] - Error reading data from iface 'pppoe0': (55u) No buffer space available

I can't say for sure if this is since 19.7 or has anything to do with the patches or #3583. I don't know why or/and when this happens, but when it does suricata process can't be killed or restarted and in this state the process creates 100% load on every core. Only hard reset of the machine helps.
If useful, I could open another ticket with it?

@fichtner fichtner added cleanup Low impact changes and removed support Community support labels Jul 23, 2019
@fichtner fichtner added this to the 20.1 milestone Jul 23, 2019
@fichtner
Copy link
Member

Please keep this on topic for performance issues with flowd_aggregate. I can understand that change is scary, but we operate under the assumption that we have to move away from Python 2 and have a rewrite in Python 3 because there is no alternative and I don't think this point has been stressed enough. The alternative to behavioural changes (albeit non-functional) is removing the feature completely. Just for perspective.

The three patches will be in 19.7.1. If the situation improves with 19.7.1 we would like to close this ticket and focus on functional work.

Cheers,
Franco

@JasMan78
Copy link

pkg -f install opnsense reinstalls the base package

pkg: illegal option -- f

What did I wrong?

@gk998
Copy link

gk998 commented Jul 23, 2019

I did a truss(1) while the spike happened. Before it reads the file, I get 54104 times "ERR#9 'Bad file descriptor'". During this, the process doesn't consume 100%, but between 10% and 20% on my apu2.
The output before this happen is:

 94422: stat("/dev/fd",{ mode=dr-xr-xr-x ,inode=94,size=512,blksize=4096 }) = 0  (0x0)
 94422: getrlimit(RLIMIT_NOFILE,{ cur=54126,max=54126 }) = 0 (0x0)
 94422: close(3)                                  = 0 (0x0)
 94422: close(4)                                  = 0 (0x0)
 94422: close(5)                                  = 0 (0x0)
 94422: close(6)                                  = 0 (0x0)
 94422: close(7)                                  = 0 (0x0)
 94422: close(8)                                  = 0 (0x0)
 94422: close(9)                                  = 0 (0x0)
 94422: close(10)                                 = 0 (0x0)
 94422: close(11)                                 = 0 (0x0)
 94422: close(12)                                 = 0 (0x0)
 94422: close(13)                                 = 0 (0x0)
 94422: close(14)                                 = 0 (0x0)
 94422: close(15)                                 = 0 (0x0)
 94422: close(16)                                 = 0 (0x0)
 94422: close(17)                                 = 0 (0x0)
 94422: close(18)                                 = 0 (0x0)
 94422: close(19)                                 = 0 (0x0)
 94422: close(20)                                 = 0 (0x0)
 94422: close(21)                                 ERR#9 'Bad file descriptor'

and after the final "Bad file descriptor:

 94422: close(54125)                              ERR#9 'Bad file descriptor'
 94422: execve("/sbin/ifconfig",0x3c798dcad08,0x739fe010a248) = 0 (0x0)

@AdSchellevis
Copy link
Member

@JasMan78 sorry my mistake, it should have been :

pkg install -f opnsense

@AdSchellevis
Copy link
Member

patches shipped in 19.7.1 as promised by @fichtner (#3587 (comment))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cleanup Low impact changes
Development

No branches or pull requests

10 participants