-
Notifications
You must be signed in to change notification settings - Fork 759
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CPU utilization is higher and shows weird shapes after 19.7 upgrade #3587
Comments
|
I must say that OPNsense is working fine, there are no noticeable side effects of the CPU utilization changes, it just looks weird on the graph. If there is a good explanation why this is going to be the normal operation from now on, I'm all ears :) |
|
we refactored Generally it can process more flowd records in the same time than the old version, but will probably keep one core busy while doing so. Is the machine also busy when there isn't a lot of traffic? |
|
I'm having high cpu usage as well.. I started a post on the forum |
|
Same effect here, but I noticed it through my power meter, which normally shows 120W throughput idle (20mins after boot and everything settled) and now I have 170-180W in idle. OPNsense runs under bhyve, on the host is powerdxx working. From host:
That's full load on 2 cores, which should'nt be under idle system. |
|
|
|
Yes, I only wanted to say this load is much higher than before and normally the cpu would go down (and so the power) with the freq -> dev.hwpstate.0.freq_settings: 3500/-1 3300/-1 3100/-1 2800/-1 2200/-1 1600/-1 |
|
The question is how much flow records the machine is producing (traffic), if the machine is busy without network load, we might have an issue finding the tail of the flowd.log file. In which case it might help to rotate more often or wait more time in between cycles. It could be that the script itself can utilise a core better, which would lead to higher load (on one core). |
The flowd.log is not showing any data, I see it growing but when trying to read it via clog or other methods it produces no output. Is this expected behavior? |
No traffic today, machine is out of production. I stopped it for now. Host (120W now again) :
|
to read the binary file. |
…ived timestamp to flowparser, so we can skip a bit of processing when the data isn't relevant. for #3587
Whoops? |
|
ls -aslh /var/log/flowd.log*
Looks good so far.
Shows data now...do you need something from it? Won't just c&p because of ips. ;) Edit: cpugraph on dashboard is now flat again and everything is snappier so far. :) |
previously we always waited 15 seconds between cycles, now we calculcate the time to wait with the time spend for the previous cycle.
|
the dashboard consumes cycles as well, worse is practically not possible (it's doing less or the same amount of work). If the script reads the file it can use one core for 100%. The old version sometimes had difficulties keeping up, likely because there was latency between the flowd library and the python process. Previously we always waited 15 seconds between polls, with 6b1f3e6 we wait max 60 seconds (minus the time spend last time). On top of the other one: We might consider lowering the max files size (10MB now), since every poll it has to read the file to know where it ended last time (which is the same as previously by the way). |
|
I've added the second patch as well.. my CPU is currently getting an extreme workout it's flapping from 100% to 26% and running about 15 degrees warmer then it did on 19.1. My max cpu load before would flap between 10% and at max 50% before on 19.1 as well. My Cpu is - Intel(R) Xeon(R) CPU X3430 @ 2.40GHz (4 cores) with 12gb of Ram. |
|
Wow, this thread has skyrocketed since I was away 😀 |
|
It doesn't sound good, @mstrdraco, I'll do a vzdump backup of the VM before applying the patches then, just to make sure. |
|
I've just seen the manpage, it can revert the patches if applied again in reverse order. opnsense-patch fabaef0a
opnsense-patch 6b1f3e60
service flowd_aggregate restart |
|
OK seems to be settling down now, maybe it just took it a while to clean itself up after the patches. I'm topping at around 50% and hopping between 20-50%. |
|
I also rebooted after applying the patches, it seems to be okay just for now, but my fear is that these changes are just about to prolong the high CPU usage for later. I feel the new emerging triangle will just have a flatter back but it will be there eventually in the morning. Please prove me wrong about this feeling 😀 |
|
@immanuelfodor what app are you using to get that graph? I'm curious to see it on something other than the splash-screen widget. |
|
OPNsense is a VM on a Proxmox node, and all the hypervisor metrics flow into InfluxDB and visualized in Grafana. Here are guides how to set it up:
Once you have InfluxDB and Grafana running, you can monitor almost everything on your infrastructure with opensource exporter plugins and custom scripts, internet speed, hardware status, etc. here is an example Proxmox metrics dashboard for inspiration: https://imgur.com/gallery/V3aKU |
|
The behaviour will likely remain a bit like that, looking at the stats, I expect you collect about 10MB of log traffic (/var/log/flowd.log) every 8 to 9 hours. Since we always need to read at least one file (usually flowd.log itself), processing will become more time consuming on every iteration (as it was before as well by the way, but I expect there was a latency in between the old reader lib and the actual process). We could opt to add some options which waste less cpu cycles:
|
|
oops, it seems that I missed 9287b55 in an earlier commit. The full set of changes should look like this :
EDIT: fixed pkg install typo |
|
I've added 9287b55 to mine so I now have all 3 patches I'll let you know how it goes. |
|
@AndyX90 the current state will likely go into 19.7.1, you could try to fiddle a bit with MAX_FILE_SIZE_MB and see if the load changes a lot (e.g. change it to 5 and restart the process).
I'm not sure yet if we should change the current defaults. As stated earlier, load is dependant on two variables, the time between polls and the size of the flowd log file. Since the old flowd library isn't compatible with python 3, there's no going back. Python 3 itself isn't always faster/more economical than version 2 either, probably the price of progress. |
|
I've just added the 3rd patch, waiting for the morning's stats. |
|
Load between versions is always difficult to compare, there's a lot of new software that could be more or less efficient. At the moment I don't think there's a lot we can do here. |
|
I recently added the 3 patches again and tested on three machines, which have all the same spikes everyone has. I then decided to do without netgraph for now, because 50W more on 24/7 are ~130€/year in germany. ;) Additionally I noticed that suricata crashes more often and floods log with this: I can't say for sure if this is since 19.7 or has anything to do with the patches or #3583. I don't know why or/and when this happens, but when it does suricata process can't be killed or restarted and in this state the process creates 100% load on every core. Only hard reset of the machine helps. |
|
Please keep this on topic for performance issues with flowd_aggregate. I can understand that change is scary, but we operate under the assumption that we have to move away from Python 2 and have a rewrite in Python 3 because there is no alternative and I don't think this point has been stressed enough. The alternative to behavioural changes (albeit non-functional) is removing the feature completely. Just for perspective. The three patches will be in 19.7.1. If the situation improves with 19.7.1 we would like to close this ticket and focus on functional work. Cheers, |
pkg: illegal option -- f What did I wrong? |
|
I did a truss(1) while the spike happened. Before it reads the file, I get 54104 times "ERR#9 'Bad file descriptor'". During this, the process doesn't consume 100%, but between 10% and 20% on my apu2. and after the final "Bad file descriptor: |
|
@JasMan78 sorry my mistake, it should have been : |
|
patches shipped in 19.7.1 as promised by @fichtner (#3587 (comment)) |













Important notices
Before you add a new report, we ask you kindly to acknowledge the following:
[x] I have read the contributing guide lines at https://github.com/opnsense/core/blob/master/CONTRIBUTING.md
[x] I have searched the existing issues and I'm convinced that mine is new.
Describe the bug
CPU utilization is noticeably higher after 19.1->19.7 upgrade than before, and the CPU graph shows weird triangles since then.
To Reproduce
Upgrade from 19.1 to 19.7, look through the admin page to explore new features, leave the VM running for a day.
Expected behavior
CPU utilization is constant and around the levels of 19.1.
Screenshots

Grafana:
Relevant log files
Running
topshows extremely highpython3.7activity sometimes, for example once I saw this below but it's usually floating around 33-44%.Environment
OPNsense 19.7 (amd64/OpenSSL)
VM with 2 vcores and 3GB RAM running on Proxmox: pve-manager/5.4-11/6df3d8d0 (kernel: 4.15.18-18-pve)
Network: Intel® I219-LM and I211
The text was updated successfully, but these errors were encountered: