Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: Change compression algorithm for memory filesystem Netflow backups #5278

Closed
2 tasks done
g-a-c opened this issue Oct 14, 2021 · 11 comments
Closed
2 tasks done
Labels
help wanted Contributor missing / timeout

Comments

@g-a-c
Copy link

g-a-c commented Oct 14, 2021

Important notices

Before you add a new report, we ask you kindly to acknowledge the following:

Is your feature request related to a problem? Please describe.

I run OPNsense with memory filesystem on /var (to avoid constant SSD writes), which collects Netflow data for use with Insight. Part of this is obviously the backup system which on shutdown (or periodically) can back up that data to real storage to make it persistent.

My Netflow data is now running into the gigabytes in size (currently it's approximately 1.6GB but it's been larger) and if I reboot the firewall appliance, this takes an increasingly long time which I think I've narrowed down to the backup step. The script that backs up Netflow data uses gzip compression with the default settings, and on my appliance this takes over 90 seconds to run currently with this 1.6GB of source data. This script can take so long to run that it causes the UI to refresh and the box to give the impression it has rebooted when actually the backup script is still running in the background and the appliance is yet to go down.

I believe this is a similar problem to #2876 which was reported several years ago and resulted in a way to just turn off the backups completely.

Describe the solution you like

I compared a couple of different algorithms available in the OPNsense version of bsdtar, including gzip, gzip-1, lz4 and zstd to see if there was a better trade-off between compression algorithms

Algorithm real time (compression) real time (decompression) Size
bzip2 default 306s 225M
gzip default 96s 7s 278M
gzip-1 29s 8s 320M
zstd-1 14s 9s 296M
lz4 default 11s 416M

It wasn't exactly a scientific test (more of a for-loop running different combinations on my Netflow data, compressing to the internal SSD and decompressing to /dev/null), but it looks like using zstd compression at level 1 achieves almost as small a Netflow archive as the current gzip default level, but in a fraction of the time.

I discounted lz4 because the files are considerably larger than the current default; bzip2 compression is considerably slower; and gzip-1 because it isn't better than zstd at either time/size. But switching to zstd appears to have considerable upsides going into a reboot, and a very negligible downside coming out of the reboot afterwards. For anyone who has periodic backups enabled every hour, this should also make that periodic job run much quicker which means less missing data since the backup script stops the flowd_aggregate service while it is running.

Since the compression algorithm can be automatically detected by the tar extract step, there would still be portability in the ability to restore a gzip archive even through the same script that may generate zstd archives because the decompression algorithm wouldn't have to be specified inside the script.

--- /tmp/20-netflow	2021-10-14 09:34:21.895198000 +0100
+++ /usr/local/etc/rc.syshook.d/backup/20-netflow	2021-10-14 09:37:15.811900000 +0100
@@ -1,9 +1,11 @@
 #!/bin/sh

 BACKUPDIR="/var/netflow"
-BACKUPFILE="/conf/netflow.tgz"
+BACKUPFILE="/conf/netflow.tar.zstd"
 BACKUPOFF="<netflowbackup>-1</netflowbackup>"

+TAROPTIONS="--zstd --options zstd:compression-level=1"
+
 # find out if the service is running and stop/start it only if needed
 RCSCRIPT="/usr/local/etc/rc.d/flowd_aggregate"
 if ! ${RCSCRIPT} status > /dev/null; then
@@ -15,13 +17,13 @@
 elif [ "${1}" = "restore" ]; then
 	if [ -f "${BACKUPFILE}" ]; then
 		${RCSCRIPT} stop
-		tar -C / -xzpf "${BACKUPFILE}"
+		tar -C / -xpf "${BACKUPFILE}"
 		${RCSCRIPT} start
 	fi
 else
 	if [ -d "${BACKUPDIR}" ]; then
 		${RCSCRIPT} stop
-		tar -C / -czf "${BACKUPFILE}" ."${BACKUPDIR}"
+		tar -C / ${TAROPTIONS} -cf "${BACKUPFILE}" ."${BACKUPDIR}"
 		${RCSCRIPT} start
 	fi
 fi

Describe alternatives you considered

A clear and concise description of any alternative solutions or features you considered.

Additional context

These figures were got from an Atom C2558 firewall appliance with 16GB of DDR3 RAM and an internal mSATA SSD of some sort (I actually don't know the model number). I don't have a way of seeing if these figures are reproducible across other similar low-powered appliances like the APU series.

@g-a-c g-a-c changed the title Change compression algorithm for memory filesystem Netflow backups feature: Change compression algorithm for memory filesystem Netflow backups Oct 14, 2021
@OPNsense-bot
Copy link

This issue has been automatically timed-out (after 180 days of inactivity).

For more information about the policies for this repository,
please read https://github.com/opnsense/core/blob/master/CONTRIBUTING.md for further details.

If someone wants to step up and work on this issue,
just let us know, so we can reopen the issue and assign an owner to it.

@OPNsense-bot OPNsense-bot added the help wanted Contributor missing / timeout label Apr 12, 2022
@TheHellSite
Copy link

Has this ever been changed yet?

I never noticed this issue before until recently.
For me the netflow backup takes more than two minutes already which is kind of annoying. (Ryzen 3 4350G + SSD)

Another improvement could be the ability to configure a log rotate for all of the reporting data (netflow, rrd, ...), so we only keep X days of that data.

@AdSchellevis
Copy link
Member

When using a memory disk, these files should not be in there anymore, the backup itself can safely be disabled in "System: Settings: Miscellaneous".

@TheHellSite
Copy link

What do you mean by memory disk?
Do you mean that the backup is only needed if I configured the logs to be safed in RAM?

@AdSchellevis
Copy link
Member

You can also use the backup/restore on disk, but usually it's the same disk so it doesn't hold much value (and may cause the machine to take a long time shutting down or starting up). We changed the defaults a long time ago in f1ea003.

@TheHellSite
Copy link

Okay, so the compression algorithm is still gzip?

And disabling all periodic backups or the netflow backup at "System: Settings: Miscellaneous" will lead to dataloss of the aggregated data at each reboot?
"This will periodically backup the NetFlow data aggregation so it can be restored automatically on the next boot."

@AdSchellevis
Copy link
Member

Okay, so the compression algorithm is still gzip?

yes

And disabling all periodic backups or the netflow backup at "System: Settings: Miscellaneous" will lead to dataloss of the aggregated data at each reboot?

no

@TheHellSite
Copy link

Ahh then I misunderstood the periodic backup function or maybe the description label is misleading.

Sorry for bothering but I would really like to know how exactly this works, you can also just send me some documentation if that exists.
So why was that periodic backup task there in the first place?
Is it just to speed up the backup of the entire configuration?

@AdSchellevis
Copy link
Member

to be honest, the documentation on this subject is a bit light, but you certainly don't need the archives when content is stored on disk, only for dhcp it more or less makes sense if you're reinstalling the device and want to keep your leases (reinstalling with config import can use these backups).

So, long story short, just stick to our new defaults and you should be good.

@g-a-c
Copy link
Author

g-a-c commented May 11, 2023

So why was that periodic backup task there in the first place?

As the opener, I'm 99% sure that when I opened it, the option was to place the entire of /var on a RAM disk, which included the /var/netflow directory that can get quite large. And because it contained data I wanted to keep, I enabled the backups and had the issue I described in the first post (backups were taking too long).

As of right now on my 23.1.7 box, it seems like the option has been reduced to /var/log on a RAM disk, which means the other directories are no longer "temporary" but stored on the disk - which means they don't need to be backed up as they will no longer be lost on reboot. Ad Schellevis has pointed out one use case where the backups do still make sense but for the most part, probably not.

@AdSchellevis
Copy link
Member

@g-a-c yes, that's indeed the case. full memory disk (sd/cf) installs are less common nowadays and we changed the behavior improve stability as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Contributor missing / timeout
Development

No branches or pull requests

4 participants