Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Include a list of file extensions that are known to be compressed to be excluded from compression #2190

Closed
ptandler opened this issue Jul 13, 2022 · 5 comments
Labels

Comments

@ptandler
Copy link

I'd assume that in most cases it doesn't gain a lot when a compressed file is compressed again when storing to the repository.

So, I think it would be helpful to ease setup, to include a list of file extensions that are known to be compressed - in such a way that they can be easily excluded from compression by a policy.

Actually, I'm not sure what would be the best way to do this. Ideas:

  • store this list internally and just include another boolean flag in the policy. Hm, maybe not preferred as it is not obvious what will be excluded and the list cannot be tweaked to custom needs
  • Kopia UI could include a button to add these extensions to a policy in the editor - but no idea how this could be done for the CLI version

Anyway, this is the list I'm currently using:

jpg
jpeg
png
svgz
mp3
m4a
ogg
avi
mov
mp4
m4v
mpg
mpeg
ogv
7z
zip
gz
xz
zst
zstd
bz2
tgz
rar
jar
war
webm
webp
docx
pptx
xlsx
@Spellshaper
Copy link

In my experience Kopia performance does not really improve when skipping already compressed data unless you choose a deliberately CPU-intensive compression setting, in which case i have observed that png, jpg and (non-encrypted) zip at least can benefit from a second round of compression.
I would also count docx, pptx and xlsx as zip because that's what they are - structured xml documents inside a zip archive with a different extension.
Choosing any flavor of S2 compression should already skip compression for blocks that do not benefit from it, if i'm reading the S2 docs right.

In conclusion, I am not sure a pre-included list like this would be of benefit on average, whether applied by default or not.

@ptandler
Copy link
Author

These are interesting findings I wouldn't have expected!

My main reason for excluding already compressed files is to save CPU - especially on some older notebooks I still have around running with Linux - and backup size is not the main focus.

But as you mention compression (a bit off-topic, but still related): It's also quite challenging to select an appropriate compression method: You can select from 18 different settings in kopia (2 of them are marked as "not recommended" which is helpful). With respect to UX, this is really overwhelming. If you're a computer scientist or admin, you quite likely have heard of the different algorithms, but for an "normal" user, I expect, they simply would have no idea what to choose.

As I guess, it really depends on the data and the hardware you have, which option is the best:

<dream-mode>
It would be cool, to have a benchmark or setup wizard included that runs all compression options (including which file types to exclude) against your snapshot data and provide a list with resulting CPU usage and backup size ...

Well, maybe not test all options, but just some, depending on a a setting with the slider value from "max speed" to "max compression" ...
</dream-mode>

I have never heard of S2 (before I saw it in kopia) and in my experience zstd seems to be better both in terms of speed and compression than zip and gzip. How does S2 compare to zstd?

@Spellshaper
Copy link

Spellshaper commented Jul 15, 2022

I see.
On very old systems CPU can be a bottleneck.

As a quick test, I ran kopia.exe benchmark compression --data-file=kopia.exe from a Windows Server 2016 KVM VM with 6 cores allocated, running on a Ryzen 5 2600.
image
Afaik it depends on the compression mechanism how many cores are used per file, most however use one.
Compression is but one CPU-intensive step of several, there's also hashing and encryption. You might want to use the built-in benchmark for both (kopia.exe benchmark crypto), since old hardware might not have the necessary extensions for hardware acceleration of AES and SHA. It will tell you which combination is the fastest on your platform.

@ptandler
Copy link
Author

Oh, there is a benchmark command already! Oops, I've overseen this. Wow! Kopia is really amazing! Thanks a lot!

I'll try benchmarking once with excluded files, once without ... when I've got some spare time.

@github-actions github-actions bot added the stale label Jul 6, 2023
@github-actions
Copy link

Closed due to inactivity. Re-open and remove "stale" label if it should remain open for an additional period of time

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants