Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Please add gzip/bzip compression for delta files in rdiff #8

Open
pavel-odintsov opened this issue Jul 31, 2014 · 12 comments
Open

Please add gzip/bzip compression for delta files in rdiff #8

pavel-odintsov opened this issue Jul 31, 2014 · 12 comments

Comments

@pavel-odintsov
Copy link

Hello!

I tried to use flags --gzip/--bzip for rdiff but got error:

rdiff: ERROR: (rdiff_options) sorry, compression is not really implemented yet

For my data (VPS disks) compression provides really excellent compression for delta files:

source size: 4.6 Gb delta size: 2093.0 MB compressed size: 223.0
source size: 14.8 Gb delta size: 2205.0 MB compresses size: 998.7 MB

Thank you!

@pavel-odintsov
Copy link
Author

I tried to compress signatures biut it's really useless:

du -sh /root/rdiff_signatures_25_june/
20M /root/rdiff_signatures_25_june/

tar -cpzf /root/rdiff_signatures_25_june.tar.gz /root/rdiff_signatures_25_june/
ls -alh /root/rdiff_signatures_25_june.tar.gz
-rw-r--r-- 1 root root 19M Авг  1 00:41 /root/rdiff_signatures_25_june.tar.gz

tar -cpjf /root/rdiff_signatures_25_june.tar.bz2 /root/rdiff_signatures_25_june/
ls -alh /root/rdiff_signatures_25_june.tar.bz2
-rw-r--r-- 1 root root 19M Авг  1 00:41 /root/rdiff_signatures_25_june.tar.bz2

But compression for deltas is really useful, please add it :)

@sourcefrog
Copy link
Contributor

You can just pipe it into gzip.

@pavel-odintsov
Copy link
Author

Hello!

Thank you for answer!

Yes, I'm use rdiff delta in way:

rdiff delta signature.dat data.dat - | pigz > signature.gz 

But out of box support for compressed deltas will be fine feature.

@dbaarda
Copy link
Member

dbaarda commented Oct 10, 2014

Note rsync uses a modified zlib for delta compression that uses matching data that is not included in the delta to "prime" the compression data tables and then throws away the "matching" compressed output. This in general gives slightly better compression than just gzipping the resulting delta. For an example of how this can be done with an unmodified zlib you can look a pysync http://minkirri.apana.org.au/~abo/projects/pysync.

@dbaarda
Copy link
Member

dbaarda commented Oct 17, 2017

I'm considering tackling this next. Either that or Rabin-Karp rollsums... whichever people prefer.

Note that signature files being collections of hash values probably don't compress at all well, unless they have long runs of identical blocks. I'm planning to only add compression to the deltas, with optional "context compression" support (which compresses hits as well as misses to prime the compressor with context from matching blocks).

@yxj1992
Copy link

yxj1992 commented Nov 28, 2017

I have set cmake -D ENABLE_COMPRESSION=ON .,
but it doesn't work,'ERROR: (rdiff_options) sorry, compression is not really implemented yet',
who can tell me why?

@dbaarda
Copy link
Member

dbaarda commented Feb 11, 2018

yxj1992: because that feature hasn't been implemented yet.

@dbaarda
Copy link
Member

dbaarda commented Aug 23, 2019

FTR, I found a good comparison between gzip, bz2, and xz here;

https://www.rootusers.com/gzip-vs-bzip2-vs-xz-performance-comparison/

Looking at this xz is the clear winner on compression ratio, but it is still diminishing returns against gzip compared to no compression at all. The clear winner on speed AKA cpu is gzip, particularly for decompression. For bz2, it beats gzip on compression, but it's not as good as xz, and pays a nasty speed/cpu price.

For librsync's application, I feel gzip is the winner, and there's not much point in implementing support for xz or particularly bzip2.

@sourcefrog
Copy link
Contributor

sourcefrog commented Aug 23, 2019 via email

@dbaarda
Copy link
Member

dbaarda commented Jun 4, 2020

I looked at zstd. I agree it looks like the best/only solution needed for compression. There is a Debian libzstd-dev package we can just build/link against.

On how to best implement this, I think this is blocked on implementing a hit/miss callback api for deltas as described in #197. That API would make it much easier to implement different kinds of delta output formats including things like compression and whole-file-checksums, as different callback implementations.

@dbaarda
Copy link
Member

dbaarda commented Aug 3, 2021

In #209 I've started experimenting with adding a pure callback API with simple hooks for adding pre and post processors for the input and output data. This API will make it much easier to add compression, so I'm not going to tackle this until after that API is implemented.

@dbaarda
Copy link
Member

dbaarda commented Aug 26, 2021

Looking again at this, lz4 is probably also worth supporting because it's much smaller and faster. It's compression ratio is not as good, but it's still pretty good. It's also a much smaller library to link and (I suspect) has a smaller memory footprint so it's probably better for embedded uses.

Related to this, lz4 use xxHash for it's checksums. It's not considered cryptographic, but it does seem to be very good and very fast, particularly compared to blake2 which we are using.

Looking at modern rsync's it seems they use/support the following;

$ rsync --version
rsync version 3.2.3 protocol version 31
Copyright (C) 1996-2020 by Andrew Tridgell, Wayne Davison, and others.
Web site: https://rsync.samba.org/
Capabilities:
64-bit files, 64-bit inums, 64-bit timestamps, 64-bit long ints,
socketpairs, hardlinks, hardlink-specials, symlinks, IPv6, atimes,
batchfiles, inplace, append, ACLs, xattrs, optional protect-args, iconv,
symtimes, prealloc, stop-at, no crtimes
Optimizations:
SIMD, asm, openssl-crypto
Checksum list:
xxh128 xxh3 xxh64 (xxhash) md5 md4 none
Compress list:
zstd lz4 zlibx zlib none

Note that it doesn't use blake2, and hence doesn't have a decent cyptographic hash. However, rsync does have a whole-file checksum and random seeds, making it harder to attack.

There is also blake3 coming out which is faster than blake2, but doesn't seem to have hit v1.0 yet.

I think it's worth adding support for xxh128, zstd, and lz4, with compiler options for enabling/disabling any hash or compression options for people who want the smallest binary possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants