Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Research/quantify performance envelopes of multiple CDC algorighms #227

Open
10 of 24 tasks
ribasushi opened this issue Dec 4, 2019 · 12 comments
Open
10 of 24 tasks

Comments

@ribasushi
Copy link
Contributor

ribasushi commented Dec 4, 2019

  • oI 95% Assemble corpuses of data from various prior performance research initiatives ( both within and outside of PL )
    • 💯 Enumerate/obtain test datasets
    • 90% Document rationales for the test datasets
    • 95% Publish all of the above as plain HTTP + IPFS pinned download
  • oI 85% Document prior art, motivation and precise scope and types of sought metrics
    • 💯 Solicit/assemble feedback from various stakeholders
    • 💯 Collect/determine relevance of existing academic research into chunking ( 14 distinct papers selected for evaluation )
    • 💯 Convert the pre-PL chunk-tester to proper multi-streaming, to dramatically lower the cost of experiments ( aiming at about 500 megabyte/s stream processing ) with the correct implementation and hardware about 3.5GiB/s standard ingestion 🎉
    • 80% Generate few preliminary datapoints to aid understanding the goal/scope
    • 90% In depth study/evaluation/application of findings from above works
    • 💯 Understand and reuse existing go-ipfs implementations of CDCs ( Rabin + Buzzhash ) in a simpler go-ipfs independent utility, allowing rapid retries of different parameters
    • 💯 Same as above but pertaining to linking strategies ( trickle-dag etc ), as ignoring the link-layer of streams skews the results disproportionately
    • 98% ( subsumes a large portion of points below v0.1 ETA: DEMO AT TEAM-WEEK ) Fully implement a standalone CLI utility re-implementing/converging with go-ipfs on all above algorithms. The distinguishing feature of said tool is the exposure of each chunker/linker as an atomic, composable primitive. The UX is similar to that of ffmpeg whereby an input stream is processed via multiple "filters", with the result being a stream of blocks with a statistic on their counts/sizes plus a valid IPFS CID. Current remaining tasks:
      • 💯 Profile/optimize baseline stream ingestion, ensure there is no penalty from applying a "null-filter", which allows one to benchmark a particular hardware setup's theoritcal maximum throughput
      • 💯 Finalize the "stackable chunkers" UI/UX, allowing effortless demonstration of impact of such chunker chains on the
      • 💯 Adjust statistics compilation/output for the above ( it currently looks like this, ignoring various "filter-levels" )
      • 💯 Make final pass on memory allocation profile and fixup obvious low hanging fruit before v0.1
      • 80% README / godoc / stuffz
    • 80% Rewrite previously utilized plotly.js-based visualiser to aid with the above point
  • oI Open document to a short discussion soliciting feedback from workgroups
  • oII Perform a number of "brute force" tests aiming at reproducible results ( utilizing https://github.com/ipfs/testground ) for the purposes of what we are trying to quantify iptb will be sufficient
  • oII ( half-covered by initial writeup ) Convert raw results into multi-dimensional scatter-plot visualizations ( plotly.js )
  • oIII Combine all available results into a "compromise chunking settings" RFC document
  • oIV Publish the results for discussion and decision of the level of incorporation into IPFS implementations ( default parameters, use of selected algorithm by default, etc )
@DonaldTsang
Copy link

My ideas for IPFS challenge:

@aidanhs
Copy link

aidanhs commented Dec 23, 2019

Hi! I've been reading up on IPFS recently and stumbled across a mention of chunking, which got me interested because they're sort of a side hobby (manifesting in a repo I occasionally maintain to compare rolling checksum implementations - https://github.com/aidanhs/rollsum-tests). Searches led me to ipfs-inactive/archives#137, then ipfs/go-ipfs-chunker#18 and now here!

I don't know how interesting/useful my input may be, so I'll just make a series of comments - happy to elaborate on any, or feel free to ignore 😄

1 - speed of Rabin

There's a lot of discussion of the current rabin implementation being slow. I've not seen quantification of this, other than ipfs-inactive/archives#137. Unfortunately the discussion is a bit muddled but there's mention of a 16 hour run for 61GB (modarchive). This number isn't credible to me as being caused by Rabin - the benchmarking I've done (results available on my repo) indicates that the IPFS Rabin with a 256KB average chunk (IPFSRA256) takes ~8s for a ~1.5GB file.

Don't get me wrong, there's improvement to be had! With my SSD and a clean cache I can tar a 9GB directory of ~90k files and send to /dev/null in 10s (or /dev/zero in 1min10s) - it'd be nice to have something on a par. But I'd be interested to see measurements indicating it's a limiting factor.

2 - purpose of CDC

Being a newcomer to IPFS I'm probably missing context (having just read a bunch of issues), but there seems to be quite a high expectation of chunking on deduplication and the like.

I personally would expect it to deduplicate A) files that are identical and B) 'items' within other files (.tar, .a, disk images, maybe object sections within executables)...and fixed size chunking already does A! So I would expect essentially no space saving for everything in the comment above, apart from the Linux repos.

A different way to look at it is that CDC "avoids the pathologically poor cases of fixed size chunking and handles archives well", rather than making it order of magnitude better in the general case. I've seen a number of issues about format-specific chunking which sounds awesome, but is orthogonal to (and possibly partially in conflict with) CDC. To be clear, I'm a huge fan of CDC (you can probably tell), but I do think it's important to set expectations.

3 - chunking configuration and algorithm choice

Based on the above, we want chunks on average a fair amount smaller than the typical file size...otherwise, in case B the chunks will span multiple embedded items and be dependent on their ordering in the container!

IPFS currently uses 512KB - this is pretty large for CDC (attic uses 64KB, anything using bupsplit (perkeep, bup) uses 8KB). Based on the above, it would then be best suited for files approx >2MB (small files would still work, just wouldn't benefit). I realise there may be other influencing factors, e.g. IPFS performance with small files.

For the algorithm itself, you'd like something fast! But there's more than that - it'd also be nice to have something that has a small standard deviation of chunk sizes. If there's a large standard deviation you're more regularly hitting the min or max limits and losing the benefits of CDC. The current IPFS buzhash relies heavily on the chunk limits - it seems to actually be configured for 128KB splits but the aggressive limits raise the average to 256KB. My current results don't show this, but it's basically on a par with Gear from rsroll (RSROGE256) for standard deviation. bupsplit modified to target 256KB chunks is a fair amount better, but I've not yet done the analysis to figure if what that 'better' means.

4 - final misc notes

  • bupsplit with 8KB would be my default 'easy' recommendation - there are multiple language implementations to verify against, it's being used by multiple people and it has a decent heritage (from rsync)
    • the best bupsplit implementations take ~2.2s per GB, rather than ~1.2s for buzhash - this feels pretty marginal to me, and given how well you've optimised buzhash I feel you can equal the best bupsplit implementations
  • FastCDC is mentioned in (WIP) records + merkledag specs #7. I've not yet had a chance to investigate it but I've seen it described as an improvement on Gear and would probably be what I'd go for if I really needed to get that edge over bupsplit
  • I strongly recommend going with something that has an existing competing implementation so you can feed PB of data in and see if they diverge. When I was implementing Rabin a long time ago, every single implementation out there bar one was incorrect (matching the experience of Rabin Fingerprinting notes#1 (comment))

(cc ipfs/go-ipfs-chunker#13 which mentions bup)

@DonaldTsang
Copy link

@aidanhs if it is possible read the lists in gilbertchen/benchmarking#13 and gilbertchen/benchmarking#14 for a consensus in chunking please, because I would like to know what chunk size is the most common.

@ribasushi
Copy link
Contributor Author

ribasushi commented Mar 23, 2020

For those following along: this weekend I finally managed to arrange incoming streams in a way that allows for building trees in parallel, while still being able to spit out a DAG 100% converging with go-ipfs. This, combined with ipfs/kubo#7011 will finally allow for the next stage of this endeavor: namely to be able to quantify how much pressure varying chunks sizes can exert on the bitswap subsystem, and how does this translate into end-user experience.

The tool itself should show up in its final version very early in April 🤞

Some very rough / unscientific numbers taken from a 4-core i7-4770HQ CPU @ 2.20GHz MacBook Pro (Retina, 15-inch, Mid 2015) ingesting the 5GiB test datastream

admin:~/devel/ipfs-dagger$ zstd -qdck test/data/repeat_5GiB.zst | time ../go-ipfs/cmd/ipfs/ipfs add -nq --upgrade-cidv0-in-output 
bafybeicby6ax3dbkxolfypowaizkhmbkrd4xzpkmmoez2wsf6jidso7fhi
       21.68 real        19.94 user         1.36 sys

admin:~/devel/ipfs-dagger$ zstd -qdck test/data/repeat_5GiB.zst | time bin/ipfs-dagger --ipfs-add-command="--upgrade-cidv0-in-output" --no-stats --single-threaded
{ "substream":      0, "size":5368709120, "CID":"bafybeicby6ax3dbkxolfypowaizkhmbkrd4xzpkmmoez2wsf6jidso7fhi" },
       19.51 real        16.54 user         1.17 sys

admin:~/devel/ipfs-dagger$ zstd -qdck test/data/repeat_5GiB.zst | time bin/ipfs-dagger --ipfs-add-command="--upgrade-cidv0-in-output" --no-stats
{ "substream":      0, "size":5368709120, "CID":"bafybeicby6ax3dbkxolfypowaizkhmbkrd4xzpkmmoez2wsf6jidso7fhi" },
        6.21 real        29.79 user         1.73 sys
admin:~/devel/ipfs-dagger$ zstd -qdck test/data/repeat_5GiB.zst | time ../go-ipfs/cmd/ipfs/ipfs add -nq --cid-version=1 --trickle
bafybeibkk3ztvggbxaev5wcfggiphqaknwj7h656ojmhaxry4oacx6pqfe
       19.50 real        17.37 user         1.23 sys

admin:~/devel/ipfs-dagger$ zstd -qdck test/data/repeat_5GiB.zst | time bin/ipfs-dagger --ipfs-add-command="--cid-version=1 --trickle" --no-stats --single-threaded
{ "substream":      0, "size":5368709120, "CID":"bafybeibkk3ztvggbxaev5wcfggiphqaknwj7h656ojmhaxry4oacx6pqfe" },
       19.47 real        16.47 user         1.16 sys

admin:~/devel/ipfs-dagger$ zstd -qdck test/data/repeat_5GiB.zst | time bin/ipfs-dagger --ipfs-add-command="--cid-version=1 --trickle" --no-stats 
{ "substream":      0, "size":5368709120, "CID":"bafybeibkk3ztvggbxaev5wcfggiphqaknwj7h656ojmhaxry4oacx6pqfe" },
        6.14 real        29.20 user         1.68 sys

@ribasushi
Copy link
Contributor Author

If one moves to proper OS/hardware ( linux / AMD Ryzen 7 3700X 8-Core )

root@rescue ~/DAGger # zstd -qdck test/data/repeat_5GiB.zst | time -p bin/ipfs-dagger --ipfs-add-command="--cid-version=1 --trickle" --no-stats
{ "substream":      0, "size":5368709120, "CID":"bafybeibkk3ztvggbxaev5wcfggiphqaknwj7h656ojmhaxry4oacx6pqfe" },
real 2.24
user 6.58
sys 1.40

with zstandard out of the way ( it tops out at about 3GiB/s )

root@rescue ~/DAGger # cat repeat_5GiB.decompressed | time -p bin/ipfs-dagger --ipfs-add-command="--cid-version=1 --trickle" --no-stats
{ "substream":      0, "size":5368709120, "CID":"bafybeibkk3ztvggbxaev5wcfggiphqaknwj7h656ojmhaxry4oacx6pqfe" },
real 2.03
user 6.40
sys 1.33

And with the pipe out of the way ( buffer max out at 16MiB on linux )

root@rescue ~/DAGger # time -p bin/ipfs-dagger --ipfs-add-command="--cid-version=1 --trickle" --no-stats < repeat_5GiB.decompressed 
{ "substream":      0, "size":5368709120, "CID":"bafybeibkk3ztvggbxaev5wcfggiphqaknwj7h656ojmhaxry4oacx6pqfe" },
real 1.36
user 6.06
sys 0.65

Not too bad 😈

@dbaarda
Copy link

dbaarda commented May 29, 2020

  • bupsplit with 8KB would be my default 'easy' recommendation - there are multiple language implementations to verify against, it's being used by multiple people and it has a decent heritage (from rsync)

As one of the maintainers of librsync who wrote one of the early "bupsplit" implementations many people have copied, I've recently introduced a RabinKarp polyhash as the new default instead of to the "bupsplit" rollsum because my analysis showed "bupsplit" had terrible hash distribution and collisions, particularly for small block/window sizes (<16K) and ASCII data, and particularly in the low "s1" end of the digest. Using rabinkarp made calculating signatures about 10% more expensive because they are more expensive to calculate, but makes calculating deltas about 10% faster because of significantly less rollsum hash collisions, and calculating deltas is the expensive operation (for 1GB files, adds about 1sec to calculating sigs, saves more than 10secs calculating deltas).

See https://github.com/dbaarda/librsync-tests/blob/master/RESULTS.rst for details.

The poor distribution probably doesn't matter quite as much for chunking, but with the tiny windows used for chunking (32 bytes?) and checking the LSB bits of the digest, you may be hitting the degenerate worst-case behavior for bupsplit. This would probably manifest as a chunk-size distribution that varies considerably with different file types (ASCII vs random) and doesn't reflect the target average block size.

I also found cyclic-poly AKA buzhash had terrible collision rates for ASCII data... again, probably not something that matters for chunking.

See the following for details. https://github.com/dbaarda/rollsum-tests/blob/master/RESULTS.rst

@dbaarda
Copy link

dbaarda commented Feb 17, 2021

FTR I've done some;more testing/analysis of chunker algorithms after reading the FastCDC paper and got some surprising results here;

https://github.com/dbaarda/rollsum-chunking/blob/master/RESULTS.rst

Also, for a definitive summary of everything related to IPFS deduplication, read this;

https://discuss.ipfs.io/t/draft-common-bytes-standard-for-data-deduplication/6813/10?u=dbaarda

@dbaarda
Copy link

dbaarda commented May 29, 2021

Another update; I've added "Regression Chunking" from a Microsoft paper to my chunking analysis that has an interesting technique for reducing the affects of truncation at the max_len.

I also added tests/analysis of Gear and small sliding windows used for chunking to my rollsum analysis.

@safinaskar
Copy link

I wrote deduplicator, which is x4 faster when storing than other solutions and x2 faster when extracting than other solutions. But my deduplicator doesn't do any CDC, it simply splits data into fixed sized chunks. Such great speed was achieved thanks to blake3 and rayon ( https://crates.io/crates/rayon ). Also I did benchmark for various deduplicators. See this thread: borgbackup/borg#7674 . Especially last comment with newest benchmark and my newest deduplicator: borgbackup/borg#7674 (comment) . Yes, I know that you are interested in CDC, but I still think you can take some ideas

@dbaarda
Copy link

dbaarda commented Jul 28, 2023

Of course a fixed size fixed chunkier is going to be fast, and on VM images will probably even give you OK de-duplication. This is because VM images are filesystems that are arranged into fixed size blocks containing files that are mostly binaries, libraries, and compressed data. These kinds of files are either identical, or completely different, so they either have all or none of the same blocks on disk, and a fixed size chunkier can easily find them.

However, a fixed size chunker fails very badly on many other simple cases. Get a big file, add one byte at the start, and a fixed chunkier will get 0% de-duplication. Or tar all the files on those VM images so they are no longer nicely aligned into fixed-size disk blocks, get nearly 0% de-duplication, etc.

@safinaskar
Copy link

I added comparison between CDC-based tools: borgbackup/borg#7674 (comment) .

Note that casync, desync and borg all use buzhash. So they use exactly same algorithm (buzhash + same chunk size + zstd level 3). And yet they give very different performance. These means that some of these tools greatly under-perform. Make sure not to repeat these problems in IPFS.

Also note that borg is single-threaded and yet it beats parallel desync. This mean that desync does something horribly wrong. Make sure not to repeat this in IPFS!

Possibly whole thread may contain some important info

@safinaskar
Copy link

Okay, so here is list of Github issues I spammed wrote in last few days on this topic (i. e. fast fixed-sized and CDC-based deduplication). I hope they provide great insight to everyone interested in fast deduplicated storage.
borgbackup/borg#7674
systemd/casync#259
folbricht/desync#243
#227
dpc/rdedup#222
opencontainers/umoci#256

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants