Better File Chunking

Within the IPFS stack/ecosystem, just as within computing as a whole, an uncompressed stream of untagged octets is a fundamental unit of exchange. As a general-purpose data storage system IPFS needs to handle an unbounded variety of content represented by such streams. Handling the maximum amount of this variety efficiently ( ideally by default ) would likely have an outsized impact on the future adoption of IPFS as a long-term data interchange medium/format.

The current stream-chunking options provided by the "official" IPFS content-adders are not particularly good. If left unchecked, the "evolutionaly pressure" will inevitably lead to proliferation of chunking algorithms within and beyond ProtocolLabs-controlled projects. This could be undesireable as any iteration would result in radically different IPFS addresses, hampering user experience, e.g.:

Preventing users from recognizing identical content via simple "eyeballing a list of hashes"
Increaed storage requirements due to failing de-duplication of identical content
Increased retrieval times due to high counts of not-already-present blocks

This session is mostly a "conversation starter" to properly map the available problem ( and solution ) spaces. I would strongly prefer to keep the deep dive light on specific technical details and instead focus on wider architectural / UX / game-theory-ish problems.

Goal

It would be great if towards the end of this session we have answers or maybe even consensus 🤞 regarding the following deceptively simple checklist:

Identify about 10 distinct realistic user-stories where IPFS could or already does play a central role.
Clearly identify pros and cons of chunking algorithm proliferation keeping the above list in mind ( spoiler alert: the author is strongly for reducing proliferation, but also recognizes that there are rather few effective tools for doing so )
Regardless of above outcome identify if there exists a low ( single-digit ) number of "general-purpose types of content" that could benefit from a single shared chunking strategy.
Identify whether the above number of algorithms could be reduced to 1 ( one ) which could then viably replace the current defaults.

List of various ( often conflicting ) prior-art discussions within Protocol Labs

Latest push for chunking standardization as part of UnixFSv2
Parameterizable-chunking proposal ( preliminary implementation available as part of ipfs-pack )
Musings on practical limitations of current "officially available" chunkers
Notes from ReproducibleBuildsSummit 2019
Collection of links on content-dependent-chunkers ( warnocked )
Chunking in the context of maximum block size
Inlining of small objects ( this is not directly related to chunking, but is relevant to smallest-chunk-size-limitations )

Several non-specialized chunking algorithm specs/implementations

FastCDC: a Fast and Efficient Content-Defined Chunking Approach for Data Deduplication
Ddelta/Gearhash ( pages 9 and 10 contain direct comparisons with Rabin-chunking )
bup "hashsplit" implementation ( this is especially notable for being side-compatible with vanilla git )
pigz implementation of --rsyncable

Team

@mib-kd743naq
@steven004

Presentation

🎤 Slides

Notes

Identify about 10 distinct realistic user-stories where IPFS could or already does play a central role

Usecases/problems

Ultra-High-Quality video files are large, and the current chunking of 256k is rather small for this purpose.

Max quality 8k requires about 360mbps which translates to needing about 180 blocks per second without accounting for buffering. Considering aplications like VR we are looking at an order of magnitude increase in packet/second requirements.

Video streams benefit from block boundaries coinciding with frame boundaries, ideally having an I-frame ( full frame ) at the start of every block

In case of source code repositories on IPFS it would be desirable to have one block per file to reduce the access latency.

In case of source code repositories on IPFS it is desirable to have semi-deterministic chunking to facilitate file portion deduplication, and thus reduce amount of storage required.

Clearly identify pros and cons of chunking algorithm proliferation keeping the above list in mind

pro: content-aware chunking maximally reduces storage, transport and computation resources
pro: content-aware chunking with built in assumptions for multiple versions with very small deltas across the same file would further magnify he effect above ( storage / transport / computation )
con: difficult to use as a "hash function" outside of the context of IPFS
con: efficient seeking to an arbitrary offset becomes more difficult with non-static chunking
con: content-aware chunking is more likely to result in too many small blocks, due to there being too many "interesting" cutpoints within a given set of structured data

Regardless of above outcome identify if there exists a low (single-digit) number of "general-purpose types of content" that could benefit from a single shared chunking strategy

NO

Identify whether the above number of algorithms could be reduced to 1 which could then viably replace the current defaults

No agreement was reached within the large sample size of 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

46-better-file-chunking.md

46-better-file-chunking.md

Better File Chunking

Goal

List of various ( often conflicting ) prior-art discussions within Protocol Labs

Several non-specialized chunking algorithm specs/implementations

Team

Presentation

Notes

Identify about 10 distinct realistic user-stories where IPFS could or already does play a central role

Usecases/problems

Clearly identify pros and cons of chunking algorithm proliferation keeping the above list in mind

Regardless of above outcome identify if there exists a low (single-digit) number of "general-purpose types of content" that could benefit from a single shared chunking strategy

Identify whether the above number of algorithms could be reduced to 1 which could then viably replace the current defaults

Files

46-better-file-chunking.md

Latest commit

History

46-better-file-chunking.md

File metadata and controls

Better File Chunking

Goal

List of various ( often conflicting ) prior-art discussions within Protocol Labs

Several non-specialized chunking algorithm specs/implementations

Team

Presentation

Notes

Identify about 10 distinct realistic user-stories where IPFS could or already does play a central role

Usecases/problems

Clearly identify pros and cons of chunking algorithm proliferation keeping the above list in mind

Regardless of above outcome identify if there exists a low (single-digit) number of "general-purpose types of content" that could benefit from a single shared chunking strategy

Identify whether the above number of algorithms could be reduced to 1 which could then viably replace the current defaults