Better File Chunking
Within the IPFS stack/ecosystem, just as within computing as a whole, an uncompressed stream of untagged octets is a fundamental unit of exchange. As a general-purpose data storage system IPFS needs to handle an unbounded variety of content represented by such streams. Handling the maximum amount of this variety efficiently ( ideally by default ) would likely have an outsized impact on the future adoption of IPFS as a long-term data interchange medium/format.
The current stream-chunking options provided by the "official" IPFS content-adders are not particularly good. If left unchecked, the "evolutionaly pressure" will inevitably lead to proliferation of chunking algorithms within and beyond ProtocolLabs-controlled projects. This could be undesireable as any iteration would result in radically different IPFS addresses, hampering user experience, e.g.:
Preventing users from recognizing identical content via simple "eyeballing a list of hashes"
Increaed storage requirements due to failing de-duplication of identical content
Increased retrieval times due to high counts of not-already-present blocks
This session is mostly a "conversation starter" to properly map the available problem ( and solution ) spaces. I would strongly prefer to keep the deep dive light on specific technical details and instead focus on wider architectural / UX / game-theory-ish problems.
It would be great if towards the end of this session we have answers or maybe
Identify about 10 distinct realistic user-stories where IPFS could or already does play a central role.
Clearly identify pros and cons of chunking algorithm proliferation keeping the above list in mind ( spoiler alert: the author is strongly for reducing proliferation, but also recognizes that there are rather few effective tools for doing so )
Regardless of above outcome identify if there exists a low ( single-digit ) number of "general-purpose types of content" that could benefit from a single shared chunking strategy.
Identify whether the above number of algorithms could be reduced to 1 ( one ) which could then viably replace the current defaults.
List of various ( often conflicting ) prior-art discussions within Protocol Labs
Inlining of small objects ( this is not directly related to chunking, but is relevant to smallest-chunk-size-limitations )
Several non-specialized chunking algorithm specs/implementations
Ddelta/Gearhash ( pages 9 and 10 contain direct comparisons with Rabin-chunking )
bup "hashsplit" implementation ( this is especially notable for being side-compatible with vanilla git )
Identify about 10 distinct realistic user-stories where IPFS could or already does play a central role
Ultra-High-Quality video files are large, and the current chunking of 256k is rather small for this purpose.
Max quality 8k requires about 360mbps which translates to needing about 180 blocks per second without accounting for buffering. Considering aplications like VR we are looking at an order of magnitude increase in packet/second requirements.
Video streams benefit from block boundaries coinciding with frame boundaries, ideally having an I-frame ( full frame ) at the start of every block
In case of source code repositories on IPFS it would be desirable to have one block per file to reduce the access latency.
In case of source code repositories on IPFS it is desirable to have semi-deterministic chunking to facilitate file portion deduplication, and thus reduce amount of storage required.
Clearly identify pros and cons of chunking algorithm proliferation keeping the above list in mind
- pro: content-aware chunking maximally reduces storage, transport and computation resources
- pro: content-aware chunking with built in assumptions for multiple versions with very small deltas across the same file would further magnify he effect above ( storage / transport / computation )
- con: difficult to use as a "hash function" outside of the context of IPFS
- con: efficient seeking to an arbitrary offset becomes more difficult with non-static chunking
- con: content-aware chunking is more likely to result in too many small blocks, due to there being too many "interesting" cutpoints within a given set of structured data
Regardless of above outcome identify if there exists a low (single-digit) number of "general-purpose types of content" that could benefit from a single shared chunking strategy
Identify whether the above number of algorithms could be reduced to 1 which could then viably replace the current defaults
No agreement was reached within the large sample size of 2