Skip to content

Latest commit

 

History

History
117 lines (74 loc) · 6.35 KB

46-better-file-chunking.md

File metadata and controls

117 lines (74 loc) · 6.35 KB

Better File Chunking

Within the IPFS stack/ecosystem, just as within computing as a whole, an uncompressed stream of untagged octets is a fundamental unit of exchange. As a general-purpose data storage system IPFS needs to handle an unbounded variety of content represented by such streams. Handling the maximum amount of this variety efficiently ( ideally by default ) would likely have an outsized impact on the future adoption of IPFS as a long-term data interchange medium/format.

The current stream-chunking options provided by the "official" IPFS content-adders are not particularly good. If left unchecked, the "evolutionaly pressure" will inevitably lead to proliferation of chunking algorithms within and beyond ProtocolLabs-controlled projects. This could be undesireable as any iteration would result in radically different IPFS addresses, hampering user experience, e.g.:

  • Preventing users from recognizing identical content via simple "eyeballing a list of hashes"

  • Increaed storage requirements due to failing de-duplication of identical content

  • Increased retrieval times due to high counts of not-already-present blocks

This session is mostly a "conversation starter" to properly map the available problem ( and solution ) spaces. I would strongly prefer to keep the deep dive light on specific technical details and instead focus on wider architectural / UX / game-theory-ish problems.

Goal

It would be great if towards the end of this session we have answers or maybe even consensus 🤞 regarding the following deceptively simple checklist:

  • Identify about 10 distinct realistic user-stories where IPFS could or already does play a central role.

  • Clearly identify pros and cons of chunking algorithm proliferation keeping the above list in mind ( spoiler alert: the author is strongly for reducing proliferation, but also recognizes that there are rather few effective tools for doing so )

  • Regardless of above outcome identify if there exists a low ( single-digit ) number of "general-purpose types of content" that could benefit from a single shared chunking strategy.

  • Identify whether the above number of algorithms could be reduced to 1 ( one ) which could then viably replace the current defaults.

List of various ( often conflicting ) prior-art discussions within Protocol Labs

Several non-specialized chunking algorithm specs/implementations

Team

Presentation

🎤 Slides

Notes

Identify about 10 distinct realistic user-stories where IPFS could or already does play a central role

Usecases/problems

Ultra-High-Quality video files are large, and the current chunking of 256k is rather small for this purpose.

Max quality 8k requires about 360mbps which translates to needing about 180 blocks per second without accounting for buffering. Considering aplications like VR we are looking at an order of magnitude increase in packet/second requirements.

Video streams benefit from block boundaries coinciding with frame boundaries, ideally having an I-frame ( full frame ) at the start of every block

In case of source code repositories on IPFS it would be desirable to have one block per file to reduce the access latency.

In case of source code repositories on IPFS it is desirable to have semi-deterministic chunking to facilitate file portion deduplication, and thus reduce amount of storage required.

Clearly identify pros and cons of chunking algorithm proliferation keeping the above list in mind

  • pro: content-aware chunking maximally reduces storage, transport and computation resources
  • pro: content-aware chunking with built in assumptions for multiple versions with very small deltas across the same file would further magnify he effect above ( storage / transport / computation )
  • con: difficult to use as a "hash function" outside of the context of IPFS
  • con: efficient seeking to an arbitrary offset becomes more difficult with non-static chunking
  • con: content-aware chunking is more likely to result in too many small blocks, due to there being too many "interesting" cutpoints within a given set of structured data

Regardless of above outcome identify if there exists a low (single-digit) number of "general-purpose types of content" that could benefit from a single shared chunking strategy

NO

Identify whether the above number of algorithms could be reduced to 1 which could then viably replace the current defaults

No agreement was reached within the large sample size of 2