Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ipfs files concat [ <local paths> | <cids> ] #9177

Open
lidel opened this issue Aug 9, 2022 · 7 comments
Open

ipfs files concat [ <local paths> | <cids> ] #9177

lidel opened this issue Aug 9, 2022 · 7 comments
Labels
kind/enhancement A net-new feature or improvement to an existing feature need/analysis Needs further analysis before proceeding need/community-input Needs input from the wider community need/triage Needs initial labeling and prioritization topic/commands Topic commands topic/files Topic files topic/UnixFS Topic UnixFS

Comments

@lidel
Copy link
Member

lidel commented Aug 9, 2022

Documenting discussion with @ikreymer, @RangerMauve and @ribasushi

We are missing a high level API for concatenating existing UnixFS files into bigger ones.
Having it would allow for improved deduplication in scenarios when bigger archives in formats like WARC (https://webrecorder.net) consist in big part of smaller files that are already on IPFS, allowing for CID/DAG reuse.

Use cases

  • building big files from preexisting DAGs (e.g. WARC from https://webrecorder.net assembled from standalone files)
  • (TBD) if we include dirs, then also
    • more friendly replacement for deprecated ipfs object patch append-data
    • merging multiple directories without mutating them in MFS

Proposed design

Add concat command to ipfs files that accepts two or more UnixFS-compatible DAGs and returns a CID that is a logical concatenation of all DAGs.

$ ipfs files concat [ /local/mfs/paths | /ipfs/cids ] 
bafy....

FAQ / Open questions

We need to agree how to handle edge cases, below are my initial ideas,
feedback on ergonomics and potential implementation caveats is appreciated

  • What happens when passed DAGs are all files?

    • Concatenate them in-order and produce a new UnixFS file that is reusing original DAGs (maximizing deduplication)
  • Should this support directories? It opens additional questions:

    • What happens when passed DAGs are all directories?
      • Create a new directory which has all children from original directories (in-order?)
    • What happens when the first DAG is a directory and all remaining ones are files?
      • Create a new directory which has remaining files added
    • What happens when the first DAG is a file but at least one of the remaining ones is a directory?
      • (A) Return "Error: concatenating directories is possible only when the first DAG is an UnixFS directory"
      • (B) Concatenate everything into a single UnixFS directory (children from directories + standalone files)
    • What happens if the same CID is in two directories under the same name?
      • Should it be duplicated or deduplicated?
@lidel lidel added kind/enhancement A net-new feature or improvement to an existing feature topic/commands Topic commands topic/files Topic files need/triage Needs initial labeling and prioritization need/community-input Needs input from the wider community need/analysis Needs further analysis before proceeding topic/UnixFS Topic UnixFS labels Aug 9, 2022
@ribasushi
Copy link
Contributor

My take is: hard-error on directories, support only files and pipes. Just like /bin/cat

@RangerMauve
Copy link

I put together a test repo using js-unixfs to show how concat could work under the hood with building up nodes from several sub nodes.

https://github.com/RangerMauve/js-ipfs-stitch-test/

Agreed that directories should be an error. I don't think we can cat a UnixFS tree with directories in it, so concatenating a directory in there seems like a separate use case.

@ikreymer
Copy link

ikreymer commented Aug 9, 2022

Another high-level API, which would be super useful, and essentially becomes easy to support, given the core ipfs files concat functionality, is a way to start with a single file and a list of splitpoints/offsets that you'd want to split on.

It could be a subcommand: ipfs files concat add <local path> <split points>, where split points just contains a JSON array, or offset per line, that would then read local path <local path> and add regular those offsets, and then concat the whole thing. Eg. given a 35M file, and offsets [0, 10M, 25M], the command would add 0-10M of file, add 10-25M, and add 25M-35M of the file. Maybe could support other add options, like being able to choose trickle dag?

Maybe there's two subcommands:
ipfs files concat add <local path> <split points>
and
ipfs files concat merge [ <local paths> | <cids> ] if the split files already exist as individual files or already added as CIDs.

This just adds a common first step that would often be needed before using ipfs files concat

@ribasushi
Copy link
Contributor

@ikreymer too complex. You'd simply:

ipfs files concat yourfile:0:20 yourfile:21:40 yourfile:41:

@ikreymer
Copy link

ikreymer commented Aug 9, 2022

@ikreymer too complex. You'd simply:

ipfs files concat yourfile:0:20 yourfile:21:40 yourfile:41:

yeah, I guess could live with that, was just thinking the separate split file makes for an easier user API, especially if to be supported in libraries as well as CLI, and maybe dealing with hundreds of split points..

@ikreymer
Copy link

I've implemented a small library in JS that includes concat as well as some related utilities that are useful for the web archiving use case:
https://github.com/webrecorder/ipfs-composite-files

@anjor
Copy link

anjor commented Mar 8, 2023

Wrote something in go: https://github.com/anjor/unixfs-cat/blob/main/unixfs_cat.go

Happy to work more on it if it's useful/along the lines of the thinking here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement A net-new feature or improvement to an existing feature need/analysis Needs further analysis before proceeding need/community-input Needs input from the wider community need/triage Needs initial labeling and prioritization topic/commands Topic commands topic/files Topic files topic/UnixFS Topic UnixFS
Projects
None yet
Development

No branches or pull requests

5 participants