Skip to content
This repository has been archived by the owner on Dec 6, 2022. It is now read-only.

Spec Proposal #2

Closed
wants to merge 3 commits into from
Closed

Spec Proposal #2

wants to merge 3 commits into from

Conversation

kevina
Copy link
Contributor

@kevina kevina commented Oct 28, 2017

Here is a draft spec so we have something to discuss.

I included justifications for many of my choices in the spec itself, but right now they are just my opinion and represent no authority on the mater.

I do not intend to do any forced updates on this draft so we have a clean record of all decisions made.

draft.md Outdated

- CBOR Map
- Key: CBOR Byte or Text String: File Name
- Value: CBOR Array of:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am really concerned having attributes being represented as an array:

  • it hampers extensibility - we are certainly forgetting something today, which means almost certainly yet another optional map in the future, and then which one is which?
  • It makes parsing difficult when represented as JSON

Why not just a map? With all keys being pre-agreed upon multicodec-style: i.e. it must exist in one of the centralized spec tables in order to be recognized by anyone

We can still declare some of the keys as mandatory, and it is at the discretion of gateways/nodes/etc to decide what to do with "obviously malformed" blocks. We already have this with protobuf/unixfs: if one uploads a link-block with only "type 2" fields, and no "data" - everything rejects it.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CBOR defines sorting logic for canonicalizing maps, but not for arrays, and a canonical representation for unixfs directory should be a must.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just a map?

I covered the reasons below in the notes section.

@ehmry I don't see how your comment applies.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not tied to this idea. The amount of space saving is something that can be calculated once we determine what the keys will be if we use a map.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevina I assumed you meant an array of [key, value] tuples. CBOR is supposed to be schema-less and ordered values seem like schema.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if we go with a map for directory entries certain keys will be required in order for the directory entry to be well defined so that is also a schema in a way.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, assigning integer keys to the spec attributes would order them in the same way and just be one byte of overhead for each attribute in a map.

draft.md Outdated
* The key type can either be a byte or text string as POSIX makes no
requirements that file names be utf-8 and it is important that any
file name can be faithfully represented, if the string is utf-8
then the type will be Text.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes an implicit decision regarding the question I posed at the end of ipfs/kubo#4292 (comment): we shift the onus of "check that the name is safe to use/dipslay" to the consumers. Are we ready to do that?

Copy link
Contributor Author

@kevina kevina Oct 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is important to represent any valid file in a POSIX system. I am against restricting the range in the spec. Yes I think it is the consumer job to make sure that the filename is safe to display.

draft.md Outdated
### Notes

* Rather than have a special attribute for an executable bit it is more compact if we just make this a different type
* It is very useful to be able to determine if a link is a directory or an ordinary file so I made it as separate type, also there can be multiple ways to define a file size for a directory so it is best to just leave it out as it is of limited usefulness

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there can be multiple ways to define a file size for a directory

Actually there are only 2 ways - either you only count the logical bytes ( what the Windows properties UI does ), or you take into account the allocation overhead of the filesystem - the blocks taken by both the directories and the files themselves rounded up ( what the unixish du does ).

Given that in the context of IPFS the DAG is completely decoupled from the storage ( it may be files, it may be badger, etc ), the only sensible way to define a file size for a directory is to count the logical bytes, which I've done in my prototypes.

I would be sad if I can't express these cumulative values as part of every link within an FS tree.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unix has it's own (basically useless) way of defining the size of a directory.

I am not totally against included the cumulative size of a directory if we can agree on how to define it.

Copy link

@ehmry ehmry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know there was a review button.

draft.md Outdated

- CBOR Map
- Key: CBOR Byte or Text String: File Name
- Value: CBOR Array of:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CBOR defines sorting logic for canonicalizing maps, but not for arrays, and a canonical representation for unixfs directory should be a must.

draft.md Outdated
* `l`, `symlink`: symbolic link. The second field is the contents of the link
* `o`, `other`: link to other ipld object, links followed for GC and related operations
* `u`, `unknown`: link to unknown objects, links not followed

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If an integer enumeration was used rather than ascii characters, the canonical CBOR representation would be packed to one byte rather than two. Given that the CID will be in raw representation, I don't think clarity would suffer by an enumeration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not agents this.

draft.md Outdated

### Notes

* Rather than have a special attribute for an executable bit it is more compact if we just make this a different type
Copy link

@ehmry ehmry Oct 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If file types are enumerated then the high bit in a one-byte packed CBOR integer (0b10000) could be an informative bit that would make regular files (type 0) into executable files (type 16).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That could work.

@kevina
Copy link
Contributor Author

kevina commented Nov 3, 2017

Everyone. I just rewrote the draft now that I have a better idea what people are looking for.

The value of the map is another CBOR map with the following standard fields:

- `type`
- `exe`: CBOR boolean: executable bit

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think instead of just having executable here, we should do a full rwxrwxrwx unix permissions set (a uint32)

Copy link
Contributor Author

@kevina kevina Nov 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is the full unix set of permissions does not have a lot of meaning on other operating systems. Even within unix systems it has limited meaning when stored in an archive.

Others may have stronger opinions on this than me. In particular see #1 (comment) by @ehmry.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think instead of just having executable here, we should do a full rwxrwxrwx unix permissions set (a uint32)

@whyrusleeping the full st_mode ( entry type + permissions ) is in fact only 16 bits ( within a typically-32-bit-aligned struct member ). If we reuse it as-is we gain some extra bit of interop with everything that understands st_mode ( git took this path: https://stackoverflow.com/questions/737673/how-to-read-the-mode-field-of-git-ls-trees-output/8347325#8347325 ).

Of course this makes direct-query of type a bit harder, but then again every libc provides S_IF... macros

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is no harm of having this additional data stored. Systems that have no meaning of those bits will skip them, systems that have will use them and allow for preservation.

It is very similar with uid and gid. They have no meaning on some systems, they may have no meaning on different machine with same system (different uid/gid mappings) but they are crucial if I wanted to, for example in future, use IPFS for /home storage in managed multi-user multi-workstation environment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should support the full range, but maybe not change the default behaviour?
or maybe we only record user executable by default.

thinking about it a bit more, the 'readable' flag really doesnt make a lot of sense in this context. I can read anything thats in ipfs...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Storing the full st_mode by default just feels wrong to me and I will go as far as it could create additional complications down the road (what those complications are I am not sure).

One possible complication is how to handle the writable bit in st_mode, should it always be set or not by default. If it is not set then should the st_mode be honored when extracting files. If it is then that could be annoying as all files will end up readonly. Or maybe only the executable bit should be honored by default, in that case it just seams better just to store that single bit.

For this version of the standard I feel rather strongly we should stick to just the executable bit as it was stated in the requirements (#1), or nothing at all (as @lgierth suggested we don't add additional meatadata). The full st_mode can be included in a later version of the standard.

- `data`: normally a CBOR link, but can be other types depending on the value of the `type` field
- `size`: cumulative size of `data`
- `fsize`: (file size) cumulative size of the payload of `data`
- `fname`: CBOR byte string: original filename if it differs from the key

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems unnecessary (though I admit i've missed a lot of the conversation from over in the other issue). Why would this differ from the map key?

Copy link
Contributor Author

@kevina kevina Nov 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the discussion on #3 is rather long. It may differ because unix filenames are not required to be UTF-8.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@whyrusleeping two things are at play / in conflict:

  • Desire for full POSIX compatibility mandates names to allow any sequence of bytes excluding 0x00 and 0x2f
  • The current proto-IPLD spec mandates keys ( i.e. names ) to be unicode: the mandate flows from the spec declaring a strict superset of RFC 7049

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm, I don't have a lot of strong opinions here. I will defer to @lgierth @diasdavid and @Stebalien

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this field going include which character set should be used to interpret it somehow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Just the raw byte string if it differs from the key (which is the filename) that is in utf-8. For display the key should be used.

* _omitted_: regular file
* `dir`: directory entry
* `special`: special file type (fifo, device, etc).
The `data` field is a CBOR Map with at least one field to describe the type.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should overload 'data', especially avoid making it have different types based on the value of a key in the parent level. That sort of parsing is hard to do efficiently

Copy link
Contributor Author

@kevina kevina Nov 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@whyrusleeping

I like the simplicity of having the contents of the file entry always be in the same field, for most types it is an IPLD link, for symbolic links it is the target, for special file types in a CBOR map with the details of the special file. My thinking was the type would just be an interface and then cast the correct type once it is known.

I can instead have the following fields:

  • link: CBOR link when applicable
  • target: symbolic link target
  • data: a CBOR map that contains additional data for the directory entry that is not a link or target

I rather not provide special fields to describe the content of all the different types of special files.

Thoughts?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're going to have to inspect fields to determine what to do with things anyways. Overloading things doesnt really save us much in my opinion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@whyrusleeping I am having a hard time interpreting that comment, are you okay with my proposal (link, target, data fields) are you saying we should create special fields for each and every special file type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More to the point, I don't want to enumerate the required fields in the first version of the spec. I like the data field abstraction because we can defer that part for later, it also allows us to to define a standard set of tags so that an implementation can error out of it find something it doesn't understand.


- `data`: link
- `size`: cumulative size of `data`
- `fsize`: (file size) cumulative size of the payload of `data`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevina this seems backwards. The size of the protocol-wrapped objects is currently optional for all intents and purposes. On the other hand the actual fsize is mandatory if what you are expressing is a "top node of a file DAG".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear fsize is the logical bytes while size is the direct or size that included the overhead of interior nodes. fsize <= size

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevina precisely. The logical bytes ( fsize ) are interesting because they are needed to calculate buffers / HTTP header / etc

The node overhead is only useful for estimating how much local storage would be necessary to grab the entire DAG locally, but given the overhead is never more than ~10 bytes per node, it becomes ( from my PoV at least ) needless cruft.

I've done a lot of tests over the last year with DAGs specifically excluding mention of what you refer to as size. Everything works fine. I strongly believe the overhead-including-size should not be a mandatory part of any future spec ( and thus should be the last item in the CBOR array, not an intermediate one )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strongly believe the overhead-including-size should not be a mandatory part of any future spec

Note: This is not a CBOR array but a map the structure is {type: "file", data: [{/*link*/}, {}, {} ...]}.

I don't have a strong opinion on which sizes to include. @whyrusleeping thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think size and fsize are both useful - download progress would depend on size, for example, as you still need to fetch the wrapper bytes.

@@ -0,0 +1,101 @@
# Draft IPLD Unixfs Spec
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's stop calling this IPLD Unixfs, the current Unixfs is already IPLD. This proposal is to:

  • Move away from dag-pb to dag-cbor
  • Leverage the flexibility of dag-cbor to add more Metadata
  • Improve Unixfs and remove some of the limitations we found while using unixfs with dag-pb.

One of the design goals of the new Unixfs is that it should be 100% interopable with the old (a directory of Unixfs2 should be able to have a file of Unixfs1)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name "IPLD Unixfs" was @whyrusleeping idea and I just went along with it. We could call it "Unixfs V2", although I am not sure how much with want to stick with the unix filesystem structure as a model (I personally thing we should move away from it and focus on the compartments that are important to a generic archive structure).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine not calling it ipld unixfs. @diasdavid is right

Copy link
Member

@achingbrain achingbrain Jun 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a directory of Unixfs2 should be able to have a file of Unixfs1

It's probably worth mentioning this in the spec.

Also, what about Unixfs1 directories?

@daviddias daviddias mentioned this pull request Nov 22, 2017
@daviddias
Copy link
Member

Anyone following up on this spec?

@ehmry
Copy link

ehmry commented Dec 14, 2017

I am working on a utility for manipulated Unixfs structures independent on IPFS, but as a client of the IPFS API. But its not actually an IPFS client right now, and it doesn't conform to this spec yet. https://github.com/ehmry/nim-ipld/

@kevina
Copy link
Contributor Author

kevina commented Dec 14, 2017

@diasdavid it was stalled, partly because I still don't have a clear picture what others are looking for and partly because I was waiting feedback, some more recent issues that have come to my mind:

  1. Do we have a standard structure for ipld links in cbor format. Recently in Use ResolveUnixfsOnce as default resolver. ipfs/kubo#4444 it was brought up that there should be a away to traverse the dag structure of raw ipld nodes. However we structure the new Unixfs it should still be possible to traverse the dag without any knowledge of the Unixfs specific parts.

  2. As an extension of (1) should interior nodes for files just use a array of links? I don't see any point in naming the links, but if we want to access the raw dag we need a way to access this. The current protobuf always uses an array for links, this is a problem for directories as it allows separate links with regard to the name of the link.

  3. How should shading be handled. I don't think we should have a separate type for shared directories, instead if a directory becomes too large, we should create a special link type that represents part of a shared directory. (I hope this makes sense, if it doesn't let me know).

  4. I think we should move away from using the unix filesystem as a model for what attributes to include. Many of the attributes don't make sense in other operating systems or even an archive (for example numeric UID are basically useless unless extracted on the same system, the usernames and group names can sometimes be useful but often don't have a lot of value) I plan to make a separate issue to discuss the tags to include.

@whyrusleeping others thoughts

@whyrusleeping
Copy link

  1. Yeah, we do. Its a tagged type in cbor. You can always traverse a cbor object. This shouldnt be too much of an issue.
  2. We can probably have intermediate nodes be just an array, but we do need a small bit of context around subtree sizes for seeking.
  3. For sharding I figure we can reuse this hamt i'm working on using cbor: https://github.com/ipfs/go-hamt-ipld
  4. But the unix filesystem was really the point here. I think we should be able to support many different things (even if they don't make sense in every context). But we definitely shouldnt require everything be included.

@kevina
Copy link
Contributor Author

kevina commented Dec 14, 2017

@whyrusleeping

For (1) and (2) are you saying we should just not worry about how to traverse a raw cbor dag and something that needs to can just traverse the cbor structure looking for the special link type?

For (3) I honestly don't understand the hamt structure, but I think we should avoid the issue of having to flip-flop between a shared and non-shared directory. If a directory becomes too big, we split it and create a special entry to point to the shared parts, if it shrinks we inline (so to say) the directories back into the root. (I hope this is making sense, if not I can spell it out with examples because I am struggling to explain what seams like a simple concept for a lack of the right terminology)

[Edited to remove my comments on (4) as it was not well thought out.]

@ghost
Copy link

ghost commented Dec 15, 2017

I think I'd like to narrow down the scope of this. Metadata is a pretty
contentious issue, in that most of us seem to have different strong opinions
on which kinds of metadata should be included, and how.

I propose for now we look only into migrating unixfs from dag-pb to dag-cbor,
and make sure the data structure leaves room for future metadata additions.
For now, we should consider any metadata beyond what we already have in the
current incarnation of unixfs, as out of scope. What should be in scope though,
is the data structure we end up with being extensible in the future.

Moving unixfs to dag-cbor is a pressing issue, and additional metadata is not.
It sure is a legitimate issue though! I'd just like to move forward with the
migration to dag-cbor ASAP.

We should concentrate on making resolution and traversal work nicely.

@kevina
Copy link
Contributor Author

kevina commented Dec 15, 2017

@lgierth I am trying to minimize the fields included in this standard, I am not sure sticking with what we currently have is the best action is it does not meet the requirements stated in #1, in particular the executable bit, and being able to distinctest directories from files.

My previous comments (1) (2) (3) concern the data structure not the metadata.

A Unixfs is either a file or a directory.
The top level IPLD object is a CBOR map with at least two fields: `type` and `data`
and maybe a few other such as a version string or a set of flags.
The `type` field is either `file` or `dir`.
Copy link

@ehmry ehmry Dec 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I say better to define a CBOR tag for files and and a tag for directories, and define a file as a tagged array and a dir as a tagged map. That makes it clear from the first atomic in the CBOR that you are parsing UnixFS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ehmry we can do that, but do we then need to register the tags?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, I was thinking just picking two random uint64 tags.

@ghost
Copy link

ghost commented Dec 17, 2017

I am trying to minimize the fields included in this standard, I am not sure sticking with what we currently have is the best action is it does not meet the requirements stated in #1, in particular the executable bit, and being able to distinctest directories from files.

What I'm trying to say is: let's narrow the requirements for the first stage. We can add metadata at the second stage. A CBOR-based unixfs is much more approachable if we don't mix it with extensions of the contained metadata right away. Trying to do everything at once will make this harder and longer to pull off.

Moving to #1

@ehmry
Copy link

ehmry commented Dec 17, 2017

An easy thing to do would be to define three basic integer keys for metadata: type:1, CID:2, and size:3. A file DAG is basically a list of maps with keys CID and size and leave type as reserved for future use. A directory DAG is a map of text keys to maps with keys type, CID, and size. Type is be a field signifying a directory or a file, CID a raw block or CBOR dag, and size a logical file size or number or subdirectories. Additional metadata keys and file types can be added in later specs.

@ipld ipld deleted a comment from kevina Jan 19, 2018
@ipld ipld deleted a comment from Kubuxu Jan 19, 2018
@ipld ipld deleted a comment from kevina Jan 19, 2018
@kevina kevina mentioned this pull request May 2, 2018

If an IPLD file is a leaf its CID type is `raw` (0x55) and has no structure.
Otherwise its CID type is `dag-cbor` (0x71).
The `type` field is set to `file` and the `data` field is an CBOR array.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the current dag-cbor and bitswap implementation it's quite limiting to require that this be an array of links. There's a block size limit of 2megs across the board. An array of links when serialized to CBOR can't be more than ~2,500 links before the node itself is larger than 2megs.

Unless we bake in a way to shard this we'll be limited to files that are smaller than ~5GB.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I'm a lot less concerned about files under 5GB than I am concerned with not being able to develop a smart chunker for fear that if I don't use the max blocks size the node will be too large.

I'm starting to think about developing a chunker for javascript bundles that uses the sourcemap from the bundler to chunk it into blocks built from each file. This should greatly reduce the new blocks that need to be pushed when new bundles are created, but the number of chunks will easily be greater 2,500.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plan is to use a CHAMP instead of an array of links, like we do today for sharded directories. I have an implementation of a CHAMP (HAMT) in ipld here: https://github.com/ipfs/go-hamt-ipld

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JS implementation for unixfs sharding can be found at https://github.com/ipfs/js-ipfs-unixfs-engine/tree/master/src/hamt. It was built by @pgte a while ago.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm missing something, but I've only seen hash map trees used for named key values, not for an ordered array. If we're using this as a replacement for a CBOR Array is there key semantics that need to be defined here in the spec for that? I'm only seeing the current JS hamt implementation used for sharded directories, not sharded file parts.

Copy link
Contributor

@mikeal mikeal Jun 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the short answer is "don't include that many file parts in a single node"

This isn't sufficient for at least 3 use cases I can think of.

  • Bundles Tarballs, webpack and browserify output, etc. The ideal way to chunk these files is to chunk around the boundaries of the originating files so that changes to the bundle translate into a relatively small number of part changes. Tarballs that pack many small files, and pretty much any modern front-end bundle, is created from more than 2,500 original files so this puts it easily above the limit.
  • Compressed Files Gzipped content, but especially streaming media files: ogg, mpeg, etc. You want the chunker to create blocks that respect the compression windows. This results in faster performance throughout the stack but is especially important when seeking within that content as the codec will always ask for the whole window that seekpoint is from and if it's in the middle of the block this translates into a delayed seek while the content buffers. These compression windows are configurable and sometimes the windows are very small, so a relatively small video file could be more than 2,500K parts.
  • Files Larger than 5GB As discussed here there's currently a 2MB bitswap block limit. There's talk of supporting larger blocks but the larger the block the less efficient the transport will be.

It's fine if we just want to say that these use cases are out of scope for unixfsv2. Pushing them out of scope just means that we have some time to see how people solve these issues outside of unixfsv2 and it may help us get things shipped faster. But we're probably signing up for a unixfsv3 at some point in the future if we don't have any other way to work around this limitation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mikeal look at how the importers work and structure the graph of file parts. The answer is still 'dont include that many file parts in a single node'. Ipfs chunks and structures things into a recursive tree, not just a single level with a flat array of links.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Someone in Berlin mentioned that it was effectively an "array of arrays."

One thing to consider with this design, range requests don't work without loading every part of the file from the beginning to the start of the range. There's no information about the size of the individual parts so the only way to know how to seek is to load them all in serial. Really not ideal, especially for media uses cases because it makes seeking quite slow.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current design has range information, seeking is efficient and only has to load the required nodes for that graph traversal. I assume we would do exactly the same in V2

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see our fix to this involves moving from data: [chunks] to data: {parts: [chunks], partLengths: []} because If we make the data attribute an object we can add other relevant information, like the type of chunking algorithm used, which would allow us to implement more efficient syncing between clients.

The `type` field is set to `file` and the `data` field is an CBOR array.
Each element of the array is CBOR map with the following fields:

- `data`: link
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why this field isn't called link given that it's always a link?


An IPLD `dir` represents a directory.
Its CID type is `dag-cbor` (0x71).
The `type` field set to `dir` and the data field is an CBOR map.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small typo, 'an CBOR map'


If an IPLD file is a leaf its CID type is `raw` (0x55) and has no structure.
Otherwise its CID type is `dag-cbor` (0x71).
The `type` field is set to `file` and the `data` field is an CBOR array.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small typo, 'an CBOR array'

When extracting implications SHOULD use the IPLD name and not `fname` unless a special flag is given.

* To save space fields of a directory may be assigned integer values.
Integers have the added benefit of conveying additional meaning based on there values;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small typo: 'there' vs 'their'.

@mikeal
Copy link
Contributor

mikeal commented Jun 30, 2018

I had some spare cycles today while I'm waiting for a graph to build so I wrote a quick implementation of the draft spec in JavaScript.

https://github.com/mikeal/js-unixfsv2-draft

Stripping this field MUST not change the meaning of the directory entry.
These attributes SHOULD be passed along but do not have to be understood.

Possible entries:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would "Extended Attributes" be a good place to optionally store explicit media type for problematic data types, as noted in #11 ?

@mikeal
Copy link
Contributor

mikeal commented Aug 8, 2018

At DWeb we discussed the fact that this PR is impossible to follow at this point. @Stebalien suggested something the Rust community does, which is to close the PR and immediately re-open with a list of the unresolved issues in the body of the PR.

@mikeal
Copy link
Contributor

mikeal commented Sep 21, 2018

I'm going to try and move things along and finalize a draft spec by the end of Q4.

This thread is a bit too large to continue. There's many conversations about many changes without a clear owner. My plan is to:

  1. Create a new PR with a skeleton of the current proposal.
  2. Create several PR's against that branch for all the details that have an active discussion in this thread, along with a summary of the current discussion.

@dokterbob
Copy link

Hey @mikeal, all!

We're nearly a year later now, when's the spec moving forward? We want decent file support in IPFS!

@warpfork
Copy link

warpfork commented May 5, 2019

Our priorities have been pulled in a lot of directions.

The good news is, the IPLD folks (including me) are getting ready to take a look at this again starting next week, and with all the new learnings and tools we have from working on IPLD schemas and IPLD selectors over the last couple of months, which will hopefully make a big difference in how tractable the overall thing is, and how quickly, tersely, and confidently we can settle a new language-agnostic spec.

I think the other already-merged PRs that referenced this one recently have also already carried the ball a bit further, and I think @mikeal has some other PoC code as well. This issue will probably be closed soon, per the reasoning two comments back, but we're planning a housecleaning and close-a-thon for lots of overgrown issues tomorrow, so I'll leave this to get one more review and be done in that batch.

@mikeal mikeal closed this May 6, 2019
@mikeal
Copy link
Contributor

mikeal commented May 6, 2019

@dokterbob We’ve got a working implementation in JavaScript https://github.com/ipld/js-unixfsv2 and expect one in Go fairly soon. We’re also working on standardizing schemas/collections/hamt which is a pre-requisite to shipping this new version (although we can start doing the integration work before this is finished).

Another thing to note about these new implementation is that they are being designed to be used outside of IPFS as well, so if you want to start using them before they are fully integrated you can.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet