Skip to content
This repository has been archived by the owner on Dec 6, 2022. It is now read-only.

Requirements #1

Closed
kevina opened this issue Sep 17, 2017 · 42 comments
Closed

Requirements #1

kevina opened this issue Sep 17, 2017 · 42 comments
Assignees

Comments

@kevina
Copy link
Contributor

kevina commented Sep 17, 2017

This issue is to gather the requirements and/or desired features for the next generation unixfs directory.

@whyrusleeping @Stebalien @magik6k

ref: ipfs/kubo#4229

@whyrusleeping
Copy link

also cc @diasdavid @dignifiedquire @Kubuxu

@daviddias
Copy link
Member

also cc @pgte

@Stebalien
Copy link

Need:

  • Executable bit (package manager)
  • Extended attributes (package manager)
  • Inline file type (performance)

Maybe:

  • Inline small files (performance).
  • Inline small directories (performance).

Question: Do we want to duplicate the file type? That is, store it in the file and in the directory or just in the directory.

@Kubuxu
Copy link

Kubuxu commented Sep 18, 2017

I would say it would be nice to preserve all information we can pull from filesystem. My guess is that basing off the tar format (as in what it stores) would be good start.

Storing extended attributes would be very nice too. It would allow for storage of for example ACLs. They could be a special case of generic metadata.

@Kubuxu Kubuxu changed the title Requiments Requirements Sep 18, 2017
@whyrusleeping
Copy link

whyrusleeping commented Sep 18, 2017 via email

@Stebalien
Copy link

@whyrusleeping actually, we can get the best of both worlds by storing this metadata in the directories but not the files. If we really need to attach the metadata to the files themselves, we could add a special metadata node that adds metadata to a file (although this would only be used when linking to files directly which probably won't be that common).

@Kubuxu
Copy link

Kubuxu commented Sep 19, 2017

I think it might be also useful to start producing directory wrapped files by default. It would be even more breaking change but would clear up more confusion and preserve more metadata.

@whyrusleeping
Copy link

Also needed is support for X-Attrs

@kevina
Copy link
Contributor Author

kevina commented Oct 9, 2017

@whyrusleeping and anyone, I can create a draft proposal to get things moving. However, I am a little confused on the format the spec should be in. Since there is ipld in the name I take it it should be in ipld, which is basically JSON, but with the use on integers. An extremely primitive spec might be:

[{"afile" {"/": "Qm...", "mtime": "...", "size" : 1234},
  "adir" {"/", "Qm...", "isdir": true},
  "aexec" {"/", "Qm...", "isexec": true, "mtime": "...", "size": 768783}}]

but the ipld spec does not spell out how this will be serialized in binary formats as far as I can tell. The translation to cbor is pretty straightforward, but I don't like the idea of including string hash keys in what should be a compact binary format, I would rather assign integer to properties such as "/", "mtime", "size", "isdir", "isexec". If we what to serialize to a protobuf we have to assign integer values to the properties anyway or have a very inefficient protobuf encoding.

Note on design choice of the spec above: (1) I believe very strongly that some sort of timestamp should be included by default. (2) There are multiple ways to define the "size" of the directory. The current unixfs defines this as the size of the directory and the contents, real unix filesystems define the size of a directory as the size of the directory entry, but not its contents, I think it is better just to leave it out.

@kevina
Copy link
Contributor Author

kevina commented Oct 9, 2017

As far what to include, the tar spec might be a good start: https://en.wikipedia.org/wiki/Tar_%28computing%29

Although there are some fields, such as the owner and user ID that are probably not generally useful and should not be included by default. Also there is the question if how unix like we want this, for example the "file mode" is very unix specific.

@Stebalien
Copy link

@kevina, a few notes:

  • IPLD is a really just a meta format that supports many different serializations. It's a way of mapping unstructured binary to structured, linked data. Usually, we prefer using DagCBOR if at all possible.
  • The {"/": "Qm..."} is just how we encode links in JSON (because, unlike e.g. CBOR, JSON doesn't support "tags" for custom types). You can't add other fields to links. If you find any documents saying otherwise, please tell me so I can fix them. Warning: the spec is a bit out of date (like all of our specs...).
  • I doubt the overhead of string names will be all that bad, even for tiny files. First, we can omit key/value pairs with default values. Second, CIDs themselves are generally ~36 bytes (in CBOR, including tags etc., links cost 41 bytes IIRC) so links will likely dwarf the cost of string keys. There have been grumblings (i.e., I've been grumbling) about adding compression support but that's still in the "it would be nice" stage.

We may be able to use xattrs for this but it would be nice to support linking to "related" files in IPFS. This way, we can write importers that can, e.g., parse a merkdown/HTML file, find all linked IPFS content, and link to it from the metadata (so it gets pinned along with the file).

@Stebalien
Copy link

  • Support for CARs.

CARs are (to be speced) content addressable archives. That is, archives of IPLD DAGs. The primary motivation is to be able to dump an IPLD DAG onto a hard drive, ship it somewhere, and them import it back into IPFS.

However, I believe we can use them to bride IPLD and IPFS. That is, when I call ipfs files cp /ipld/QmId... /my-car, IPFS would link /ipld/QmId into my mfs filesystem (with a filename my-car). If I were to cat this file, IPFS would generate a CAR of the IPLD DAG on the fly.

The power here is not the ability to generate CARs on the fly (although that's really convenient), it's the ability to map structured linked data into unixfs without losing its structure. For now, most tools will just see it as a byte stream (a CAR). However, we can give it some extra metadata marking it as an IPLD DAG so tools that understand IPLD can operate on it that way.

This may also be useful for access control (ACLs on files lead to ACLs on structured data) but I haven't put much thought into that yet.

@Kubuxu
Copy link

Kubuxu commented Oct 20, 2017

What we have to take into account is not only format for unixfs itself but we also have to create arbitrary sized, seekable bytearrays on IPLD.

@Stebalien
Copy link

What we have to take into account is not only format for unixfs itself but we also have to create arbitrary sized, seekable bytearrays on IPLD.

IPLD doesn't have built-in sharding by design (so it can't really have arbitrary sized byte arrays). We decided to punt on that and build a sharded DAG system on top of it later (leave IPLD objects as "atoms"). However, this is a good opportunity to tackle that. The idea was to abstract the sharding logic from IPFS into a middle layer between IPFS and IPLD.

@kevina
Copy link
Contributor Author

kevina commented Oct 20, 2017

@whyrusleeping @Stebalien and others, can I have a concrete example of what you envision the ipld unixfs looking like. I have not been following the IPLD development closely and right now the spec seams more like a collection of notes that a formal spec.

@Stebalien
Copy link

Writing up an IPLD spec is one of my tasks this quarter (right behind putting out fires on the gateways).

Concrete Problems

There are two concrete issues with the current system: poor abstractions and the DagProtobuf IPLD format.

Abstractions

We've implemented sharding directly in unixfs. This means that other applications can't take advantage of this work to shard up their own structured data.

DagProtobuf

This IPLD format is cumbersome, rigid, and not self describing.

  • Cumbersome: It's just hard to work with. Having to pull out a binary blob from the data field, decode it, modify it, add any necessary links, and then put everything back together is a pain. Being able to mix data and links (and have structured data) is really convenient. For example, we can't just add extra metadata to files linked to from directories, we'd have to add the metadata to the Data section and reference the appropriate link in the Links section.
  • Rigid: It's hard to just add new fields as we need them. For example, let's say that I want some directory entries to link to some metadata. Currently, I'd have to create a new directory type because I can't add links to the links section without them being interpreted as additional directory entries.
  • Unixfs objects aren't self-describing. We put all the interesting data in a protobuf and then stick it in the Data field. This makes a lot of the interesting data inaccessible to, e.g., IPLD selectors (or any code that operates over unixfs at the IPLD level).

Goals

  • Self describing: systems that understand IPLD but not, e.g., IPFS should be able to execute queries over IPFS DAGs.
  • Extensible: we should be able to add features without breaking old ipfs daemons. Honestly, I'd like to get to a point where we never introduce a feature such that an old daemon can't read a file imported by a new daemon; this isn't going to happen but we should at keep it in mind.
  • A pleasure to work with: users should be able to write their own file importers without cursing us.
  • Reasonably efficient (both in space and time).
  • unixfs <-> *nix filesystems: we should, generally, be able to copy files from linux to unixfs and back without loosing information. The caveat is that we may lose some information about permissions but we can't always help that (we'll have to think carefully about this).
  • Boot to IPFS: many of use would like to be able to eventually use IPFS as our root filesystem (long term goal, will likely require lots of caching). We should avoid any decisions that will make this unworkable.

Personally, I'd also like unixfs to interoperate with structured data better than other filesystems. That's why I want the CAR support.

@mib-kd743naq
Copy link

unixfs <-> *nix filesystems: we should, generally, be able to copy files from linux to unixfs and back without loosing information. The caveat is that we may lose some information about permissions but we can't always help that (we'll have to think carefully about this).

This is exactly the "mission statement" of the various tar formats: we will try to preserve as much as we can, uid/gid included, but we might not be able to actually recreate it on the client side.

See also my thoughts in ipfs/kubo#4292 (comment)

@mib-kd743naq
Copy link

/cc @mguentner as once upon a time he expressed the same frustrations as mine: ipfs/notes#60 (comment)

@ehmry
Copy link

ehmry commented Oct 24, 2017

This is probably a minority opinion but I think that storing permissions, timestamps, and extended attributes is archaic and backwards. I say make a simple standard for simple file-system DAGs, and then a superset unixfs standard for storing cruft metadata. Unix is only as relevant as computer science is lazy and unimaginative. UIDs, GIDs, and permissions are features of administrated file-systems, not distributed ones. The executable bit is just arbitrary metadata that can be replaced by conventions in where executable data is stored. If the exec bit is present, what would a proper #! line look like? The more Unix stuff that is explicitly supported the more there will be odd corner cases to fumble through.

@mib-kd743naq
Copy link

and then a superset unixfs standard for storing cruft metadata.

My understanding is that this is exactly what this issue is trying to define. IPLD on its own is well defined and already has apps running on it...

@ehmry
Copy link

ehmry commented Oct 24, 2017

I'm just trying to argue for a simple file and directory standard that works across different operating systems that addresses things like small file packing but defers the Unix stuff.

@kevina
Copy link
Contributor Author

kevina commented Oct 24, 2017

@ehmry I agree that storing permissions and extended attributes is not useful for other operating systems but storing the last modified time is useful. The zip file format also stores timestamps: https://en.wikipedia.org/wiki/Zip_(file_format).

@whyrusleeping
Copy link

I really disagree that last modified time is useful in this context. @kevina what is the use case for that?

@kevina
Copy link
Contributor Author

kevina commented Oct 24, 2017

@whyrusleeping I don't have a "use case" I just consider it a good best practice. Many, many, many times when trying to figure out what a file is several years later the timestamp can give important clues to doing so. I would really like to see this information preserved by default and not as part of some extended metadata package that can easily be discarded and lost forever.

Almost every filesystem and every archive format has some sort of timestamp on files and I consider it basic information. I was quite surprised that the current unixfs format does not have it.

@Stebalien
Copy link

I say make a simple standard for simple file-system DAGs, and then a superset unixfs standard for storing cruft metadata.

Personally, I'd like to specify a minimum format and a way of storing metadata. Then, we can say that specific metadata fields mean specific things. Minimal implementations can ignore metadata they don't care about. This also means that, for the purposes of archiving, we can record and extract anything.

I really disagree that last modified time is useful in this context.

I kind of agree as well. Personally, I see IPFS data as kind of "timeless". However, I have also found modification times to be useful... Honestly, I don't know the right solution. I'd say at least support them, even if "support" just means "stick them in some extended attribute field if requested" (e.g., by a tar-like program).

@Stebalien
Copy link

Other idea: Pinning. This came up elsewhere but we've discussed replacing the current pinning system with mfs. That is, you wouldn't "pin" files, you'd link them into your own personal filesystem. This way, all pinned files are named and applications can manage their own named pin sets by managing their own application-specific sub-directories (each app would get a data folder).

However, the we'd probably want a way to specify what should be pinned. There are many cases where one would want to pin certain parts of one's filesystem but not all of it (it can be useful to link to data without necessarily persisting it). Even beyond pinning, it would be useful to be able to include hints about what to download first/what data is "important" in IPFS. Therefore, I think it may be useful to allow one to attach IPLD selectors to directory entries describing the relative importance of the data within it (where pin-selectors attached to higher level directories can override selectors attached to lower-level directories).

For now, we probably don't have to tackle this (we don't even have IPLD selectors yet). However, we may want to keep it in mind. This is yet another piece of metadata we'll likely want to be able to attach to files.

@kevina
Copy link
Contributor Author

kevina commented Oct 25, 2017

I'd say at least support them, even if "support" just means "stick them in some extended attribute field if requested"

And this is exactly what I don't want to happen because it can easily be disregarded. If I understand how we want this to be implemented extended attribute may be in a separate IPFS block from the core date (to say aid in deduplications). And even if it is not it would be lumped with all the other extended attributes that could easily be stripped. The motivation for stripping this could be to save space, or it could also be out of privacy concerns if the metadata includes other information such as the Unix username.

For example of how metadata can get lost look at pictures shared on the net. JPEG has all sorts of metadata as part of the format. But often this information is often intentionally stripped due to privacy concerns. And even when it is not all the metadata is lost anyway when we convert it various other formats or when it is shared by taking a screenshot.

For archival purposes when something was created is the single most useful piece of context. For example if someone wrote a note to themselves (but didn't include the date in the text themselves) but then looks back on it several years later, the date can help determine exactly what that note was referring to. Since creation data is not really stored by most filesystems we have to go with the next best thing, the last modified date. Even with that, in a group of files in a directory the combined modification time of all the files in a directory could be useful in determining when the files where created. For example if the backup copy of the file was modified 4 years before the current copy and all the other files in directory where modified 3-4 years ago, that gives a pretty good idea that most of the files where likely created 3-4 years ago.

@Stebalien
Copy link

And this is exactly what I don't want to happen because it can easily be disregarded.

That's exactly why I want to do it this way. I'd like to keep the core of IPFS simple (ish). Now, I wouldn't just stick it in an arbitrary metadata field, we'd have an agreed upon metadata field for storing modification dates (and possibly others for, e.g., linking to past versions of the file). However, I'd like to support simple clients that simply treat all metadata as "extra stuff" to be preserved and copied but not necessarily understood. That way, the protocol is extensible but simple at its core.

For example of how metadata can get lost look at pictures shared on the net. JPEG has all sorts of metadata as part of the format. But often this information is often intentionally stripped due to privacy concerns. And even when it is not all the metadata is lost anyway when we convert it various other formats or when it is shared by taking a screenshot.

That's another reason I'd like extensible metadata fields.

If I understand how we want this to be implemented extended attribute may be in a separate IPFS block from the core date (to say aid in deduplications).

Honestly, I'm not sure what's the best way to do this.

My current thinking is that metadata would usually be inlined into the directory except when linking to a file directly (in which case, it would be broken out into a separate block). Usually, I'd discourage linking to individual files directly. However, this does introduce some complexity.

@mib-kd743naq
Copy link

I am a visual kind of person, so I put together this graph, want to make sure I understood what both @kevina and @Stebalien are proposing.

This assumes 3 directories in each case, each containing a single entry, pointing to a file that is small enough to fit in a single block.

What did I get right, and what did I get wrong?

@whyrusleeping
Copy link

@mib-kd743naq could you also post your dot source?

@kevina
Copy link
Contributor Author

kevina commented Oct 26, 2017

@mib-kd743naq t hat is not really what I am proposing. I am proposing the mtime be just another field in the directory entry, just as the file name, the hash, the execution bit, etc.

@ehmry
Copy link

ehmry commented Oct 28, 2017

Timestamps are probably most efficiently stored as data derived from journal. If a major file-system is backed by IPLD then append a timestamp and the top-level CID to a journal every few minutes. If you want a revision history then work your way backwards through the journal. If that is too slow then you should be using a native unix file-system as scratch-space anyway.

@kevina
Copy link
Contributor Author

kevina commented Oct 28, 2017

Everyone, there is now an initial spec to comment on in #2.

@Stebalien
Copy link

Another thing it would be nice to support: hints about where to find files (e.g., a URL, a bittorrent tracker, etc.). This would probably just go in the metadata section but we should keep it in mind.

@kevina
Copy link
Contributor Author

kevina commented Nov 5, 2017

Another thing it would be nice to support: hints about where to find files (e.g., a URL, a bittorrent tracker, etc.). This would probably just go in the metadata section but we should keep it in mind.

I am a little confused by this, doesn't this go against what IPFS is about. That is given a CID it should be IPFS job to find it, wherever it is.

@Stebalien
Copy link

I am a little confused by this, doesn't this go against what IPFS is about. That is given a CID it should be IPFS job to find it, wherever it is.

At the end of the day, it is. However, IMO, hints are fair game. Giving IPFS hints at where to find it could make IPFS much faster.

@ghost
Copy link

ghost commented Dec 17, 2017

Moving these comments here from #2:

I think I'd like to narrow down the scope of this. Metadata is a pretty
contentious issue, in that most of us seem to have different strong opinions
on which kinds of metadata should be included, and how.

I propose for now we look only into migrating unixfs from dag-pb protobufs-in-dag-pb to dag-cbor,
and make sure the data structure leaves room for future metadata additions.
For now, we should consider any metadata beyond what we already have in the
current incarnation of unixfs, as out of scope. What should be in scope though,
is the data structure we end up with being extensible in the future.

Moving unixfs to dag-cbor is an pressing ongoing issue, and additional metadata is not.
It sure is a legitimate issue though! I'd just like to move forward with the
migration to dag-cbor away from protobufs ASAP.

We should concentrate on making resolution and traversal work nicely.


I am trying to minimize the fields included in this standard, I am not sure sticking with what we currently have is the best action is it does not meet the requirements stated in #1, in particular the executable bit, and being able to distinctest directories from files.

What I'm trying to say is: let's narrow the requirements for the first stage. We can add metadata at the second stage. A CBOR-based unixfs is much more approachable if we don't mix it with extensions of the contained metadata right away. Trying to do everything at once will make this harder and longer to pull off.

@daviddias
Copy link
Member

Moving unixfs to dag-cbor is a pressing issue

I agree with many of the points above, however I do not understand why just creating dag-cbor nodes instead of dag-pb is a pressing issue by itself if they are going to have the same format (Links + Data). What am I missing?

@ghost
Copy link

ghost commented Dec 17, 2017

I agree with many of the points above, however I do not understand why just creating dag-cbor nodes instead of dag-pb is a pressing issue by itself if they are going to have the same format (Links + Data). What am I missing?

We could just start creating dag-cbor nodes instead of dag-pb nodes, but my understanding is that we want to get rid of protobuf structures (unixfs would still be protobufs in a dag-cbor node), not carry them over.

"Pressing" might have been a too strong word -- there's just UX issues with unixfs still being based off double protobuf structure that I think are significant. We can solve them sooner if we break up this endeavour into smaller, more digestable pieces.

Unixfs is the data structure that almost every user is exposed to right from the start, and it's confusing that there's an additional layer, instead of it just being straight-forward IPLD, like the filesystem examples in the IPLD spec. We're just still in the middle of the IPLD migration, and we have these two separate UX strains: ipfs cat|add|files|object on the one hand, ipfs dag on the other hand.

This is basically the UX problem that motivated the unixfs/IPLD session during the labweek unconf in October (but I think unfortunately we didn't take good notes there). This problem makes it really complicated to explain how IPFS represents data. Most people eventually get, but it's still a complication we should solve sooner rather than later.

tl;dr I'm thinking moving forward with the general IPLD migration should take priority over extending unixfs features, for now.

@kevina
Copy link
Contributor Author

kevina commented Jan 5, 2018

@Stebalien how important is the executable bit?

To help move things along I am very tempted to go with @lgierth suggestion and not add new attributes.

@rvagg
Copy link
Member

rvagg commented Dec 6, 2022

closing for archival

@rvagg rvagg closed this as completed Dec 6, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants