Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding additional file metadata to UnixFSv1 #217

Closed
mikeal opened this issue Aug 8, 2019 · 16 comments
Closed

Adding additional file metadata to UnixFSv1 #217

mikeal opened this issue Aug 8, 2019 · 16 comments

Comments

@mikeal
Copy link

mikeal commented Aug 8, 2019

Current UnixFSv1 importers do not encode most of the standard file metadata from most file systems.

This has been a particular challenge for package managers since they already rely on some of this metadata.

The goal of this issue is to surface all the necessary discussion points in order to drive a new PR against the unixfs spec.

Potential metadata

  • Permissions
    • Executable bit
    • Ownership (user and group)
  • Filename in file object
  • mtime
  • ctime
  • atime

Additional considerations

For time stamps (mtime, ctime, atime) we need to decide if we’re going to use high precision times or not. Most systems expect a 32-bit integer (low precision) while other use cases may need a 64-bit integer (high precision).

Do we want to store additional metadata of the directory? How do we handle updating this when someone updates only a single file in the directory?

Where do we store this metadata?

In terms of the data format, should these properties be added to the File message or the Data message?

History

The history of this feature as well as meeting notes where this feature was prioritized are available here.

@lidel
Copy link
Member

lidel commented Aug 12, 2019

Could this also include support for opt-in setting of content type?

The spec @ 12a3d57 already has a field for this (but it does not seem to be wired to anything):

message Metadata {
	optional string MimeType = 1;
}

This would enable people to solve false-positives in content-type sniffing before v2 lands (ipld/legacy-unixfs-v2#11)

@mikeal
Copy link
Author

mikeal commented Aug 12, 2019

Can someone more familiar with the original spec and implementations explain how the Metadata message is currently used? It seems obvious that we should leverage it and also start using the MimeType field but without knowing a bit more about the history and current usage I can’t tell if we’re likely to break anything.

@warpfork
Copy link
Member

I'm not sure what to make of the MimeType idea. Unixy filesystems don't have a concept of MimeType; that's much higher level.

It certainly seems like an sizable embiggening of scope from the bullet points at the top.

@mib-kd743naq
Copy link

Can someone more familiar with the original spec and implementations explain how the Metadata message is currently used?

As far as I know it was never implemented in neither go-ipfs nor js-ipfs. The way it was supposed to work is somewhat described here. I would again strongly advise steering clear of this construct: wrapper-blocks carrying metadata are not... great.

@ivan386
Copy link

ivan386 commented Aug 12, 2019

@mib-kd743naq at now time metadata block can be included in directory block by using identity hash. And file block can be included in metadata block in same way.

@mikeal
Copy link
Author

mikeal commented Aug 12, 2019

I’d like to surface these tradeoffs so that the folks with use cases driving the need for it can comment appropriately.

Using the Metadata message will:

  • Increase the graph depth by 1 for every file, and also by 1 for any directory we add metadata to.
  • Increase the amount of de-duplication we can do of the main file metadata object (not the actual file data, since that’s de-duplicated either way).

Adding the metadata to the file/dir object itself will:

  • Duplicate this file/dir object for two files that are effectively the same but have different metadata. Again, the actual file data is still de-duplicated either way.
  • Avoid increasing the graph depth.

I’d like people closer to the use cases to weigh in on which of these they find most compelling. @andrew @alanshaw @achingbrain

@mib-kd743naq
Copy link

@mikeal you are missing option 3 though: "metadata is part of the 'directory' entry"

@mikeal
Copy link
Author

mikeal commented Aug 12, 2019

@mib-kd743naq I updated my comment to be “file/dir” in the case of directory metadata. If there is another option you’re suggesting we explore where the metadata for every file in a directory is added to the directory entry we’ll need to discuss that a bit more before I add it because that sounds quite problematic when we start dealing with sharded directories :(

@ianopolous
Copy link
Member

We have a mime type field in file metadata in Peergos, so can relate our experiences. There are two things useful to be aware of.

  1. a file can have multiple mime types depending on the context
  2. some mime types can't be deduced until the entire file has been read

@mib-kd743naq
Copy link

@mikeal words are hard... instead next week during chaos camp I will attempt to build a PoC similar to my last large scale stress test, but this time for various types of metadata embedded in backwards-compatible-ish variants of dag-pb.

Then a concrete discussion based on actual blocks can be had.

@achingbrain
Copy link
Member

I’d like people closer to the use cases to weigh in on which of these they find most compelling

Expanding the fields in the UnixFS data type seems like the most sensible path as adding extra nodes for each and every file will become expensive for very large file systems (package manager datasets, for example).

Two files with different metadata will have different root nodes, but I think this is fine as the file data is still de-duped across the two and fundamentally the metadata has to be stored somewhere. If we can do that without causing another network/disk/blockstore trip then great.

@Stebalien
Copy link
Member

@warpfork the mime-type is there because users sometimes want to explicitly specify the MIME type. Unfortunately, this can be very important when using ipfs with a gateway. The alternative of just encoding this in a separate file and having the gateway interpret it was also discussed.


Some history: the metadata block was supposed to be used as follows:

{
  Data: {
    Type: Metadata,
    // stuff...
  },
  Links: [{Cid: ActualFile}]
}

Unfortunately, doing it this way would be a slightly breaking change (for users of this feature). Inlining metadata directly into files would not.


As @mib-kd743naq points out, we could also inline into directories. This also gives us fast LS (which is currently a bit annoying). We could even add file types to directories (the repeated information shouldn't be an issue).

The primary problem with this is that resolving to a CID and then copying wouldn't carry the metadata.

On the other hand, this isn't unreasonable. Names are already a part of the directory. Making metadata a part of the directory isn't all that odd. I'd expect most tools to reference files relative to directories anyways.


A hacky alternative is to:

  1. Embed the metadata into an intermediate block with the type "file".
  2. Don't stick the actual data into this block. Instead, force it out into a second block.

This matches the original design without breaking anything.

@warpfork
Copy link
Member

warpfork commented Aug 30, 2019

+1 towards the idea that if MIME type is getting well-known support, it should be something we move towards the gateway knowing of it, rather than making it a feature of the filesystem. This would be a much closer set of relationships to how the rest of the world works already (e.g. doing sysadmin today with nginx or something, I would generally configures MIME types at the webserver area, and not in filesystem metadata) -- and thus seems much less likely to go awry.

Carefully avoiding baking in the idea of a single "mimetype string" field into our filesystem metadata also leaves much more room for issues to evolve around the things Ian mentioned:

  1. a file can have multiple mime types depending on the context
  2. some mime types can't be deduced until the entire file has been read

@mikeal
Copy link
Author

mikeal commented Aug 30, 2019

PR is up now at #220

Note that I used uint32 for all the time data. In unixfsv2 we’re considering properties for 64bit high precision times but since uint32 is what most people expect I figured that was appropriate when adding these to unixfsv1.

@warpfork
Copy link
Member

warpfork commented Sep 2, 2019

(I just commented this on the PR, but posting again here for discoverability for anyone who didn't follow the jump to the PR...)

I'd like to just mention a couple links to prior art that's not merely prior art, but also particularly easy to read and review for inspirations:

Both of these (as well as the specs of tar, which I'm assuming everyone's at least given a cursory glance at already) are highly worth a quick skim just to see what other people have covered when trying to map this terrain.

There are large (large) bodies of thought on this out there already, and while we may or may not choose to do some things differently, we should make sure we're doing that on purpose. We'll be doing ourselves a sizable disservice if we add new features that unintentionally strike too far outside the norm by sheer accident of not having checked where the norm is.

@achingbrain
Copy link
Member

I think this can be closed now - we've added mtime and mode to UnixFSv1, additional fields and arbitrary metadata will probably wait for UnixFSv2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants