Skip to content
This repository has been archived by the owner on Jun 29, 2022. It is now read-only.

Supporting existing mime types. #65

Closed
mikeal opened this issue Jun 23, 2018 · 15 comments
Closed

Supporting existing mime types. #65

mikeal opened this issue Jun 23, 2018 · 15 comments

Comments

@mikeal
Copy link
Contributor

mikeal commented Jun 23, 2018

I wanted to start a conversation about the best way to support existing mime types.

Specifically, I want to talk about data that doesn't have links but is often linked to, like images and video. It would be great not to re-invent the entire mime/content-type system for data without links.

Something along the lines of mime[audio/aac].

We also may want to consider the same for addressing compression of the format mime[audio/aac][gzip].

I looked around for a previous discussion around this but couldn't find anything. If there's another thread please point me at it :)

@vmx
Copy link
Member

vmx commented Jun 25, 2018

Do you mean leveraging existing mime-types to describe the blocks (e.g. if you store a JPEG), so that the resolvers can correctly deal with the data?

@mikeal
Copy link
Contributor Author

mikeal commented Jun 25, 2018 via email

@vmx
Copy link
Member

vmx commented Jun 25, 2018

I looked into it some time ago (I can't remember why, I guess there was some other issue triggering it). I was wondering if there was an easy way to get unique identifiers (as in "hex value") from the IANA Media Types. The only idea I has was scraping the Templates, getting a date from them and then assigning an increasing value ourselves.

@mitra42
Copy link

mitra42 commented Jun 25, 2018

Why reinvent the wheel when there is an existing, extensible process for assigning them ? With higher level types, (image, video etc) ; splits off that, and when needed parameters to allow even more detail. Its not perfect for all situations, but its unlikely that any replacement would be perfect either, and it has the huge advantage that it integrates with other things - for example you can check what application your system wants to open the file in.

If you invented your own system everyone would just have to carry around a big conversion table in their apps and figure out how to continuously update it to match a new hex type to the table.

@mikeal
Copy link
Contributor Author

mikeal commented Jun 25, 2018 via email

@Stebalien
Copy link
Contributor

Note: The "codec" on the CID just tells you how to interpret the binary data as a structured IPLD object. It should not be used as an MIME type.

@vmx
Copy link
Member

vmx commented Jun 27, 2018

@Stebalien Isn't there a huge overlap between MIME types and codecs? For me it makes sense to have a codec that tells me to interpret something like image/png.

@mikeal
Copy link
Contributor Author

mikeal commented Jun 27, 2018

If you look at the existing list of multicodecs many of them already have registered mime types, so there's certainly overlap.

If the only codecs were for dag nodes and all edge nodes were raw then I could see the separation, but that just isn't the case right now, there are many registered codecs for edge nodes in formats that don't support links.

Here's a question that might shed some light on how to interpret this. If I'm building a fileserver on top of unixfs-v2 and the file name has an extension of .json but the CID has a codec of bson what do I set the content-type header to?

To me, someone clearly encoded the node into bson and just set the wrong file extension, so I would trust the CID's codec for interpretation.

I'll also note that projects like the IPLD graph viewer get much more interesting if we can signal mime-types for any edge node in a graph. It means that even the most abstract graphs people create that include images and other content can be interpreted and viewed much more easily.

@Stebalien
Copy link
Contributor

Isn't there a huge overlap between MIME types and codecs?

...

If you look at the existing list of multicodecs many of them already have registered mime types, so there's certainly overlap.

...

If the only codecs were for dag nodes and all edge nodes were raw then I could see the separation, but that just isn't the case right now, there are many registered codecs for edge nodes in formats that don't support links.

Yes. However, those aren't all IPLD formats.


So, the issue here is twofold:

  1. We don't want to tie binary representation to interpretation.
  2. We don't want to have to create a new IPLD format every time someone implements a new filetype. With normal MIME types, I can talk about some data without actually understanding the MIME type. With IPLD formats, I can't even talk about the data. If you tell me to pin some IPLD DAG that has nodes in a format I don't understand, I literally can't pin it because I have no idea how to find/follow the internal links.

Really, we want a type system in addition to IPLD formats. However, IPLD formats are not a type system. The important difference is that, as long as a tool understands all the relevant IPLD formats, it can traverse/transform arbitrary IPLD DAGs even if it doesn't understand the types.

Our current plan is to:

  1. Extract type information from existing IPLD datastructures.
  2. Allow users to explicitly specify types in future IPLD datastructures (requires support from the format).

Aside: Yes, I know we have a GitRaw codec. IMO, we shouldn't. I don't know how that snuck in but that shouldn't be there. However, it is slightly useful because raw git objects are a bit special (they use the broken SHA1 hash and may be arbitrarily large).

@vmx
Copy link
Member

vmx commented Jun 28, 2018

Aside: Yes, I know we have a GitRaw codec. IMO, we shouldn't. I don't know how that snuck in but that shouldn't be there. However, it is slightly useful because raw git objects are a bit special (they use the broken SHA1 hash and may be arbitrarily large).

Do you mean there should be a codec for each Git Object type (commit, tag, tree)? If yes, why not changing it while we can?

@Stebalien
Copy link
Contributor

Do you mean there should be a codec for each Git Object type (commit, tag, tree)? If yes, why not changing it while we can?

No, no, I'm just confused. I saw GitRaw and assumed that only applied to blobs. Turns out GitRaw just means "non-blob git object" and blobs are stored using the Raw codec. This is correct and as it should be.


Now, really, there probably is a large overlap. Most files have some logical internal structure that could be decoded as a structured IPLD object. However, we have to be careful about adding new formats too eagerly as, again, we need to add support for those formats to every implementation.

@jchris
Copy link

jchris commented Sep 9, 2018

One cool aspect of mime in the browser world is content negotiation. I don’t see mime in the multiformats project, but I’m still finding my way around. At first glance IPLD seems like a reasonable place to empower user agents to pick content types, especially since different files with the same content might only be linked at the appplication layer otherwise. Sniffing is fine for serving files when only one format is available, but that path doesn’t lead toward robust content negotiation.

@mikeal
Copy link
Contributor Author

mikeal commented Sep 10, 2018

At the CID/Block level we can't really negotiate the content because we can't change out the underlying data. If I have a dag-json node you could interpret the same data as json, dag-json (JSON with links), and raw (binary). But you couldn't ask for a different content type because it would end up being a different hash.

Also, keep in mind that a CID is rarely an entire file. Files are written as a metadata node (dag-json, dag-cbor, dag-pb) with a bunch of links to the chunks of binary data for the actual file data. If we wanted to enable some kind of content negotiation we would need to encode it at that layer.

The current format doesn't support it, but it might be worth creating an issue in the unixfs-v2 spec. ipld/legacy-unixfs-v2#2

You'd need something like:

{
  type: 'dir'
  data: {
     'filename;image/png': CID()
     'filename;image/svg': CID()
  }
}

Or, alternatively, you could just use file extensions and a naming convention to do multiple formats of resources in a single directory and then write logic on top in order to pick which one is supported by the client.

@da2x
Copy link

da2x commented Oct 4, 2018

Media types (formerly MIME types) can contain more information than the suggested format allows.

type "/" [tree "."] subtype ["+" suffix] *[";" parameter]

Some examples:

text/plain; charset=utf-16 (UTF-16 encoded text)

application/atom+xml (XML structured Atom document)

text/csv; header (comma separated values; first line contains column headers)

@jonnycrunch
Copy link

just adding a link for the verifiable credentials discussion. w3c/vc-data-model#421

@mikeal mikeal closed this as completed Aug 12, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants