-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IPIP-49: CIDv2 “fat pointers” #49
Conversation
Leaving a comment to state my support for this. The project I'm part of (IPNS-Link: https://github.com/ipns-link/ipns-link) could definitely benefit from this, especially the fact that IPFS Gateways can shorten the CIDs for subdomains. I had originally written https://github.com/Winterhuman/ipns-link-alt/wiki/IPNS-Link-V2.1 as a way to use half the digest for a 128 bit hash, and then the rest for encoding the But with this, the context CID can inline the |
Shouldn't it get a
(with no Or .. are you expecting all our decoders to read a CID and then optionally expect to possibly read a second one straight after it? I can imagine places where that would work (I think it would come cleanly through dag-cbor, although backward compatibility would be a problem). But other places where it probably wouldn't (dag-json might be a problem? CAR sections would certainly be a problem). If we have a
|
I don't know what conversation led here, but this doesn't look like a good design. Throwing more "who knows what this means!" metadata at a problem just weakens CIDs having a clear semantics distinct from specific implementations. This seems like a classic case of https://wiki.c2.com/?OneMoreLevelOfIndirection. |
@rvagg it’s “any CID of any [current or future] version.” Thinking of it like a codec, it always decodes to a tuple of two CID’s. Could be v0, v1, or v2, but as far as the codec is concerned it just returns two Link value types. So ya, you can nest them indefinitely, but if you think about it long enough you’ll realize you could do the same thing today with identity CID’s :) |
That’s exactly the opposite of what is happening here. Today, there’s a bunch of “who knows what this means” data in the network, the context for which exists only in the applications reading and writing the data. That context, currently, does not live in the network and the data lacks sufficient self-description. CIDv2 gives us the ability to write that context into the link layer with existing and future mutliformat protocols. The reason there isn’t a hyper-specific definition of exactly what “context” means is because it’s meant to extend to encompass the totality of all applications. |
ya, there’s no getting around that if we want to upgrade though. at least there’s the CIDv1 encoding that applications can use if they need to make sure they work with systems that have older parsers.
By the time you’re in the mutlihash you’ve got a length, so even with just jamming the CID’s next to each other you’ve never gotta parse more than a few bytes before you’ve got the end of the first CID. So I don’t see a super compelling reason to put the length a few bytes earlier, but maybe there’s a good reason to have a static guarantee of which byte the length is at? Having a couple varints in front of it will make it vary.
I don’t think so at all. There’s going to be a block limit you’re inside of when you encode them most of the time, and you’re free to break them up into separate blocks in order to control the size of any nesting. That’s the beauty of this whole approach, encoders have a lot of flexibility in how they choose to encode the pointers, and you kinda need that because fat pointers are, by definition “fatter” than average. You won’t find a single encoding that solves all use cases, so having the flexibility of encoding as a separate block let’s the encoders get whatever they need. |
What about constructing CIDv2 like this: Example CIDv1:
Alternate CIDv2:
Not saying my idea is any good, just putting this out there to continue the discussion. |
This spec proposal is IMO missing a few things to make it useful to reason about having nothing to do with the technical aspects here. While AFAIK the multiformats repo does not have as formalized a spec proposal process as others in our ecosystem (e.g. off the top of my head IPFS, libp2p, and Filecoin it's still important to run through some process here. In fact it's probably more necessary here since part of the reason there's no specs process in multiformats is that the specs have largely not changed in years. Some things that IMO are missing as illustrated by the comments and questions above mine:
The linked spec processes above may give some more insight as to other details to add here. I have some thoughts on the proposal, but I suspect my comments will be more helpful once some of the above is written up. Otherwise I'm trying to make guesses without sufficient context - which is rarely a good idea 😅 |
Nothing being proposed here in CIDv1 is outside of what CIDv1 already does. I think what is missing is a more straightforward definition of “CIDv2 as a CIDv1 codec.”
We’ve been discussing “fat pointers”/CIDv2 for years, some of those discussions exist in issues and PR’s, many don’t. I’m not going to go and dig all of those up just to go back through the process of shooting them down in a new forum. If people want to bring up alternatives I’m happy to discuss them, but most of the other approaches we’ve looked at were focused on addressing more specific features like “i want a link with a path” and “i want a link that is both an immutable address (CIDv1) and a reference to a mutable pointer” all of which are accomplishable within this approach but I really don’t want to start proposing what those will look like for fear of being pulled into an endless bikeshed.
Is there any reason why we shouldn’t just use the IPIP process? I’m happy to agree on this there and then come back to this repo for an update after it’s agreed upon. The lack of formal governance in multiformats makes me pretty hesitant to invest a lot of time in formalizing any of these arguments as there is no mechanism for resolving disputes or calling the discussion to a close. |
Our use case is “all the context our services build about the data we receive and transport.” Knowing that “this block is a UCAN capability” and “this block is a UCAN invocation” is entirely based on signaling we do outside the data. Since this signal is invisible to the link layer, none of what we put into the network can be properly leveraged by other actors in the network because the data lacks sufficient self-description for it to be useful in-and-of itself. In order to produce cross compatible applications we have to build additional protocols in the transport or discovery layer, which means we don’t get the “open innovation” we all want to see. This is not a blocker for us building a service, it’s a blocker for building a real ecosystem around what we do. Frankly, it’s a little frustrating after all these years that people don’t consider this a serious problem, and the response when it’s brought up is usually a lot of moaning about the work we’ll have to do in upgrading some of our tools. This is a substantial barrier to realizing growth of the network. When data lack self-description, applications and services become the arbiters of how the data in the network can be leveraged, and this represents a substantial barrier to the network effects you would expect to realize in a network of publicly addressable data. |
Would it be possible to add some examples of how this looks like with existing codecs like dag-json or dag-cbor? As well, would it be possible to have examples of how this could change IPLD schemas. https://ipld.io/docs/schemas/ I think it would be cool to make sure we could includes expected data in fat pointers when defining a schema. |
I really like this! This could lead to some useful specs built on top for better interoperability between some existing systems like wnfs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason why we shouldn’t just use the IPIP process?
I’m happy to agree on this there and then come back to this repo for an update after it’s agreed upon.
Agree, following IPIP process will be the best. Filled #51 to write down this as a policy for this repo.
For instance, IPFS HTTP Gateways redirect to CID based subdomains which introduced a byte limit on the size | ||
of the link. In this case, IPFS HTTP Gateways would create a single block for the CIDv2 link | ||
with a 256b multihash encoded into CIDv1 for any redirect subdomain. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just flagging this may not be the best example.
When subdomain gateways were created, we chose to not do this block generation because asking Gateways to create blocks on the fly is a can of worms (complexity, link rot):
- what happens when I copy the CID created by the gateway and share it with someone? who is providing the root block now?
- are gateways expected to cache/pin these artificial blocks and provide them to the network?
- or is it to every IPFS client to double-publish both CIDs on the DHT?
I.e. we're going to be right back where we started! Data that we don't know what it means, because why don't know what the metadata CID means! Will we need a meta-metadata CID too? |
CIDv2 is, quite literally, two CIDs. | ||
|
||
``` | ||
<cidv2> ::= <data-cid><context-cid> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope with cid
you mean how CIDs are used today and not what the spec says, i.e. these won't contain a multibase prefix.
I agree with @rvagg here. I think it should be a CIDv2, I think that would make implementations easier. If you don't want to fully support CIDv2, you could just add the
I think I lean towards what @rvagg suggested and making the context CID a v1 only. So that you don't have nesting. Unlimited nesting sounds like a big can of worms, especially thinking about codecs that encode CIDs. |
Just wanted to chime in to support this proposal on behalf of the Yatima team! Our work with Lurk-Lang would hugely benefit from (and in some ways requires) having the ability to add additional context or metadata to CIDs. One concept we played around a couple months ago to do this was to remove the length limit on the multicodec field, which we did a short write-up on when designing Lurk's IPLD content-addressing. I definitely prefer the proposal here of using a tuple of CIDs though, since it seems a more minimal/compatible change with how CIDs are used. I also think @Ericson2314 comment about not creating recursive nesting CIDs is really important, and any "fat-pointer" CIDv2 proposal should be as constrained as possible, while still achieving the goal of allowing for more expressive metadata beyond just the multicodec in a CID. Here's my interpretation in Rust of how I understood @rvagg's concept of CidV2 as a pair of CidV1s: pub struct CidV2<const S: usize> {
/// the data multicodec
data_code: u64,
/// The data multihash.
data_hash: Multihash<S>,
/// the metadata codec
meta_codec: u64,
/// The data multihash of CID.
meta_hash: Multihash<S>
} This would serialize as:
(with the prepended multibase prefix when represented in text) As an example, suppose you wanted a CID which pointed a piece of IPLD data structure and its IPLD schema. Let's say you have the schema
which corresponds to the Ipld data: While you could in principle propose a new multicodec for Trit, but this might be not suitable if Trit is a temporary or ephemeral structure, or if you have a large number of different schemas (For instance, in Lurk-lang's content-addressing we would need to reserve 16-bits of the multicodec table, or 2^16 distinct multicodecs) However, since IPLD schemas can be represented as JSON (https://ipld.io/specs/schemas/#dsl-vs-dmt) and hashed, with a CIDv2 we could reserve a single
We could then use the above
And thus we could then create an unambiguous hash to CidV2 {
data_codec: 0x71,
data_hash: Ipld::Num(1).hash(),
meta_hash: trit_schema.hash(),
meta_codec : 0x3e7ada7a,
} without having to reserve anything new on the multicodec table. For backwards compatibility, CidV2's could be embedded inside CidV1s by using the cidv2 codec and the identity multihash: CidV1 {codec: 0x02, hash: <identity-multihash-of-cidv2-serialization> } (In fact you can already nest CidV1's inside themselves with the identity multihash in this way) Would love to hear feedback on whether the above idea seems like a reasonable direction to go in. I and the Yatima team would absolutely love to collaborate on this proposal, whether that's working on an IPIP, writing a Rust implementation, etc. |
I very much support the idea implementing some kind of 'fat pointer'. Are there potential vulnerabilities associated with expressing the relation between content and metadata as a tuple which is not, in itself, hashed? I think a lot of people have been toying with similar ideas; my idea was similar but the metadata hash was prepended to the content when hashing it. I'm no cryptographer so quite possible that this is un-necessary but it seemed like a sensible precaution at the time. |
There's good discussion happening here. Given the push to have this follow the IPIP process (see FAQ), can someone create the IPIP so the comments happen there? |
Thanks @johnchandlerburnham for taking this further with ipfs/specs#305, we should probably move most discussion over there so we can get specific about it. |
UnixFS have metadata block type. If you need to have metadata in cid than just inline that block in it. |
Except that the ask is for a link and metadata, not just metadata. Although that is certainly a valid way to encode metadata when you have a place to put it—it's just that unixfs is a format on top of a format (dag-pb) so maybe not the most optimal form? |
Metadata block contain link. Example of identity link with metadata. |
I've spent a lot of time talking about CIDv2 at IPFSCamp/LabWeek in Lisbon. I now found the time to write things down a bit. The result is at https://hackmd.io/@vmx/SygxnMmso (it still needs work). What I realized after writing it down is, that my proposal is basically what @mikeal originally suggested. Just having two CIDs, one for the context, one for the content. The difference to this PR is, that it really is that tuple and not a CIDv2. The reason is that this way, CIDs won't change and also won't change the IPLD Data Model. This is a huge win, this way "fat pointers" (or what I call "Application Context") is a layer on top of that. Nonetheless I still have one more idea to write down that floated around, once done I'll link it from the HackMD mentioned above. I'm closing this issue as I'm convinced (after talking with so many folks about this) that it should not bit a CIDv2, but something build on top of those primitives. |
After numerous discussions at IPFS Thing I decided it’s time to pull the trigger on CIDv2.
This isn’t the only PR we’ll need to do, but this should serve as a way to resolve any objections or concerns.
We (DAG House) have a pressing need for these in the short term and will be implementing them rather quickly.
We’ve floated a lot of different solutions to this problem and the one that everyone seems to disagree the least on is the simplest, which is what I’ve proposed: two cids.
The first is the data pointer, the second is the context. If you want inline context, use identity multihash.
This also makes CIDv2 a valid CIDv1 codec and can be used for reverse compatibility when necessary (although we should do the work of supporting them natively in the codecs).
Since CIDv2 can be viewed as a tuple of CIDs it’s possible to add support across the existing interfaces representing CIDv2 as a simple list of two CIDs in the existing IPLD Data Model.
There were a lot of discussions about this in-person, so there’s plenty of details I’m sure I’m leaving out, but it’s time to discuss.