-
Notifications
You must be signed in to change notification settings - Fork 108
Is this thin? #18
Comments
Only if it creates IPLD objects and wants to validate that they aren't malformed. For storage, networking, etc... one can treat names as bytes. Also, one of the really nice things about UTF-8 is that it only uses bytes 127+ for non-ASCII data. This means that, e.g., when parsing a path one can just split on "/" (0x47) without doing any UTF-8 parsing as 0x47 can only mean "/".
See #9 (comment). Basically, @jbenet agrees with you (and so do I). While IPLD will support links like this, IPLD implementations shouldn't resolve them.
Even without #1, this is true. Being able to point into/through objects was a motivation for IPLD.
That's effectively what the Protobuf version is. Yes, this does make it easier to traverse a DAG. Unfortunately, this doesn't actually fix anything because we want IPLD to be compatible with existing merkeldags. That is, we want to be able to interpret the block chain, git, etc. as IPLD data structures so we already need to support multiple complex formats. Someone who knows more about the motivation for IPLD can probably give you a better answer to this. |
That's all true, but it's nice to be able to lift those restrictions and be able to create objects too.
Whew :).
Ok, but just cause we want a format that does that doesn't mean it needs to be the thin waist.
Not quite. The protobuf version also has string keys---mine explicitly forgoes any attempt to support fs-style paths and leaves that to a higher layer. I'm a bit skeptical of the size field on links too---that might be better as an optional hint, or left as wire format detail not part of the core spec as it doesn't add any information.
Oh, Ok. Now things make a little more sense. My idea for this was to abuse multihash: instead of using just a one-byte discriminant, do something variable-size and re-purpose some large tags for these legacy protocols. Certainly any way of supporting these legacy formats is going to involve some dubious layering. |
Lots to say of course, a bit limited for time today. Just this for now: The point of IPLD as "a thin-waist" is to create a common format for distributed authenticated protocols. There needs to be enough in the thin waist to be valuable, and not too much to be bloated. We think the current construction is the right balance, there's very little that it does / forces, but enough that meaty applications can interop with it. As we have it, we'll be able to link all existing merkledags too. (The merkle forest is seeded \o/) |
Sure, it's the weekend, happy to read more later. At this point, well, I figured you all do think it is a good thin waist seeing that this is what you went with :). And yes, I am enough of an "end-to-end-er" in that I gravitate towards absolutely minimalism while adding more layers to pick up the ergonomic slack. The plan special casing the hashing for the sake of various legacy formats (including but not limited to the old protobufs one) I find interesting because that is the only area I can see my bare-bones<->IPLD shim idea running into trouble. |
@Ericson2314 thanks for filing this in. I am going to go through your argument to make sure I agree and we understand each other!
Would the simple case be the bare-bone case you are discussing? Let me know! |
First two are totally correct. But let me elaborate on the rest.
There's more difference than that more than just that.. IPLD add JSON-like structure within each merkle node. My bare-bones Firstly (granted, I didn't talk about this effect before), this creates "observable chunking choices" if someone is, say, converting to JSON to IPLD. Every parent-child edge can be "outlined" into a full-blown IPLD link (the child is replaced with its hash), or "inlined" and kept as is. This means that anyone creating nodes needs to decide between these two options that are semantically equivalent. This choice essentially boils down to performance tradeoffs: bigger chunks (less outlining) might mean less requests), smaller chunks means more deduplication. But data often lives longer than the optimal of such tuning---i.e. newer hardware and cleverer algorithms and wire/memory formats can probably reduce the cost of coarser grained chunking. This should be all the more so true for the permanent web. I'd feel more comfortable if every child had to be "outlined" and given its own hash. So say we do that, so IPLD now conceptually looks like:
My second complain is strings/maps. As I mentioned in previous comments, from the implement's side, this creates burdens. When creating nodes, the implementation needs to validate UTF-8 (and the receiver may want to too). Also when decoding nodes the implementation probably wants to make sure no map key is duplicated. Now, perhaps most languages have good unicode implementations, and this is no big deal. It is almost my experience that perhaps first class strings and maps, or more broadly mandatory self-describing data (cause that's generally how this stuff is used in the JSON world) is perhaps not as useful as it seems. [This is definitely more opinion territory, and I don't really expect to make anybody that doesn't think this already change their minds. But rather consider there's more people like me.] I write mainly Haskell and Rust, which are statically typed and put algebraic data types everywhere. For single language systems, we don't need self describing data because both ends know exactly what form it is in. Now programs definitely do evolve, and self-describing data does too, but JSON and like formats (including IPLD) are notoriously awkward for this because they make adding more fields struct a lot more easier than variants to a enum (using the easier-to-understand Rust/Swift terminology). Furthermore, we often like to refactor our data in more drastic ways---changing whole hierarchies---where JSON-style self description no longer works for backwards compatibility. [I'm sure everybody does this at some point, but I think static types make it less painful / more common.] So basically for the use-cases I'm most excited to use IPFS for, the strings and maps are awkward / unhelpful. Taking out strings and maps leaves:
Which is closer. I need to go now, but I'll hopefully soon continue anti-motivating features and stripping it away. Also, remember as I wrote above for those that like IPLD exactly the way it is, it can be implemented on top of my bare-bones with little-to-no penalty. I'll elaborate on that too. |
The language is now simple enough where the different sorts of atoms seem a bit incongruous.
Putting only a single byte in only leaf nodes is a bit ridiculous, so let's allow every node to have some bytes:
We're now close but what goes in a link? Something perhaps like:
Note that I've already removed maps but not arrays so positional indexing still make sense for relative paths---this will be my strawman for discrediting relative paths :). I've mentioned mutable paths, but let me expand a bit. If all software storing/exchange IPFS nodes ought to make sure that IPLD references are defensible, then we are blurring the boundary between IPFS and IPNS---but I think everyone agrees this is sketchy. If we are just distinguishing mutable links for the user's sake, and IPFS implementations are free to ignore them, I don't think this is very thin-waist-ish. IPLD and application data are both black boxes as far as IPFS is concerned---all upper layers are equally opaque. [Now conversely, for IPLD-as-is repurposed as a batteries-included standard especially tailed for end languages written with dynamic types I think it's fine to put any sort of mutable or immutable link there.] That leaves indexing, or more generally relative paths. My problem for the thin waist is they provide a fall sense of cheapness. Given an untrusted network (the default for IPFS), a node requesting data for a multihash/path still needs the node pointed to by the hash and it's children down to the relative node to prove that second node is indeed the proper descendant. Maybe for some optimized network protocol there can be a way to prefetch such a chain, but for the data itself I don't see what this buys us. It creates multiple canonical forms for the exact same data since we're talking immutable data. [Conversely with mutable link data / cycles relative paths are super nice in order to make more updates atomic.] With those two gone, we have:
the "ipfs/" doesn't add any information anymore, so we can just drop it:
inlining link in our minimal ipfs, and removing more information-less charcters we have get:
Now THAT is my minimal thin waist proposal. |
@Ericson2314, thank you so much for writing all this down, this is really valuable. I haven't been much connected recently, and I had to think more about what you say. I will go in this order (1) address your complaints (2) discuss your thinner layer 1
Here I guess we are still discussing a data model that is JSON-like. At the very very beginning of my IPLD interest, I just needed every attribute to be a link, eventually linking to a leaf. This is really great for many reasons: a lot of data == a lot of deduplication (to be proven!), we can make cheaper proofs of membership, the data model is REALLY simple. However, as soon as I started talking to the IPFS people I realized that being too strict would cause some issues to them, let me give you some examples:
Now, objects like
Let's live this as you said: there's more people like me. 2There has been a lot of conversation about this recently, I strongly suggest you to participate in this: ipfs/specs#130. I will be short (but please ping me online if you want me to expand on this). The way I see this is to provide a generic data model (like in the IPLD on CBOR case) and a path scheme that makes it easier to traverse and point to data. However, given the CID proposal, we can have IPLD on other formats, at a point in which we are even supporting byte arrays and array position in the path scheme. In conclusion, the way I see is the following: we have a byte array (that can be an encoded CBOR or your particular implementation described above), a path scheme and a parser (in the current IPLD, it's an IPLD+CBOR parser). The role of this parser is: given a path, give me the bytes in the byte array. Typing is done via the CID bit. How does this sound? |
If we can do what I suggest in ipfs/specs#130 (comment) to make IPLD just another format supported by CIDs, then I am pretty much OK with however IPLD turns out. :) |
@nicola about the issue of fixed hashing ever child causing too much fragmentation. A middle ground is to only hash the children of maps or lists (+ strings = non-atoms) if the children are themselves non-atoms. This means for the JSON: {
"foo": null,
"bar": "asdf asdf asdf asdf asdf asdf adsf asdf asdf adsf asdf asdf asdf q",
"baz": { "a": 1, "b": 2, "c": 3 },
"qwer": [ true false false ]
} the hashed nodes are: {
"foo": null,
"bar": <hash>,
"baz": <hash>,
"qwer": <hash>
} "asdf asdf asdf asdf asdf asdf adsf asdf asdf adsf asdf asdf asdf q" { "a": 1, "b": 2, "c": 3 } [ true false false ] For easy memorizing, this mirrors when a pointer would be needed if maps are stored as association lists. |
Hey, I am going to close this issue, Resolution: CID makes IPLD thinner than IPLD+CBOR We can discuss the other issues that we discussed (e.g. how to hash every children & so on) in new issues, feel free to re-open if you think there is more conversation to do, or open others! |
That's not true at all. CIDs reduce the size of the IPLD spec but increase the implementation complexity (the complaint in this post). That is, with CIDs, reading an IPLD object requires understanding every supported format. |
That's true, but I was thinking one can just understand the CID-supported formats they want and ignore the rest. |
I think this is a good format, and I love the idea that the thin waist of IPFS gets it's own name and spec, but I don't think this format works well as a thin waist.
It's great it use from the application's perspective, but from the perspective of an implementation (storage, networking, etc), imposes extra obligations. The implementation needs to, e.g.:
But suppose that a bunch of implementations just used "bare-bones" Merkle nodes made up of a) a list hashes of other nodes, and b) a list of bytes. It's straightforward to implement IPLD on top of that: replace hashes in links with indices into the list of hashes, pack into canonical form, and use that as the list of bytes. All of bullet points I provisioned now become the responsibility of the bare-bones Merkle DAG <-> IPNS shim, and not the underlying system. If this is indeed a valid implementation strategy, then it would be nice to specify that bare-bones format so the shim can indeed be reused.
Now, it sometimes to make sense to give lower layers more information than they strictly need for the sake of optimization. But I don't see how a storage / networking layer could leverage the knowledge that nodes have the rich stricture of IPLD rather than the dumb structure of that bare-bones example. If there indeed is none, then all layers should use the bare bone format with a shim, and thus the bare bones format (by it's simplicity) becomes the thin waist.
The text was updated successfully, but these errors were encountered: