feat: add code for car serialization format #258

Gozala · 2022-02-16T17:18:12Z

Add a code for CARs so that in .storage services we could tag multihashes with

Add a code for CARs so that in .storage services we could identify them by multihash

rvagg

I'm fine with this, but #239 had some objections so we'll have to be careful not to steamroll through those.

rvagg · 2022-02-17T03:59:31Z

oh, and do we need a v1 and v2 here? we can differentiate once we get the bytes, but do we need to know up front where multicodecs get used?

Gozala · 2022-02-17T09:54:58Z

oh, and do we need a v1 and v2 here? we can differentiate once we get the bytes, but do we need to know up front where multicodecs get used?

For our use cases that seems irrelevant, as long as we can identify the version from the bytes.

I suggest we go with generic car code and if we find that capturing version is important we could add version specific entries as well.

vmx · 2022-02-17T10:39:29Z

If this code is intended to be used in a CID as multicodec-content-type (this is what the spec currently calls it), then it should be ipld and not serialization. I think there is agreement that only IPLD formats should be there and we should update the CID spec to make that clear.

Gozala · 2022-02-17T10:51:27Z

If this code is intended to be used in a CID as multicodec-content-type (this is what the spec currently calls it), then it should be ipld and not serialization. I think there is agreement that only IPLD formats should be there and we should update the CID spec to make that clear.

What the point of that table column if only value allowed is “ipld” ?

I suggest we start with “serialization” because it is a fact today. If we end up turning it into codec, using it in CIDs we can update that column to reflect that fact.

vmx · 2022-02-17T11:26:08Z

What the point of that table column if only value allowed is “ipld” ?

The Multicodec Table is a table that is not related to CIDs. It's just a list of things that map to certain numbers. The column is there to make sense, what such a number is used for. E.g. for a Multihash, or for IPLD Codecs that can then be used in CIDs.

Gozala · 2022-02-17T11:42:34Z

The Multicodec Table is a table that is not related to CIDs.

I have misunderstood what you were referring to with “there” in your previous comment.

Does my suggestion of starting with the “serialization” to reflect fact today and updating that as necessary in the future makes sense ?

Gozala · 2022-02-17T18:27:10Z

Can I go ahead and merge this ? Or do we still have some disagreements to resolve ?

willscott · 2022-02-17T18:43:25Z

table.csv

@@ -124,6 +124,7 @@ http,                           multiaddr,      0x01e0,         draft,
 swhid-1-snp,                    ipld,           0x01f0,         draft,     SoftWare Heritage persistent IDentifier version 1 snapshot
 json,                           ipld,           0x0200,         permanent, JSON (UTF-8-encoded)
 messagepack,                    serialization,  0x0201,         draft,     MessagePack
+car,                            serialization,  0x0202,         draft,     Content Addressable aRchive (CAR)


can we specify this is a carv1 specifically, or does this cover both car v1 and car v2?

Clarify as in a table somehow or here ? If here it's supposed to be version agnostic.

lidel

Not a blocker, but think this should be clarified before we start using this code:

Add a code for CARs so that in .storage services we could identify them by multihash

How is that multihash generated? Is it the multihash of the root block or something else (of so, how to calculate it)?
Where will the spec for this live? https://ipld.io/docs/codecs/known/?

rvagg · 2022-02-17T23:22:48Z

Re serialization and ipld:

If we end up turning it into codec, using it in CIDs we can update that column to reflect that fact.

I'm OK with this as a position if it's not going to be used for CIDs (a good way to think about this column might be something like: "does the decoder yield IPLD links?", and a CAR decoder does in fact yield links). But this raises the question of what this is being used for if not CIDs? Continuing from #239, I think most of us are assuming that's what this would be for. But apparently not?

So back to the original ask:

Add a code for CARs so that in .storage services we could identify them by multihash

How does this help you identify by multihash? Presumably you're going to hash the bytes and the digest from that gives you the multihash. What do you need the additional identifier for if not to make CIDs?

This is not a blocker btw, I think this can be merged, but the nuances might dictate needing to change that type column. I'm currently imagining this being a little like the CAR index format codes, 0x0400 and 0x0401 which are just unique identifiers for a single thing among a group of related things and I'm assuming that .storage services have a need for uniquely identifying a CAR as a thing among a group of related things, but I'm not sure what that would be, if not the same use-case as CIDs.

lidel · 2022-02-18T17:08:27Z

To clarify why I asked, the use case I have in mind is convention where raw and car codecs are used on HTTP Gateway as a way of requesting a single Block or a CAR with blocks for a DAG.

HTTP GET /ipfs/{cid-with-raw-codec} returning a raw Block
HTTP GET /ipfs/{cid-with-car-codec} returning a CAR with the entire DAG behind a CID

In this convention the multihash in a CID represents the root block of a DAG, and if you plan to use car with a multihash that has different meaning, we should agree on that now.

Gozala · 2022-02-19T23:11:27Z

How is that multihash generated?

I messed up when I said "we could identify them by multihash", because as you've all pointed out it's not really a multihash and I'm not sure we have term for it. We want to generate multihash for CAR and tag it with this code.

It is true that it sounds like CID, maybe it should be CID. Yet I really want to avoid the debate of whether it is good idea to identify things larger than libp2p block size limit with a CIDs. There are tradeoffs there and I'm not sure we're prepared to evaluate them yet.

I do think however that we can all agree on the fact that CAR is an established serialization format which can have it's own code.

I think we'll be in a better position to debate whether CAR as an IPLD codec is good idea after we've had a chance to evaluate that in our work. And only we're convinced that it's a right choice we can discuss tradeoffs and update table field if we choose so.

Gozala · 2022-02-19T23:29:16Z

In this convention the multihash in a CID represents the root block of a DAG, and if you plan to use car with a multihash that has different meaning, we should agree on that now.

I love the idea of making gateway capable of export DAGs, but I am concerned about overloading CID codec here because:

CAR may not cover whole DAG (it may contain only subset of nodes)
It may contain nodes from multiple unrelated DAGs.
Same DAG can be represented by different CARs.

More broadly I think it is a mistake to think of CAR as DAG serialization format. Thinking of it as block set serialization seems a lot more accurate to me.

In regards to how we want to use it.

We want to generate CAR multihash by hashing bytes of the file (e.g. with sha256 and tagging accordingly) and than tag that multihash with CAR code. If we tag it with CID version we'd get a CID in a more traditional sense, but again I'm not prepared to have a debate on whether we should identify large things (greater than block size limit) with CIDs or not.

Gozala · 2022-02-19T23:35:09Z

I'm going to merge this given approvals and comments suggesting no blockers here. Happy to carry on related discussions at #239 instead

feat: add code for car serialization format

05eb1f9

Add a code for CARs so that in .storage services we could identify them by multihash

Gozala requested a review from Stebalien February 16, 2022 17:18

rvagg approved these changes Feb 17, 2022

View reviewed changes

rvagg mentioned this pull request Feb 17, 2022

multiformat code for CARs #239

Closed

willscott reviewed Feb 17, 2022

View reviewed changes

willscott approved these changes Feb 17, 2022

View reviewed changes

lidel reviewed Feb 17, 2022

View reviewed changes

Gozala merged commit 4e93923 into master Feb 19, 2022

Gozala deleted the feat/car branch February 19, 2022 23:35

lidel mentioned this pull request Mar 1, 2022

Gateway support for /ipfs/{cid}?format=car|raw|... ipfs/kubo#8234

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add code for car serialization format #258

feat: add code for car serialization format #258

Gozala commented Feb 16, 2022 •

edited

Loading

rvagg left a comment

rvagg commented Feb 17, 2022

Gozala commented Feb 17, 2022

vmx commented Feb 17, 2022

Gozala commented Feb 17, 2022

vmx commented Feb 17, 2022

Gozala commented Feb 17, 2022

Gozala commented Feb 17, 2022

willscott Feb 17, 2022

Gozala Feb 19, 2022

lidel left a comment •

edited

Loading

rvagg commented Feb 17, 2022

lidel commented Feb 18, 2022 •

edited

Loading

Gozala commented Feb 19, 2022

Gozala commented Feb 19, 2022

Gozala commented Feb 19, 2022

feat: add code for car serialization format #258

feat: add code for car serialization format #258

Conversation

Gozala commented Feb 16, 2022 • edited Loading

rvagg left a comment

Choose a reason for hiding this comment

rvagg commented Feb 17, 2022

Gozala commented Feb 17, 2022

vmx commented Feb 17, 2022

Gozala commented Feb 17, 2022

vmx commented Feb 17, 2022

Gozala commented Feb 17, 2022

Gozala commented Feb 17, 2022

willscott Feb 17, 2022

Choose a reason for hiding this comment

Gozala Feb 19, 2022

Choose a reason for hiding this comment

lidel left a comment • edited Loading

Choose a reason for hiding this comment

rvagg commented Feb 17, 2022

lidel commented Feb 18, 2022 • edited Loading

Gozala commented Feb 19, 2022

Gozala commented Feb 19, 2022

Gozala commented Feb 19, 2022

Gozala commented Feb 16, 2022 •

edited

Loading

lidel left a comment •

edited

Loading

lidel commented Feb 18, 2022 •

edited

Loading