Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add code for car serialization format #258

Merged
merged 1 commit into from
Feb 19, 2022
Merged

feat: add code for car serialization format #258

merged 1 commit into from
Feb 19, 2022

Conversation

Gozala
Copy link
Contributor

@Gozala Gozala commented Feb 16, 2022

Add a code for CARs so that in .storage services we could tag multihashes with

Add a code for CARs so that in .storage services we could identify them by multihash
Copy link
Member

@rvagg rvagg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with this, but #239 had some objections so we'll have to be careful not to steamroll through those.

@rvagg
Copy link
Member

rvagg commented Feb 17, 2022

oh, and do we need a v1 and v2 here? we can differentiate once we get the bytes, but do we need to know up front where multicodecs get used?

@rvagg rvagg mentioned this pull request Feb 17, 2022
@Gozala
Copy link
Contributor Author

Gozala commented Feb 17, 2022

oh, and do we need a v1 and v2 here? we can differentiate once we get the bytes, but do we need to know up front where multicodecs get used?

For our use cases that seems irrelevant, as long as we can identify the version from the bytes.

I suggest we go with generic car code and if we find that capturing version is important we could add version specific entries as well.

@vmx
Copy link
Member

vmx commented Feb 17, 2022

If this code is intended to be used in a CID as multicodec-content-type (this is what the spec currently calls it), then it should be ipld and not serialization. I think there is agreement that only IPLD formats should be there and we should update the CID spec to make that clear.

@Gozala
Copy link
Contributor Author

Gozala commented Feb 17, 2022

If this code is intended to be used in a CID as multicodec-content-type (this is what the spec currently calls it), then it should be ipld and not serialization. I think there is agreement that only IPLD formats should be there and we should update the CID spec to make that clear.

What the point of that table column if only value allowed is “ipld” ?

I suggest we start with “serialization” because it is a fact today. If we end up turning it into codec, using it in CIDs we can update that column to reflect that fact.

@vmx
Copy link
Member

vmx commented Feb 17, 2022

What the point of that table column if only value allowed is “ipld” ?

The Multicodec Table is a table that is not related to CIDs. It's just a list of things that map to certain numbers. The column is there to make sense, what such a number is used for. E.g. for a Multihash, or for IPLD Codecs that can then be used in CIDs.

@Gozala
Copy link
Contributor Author

Gozala commented Feb 17, 2022

The Multicodec Table is a table that is not related to CIDs.

I have misunderstood what you were referring to with “there” in your previous comment.

Does my suggestion of starting with the “serialization” to reflect fact today and updating that as necessary in the future makes sense ?

@Gozala
Copy link
Contributor Author

Gozala commented Feb 17, 2022

Can I go ahead and merge this ? Or do we still have some disagreements to resolve ?

@@ -124,6 +124,7 @@ http, multiaddr, 0x01e0, draft,
swhid-1-snp, ipld, 0x01f0, draft, SoftWare Heritage persistent IDentifier version 1 snapshot
json, ipld, 0x0200, permanent, JSON (UTF-8-encoded)
messagepack, serialization, 0x0201, draft, MessagePack
car, serialization, 0x0202, draft, Content Addressable aRchive (CAR)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we specify this is a carv1 specifically, or does this cover both car v1 and car v2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarify as in a table somehow or here ? If here it's supposed to be version agnostic.

Copy link
Member

@lidel lidel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker, but think this should be clarified before we start using this code:

Add a code for CARs so that in .storage services we could identify them by multihash

How is that multihash generated? Is it the multihash of the root block or something else (of so, how to calculate it)?
Where will the spec for this live? https://ipld.io/docs/codecs/known/?

@rvagg
Copy link
Member

rvagg commented Feb 17, 2022

Re serialization and ipld:

If we end up turning it into codec, using it in CIDs we can update that column to reflect that fact.

I'm OK with this as a position if it's not going to be used for CIDs (a good way to think about this column might be something like: "does the decoder yield IPLD links?", and a CAR decoder does in fact yield links). But this raises the question of what this is being used for if not CIDs? Continuing from #239, I think most of us are assuming that's what this would be for. But apparently not?

So back to the original ask:

Add a code for CARs so that in .storage services we could identify them by multihash

How does this help you identify by multihash? Presumably you're going to hash the bytes and the digest from that gives you the multihash. What do you need the additional identifier for if not to make CIDs?

This is not a blocker btw, I think this can be merged, but the nuances might dictate needing to change that type column. I'm currently imagining this being a little like the CAR index format codes, 0x0400 and 0x0401 which are just unique identifiers for a single thing among a group of related things and I'm assuming that .storage services have a need for uniquely identifying a CAR as a thing among a group of related things, but I'm not sure what that would be, if not the same use-case as CIDs.

@lidel
Copy link
Member

lidel commented Feb 18, 2022

To clarify why I asked, the use case I have in mind is convention where raw and car codecs are used on HTTP Gateway as a way of requesting a single Block or a CAR with blocks for a DAG.

  • HTTP GET /ipfs/{cid-with-raw-codec} returning a raw Block
  • HTTP GET /ipfs/{cid-with-car-codec} returning a CAR with the entire DAG behind a CID

In this convention the multihash in a CID represents the root block of a DAG, and if you plan to use car with a multihash that has different meaning, we should agree on that now.

@Gozala
Copy link
Contributor Author

Gozala commented Feb 19, 2022

How is that multihash generated?

I messed up when I said "we could identify them by multihash", because as you've all pointed out it's not really a multihash and I'm not sure we have term for it. We want to generate multihash for CAR and tag it with this code.

It is true that it sounds like CID, maybe it should be CID. Yet I really want to avoid the debate of whether it is good idea to identify things larger than libp2p block size limit with a CIDs. There are tradeoffs there and I'm not sure we're prepared to evaluate them yet.

I do think however that we can all agree on the fact that CAR is an established serialization format which can have it's own code.

I think we'll be in a better position to debate whether CAR as an IPLD codec is good idea after we've had a chance to evaluate that in our work. And only we're convinced that it's a right choice we can discuss tradeoffs and update table field if we choose so.

@Gozala
Copy link
Contributor Author

Gozala commented Feb 19, 2022

In this convention the multihash in a CID represents the root block of a DAG, and if you plan to use car with a multihash that has different meaning, we should agree on that now.

I love the idea of making gateway capable of export DAGs, but I am concerned about overloading CID codec here because:

  1. CAR may not cover whole DAG (it may contain only subset of nodes)
  2. It may contain nodes from multiple unrelated DAGs.
  3. Same DAG can be represented by different CARs.

More broadly I think it is a mistake to think of CAR as DAG serialization format. Thinking of it as block set serialization seems a lot more accurate to me.

In regards to how we want to use it.

We want to generate CAR multihash by hashing bytes of the file (e.g. with sha256 and tagging accordingly) and than tag that multihash with CAR code. If we tag it with CID version we'd get a CID in a more traditional sense, but again I'm not prepared to have a debate on whether we should identify large things (greater than block size limit) with CIDs or not.

@Gozala
Copy link
Contributor Author

Gozala commented Feb 19, 2022

I'm going to merge this given approvals and comments suggesting no blockers here. Happy to carry on related discussions at #239 instead

@Gozala Gozala merged commit 4e93923 into master Feb 19, 2022
@Gozala Gozala deleted the feat/car branch February 19, 2022 23:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants