Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add 'jcs' and 'urdna2015' canonicalization values. #261

Merged
merged 2 commits into from Oct 10, 2022

Conversation

dmitrizagidulin
Copy link
Contributor

@dmitrizagidulin dmitrizagidulin commented Mar 4, 2022

Adds a new canonhash tag value that represents a combination canonicalization+hash operation (using RDF Dataset Canonicalization URDNA2015, soon to be renamed to URDCA2015).

Used for the hashlinking of Verifiable Credentials proposal to the W3C VC WG, in the implementation of digestMultibase.

digestMultibase example:

MULTIBASE('base58btc', CANONICALIZE('urdca-2015-canon', MULTIHASH('sha256', <canonicalized input>)))
MULTIBASE('base58btc', CANONICALIZE('jcs-canon', MULTIHASH('sha256', <canonicalized input>)))

@vmx
Copy link
Member

vmx commented Mar 7, 2022

I only had a quick look at JCS and urdna2015. Do I understand it correctly MULTIHASH('jcs', MULTIHASH('sha256', <canonicalized hashed VC>)) would still be a SHA-256 hash? Is the idea that you'd like to be able to determine "that SHA-256 came from a canonicalized JSON according to the JCS rules"?

@dmitrizagidulin
Copy link
Contributor Author

Do I understand it correctly MULTIHASH('jcs', MULTIHASH('sha256', <canonicalized hashed VC>)) would still be a SHA-256 hash? Is the idea that you'd like to be able to determine "that SHA-256 came from a canonicalized JSON according to the JCS rules"?

That's it, exactly. Tagging jcs as a multihash is not exactly right, but we're trying to work with the limitations of the fact that MULTIHASH has essentially one parameter, but really needs multiple params (see the discussion for issue 78/Parametrized Hashing multiformats/multihash#78, and issue 56 multiformats/multihash#56)

@dmitrizagidulin
Copy link
Contributor Author

Ok, on further conversation, it might be less confusing to people if this PR introduced a new tag (instead of overloading the use of multihash).
So instead, I propose adding a canonized hash tag.

@vmx
Copy link
Member

vmx commented Mar 10, 2022

In your original comment you mention hashlinking. Is the goal to use that multicodec code as part of a CID?

I'm asking as I think this request poses an interesting question. If I think in terms of a CID, where we specify the encoding as well as the hash algorithm, the question is, should this be the encoding information or the hash algorithm information?

To me a CID is self-describing on how to get from the bytes it points to, to some deserialized version of it and back. If the hash algorithm is always SHA-256, I can see two ways describing it:

  1. The encoding is JSON and the hash algorithm is "canonicalize things first, then do a SHA-256" hash.
  2. The encoding is canonicalized JSON and the hash algorithm is SHA-256.

In both cases you'd have all the information you need.

@dlongley
Copy link
Contributor

dlongley commented Mar 10, 2022

@vmx,

We would ideally like to design this in such a way that any hash algorithm from the multihash table could be used -- without having to create NxM combination codec values. So, we can express that some data was canonicalized with algorithm X (urdca2015 or jcs are the two most interesting values here right now) and then hashed with algorithm Y (any value from the multihash table). So we're just looking for the best way / format to allow this kind of parameterization so that all of the information needed (as you mentioned) is there.

@vmx
Copy link
Member

vmx commented Mar 11, 2022

@dlongley This means that urdca2015 and jcs aren't about hashing at all, they are about the step before the hashing. I still guess you want to use this as part of a CID, so the only possible place to put this identifier in is the data codec (the CID spec names that "multicodec codec type"). The information there is used to know how to encode/decode the bytes that were addressed with the CID. Is JCS always JSON and URDCA2015 always XML? Or could also other data formats be canonicalized with such algorithms?

@dmitrizagidulin
Copy link
Contributor Author

dmitrizagidulin commented Apr 7, 2022

@vmx

@dlongley This means that urdca2015 and jcs aren't about hashing at all, they are about the step before the hashing.

Right, exactly. They're essentially a second parameter to the multihash (what pre-processing steps must be taken with the data before hashing).

Is JCS always JSON and URDCA2015 always XML? Or could also other data formats be canonicalized with such algorithms?

JCS is always JSON. URDCA2015 is any sort of RDF-based linked data (which includes JSON, Turtle, RDF-XML, N-Quads, etc).

To me a CID is self-describing on how to get from the bytes it points to, to some deserialized version of it and back. If the hash algorithm is always SHA-256, I can see two ways describing it:

  1. The encoding is JSON and the hash algorithm is "canonicalize things first, then do a SHA-256" hash.
  2. The encoding is canonicalized JSON and the hash algorithm is SHA-256.

Right, so, this is the tricky part. I'd say the situation is closer to 1 -- the hash algorithm is "canonicalize things first, then do a SHA-256" hash. And the encoding (of the hash) is multibase. (I'm not sure it's necessary to specify the encoding of the pre-hash data, though. Since the hash is a one-way operation.)

@vmx - would you be open to defining a new "canonized hash" tag?

@rvagg
Copy link
Member

rvagg commented Apr 8, 2022

Finally found time to look at this and give my 2c.

  1. Firstly, I'd like to make a bit of space after the poseidon* entries because we can expect more of those, maybe bump it to 0xb503 or even find a different space for it around that area.
  2. I don't think I have an objection to making a new tag for this, it really is a different beast, and it's not like we have strong rules for that column anyway. It would probably be inappropriate to make it multihash or ipld or even serialization since it's not quite any of those.
  3. I think I could see a path to this being used in CIDs if you implement it as a faux-multihash. Our implementations have ways of abstracting the multihash part of a CID such that you just need to be able to produce a digest. So, you could implement this as a layer ontop of the existing mutlihash interfaces so you take existing multihash implementations and wrap them in this thing and the multihash part of the CID is really a multihash(multihash), although as far as the CID implementations are concerned it's just the one multihash. That would be interesting to see work and there may be hiccups along the way. I'm not sure it's a great idea, but it doesn't seem impossible.
  4. Having said all of that ^, using this for CIDs does feel a bit like a hack, to squish information into a CID because CIDv1 doesn't have the ability to convey quite enough information as it is. Maybe this goes into the wishlist bucket for CIDv2?

@vmx
Copy link
Member

vmx commented Apr 8, 2022

I'd like to check if I understood the current outcome correctly.

The urdca-2015-hash is used in the multihash part of the CID. So a CID would look like this (I leave out the size information bits for simplicity):

<v1><can-e.g.-be-json-turtle-xml><urdca-2015-hash><the-hash-digest>

This points to some data.

Now I retrieve the data and I want to create a CID out of it. I would only know that I need to canonicalize the the data before hashing, but I wouldn't know which hash algorithm to use. Is that correct?

@rvagg
Copy link
Member

rvagg commented Sep 27, 2022

@dmitrizagidulin any changes to this you want to pursue so we can get this over the line in some form?

@dmitrizagidulin
Copy link
Contributor Author

Hi @rvagg, thanks for checking in.
So, yeah, absolutely, we’ve got even more implementations in need of this mechanism on the way, so we definitely want to find some kind of solution. (I was chatting with @gobengo about this just yesterday, and he gave me a couple new vectors to consider.) So, let me review the issue and get back to you later today.

@dmitrizagidulin dmitrizagidulin changed the title Add 'jcs' and 'urdna2015' multihash values. Add 'jcs' and 'urdna2015' canonicalization values. Sep 30, 2022
@dmitrizagidulin
Copy link
Contributor Author

dmitrizagidulin commented Sep 30, 2022

Hi @rvagg -- after some discussion with @gobengo, I've updated the PR (and resolved merge conflicts) to hopefully address some of your concerns.

Firstly, I'd like to make a bit of space after the poseidon* entries because we can expect more of those, maybe bump it to 0xb503 or even find a different space for it around that area.

Totally understood wanting to make space -- I moved the JCS canonicalization entry to post-poseidon.
If at all possible, we would really like to keep urdna-2015-canon entry as 0xb403. (This is totally my fault, I dropped the ball on resolving this PR, and meanwhile the 0xb403 tag is being deployed to millions of Point-of-Sale systems (literally old-school cash registers) as part of a US-wide Age Verification project.)

We've also updated the tag for those two entries to re-use ipld (on Bengo's advice), instead of introducing a new 'canonhash' tag. This is because json-jcs is essentially a standardized version of what dag-json does (sorts/canonicalizes JSON input so that it can be composed with hashing).

@dlongley
Copy link
Contributor

@dmitrizagidulin,

We've also updated the tag for those two entries to re-use ipld (on Bengo's advice), instead of introducing a new 'canonhash' tag.

Does that mean the existing implementations need to change? If not, why not?

@dmitrizagidulin
Copy link
Contributor Author

@dmitrizagidulin,

We've also updated the tag for those two entries to re-use ipld (on Bengo's advice), instead of introducing a new 'canonhash' tag.

Does that mean the existing implementations need to change? If not, why not?

Hey @dlongley - no, no existing implementations need to change. The tag in the CSV file is conceptual / for organizing things into categories, it's not used in the code.

@BigLep BigLep requested a review from rvagg October 4, 2022 22:38
Copy link
Member

@rvagg rvagg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, OK, I think we can just merge these now, although I'll register two final comments:

  • I'm still unsure if ipld is the right way to go, serialization might be better, we tend to use ipld for schemes that yield linked data .. maybe this does, maybe it's a scheme that yields a single link, but the canonicalisation is also something that we do more in ipld than generic serialization schemes so 🤷.
  • The placement is pretty annoying, I'd really like to have reserved the 0xb4xx block for poseidon*. I get that you've deployed this and that's certainly a strong consideration, but still pretty annoying. It's going to be an ugly duckling amongst additional poseidon entries.

@rvagg rvagg merged commit 5e275cd into multiformats:master Oct 10, 2022
@msporny
Copy link
Contributor

msporny commented Oct 10, 2022

Thanks for the merge @rvagg.

To come back to ipld not being the right way to go, I agree. Can we just use "multiformat" for the tag name?

If not, what if we introduced a new "transformed-multihash" namespace? It's not clear to me what constitutes a "namespace" vs. a "multiformat".

@rvagg
Copy link
Member

rvagg commented Oct 10, 2022

@msporny the tags really don't matter that much so it's not worth getting too hung up about it - I imagine a future point where we refactor a bunch of the organisational stuff and they become more relevant at which point we take a more holistic view of what we have and do some adjustment.

If something feels like it should be just "multiformat" then we should probably just invent a new tag for it - if you're making something that could be described in a new multiformat spec then make a tag as a new category. I'm not sure about "namespace", mostly I treat those as networking / libp2p related so usually not appropriate for hashing or encoding.

I'd be happy for someone to come up with a new tag for this, but maybe something broad enough that can fit other things too? transformed-multihash might work, it's a little long but it explains the purpose. multimultihash might be a bit too cute, compound-multihash is another option in the same theme.

@RangerMauve
Copy link

Can't believe I'm just seeing this now! Really glad that this has been put in place.

IMO IPLD is absolutely something that we should look into here since we can use this as a component of IPLD based database systems at large.

@@ -483,8 +483,10 @@ skein1024-1016, multihash, 0xb3df, draft,
skein1024-1024, multihash, 0xb3e0, draft,
poseidon-bls12_381-a2-fc1, multihash, 0xb401, permanent, Poseidon using BLS12-381 and arity of 2 with Filecoin parameters
poseidon-bls12_381-a2-fc1-sc, multihash, 0xb402, draft, Poseidon using BLS12-381 and arity of 2 with Filecoin parameters - high-security variant
urdca-2015-canon, ipld, 0xb403, draft, The result of canonicalizing an input according to URDCA-2015 and then expressing its hash value as a multihash value.
Copy link

@gobengo gobengo Mar 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be urdna-2015-canon with an n not a c

urdna

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a debate raging over what we should call it. Traditionally, we used "n" to mean "normalization"... but it's generally accepted now that we should've said "canonicalization" since it's a more accurate description of what's happening. Thus, the "urdca" vs. "urdna" distinction. This is currently being discussed in the W3C RDF Dataset, Canonicalization, and Hashing Working Group (note that we didn't call it the "normalization" working group).

Copy link

@gobengo gobengo Apr 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@msporny is there a uri for that issue or do I need to file one? I just earlier today noticed meetings are started and I need to get that on my cal.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants