Add 'jcs' and 'urdna2015' canonicalization values. #261

dmitrizagidulin · 2022-03-04T22:15:13Z

Adds a new canonhash tag value that represents a combination canonicalization+hash operation (using RDF Dataset Canonicalization URDNA2015, soon to be renamed to URDCA2015).

Used for the hashlinking of Verifiable Credentials proposal to the W3C VC WG, in the implementation of digestMultibase.

digestMultibase example:

MULTIBASE('base58btc', CANONICALIZE('urdca-2015-canon', MULTIHASH('sha256', <canonicalized input>)))

MULTIBASE('base58btc', CANONICALIZE('jcs-canon', MULTIHASH('sha256', <canonicalized input>)))

vmx · 2022-03-07T09:09:05Z

I only had a quick look at JCS and urdna2015. Do I understand it correctly MULTIHASH('jcs', MULTIHASH('sha256', <canonicalized hashed VC>)) would still be a SHA-256 hash? Is the idea that you'd like to be able to determine "that SHA-256 came from a canonicalized JSON according to the JCS rules"?

dmitrizagidulin · 2022-03-08T17:51:48Z

Do I understand it correctly MULTIHASH('jcs', MULTIHASH('sha256', <canonicalized hashed VC>)) would still be a SHA-256 hash? Is the idea that you'd like to be able to determine "that SHA-256 came from a canonicalized JSON according to the JCS rules"?

That's it, exactly. Tagging jcs as a multihash is not exactly right, but we're trying to work with the limitations of the fact that MULTIHASH has essentially one parameter, but really needs multiple params (see the discussion for issue 78/Parametrized Hashing multiformats/multihash#78, and issue 56 multiformats/multihash#56)

dmitrizagidulin · 2022-03-08T20:14:32Z

Ok, on further conversation, it might be less confusing to people if this PR introduced a new tag (instead of overloading the use of multihash).
So instead, I propose adding a canonized hash tag.

vmx · 2022-03-10T12:39:52Z

In your original comment you mention hashlinking. Is the goal to use that multicodec code as part of a CID?

I'm asking as I think this request poses an interesting question. If I think in terms of a CID, where we specify the encoding as well as the hash algorithm, the question is, should this be the encoding information or the hash algorithm information?

To me a CID is self-describing on how to get from the bytes it points to, to some deserialized version of it and back. If the hash algorithm is always SHA-256, I can see two ways describing it:

The encoding is JSON and the hash algorithm is "canonicalize things first, then do a SHA-256" hash.
The encoding is canonicalized JSON and the hash algorithm is SHA-256.

In both cases you'd have all the information you need.

dlongley · 2022-03-10T17:17:41Z

@vmx,

We would ideally like to design this in such a way that any hash algorithm from the multihash table could be used -- without having to create NxM combination codec values. So, we can express that some data was canonicalized with algorithm X (urdca2015 or jcs are the two most interesting values here right now) and then hashed with algorithm Y (any value from the multihash table). So we're just looking for the best way / format to allow this kind of parameterization so that all of the information needed (as you mentioned) is there.

vmx · 2022-03-11T09:34:37Z

@dlongley This means that urdca2015 and jcs aren't about hashing at all, they are about the step before the hashing. I still guess you want to use this as part of a CID, so the only possible place to put this identifier in is the data codec (the CID spec names that "multicodec codec type"). The information there is used to know how to encode/decode the bytes that were addressed with the CID. Is JCS always JSON and URDCA2015 always XML? Or could also other data formats be canonicalized with such algorithms?

dmitrizagidulin · 2022-04-07T21:25:32Z

@vmx

@dlongley This means that urdca2015 and jcs aren't about hashing at all, they are about the step before the hashing.

Right, exactly. They're essentially a second parameter to the multihash (what pre-processing steps must be taken with the data before hashing).

Is JCS always JSON and URDCA2015 always XML? Or could also other data formats be canonicalized with such algorithms?

JCS is always JSON. URDCA2015 is any sort of RDF-based linked data (which includes JSON, Turtle, RDF-XML, N-Quads, etc).

To me a CID is self-describing on how to get from the bytes it points to, to some deserialized version of it and back. If the hash algorithm is always SHA-256, I can see two ways describing it:

The encoding is JSON and the hash algorithm is "canonicalize things first, then do a SHA-256" hash.

The encoding is canonicalized JSON and the hash algorithm is SHA-256.

Right, so, this is the tricky part. I'd say the situation is closer to 1 -- the hash algorithm is "canonicalize things first, then do a SHA-256" hash. And the encoding (of the hash) is multibase. (I'm not sure it's necessary to specify the encoding of the pre-hash data, though. Since the hash is a one-way operation.)

@vmx - would you be open to defining a new "canonized hash" tag?

rvagg · 2022-04-08T05:13:37Z

Finally found time to look at this and give my 2c.

Firstly, I'd like to make a bit of space after the poseidon* entries because we can expect more of those, maybe bump it to 0xb503 or even find a different space for it around that area.
I don't think I have an objection to making a new tag for this, it really is a different beast, and it's not like we have strong rules for that column anyway. It would probably be inappropriate to make it multihash or ipld or even serialization since it's not quite any of those.
I think I could see a path to this being used in CIDs if you implement it as a faux-multihash. Our implementations have ways of abstracting the multihash part of a CID such that you just need to be able to produce a digest. So, you could implement this as a layer ontop of the existing mutlihash interfaces so you take existing multihash implementations and wrap them in this thing and the multihash part of the CID is really a multihash(multihash), although as far as the CID implementations are concerned it's just the one multihash. That would be interesting to see work and there may be hiccups along the way. I'm not sure it's a great idea, but it doesn't seem impossible.
Having said all of that ^, using this for CIDs does feel a bit like a hack, to squish information into a CID because CIDv1 doesn't have the ability to convey quite enough information as it is. Maybe this goes into the wishlist bucket for CIDv2?

vmx · 2022-04-08T09:59:50Z

I'd like to check if I understood the current outcome correctly.

The urdca-2015-hash is used in the multihash part of the CID. So a CID would look like this (I leave out the size information bits for simplicity):

<v1><can-e.g.-be-json-turtle-xml><urdca-2015-hash><the-hash-digest>

This points to some data.

Now I retrieve the data and I want to create a CID out of it. I would only know that I need to canonicalize the the data before hashing, but I wouldn't know which hash algorithm to use. Is that correct?

rvagg · 2022-09-27T22:50:08Z

@dmitrizagidulin any changes to this you want to pursue so we can get this over the line in some form?

dmitrizagidulin · 2022-09-28T06:20:26Z

Hi @rvagg, thanks for checking in.
So, yeah, absolutely, we’ve got even more implementations in need of this mechanism on the way, so we definitely want to find some kind of solution. (I was chatting with @gobengo about this just yesterday, and he gave me a couple new vectors to consider.) So, let me review the issue and get back to you later today.

dmitrizagidulin · 2022-09-30T10:28:39Z

Hi @rvagg -- after some discussion with @gobengo, I've updated the PR (and resolved merge conflicts) to hopefully address some of your concerns.

Firstly, I'd like to make a bit of space after the poseidon* entries because we can expect more of those, maybe bump it to 0xb503 or even find a different space for it around that area.

Totally understood wanting to make space -- I moved the JCS canonicalization entry to post-poseidon.
If at all possible, we would really like to keep urdna-2015-canon entry as 0xb403. (This is totally my fault, I dropped the ball on resolving this PR, and meanwhile the 0xb403 tag is being deployed to millions of Point-of-Sale systems (literally old-school cash registers) as part of a US-wide Age Verification project.)

We've also updated the tag for those two entries to re-use ipld (on Bengo's advice), instead of introducing a new 'canonhash' tag. This is because json-jcs is essentially a standardized version of what dag-json does (sorts/canonicalizes JSON input so that it can be composed with hashing).

dlongley · 2022-09-30T14:35:26Z

@dmitrizagidulin,

We've also updated the tag for those two entries to re-use ipld (on Bengo's advice), instead of introducing a new 'canonhash' tag.

Does that mean the existing implementations need to change? If not, why not?

dmitrizagidulin · 2022-10-03T17:48:14Z

@dmitrizagidulin,

We've also updated the tag for those two entries to re-use ipld (on Bengo's advice), instead of introducing a new 'canonhash' tag.

Does that mean the existing implementations need to change? If not, why not?

Hey @dlongley - no, no existing implementations need to change. The tag in the CSV file is conceptual / for organizing things into categories, it's not used in the code.

rvagg

yeah, OK, I think we can just merge these now, although I'll register two final comments:

I'm still unsure if ipld is the right way to go, serialization might be better, we tend to use ipld for schemes that yield linked data .. maybe this does, maybe it's a scheme that yields a single link, but the canonicalisation is also something that we do more in ipld than generic serialization schemes so 🤷.
The placement is pretty annoying, I'd really like to have reserved the 0xb4xx block for poseidon*. I get that you've deployed this and that's certainly a strong consideration, but still pretty annoying. It's going to be an ugly duckling amongst additional poseidon entries.

msporny · 2022-10-10T14:04:06Z

Thanks for the merge @rvagg.

To come back to ipld not being the right way to go, I agree. Can we just use "multiformat" for the tag name?

If not, what if we introduced a new "transformed-multihash" namespace? It's not clear to me what constitutes a "namespace" vs. a "multiformat".

rvagg · 2022-10-10T23:49:04Z

@msporny the tags really don't matter that much so it's not worth getting too hung up about it - I imagine a future point where we refactor a bunch of the organisational stuff and they become more relevant at which point we take a more holistic view of what we have and do some adjustment.

If something feels like it should be just "multiformat" then we should probably just invent a new tag for it - if you're making something that could be described in a new multiformat spec then make a tag as a new category. I'm not sure about "namespace", mostly I treat those as networking / libp2p related so usually not appropriate for hashing or encoding.

I'd be happy for someone to come up with a new tag for this, but maybe something broad enough that can fit other things too? transformed-multihash might work, it's a little long but it explains the purpose. multimultihash might be a bit too cute, compound-multihash is another option in the same theme.

RangerMauve · 2022-11-04T16:09:51Z

Can't believe I'm just seeing this now! Really glad that this has been put in place.

IMO IPLD is absolutely something that we should look into here since we can use this as a component of IPLD based database systems at large.

gobengo · 2023-03-31T23:27:04Z

table.csv

@@ -483,8 +483,10 @@ skein1024-1016,                 multihash,      0xb3df,         draft,
 skein1024-1024,                 multihash,      0xb3e0,         draft,
 poseidon-bls12_381-a2-fc1,      multihash,      0xb401,         permanent, Poseidon using BLS12-381 and arity of 2 with Filecoin parameters
 poseidon-bls12_381-a2-fc1-sc,   multihash,      0xb402,         draft,     Poseidon using BLS12-381 and arity of 2 with Filecoin parameters - high-security variant
+urdca-2015-canon,               ipld,           0xb403,         draft,     The result of canonicalizing an input according to URDCA-2015 and then expressing its hash value as a multihash value.


Shouldn't this be urdna-2015-canon with an n not a c

urdna

There is a debate raging over what we should call it. Traditionally, we used "n" to mean "normalization"... but it's generally accepted now that we should've said "canonicalization" since it's a more accurate description of what's happening. Thus, the "urdca" vs. "urdna" distinction. This is currently being discussed in the W3C RDF Dataset, Canonicalization, and Hashing Working Group (note that we didn't call it the "normalization" working group).

@msporny is there a uri for that issue or do I need to file one? I just earlier today noticed meetings are started and I need to get that on my cal.

w3c/rdf-canon#88

dmitrizagidulin mentioned this pull request Mar 8, 2022

URDNA2015 Support #149

Open

dmitrizagidulin force-pushed the canonz branch from 5194771 to 82f190f Compare April 7, 2022 21:25

Introduce 'canonhash' tag instead of c14n via multihash.

30da43e

dmitrizagidulin force-pushed the canonz branch from 82f190f to 6e186dc Compare September 30, 2022 10:20

dmitrizagidulin changed the title ~~Add 'jcs' and 'urdna2015' multihash values.~~ Add 'jcs' and 'urdna2015' canonicalization values. Sep 30, 2022

Re-add JCS, change tag from canonhash to ipld to match json-dag.

f2559ee

dmitrizagidulin force-pushed the canonz branch from 6e186dc to f2559ee Compare September 30, 2022 10:31

BigLep requested a review from rvagg October 4, 2022 22:38

rvagg approved these changes Oct 10, 2022

View reviewed changes

rvagg merged commit 5e275cd into multiformats:master Oct 10, 2022

gobengo mentioned this pull request Feb 3, 2023

IPFS-LD - Linked Data ipfs/ipfs#36

Closed

gobengo reviewed Mar 31, 2023

View reviewed changes

gobengo mentioned this pull request Apr 1, 2023

Should multiformats refer to URDNA or URDCA? w3c/rdf-canon#88

Closed

rvagg mentioned this pull request Jun 13, 2023

Rename urdca-2015-canon to rdfc-1-0 #328

Merged

dmitrizagidulin deleted the canonz branch March 5, 2024 22:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add 'jcs' and 'urdna2015' canonicalization values. #261

Add 'jcs' and 'urdna2015' canonicalization values. #261

dmitrizagidulin commented Mar 4, 2022 •

edited

vmx commented Mar 7, 2022

dmitrizagidulin commented Mar 8, 2022

dmitrizagidulin commented Mar 8, 2022

vmx commented Mar 10, 2022

dlongley commented Mar 10, 2022 •

edited

vmx commented Mar 11, 2022

dmitrizagidulin commented Apr 7, 2022 •

edited

rvagg commented Apr 8, 2022

vmx commented Apr 8, 2022

rvagg commented Sep 27, 2022

dmitrizagidulin commented Sep 28, 2022

dmitrizagidulin commented Sep 30, 2022 •

edited

dlongley commented Sep 30, 2022

dmitrizagidulin commented Oct 3, 2022

rvagg left a comment

msporny commented Oct 10, 2022

rvagg commented Oct 10, 2022

RangerMauve commented Nov 4, 2022

gobengo Mar 31, 2023 •

edited

msporny Apr 1, 2023

gobengo Apr 1, 2023 •

edited

gobengo Apr 1, 2023

Add 'jcs' and 'urdna2015' canonicalization values. #261

Add 'jcs' and 'urdna2015' canonicalization values. #261

Conversation

dmitrizagidulin commented Mar 4, 2022 • edited

vmx commented Mar 7, 2022

dmitrizagidulin commented Mar 8, 2022

dmitrizagidulin commented Mar 8, 2022

vmx commented Mar 10, 2022

dlongley commented Mar 10, 2022 • edited

vmx commented Mar 11, 2022

dmitrizagidulin commented Apr 7, 2022 • edited

rvagg commented Apr 8, 2022

vmx commented Apr 8, 2022

rvagg commented Sep 27, 2022

dmitrizagidulin commented Sep 28, 2022

dmitrizagidulin commented Sep 30, 2022 • edited

dlongley commented Sep 30, 2022

dmitrizagidulin commented Oct 3, 2022

rvagg left a comment

Choose a reason for hiding this comment

msporny commented Oct 10, 2022

rvagg commented Oct 10, 2022

RangerMauve commented Nov 4, 2022

gobengo Mar 31, 2023 • edited

Choose a reason for hiding this comment

msporny Apr 1, 2023

Choose a reason for hiding this comment

gobengo Apr 1, 2023 • edited

Choose a reason for hiding this comment

gobengo Apr 1, 2023

Choose a reason for hiding this comment

dmitrizagidulin commented Mar 4, 2022 •

edited

dlongley commented Mar 10, 2022 •

edited

dmitrizagidulin commented Apr 7, 2022 •

edited

dmitrizagidulin commented Sep 30, 2022 •

edited

gobengo Mar 31, 2023 •

edited

gobengo Apr 1, 2023 •

edited