New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Referential integrity for $schema and any external references #1126
Comments
relevant parts of the spec:
We could potentially introduce a new keyword that specified checksums for each document used in an $id or $schema keyword, to ensure that invalid documents are not injected into the implementation... say
..But we'd have to define how the checksum is determined. Is it just a hash of the input file? Then the checksum will be different if the file format is YAML instead of JSON, or has whitespace vs. no extra whitespace. JSON Schema doesn't care about the file or file format itself -- it is only interested in the content once it has been decoded into the JSON document model. |
Good point, not everything needs to be a network resource. And I still really don't care where the resources come from. For defining how the checksum is determined, we can wholesale steal the W3C SRI specification. I agree that the JSON Schema doesn't care about the file format. Just like HTML/CSS does not care about extra whitespace in the CSS file. But standards are happily using hashes of binary files for this purpose and we can steal that approach. |
Can you provide more information about this? |
Here is how they do it: https://w3c.github.io/webappsec-subresource-integrity/#hash-functions |
I'm not sure I see the problem. JSON Schema defines how schemas are identified (RFC-3986) and leaves it as an implementation detail how to store and retrieve those schemas.
How is the meaning of a schema document affected by how they are stored and retrieved? How a URI is associated to a schema is clearly defined. What difference does it make if that comes from an in-memory cache, a database, or the network? That's just swapping out the backend.
I don't see how a URI is insufficient to express which document is being referred to. I do see how this could help make retrieval more secure if the document is retrieved over the network, but the spec is clear that documents are not expected to be fetched over the network. Implementations that do support this (which is rare, especially for So, adding this feature might be a bit out of scope. If this proposal ends up going that direction, it could still be a vocabulary that people who write schemas which are intend to be fetched over the network can adopt. I can see such a vocabulary being adopted for JSON Hyper-Schema since fetching schemas over the network is a natrual part of how hyper-schema works. |
If you read the statement "I like HotBot.com", this is insufficient to express which document is being referred to. Is it HotBot VPN? Or, more likely, are they referring to the popular web search engine hosted there in 1999? That question might sound silly because the internet has changed so much in the past twenty years. But when a single piece of artwork sells for many millions of dollars at auction, and the only thing backing that artwork is a JSON document attached to a JSON Schema, and this document is expected to have the same meaning decades from now... then that linkage becomes very important. |
That explanation makes sense for a long lived distributed system with no centralized control, but that's not what JSON Schema is. When JSON Schema says that the identifier for the dialect is The same goes for the meta-schema. |
Does this mean that broadly speaking a JSON validator which supports validation to a schema is NOT expected to work with arbitrary schemas? Instead we are expected to use a JSON-validator-for-package.json-files program and a JSON-validator-for-NFT-files program? Basically each program hardcodes which metaschema(s) it supports. |
No, you are confusing schemas with metaschemas. A metaschema contains the semantics under which the schema itself runs. Schemas are arbitrary, and are intended to be parsed at runtime when evaluating data instances. The schemas use the semantics described by its "$schema" keyword, which references the logic baked into each specification version (as described in the spec documents). |
Thank you for your patience. So we have:
|
Ahh, you're talking about using |
I get that part. And yes I want to standardize that. I'll work on that separately. But for this issue I even want the Specification-for-describing-package.JSON-files file to be locked down hardcore if it uses any vocabularies that are not from the well-known IETF spec. |
The latest few versions of the specification state that you can use request or response headers to do so, with a new MIME type -- but this information is not in the document itself: https://datatracker.ietf.org/doc/html/draft-bhutton-json-schema-00#section-14.2 |
Currently the JSON Schema specification allows to reference external files using a hyperlink. This is a very loose reference, specifically:
The schema in this case (the one referencing to other.json) to be insufficiently expressive. If the author of the schema wants to say "I refer to the meta-schema hosted at https://example.com/other.json" then they are helpless to make this expression. Instead they can only make the very limited utility statement "I refer to the meta-schema identified as https://example.com/other.json". This means that the meaning of every schema document is extremely implementation-dependent. (Even if they are implemented the same way.) Isn't this an underspecification of the JSON Schema specification?
There may not be an appetite to update JSON Schema specification to explain how the retrieval of resources over the internet works. That process is not consistent, not reliable and it depends on HTTPS/SSL/MITM and a lot more.
Instead, is there some other way we can include referential integrity into the standard? Maybe something like this:
This would only be applicable to whole documents, not partial resources (because it depends on the full binary representation of the JSON file, which is not unique).
We could reuse the approach W3C uses for Subresource Integrity.
The end result would be that, JSON Schema specification still does not specify how you are to download resources, but it allows schema authors to express clearly which document they are referring to.
Background: I am lead author of ERC-721 (the Non-fungible Token standard) and am focused on high-value, long-term, immutable metadata documents that validate against JSON Schemas.
The text was updated successfully, but these errors were encountered: