Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Referential integrity for $schema and any external references #1126

Open
fulldecent opened this issue Sep 14, 2021 · 13 comments
Open

Referential integrity for $schema and any external references #1126

fulldecent opened this issue Sep 14, 2021 · 13 comments

Comments

@fulldecent
Copy link

Currently the JSON Schema specification allows to reference external files using a hyperlink. This is a very loose reference, specifically:

When an implementation encounters the reference to "other.json", it resolves this to https://example.net/other.json, which is not defined in this document. If a schema with that identifier has otherwise been supplied to the implementation, it can also be used automatically.

The schema in this case (the one referencing to other.json) to be insufficiently expressive. If the author of the schema wants to say "I refer to the meta-schema hosted at https://example.com/other.json" then they are helpless to make this expression. Instead they can only make the very limited utility statement "I refer to the meta-schema identified as https://example.com/other.json". This means that the meaning of every schema document is extremely implementation-dependent. (Even if they are implemented the same way.) Isn't this an underspecification of the JSON Schema specification?

There may not be an appetite to update JSON Schema specification to explain how the retrieval of resources over the internet works. That process is not consistent, not reliable and it depends on HTTPS/SSL/MITM and a lot more.

Instead, is there some other way we can include referential integrity into the standard? Maybe something like this:

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$schema-sri": "sha384-F3w7mX95PdgyTmZZMECAngseQB83DfGTowi0iMjiWaeVhAn4FJkqJByhZMI3AhiU"
}

This would only be applicable to whole documents, not partial resources (because it depends on the full binary representation of the JSON file, which is not unique).

We could reuse the approach W3C uses for Subresource Integrity.

The end result would be that, JSON Schema specification still does not specify how you are to download resources, but it allows schema authors to express clearly which document they are referring to.


Background: I am lead author of ERC-721 (the Non-fungible Token standard) and am focused on high-value, long-term, immutable metadata documents that validate against JSON Schemas.

@fulldecent fulldecent changed the title Referential integrity for $schema Referential integrity for $schema and any external references Sep 14, 2021
@karenetheridge
Copy link
Member

relevant parts of the spec:

Note that this URI is an identifier and not necessarily a network locator. In the case of a network-addressable URL, a schema need not be downloadable from its canonical URI.
(https://json-schema.org/draft/2020-12/json-schema-core.html#rfc.section.8.2.1)

The resolved URI produced by these keywords is not necessarily a network locator, only an identifier. A schema need not be downloadable from the address if it is a network-addressable URL, and implementations SHOULD NOT assume they should perform a network operation when they encounter a network-addressable URI.
(https://json-schema.org/draft/2020-12/json-schema-core.html#rfc.section.8.2.3)

We could potentially introduce a new keyword that specified checksums for each document used in an $id or $schema keyword, to ensure that invalid documents are not injected into the implementation... say

  "checksums": {
    "https://json-schema.org/draft/2020-12/schema": "...",
    ...
  }

..But we'd have to define how the checksum is determined. Is it just a hash of the input file? Then the checksum will be different if the file format is YAML instead of JSON, or has whitespace vs. no extra whitespace. JSON Schema doesn't care about the file or file format itself -- it is only interested in the content once it has been decoded into the JSON document model.

@fulldecent
Copy link
Author

Good point, not everything needs to be a network resource. And I still really don't care where the resources come from.

For defining how the checksum is determined, we can wholesale steal the W3C SRI specification.

I agree that the JSON Schema doesn't care about the file format. Just like HTML/CSS does not care about extra whitespace in the CSS file. But standards are happily using hashes of binary files for this purpose and we can steal that approach.

@karenetheridge
Copy link
Member

we can wholesale steal the W3C SRI specification

Can you provide more information about this?

@fulldecent
Copy link
Author

@jdesrosiers
Copy link
Member

I'm not sure I see the problem. JSON Schema defines how schemas are identified (RFC-3986) and leaves it as an implementation detail how to store and retrieve those schemas.

This means that the meaning of every schema document is extremely implementation-dependent.

How is the meaning of a schema document affected by how they are stored and retrieved? How a URI is associated to a schema is clearly defined. What difference does it make if that comes from an in-memory cache, a database, or the network? That's just swapping out the backend.

The end result would be that, JSON Schema specification still does not specify how you are to download resources, but it allows schema authors to express clearly which document they are referring to.

I don't see how a URI is insufficient to express which document is being referred to. I do see how this could help make retrieval more secure if the document is retrieved over the network, but the spec is clear that documents are not expected to be fetched over the network. Implementations that do support this (which is rare, especially for $schema), are providing features beyond what is specified by JSON Schema.

So, adding this feature might be a bit out of scope. If this proposal ends up going that direction, it could still be a vocabulary that people who write schemas which are intend to be fetched over the network can adopt. I can see such a vocabulary being adopted for JSON Hyper-Schema since fetching schemas over the network is a natrual part of how hyper-schema works.

@fulldecent
Copy link
Author

If you read the statement "I like HotBot.com", this is insufficient to express which document is being referred to.

Is it HotBot VPN? Or, more likely, are they referring to the popular web search engine hosted there in 1999?

That question might sound silly because the internet has changed so much in the past twenty years. But when a single piece of artwork sells for many millions of dollars at auction, and the only thing backing that artwork is a JSON document attached to a JSON Schema, and this document is expected to have the same meaning decades from now... then that linkage becomes very important.

@jdesrosiers
Copy link
Member

That explanation makes sense for a long lived distributed system with no centralized control, but that's not what JSON Schema is.

When JSON Schema says that the identifier for the dialect is https://json-schema.org/draft/2020-12/schema, that doesn't mean, go fetch that thing and whatever you get determines the semantics of the schema. The spec defines that this URI identifies the semantics defined in that version of the spec. It can't change over time. It's baked into the spec.

The same goes for the meta-schema. $schema can identify a meta-schema to validate that the schema appears to be a valid schema. Again, this is just an identifier that identifies the schema in this repository. We happen to host the schema at that address for convenience, but even if we took that down or replaced it's with an image of puppies, schemas would not break. The URI still identifies the schema in this repository no matter what the URI resolves to on the web. Implementations keep a copy of the meta-schemas for the dialects they support like any other dependency. They don't fetch them from over the network.

@fulldecent
Copy link
Author

Does this mean that broadly speaking a JSON validator which supports validation to a schema is NOT expected to work with arbitrary schemas?

Instead we are expected to use a JSON-validator-for-package.json-files program and a JSON-validator-for-NFT-files program?

Basically each program hardcodes which metaschema(s) it supports.

@karenetheridge
Copy link
Member

Does this mean that broadly speaking a JSON validator which supports validation to a schema is NOT expected to work with arbitrary schemas?

No, you are confusing schemas with metaschemas. A metaschema contains the semantics under which the schema itself runs. Schemas are arbitrary, and are intended to be parsed at runtime when evaluating data instances. The schemas use the semantics described by its "$schema" keyword, which references the logic baked into each specification version (as described in the spec documents).

@fulldecent
Copy link
Author

fulldecent commented Sep 22, 2021

Thank you for your patience.

So we have:

  1. Some package.json file
    • This links (or should) to the next thing using $schema
    • There is no integrity in this link, can get rugpulled
  2. A definition of howe package.json files should be validated and interpreted
    • This links (should and usually does) to the next thing using $schema
    • There is no integrity in this link, but that's okay because the next below thing is some unitary IETF standard
  3. The big thing that IETF is going to standardize
    • This doesn't link anywhere, it is freestanding

@jdesrosiers
Copy link
Member

Ahh, you're talking about using $schema in a document as a way to reference a schema that describes that document. That's not actually a JSON Schema thing. It's a convention that VS Code (and maybe some others) use to associate a document with a JSON Schema. There's no standard way for a document to reference a schema that describes it.

@fulldecent
Copy link
Author

I get that part. And yes I want to standardize that. I'll work on that separately.

But for this issue I even want the Specification-for-describing-package.JSON-files file to be locked down hardcore if it uses any vocabularies that are not from the well-known IETF spec.

@karenetheridge
Copy link
Member

karenetheridge commented Sep 23, 2021

There's no standard way for a document to reference a schema that describes it.

The latest few versions of the specification state that you can use request or response headers to do so, with a new MIME type -- but this information is not in the document itself: https://datatracker.ietf.org/doc/html/draft-bhutton-json-schema-00#section-14.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants