Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jsonschema: dialect identification #20

Open
1 of 4 tasks
ioggstream opened this issue Feb 17, 2022 · 5 comments · May be fixed by #26
Open
1 of 4 tasks

jsonschema: dialect identification #20

ioggstream opened this issue Feb 17, 2022 · 5 comments · May be fixed by #26

Comments

@ioggstream
Copy link
Collaborator

ioggstream commented Feb 17, 2022

I expect

  • to establish whether dialect identification should be addressed in this document ✔️
  • or in json-schema spec, and we just need to add a reference: ❌
  • in the first case, we need to define a procedure for dialect identification
  • reference definitions instead of copying them

cc: @Relequestual @jdesrosiers

@jdesrosiers
Copy link
Contributor

I think it's very important for dialect identification to be defined here. In fact, I don't think we can reference any specific part of the JSON Schema spec because it's a moving target. The JSON Schema spec isn't one thing. Releasing a new dialect doesn't replace or obsolete the previous one. The old dialects are still valid and have implementations and users. If we point to the JSON Schema spec to define dialect identification, which version of the spec do we point to? Will the next release take precedence? I can't imagine that the IETF would except the definition of a media type that can change without notice some time in the future.

My goal is to standardize a media type that defines the bare minimum necessary to identify which dialect the schema uses and then delegate semantics of the schema to wherever the dialect is defined. Dialects can't decide how dialects are identified because if they did, they might have different rules and it would be unclear which rules to apply. There needs to be one authority and that's where this I-D comes in.

Having the media type and dialect identification officially registered and stable means future JSON Schema releases can just be dialects without redefining the media type with every release. It also means third-party dialects don't need to be updated whenever a new JSON Schema dialect is released because they point to this stable I-D, not the previous JSON Schema release.

I hope that made sense.

@jdesrosiers jdesrosiers linked a pull request Feb 23, 2022 that will close this issue
@ioggstream
Copy link
Collaborator Author

I think it's very important for dialect identification to be defined here
Dialects can't decide how dialects are identified because if they did, they might have different rules

Since dialect identification relies on the media type parameter, it is correct to define dialect identification in this I-D.

I don't think we can reference any specific part of the JSON Schema spec because it's a moving target

We then need to cite in future JSON Schema releases using the $schema keyword, that it's defined
according to the normative parts contained in this I-D.

I'll add this issue to the slide so we can get some editorial feedback regarding the best strategy for doing this.

@jdesrosiers
Copy link
Contributor

We then need to cite in future JSON Schema releases using the $schema keyword, that it's defined
according to the normative parts contained in this I-D.

Yep. That's the plan.

@handrews
Copy link
Collaborator

handrews commented Jun 6, 2022

Jumping back here from the discussion in PR #32, I want to follow up on what I think is a debate between:

  • @jdesrosiers wanting to specify dialect identification in the sense of where to look ($schema, the schema media type parameter, the enclosing context) and what the URI value means
  • @awwright wanting to reference the JSON Schema specification, which would involve filling in any gaps currently caused by the specification not explaining how to identify or process previous drafts.

Please let me know if I am mischaracterizing your positions!

Furthermore, @awwright observed:

I know we sometimes say there's "versions" of JSON Schema, but in this context that may be misleading: There's been many publications of JSON Schema over time, but newer publications replace older ones in their entirety (this is specified in the first few paragraphs).

in response to @jdesrosiers's concern that the multiple versions of the JSON Schema Core specification mean that it is not possible to use any JSON Schema Core spec as a stable reference.

I outlined what I thought a stable base specification could look like, but failed to make clear that I do not think that this media type registration needs to wait for all of that stable base to be finalized.

Problems with meta-schema URIs and dialect identification

$schema and analogues such as the schema media type parameter have always been problematic as indicators of which JSON Schema draft a.k.a. version is in use. Since draft-04, $schema was intended to be customized. This has meant that there are two dimensions that can be indicated in the same opaque URI:

  • which draft/version, which determines the processing rules
  • which customizations, which determines the keywords being processed

If you are only looking at standardized meta-schema URIs from the JSON Schema specification documents, then you are choosing among version dialects. If you know which processing rules are involved across all possible URIs and are looking at non-standardized URIs, then you are choosing among non-version dialects in the context of a known version.

The problem being that, without a known version, a custom URI obliterates the draft/version information, which is what signifies the processing rules. Assuming vocabularies are used as expected, we should expect a proliferation of custom $schema URIs. The intention was that $vocabulary URIs would be fairly stable, with the core vocabulary URI indicating the processing rules, while $schema URIs would be created as needed for different vocabulary combinations.

This was how I hoped to separate the processing rules from the keyword syntax and semantics. Otherwise I don't see how meta-schema URIs are viable for determining processing rules. Each implementation would have to know each URI in advance and know the processing rules associated with it, which defeats the purpose of being able to assemble a custom vocabulary [EDIT: I meant dialect] for any application.

JSON Schema "fragmentation"

Over in issue #32, in a conversation with @dret and @awwright, @jdesrosiers lamented that "it's unfortunate that JSON Schema has become fragmented". I've been trying to figure out what you meant, and I noticed that @awwright started https://github.com/orgs/json-schema-org/discussions/169 in response with "a more positive outlook."

@jdesrosiers, if you are using the proliferation of meta-schema URIs as a metric for fragmentation, I understand your concern! As noted above, the vocabulary system was intentionally designed with the expectation that meta-schema URIs (and the dialects they represent) would proliferate more-or-less uncontrollably, on a relatively stable base of vocabulary URIs.

I did this because I did not see any way to salvage $schema as it was, although something along the lines of your ideas in json-schema-org/json-schema-spec#918 (basically inlining vocabulary URIs or even keyword declarations under $schema as an object or array) could be viable in a schema document. But not, I think, as a media type parameter.

The upshot of this is I don't think it's sufficient to explain $schema and call that "dialect identification." I mean, it literally is dialect identification, but it doesn't identify the processing model in any reliably useful way. This isn't fragmentation, it's how the system is supposed to work. It's only the core processing that needs to remain unified.

An approach for handling both past and future.

@awwright has advocated for including instructions for processing past drafts in the next iteration of the spec. You could alternatively do at least some of that in this document. What would that look like? Here's a quick proposal that might be missing some things, but you get the idea:

  • draft-00 through draft-02 do not include any way to identify the processing rules, and I don't think I've ever seen them in the wild.
  • draft-03 and draft-04 work with $schema and id, and a $ref that replaces the object that contains it (I've rarely seen draft-03 in the wild, and not for several years, but I think it works the same as draft-04 which is common)
  • there is no draft-05 😜
  • draft-06 and draft-07 work with $schema, $id (with the same capabilities as id), and a replacing $ref

Each of the above could be enumerated, with their standardized meta-schemas. If you try to use them with custom meta-schemas, unless the implementation specifically recognizes them, you're out of luck.

From that point on, $schema and $vocabulary plus $id (with $anchor split out) and a delegating $ref are the rules. As noted elsewhere, it's quite plausible to consider all of that except the details of $vocabulary finalized. This is what I was trying to get at with the base specification stuff, but I kind of went overboard and obscured the point.

If $vocabulary remains stable enough that it's always possible to identify the core vocabulary URI, then that's sufficient to figure out how to process anything else (such as $dynamicRef, which is necessary for meta-schemas but almost certain to change). This is plausible because further $vocabulary functionality could be offloaded to a vocabulary description file identified by the vocabulary URI, which could be independently self-descriptive. So it's arguably not necessary to nail down the vocabulary system to reach a point of stability for bootstrapping.

But I don't think $schema alone can possibly do it.

@jdesrosiers
Copy link
Contributor

That's a lot of good stuff to discuss, @handrews. I don't have time to fully address all of that, but I'll quickly address what I meant by fragmentation. I was referring to OpenAPI and MongoDB defining their own custom versions of JSON Schema. That doesn't include OpenAPI 3.1 which is a dialect of 2020-12, but does include OpenAPI 2.0 and 3.0 that make their own rules. I would prefer that this media type be defined in a way that is inclusive of those rouge versions because OpenAPI 3.0 users, for example, are likely to want to use this media type as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Document Update Required
Development

Successfully merging a pull request may close this issue.

3 participants