Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add keyword id to output unit #1065

Open
jdesrosiers opened this issue Jan 15, 2021 · 27 comments
Open

Add keyword id to output unit #1065

jdesrosiers opened this issue Jan 15, 2021 · 27 comments

Comments

@jdesrosiers
Copy link
Member

A schema using dialect-a can $ref a schema using dialect-b but, there is nothing in the output format that indicates the dialect that was used to determine each output unit result. This can be important for keywords whose semantics have changed over time, but their name hasn't changed. The ones that come to mind immediately are maximum/exclusiveMaximum, minimum/exclusiveMinimum, format, and links.

We could potentially determine the dialect for an output unit by using the absouteKeywordLocation to lookup the schema and inspect it's $schema property, but that has two problems. First, absoluteKeywordLocation is not required. Second, if the output is passed to a third party for processing, it might not have access to the original schema to look up it's dialect.

The simplest approach would be to add a field such as dialect to the output unit. However, It would be more useful to have a unique identifier for each keyword. That way, when we introduce a new draft, if a keyword doesn't change, it's ID doesn't have to change. That would make it easier for tools to do things with annotations where they don't care what dialect they are as long as they have the same semantics (such as title and description).

@gregsdennis
Copy link
Member

gregsdennis commented Jan 16, 2021

We should probably properly define a way to identify a dialect first. From 2020-12:

A dialect is defined as a set of vocabularies and their required support identified in a meta-schema.

It has no specific identifier, but perhaps the meta-schema $id (specified by $schema in the schema) could suffice since this effectively describes this set of vocabularies.

First, absoluteKeywordLocation is not required. - @jdesrosiers

The reason it's not required is because the relative location is sufficient when the navigated path contains no references.

That said, maybe it's sufficient to include the $schema value and/or $vocabulary value in the cases where there is ambiguity because

  • a dereference takes you to a new dialect (different meta-schema),
  • a keyword contains multiple semantic definitions due to the specific set of vocabularies include (this also brings up the namespacing concept we had discussed somewhere else).

I do like the idea of including something in the output, though.

@jdesrosiers
Copy link
Member Author

I think identifying a dialect is sufficiently defined.

The "$schema" keyword is both used as a JSON Schema dialect identifier and
as the identifier of a resource which is itself a JSON Schema, which describes the
set of valid schemas written for this particular dialect.

Let's not get into the choice to make absoluteKeywordLocation required or not. It's not relevant to this issue and it's just going to get us off on a tangent argument. We can save that discussion for another time. For this issue, it's sufficient to say that even if it is present, it's still not a solution to this problem.

maybe it's sufficient to include the $schema value and/or $vocabulary value in the cases where there is ambiguity

A vocabulary identifier is certainly an equivalent alternative to a dialect identifier, but I'd still prefer a keyword level identifier so we don't have a new version of every keyword for every draft even for the keywords that don't change. That would be a big help for annotation related tooling that I was working on a while ago (and hope to get back to soon).

I would prefer to always include this information rather than only when the dialect changes. Consider the case when I pass off my output to a third party library for processing. That third party library shouldn't need to know anything about the original schema and wouldn't know what dialect to process the schema as if it isn't in the output, even if the dialect doesn't change. You could pass the library a default dialect to use, but it would be easier for everyone involved if that information is just always there.

@handrews
Copy link
Contributor

handrews commented May 8, 2021

@jdesrosiers just make the keyword name a fragment on the vocabulary URI. We can declare the resources to be of media type application/schema-vocabulary+json and the fragment semantics to be that each keyword is usable as a fragment.

(We should also really register application/schema+json and if we want it application/schema-vocabulary+json)

@gregsdennis
Copy link
Member

This should be mentioned over in #973.

@jdesrosiers
Copy link
Member Author

@handrews

just make the keyword name a fragment on the vocabulary URI

That's exactly how I identify keywords now and it leaves me with the problem I described in my previous comment.

{
  "$id": "https://example.com/contrived-example",
  "allOf": [
    {
      "$id": "/foo",
      "$schema": "https://json-schema.org/draft/2020-12/schema",
      "title": "Foo"
    },
    {
      "$id": "/bar",
      "$schema": "https://json-schema.org/draft/2019-09/schema",
      "title": "Bar"
    },
    {
      "$id": "/baz",
      "$schema": "https://example.com/my-dialect",
      "title": "Baz"
    }
  ]
}

In this example, the title keywords would have the identifiers, https://json-schema.org/draft/2020-12/schema#title, https://json-schema.org/draft/2019-09/schema#title, and https://example.com/my-dialect#title respectively. Let's assume that title in https://example.com/my-dialect has different semantics that the standard JSON Schema semantics. If I have tooling that collects all the title annotations, the only way to do that is to maintain configurations that says when I say "title" I mean https://json-schema.org/draft/2020-12/schema#title or https://json-schema.org/draft/2019-09/schema#title and not https://example.com/my-dialect#title.

If each keyword has an identifier that doesn't change across drafts, we can avoid this problem. The title keywords from the example would be identified as, https://json-schema.org/keywords#title, https://json-schema.org/keywords#title, and https://example.com/my-dialect#title respectively. Now I can ask my tooling for all the https://json-schema.org/keywords#title annotations and not have to worry about if draft 2020-12, draft 2019-09, and my-dialect each have different semantics for title.

Of course it could be difficult to determine when an new keyword identifier should be minted. For example, if we changed description to allow an array of strings instead of just a string, does that deserve an new identifier? If I want to collect all the description information, both keywords express that information so maybe they should have the same identifier. However, tooling would break if it expected a string and suddenly started getting an array of strings, so maybe they should have different identifiers.

@gregsdennis
Copy link
Member

This sounds like a bigger issue than just in the output. Sure, the output is probably the primary benefactor, but it sounds like this is more about adding a property to vocabularies (maybe in line with @handrews' idea of a machine-readable vocab file) than extending output specifically.

@jdesrosiers
Copy link
Member Author

@gregsdennis I think I see what you mean. I'm open to discussing alternatives that also solve this problem.

@gregsdennis
Copy link
Member

This was also mentioned here.

I think restructuring the output based on subschemas instead of keywords solves this (related comment).

@gregsdennis
Copy link
Member

@jdesrosiers can you confirm whether your concerns here were covered in #1249?

@jdesrosiers
Copy link
Member Author

can you confirm whether your concerns here were covered in #1249?

No, I don't see how those changes address this problem. There's still no indication in the output unit of what dialect or vocab the keyword is defined in.

@handrews
Copy link
Contributor

@jdesrosiers @gregsdennis can we for now settle on the dialect IRI as the thing to be added? Or if we want this issue to be about a more comprehensive identification (dialect+vocab+keyword), can we split out an issue for just dialect identification?

@gregsdennis
Copy link
Member

I question the need for dialect identification in the output.

The schema doesn't declare individual keywords by their vocab; it just uses the keyword. For example, there is no https://json-schema.org/draft/next/vocab/core#$id; it's just $id. A meta-schema that says "these keywords are available," then the schema just includes the keywords without any mention of what vocab that keyword came from. I don't see why it should be any different for the output.

If you really need to know what vocab a keyword came from, that information is traceable by going back to the schema that produced the output and looking at its meta-schema. I don't see the value in having that information in every instance of the output, especially not for every listing of a keyword.

there is nothing in the output format that indicates the dialect that was used to determine each output unit result.

There is: schemaLocation gives the absolute location of the keyword. From that you can trace back to the schema and its meta-schema to get the dialect in use.

From the examples in the current doc:

    {
      "valid": true,
      "evaluationPath": "",
      "schemaLocation": "https://json-schema.org/schemas/example#",
      "instanceLocation": "",
      "annotations": {
        "title": "root",
        "properties": [
          "foo",
          "bar"
        ]
      }
    }

Properties are given here. Schema location is given here. You can look up the dialect if you want.

You may have identifiers for keywords in your implementaiton, but JSON Schema defines no such identifiers. I see no reason why the output should include identifiers for keywords when no such definition exists.

@handrews
Copy link
Contributor

handrews commented Sep 28, 2022

Properties are given here. Schema location is given here. You can look up the dialect if you want.

If I call an API that, somewhere on the back end, uses JSON Schema to evaluate some data and produce annotations, and then sends that annotation data back to me without sending me the schema, no, I can't just look up the dialect.

This is a common problem, and is why resources typically have "self" links even though the self link URI is probably the URI that you just got it from. Because sometimes you end up with a resource representation with no way to tell where it came from. It's also why it's a good idea to have an absolute $id in a schema even if you don't have any relative URI-references in it.

We do not need the dialect URI in every output unit. We just need a map of the absolute URIs (without fragments) in schemaLocation to dialect URIs. It can go at the top level, since the schemaLocation is sufficient to determine which dialect applies to a given output unit.

@gregsdennis
Copy link
Member

gregsdennis commented Sep 28, 2022

Can someone please post an example of how they want this fit into the output?

@jdesrosiers
Copy link
Member Author

You may have identifiers for keywords in your implementaiton, but JSON Schema defines no such identifiers. I see no reason why the output should include identifiers for keywords when no such definition exists.

Part of the proposal was to define URIs to identify each keyword. That would definitely be a requirement for adding a keyword URI in the output.

I still think this is the best solution in the long run because it would allow output format post-processors to know that one keyword has the same semantics as another. For example, someone could write a dialect that has the same semantics as the standard dialect except all the keywords are translated to Spanish. Tools that can process the standard dialect output should be able to process the Spanish dialect output as well because the semantics of the keywords should be the same. Of course this would require changes to the vocabulary system in addition to just defining keyword URIs, but I think this would be a good path to be on.

schemaLocation gives the absolute location of the keyword. From that you can trace back to the schema and its meta-schema to get the dialect in use.

I mention in the issue description why I didn't think that was a good enough solution and @handrews reiterated that view point.

Can someone please post an example of how they want this fit into the output?

Honestly I haven't had a chance to fully think through how this would fit in to the new output format. We could figure out how to incorporate a dialect ID into what we have now, but I think that would be a partial solution. I'd like to explore incorporating the idea of keyword URIs to the vocabulary system first and then see what that means for the output format. We can probably just put this issue on hold for now.

@handrews
Copy link
Contributor

handrews commented Sep 28, 2022

This is almost exactly the example in your recent PR, except I changed the output units with an evaluation path prefix of /properties/bar/$ref so that they jumped to a different schema resource (https://json-schema.org/schemas/other) on that $ref, to which I assigned a different dialect/version. Please disregard the fact that the output spec doesn't match the dialects and draft-07 didn't support annotations like this.

{
  "valid": true,
  "dialects": {
    "https://json-schema.org/schemas/example": "https://json-schema.org/draft/2020-12/schema",
    "https://json-schema.org/schemas/other": "https://json-schema.org/draft-07/schema"
  },
  "nested": [
    {
      "valid": true,
      "evaluationPath": "",
      "schemaLocation": "https://json-schema.org/schemas/example#",
      "instanceLocation": "",
      "annotations": {
        "title": "root",
        "properties": [
          "foo",
          "bar"
        ]
      }
    },
    {
      "valid": true,
      "evaluationPath": "/properties/foo/allOf/1",
      "schemaLocation": "https://json-schema.org/schemas/example#/properties/foo/allOf/1",
      "instanceLocation": "/foo",
      "annotations": {
        "title": "foo-title",
        "properties": [
          "foo-prop"
        ],
        "additionalProperties": [
          "unspecified-prop"
        ]
      }
    },
    {
      "valid": true,
      "evaluationPath": "/properties/bar/$ref",
      "schemaLocation": "https://json-schema.org/schemas/other#/$defs/bar",
      "instanceLocation": "/bar",
      "annotations": {
        "title": "bar-title",
        "properties": [
          "bar-prop"
        ]
      }
    },
    {
      "valid": true,
      "evaluationPath": "/properties/foo/allOf/1/properties/foo-prop",
      "schemaLocation": "https://json-schema.org/schemas/example#/properties/foo/allOf/1/properties/foo-prop",
      "instanceLocation": "/foo/foo-prop",
      "annotations": {
        "title": "foo-prop-title"
      }
    },
    {
      "valid": true,
      "evaluationPath": "/properties/bar/$ref/properties/bar-prop",
      "schemaLocation": "https://json-schema.org/schemas/other#/$defs/bar/properties/bar-prop",
      "instanceLocation": "/bar/bar-prop",
      "annotations": {
        "title": "bar-prop-title"
      }
    }
  ]
}

@handrews
Copy link
Contributor

@gregsdennis the above example is only for the dialect IRI. I'm not currently concerned with anything more granular (hence me asking if we can either agree to just do this, or split this out to its own issue if folks want to keep working on the more granular thing here).

So now if I want to know the dialect for that last output unit, I look at "schemaLocation": "https://json-schema.org/schemas/other#", chop off the fragment, and look it up as something like output["dialects"]["https://json-schema.org/schemas/other"]

@handrews
Copy link
Contributor

🤦 I initially copy-pasted the diff instead of the changed file, so the example two comments above was a mess. It's fixed now.

@gregsdennis
Copy link
Member

I think that's doable, but defining how dialects are declared seem to still be an open topic on Keyword for identifying bootstrapping rules #217. We should nail that down before deciding how to put it in the output.

@handrews
Copy link
Contributor

@gregsdennis I am working on a replacement proposal for #217, but that depends on how some things regarding the SDLC play out.

For now, the important part is to get it into the output format. We can update the source of the dialect IRI as needed as we change that. If we change it.

@gregsdennis
Copy link
Member

For now, the important part is to get it into the output format. We can update the source of the dialect IRI as needed as we change that. If we change it.

My point is that we don't have the "dialect IRI" defined as a concept. We seem to be using the meta-schema IRI, but is that really the same thing? It doesn't make sense to include something that isn't defined yet.

@handrews
Copy link
Contributor

Yes, it's definitely the same thing. I thought that was made clear in the spec. If not, that's just an oversight.

@gregsdennis
Copy link
Member

Okay. I still want to hold this open until process is defined. Less to think about.

@gregsdennis
Copy link
Member

Coming back to this, the idea has warmed on me.

I like the idea of the root output unit not actually being an output unit but a host for other evaluation data with the actual evaluation results in the details property.

From this comment and the one that follows it, I think it makes sense to have keywords identified. This aligns with the ideas to resolve keyword collisions (also from that discussion) in that their output need to be reported somehow.

I imagine output will go through some more refinement soon.

@gregsdennis
Copy link
Member

@jdesrosiers would it make sense to only include dialect in nodes where the dialect changes, or would you have it in every node?

@jdesrosiers
Copy link
Member Author

It could work either way, but I would include it in every node. I feel like if I was writing code to process this output, it would be annoying if I had to look in two places to determine the dialect. It would be easier if it was just always in the same place when it's needed.

@gregsdennis
Copy link
Member

Yeah, thinking about it again, it would only make any sense at all to omit it in some nodes for the hierarchical format. The list format would still need it in every node.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants