Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need more details of annotation collection #530

Closed
handrews opened this issue Jan 9, 2018 · 29 comments
Closed

Need more details of annotation collection #530

handrews opened this issue Jan 9, 2018 · 29 comments

Comments

@handrews
Copy link
Contributor

handrews commented Jan 9, 2018

There should be pseudocode and a sample/recommend output format, as there is with Hyper-Schema. This should include guidance on how annotation collection can be implemented as part of the same evaluation process as validation.

  • Should explain that annotations are dropped whenever a validation check fails; the lack of annotations from not and non-validating branches of oneOf, if, etc. is a natural consequence of that
  • Should cover how to indicate the schema from which the annotation was contributed- see example below for concerns around $ref
  • Should note that there are both static annotations (all of the ones in the validation spec) and dynamic annotations that change based on the instance's contents (links and base in the hyper-schema spec)
  • Should perhaps give some guidance around complex object annotations (e.g. links): how do they fit into the overall JSON Schema model? Are there any particular limitations?
  • Should provide guidance on whether keywords can only register annotations under the keyword name or can define different annotation names. The use case for different names would include defining an annotation to which multiple keywords contribute, and wanting to avoid associating the annotation with one specific keyword

Some of the above may get split out into their own issues.

The following example shows the motivation for indicating where we were in the schema when the annotation was contributed. A common request is to be able to figure out which title or description to use, with a typical desired heuristic is that values in parent schemas are preferred over values in child schemas. Note that "child" in this cases is what me might call the logical child, following $refs, instead of the physical child (a nested subschema within the current schema object).

So let's have a schema:

{
    "type": "object",
    "properties": {
        "foo": {
            "title": "Title adjacent to reference",
            "description": "Lots of text",
            "readOnly": true,
            "$ref": "#/definitions/fooDef"
        }
    },
    "definitions": {
        "fooDef": {
            "title": "Title in reference target",
            "description": "Even more text"
        }
    }
}

With an instance of:

{
    "foo": "whatever"
}

The annotation results might look like this (I literally made up how $ref appears in the JSON Pointer-ish thing while typing this example, so don't take that as set in stone):

{
    "/foo": {
        "title": {
            "/properties/foo/$ref/title": "Title in reference target",
            "/properties/foo/title": "Title adjacent to reference"
        },
        "description": {
            "/properties/foo/$ref/description": "Even more text",
            "/properties/foo/description": "Lots of text"
        },
        "readOnly": true
    }
}

"/foo" is the instance pointer- this is the place in the instance to which the annotations apply.

Under that, the property names are the names of each annotation. All of these are also the keyword names that produced them, but in theory you could do something that doesn't map directly (please allow me to wave my hands on that- it's related to a complicated proposal that's under discussion and may not end up being relevant).

Under each annotation, the format is determined by the annotation.

readOnly just has a boolean, because it's explicitly stated (as of draft-07) that the overall value of readOnly is a logical OR of all applicable values. So there's no need to keep track of which value came from where- just let the application know the final outcome.

On the other hand, the way "title" and "description" are shown here is probably the default behavior. For those, "/properties/foo/... are schema pointers, more or less. Pointing across $ref is a bit weird and I need to figure out if that's the right approach. But the idea is to be able to tell that one of those is a runtime-parent of the other, and allow applications to make a decision based on that.

It might be better to use an array of terms instead of a not-quite-JSON Pointer. And it's not yet clear to me whether the keyword itself should be at the end of the pointer-ish thing / array. That also has to do with how much we want to abstract away the keyword source of an annotation. In the case of something like title or readOnly, we really do just want literally what the keyword says.

@epoberezkin
Copy link
Member

epoberezkin commented Jan 10, 2018

Are collected annotations defined if the validation of the instance fails? Or because the whole schema fails, they should all be dropped (they are defined but empty)?

@handrews
Copy link
Contributor Author

@epoberezkin I think the conceptual algorithm is something like:

  • Recurse down through applicators
  • Evaluate assertions
    • If any assertion fails, the only result is the assertion failure
    • Otherwise, collect annotations and return assertion success + annotations
  • As you go back up through parent schemas, if any assertion fails, drop all annotations from that schema and all subschemas

So the reason that you never get annotations from not is that either:

  • The not subschema failed an assertion, so all of its annotations and its subschema annotations are dropped before the result is negated to a validation pass
  • The not subschema passed, but then not negates that result to a fail, and therefore drops all annotations from its subschema

Of course, in practice you would probably check assertions like type before recursing down through applicators in order to avoid evaluating subschemas when not needed.

But I think if we conceptualize this as collecting annotations as we work from "leaf" schemas up to the root schema, and dropping those annotations whenever we reach an assertion failure, that's the right simplified mental model.

I like it a lot more than "collect annotations except under keywords X, Y, Z"

Does that make sense?

@handrews
Copy link
Contributor Author

@epoberezkin I had not thought about whether dropping means "defined but empty" or "not present at all". I would lean towards "not present at all". If you don't have annotations, you shouldn't know that you could maybe have had annotations. That's a gut reaction, though, I need to think on exactly why I think that's the correct behavior.

@gregsdennis
Copy link
Member

@handrews I agree with @epoberezkin on the "defined but empty" part. It makes more sense to me that an empty array says "evaluation completed without errors, but nothing was found."

Also, as a very minor point, JSONPath contains verbiage that allows for an empty array to be returned in the case that nothing was found, but the primary return in these cases is false. I used the empty array because of the reason above.

Still, either way (null or []) is fine with me.

@handrews
Copy link
Contributor Author

@gregsdennis, if I'm reading @epoberezkin correctly, the only time you have these placeholder values is if annotation values were collected, but were dropped due to a schema failure. What this would give you is the ability to determine whether information was collected and then revoked as not relevant, vs simply never having been collected.

What is the use case for that distinction?

In my view, the fact that a schema that failed validation happened to have some annotations in it is an implementation detail that should not be leaked. Dropping annotations due to a failed assertion means dropping them as if they were never collected.

Consider the following schema:

{
    "allOf": [
        {"type": "array", "minItems": 10, "description": "foo"},
        {"oneOf": [A, B, C]}
    ]
}

where A, B, and C are complex subschemas with various annotations.

Consider the case where the first thing that is checked is "minItems": 10. Ideally, we do not bother to process the oneOf or any of its subschemas. We already know that validation has failed, and therefore no annotations (not "description": "foo" and not anything in any of the oneOf subschemas) are relevant.

However, if "dropping" annotations means keeping the annotation name but clearing the data to null, [], or {}, then we are forced to process everything, collect all possible annotations, and then clear out the data as it becomes invalid.

Why do all of that extra work? What benefit does it bring?

Plus, in general with JSON (and JavaScript, but the important part is JSON), things that aren't relevant are usually simply not present. In JSON Schema, absent keywords never cause validation to fail, and there are no required placeholder values. That is the general philosophy that JSON Schema, as a JSON-based system, should follow.

@gregsdennis
Copy link
Member

@handrews I see the distinction and your confusion.

I don't mean that we should continue processing annotations if we know validation will fail, only to remove those annotations. I'm simply talking about the return value. It makes more sense to me to return an empty array (or object if your example is to be followed) when there are no annotations rather than to return null.

If I know at some point during validation that it will fail, I can still shortcut collection in my implementation and return the empty array/object.

Coming from a strongly typed language, it's nice for me to know what type of entity I'm going to get. It's a question of getting JSON object vs JSON null (which is very different than .Net null).

@epoberezkin
Copy link
Member

@handrews another question is whether you've considered collecting all keywords in the passed (sub)schemas as annotations. i.e., when the validation fails, it's usually obvious how it did. When it passed, it's usually not as obvious which branch made it to pass. In theory, while only some keywords are assertions, all keywords can be seen as annotations (e.g. required) and knowing which of them have been used to pass the validation can be useful.

The question is prompted by this: ajv-validator/ajv#661
TLDR: how from the schema and the data we know what are required fields when 1) the data passes validation 2) fails validation (that's why I asked whether some annotations can be returned when validation fails, but it probably complicates things).

@Relequestual
Copy link
Member

@epoberezkin I think that's interesting, but a separate issue.
I think we should move forward with this issue as explained in the opening explanation.

@handrews
Copy link
Contributor Author

handrews commented Jan 24, 2018

@gregsdennis I think that the question of how to represent an empty set of annotations is mostly an implementation-specific thing.

Following the conventions of JSON (which come from JavaScript), it is preferable for something to be absent than to be a literal JSON null. This is how all JSON Schema keywords work. And other JSON technologies like JSON Merge Patch make use of the relative rarity of null in JSON.

However, when it comes to working with schemas and schema results in-memory, an implementation may choose to represent that absence however it wants, and should do so in a way that is as idiomatic as possible for the implementation language. Maybe that means using a .NET null? I'm very unfamiliar with .NET, and last looked at it, um... 15 years ago? And not in depth then :-D

[EDIT: I realize the I misread the question and it is about using an empty data structure]

As far as an empty data structure, I'll have to think on that. It might be reasonable, or it might be that strongly typed languages should just use an empty data structure where other languages may not use anything. It's so variable- in JavaScript empty data structures are truth-y, but in Python they are false-y.

Ideally JSON Schema is idiomatic JSON, and in-memory representations are as close to idiomatic for that language as is practical.

@handrews
Copy link
Contributor Author

To summarize some off-github discussions with @Relequestual and @philsturgeon:

  • We do want to collect annotations
  • We do want to collect the right set of information with annotation values to allow applications to define their own usage
  • Applications care more about whether an annotation was found closer to or further from the root in terms of processing than about the absolute location of an annotation within a file

This last point is about what leads to "pointers" like: /properties/foo/$ref/title. This follows JSON Pointer syntax but rather than being applied to a file, is applied to the lazily-evaluated in-memory schema structure as it is processed with the instance. I'll refer to this concept as the processing location.

The runtime processing location seems useful because in situations such as the example I gave in the first comment, applications often want to define rules about "parent" schemas overriding "child" schemas. In this usage, the schema containing a $ref is still considered a parent of the reference's target. In other words, the parent/child concept is a runtime one rather than the usual file layout concept.

I'm open to suggestion for a better, more clear name for this.


An alternative to the /properties/foo/$ref/properties/bar/title format would be a list of pointers where the first is an absolute pointer applied to the file containing the root schema, and each subsequent element of the list is a relative JSON pointer relative to the target of the reference:

["/properties/foo/$ref", "0/properties/bar/title"]

This is a more complicated format but avoids the need to define an in-memory runtime tree concept.

@handrews
Copy link
Contributor Author

I'm leaning more and more towards

["/properties/foo/$ref", "0/properties/bar/title"]

as the format, as it sticks with existing JSON Pointer concepts. Also, probably one of the most common things an application will do is check whether something is above, alongside, or on the far end of a reference, and having a $ref crossing represented by adding another pointer is very easy to work with.

It would also allow the annotation collection process to be defined without requiring knowledge of whether you have crossed a reference or not. If you're creating one pseudo-json-pointer you need to "remember" the pointer prefix that had been built at the time that you follow the reference, and keep appending to that.

If you start a new pointer when crossing a reference, then you don't need to do that. You just always start the pointer from where you start processing something- an absolute pointer if you're starting with a document's root schema, a relative pointer otherwise. And the you "return" that back across the $ref. Definitely need to work out pseudocode to make sure this actually works. And work out pseudocode for the alternative(s) so we can see what's really more straightforward.

@gregsdennis
Copy link
Member

I don't understand the 0 in the second pointer. Does it reference the first item in the array? How is this consistent with existing concepts when pointers reference from the instance root?

Personally I like the included $ref better. Granted, I think it'll be easier for me to process, given my implementation.

@handrews
Copy link
Contributor Author

handrews commented Jan 25, 2018

@gregsdennis it's a Relative JSON Pointer starting from the target of the $ref. So you would get

["/properties/foo/$ref", "0/properties/bar/title"]

associated with the value "X" from something like:

{
    "properties": {
        "foo": {"$ref": "#/$defs/bar"}
    },
    "$defs": {
        "properties": {
            "bar": {
                "title": "X"
            }
        }
    }
}

@handrews
Copy link
Contributor Author

@gregsdennis what makes the pointer-with-$ref easier for you to work with? I'm not dead set on either option, just trying to figure out the tradeoffs for both implementations and users here.

@gregsdennis
Copy link
Member

In order to indicate where an error occurred, I build a path backward from the leaf to the root. It's just easier for me if I don't have to break it up at a reference node.

I like your example in the original post. It's easy to read and makes sense. "The /foo property acquired these annotations, and here's where they're from."

Also, given how the pointer is a key, how do you intend to use the array syntax?

@handrews
Copy link
Contributor Author

Also, given how the pointer is a key, how do you intend to use the array syntax?

It would have to be a somewhat different structure, yeah. If we go with the array we'll sort the rest of it out.

Now that you mention it I do think that I tend to see error reporting like that (usually with the $ref removed because it was modeled as inclusion, but with the delegation approach and allowing adjacent keywords I think we would keep it). Sometimes I see it as a JavaScript expression rather than a pointer, but the pointer is obviously better for interoperability.

Hmm... good points. I'm not in a rush to decide this so let's see what other opinions pop up :-)

@spenced
Copy link

spenced commented Jan 26, 2018

Section 3.3 states that:

A schema that is applicable to a particular location in the instance,
against which the instance location is valid, attaches its
annotations to that location in the instance.

My reading of this is that if input value...

{"foo": "bar"}

...is validated by schema (fragment)...

{
    "title": "my foo",
    "properties": {
        "foo": {
            "type": "string"
        }
    }
}

...then the output value would be:

{"title": "my foo", "foo": "bar"}

If I'm correct so far, then what happens if the input value is...

{"title": "input title"}

...and the validating schema (fragment) is as follows?

{
    "title": "schema title",
    "properties": {
        "title": {
            "type": "string"
        }
    }
}

(end)

@handrews
Copy link
Contributor Author

@spenced the "attaches" wording was not intended to mean "write into the instance". It means to associate it in some vague and not really specified way. We're working out the mechanism in this issue.

It definitely will never mean writing it into the instance, because that would constrain what instances can be used with annotations.

I might publish a bugfix / wording clarification of core and validation (like I did with hyper-schema and relative json pointer about a week ago) and include some clarification here. It would still be draft-07 (no functional changes). But this is confusing and the if wording is confusing so maybe I'll fix that.

@spenced
Copy link

spenced commented Jan 26, 2018

@handrews Thanks for the clarification: I am happy to be wrong!

@spenced
Copy link

spenced commented Jan 26, 2018

With my new understanding, I can see that annotation values are determined from the reverse relationship of validated value to the schemas which positively validate the value. This set of schemas is the source of the annotations. As @epoberezkin notes, why discard assertions when they also provide useful information? Assertions are just extra data at the target of the relationship. Similarly, unknown keywords in the schema could also be passed through to the application in this manner.

@philsturgeon
Copy link
Collaborator

@spenced only irrelevant annotations are dropped. If the validation of that branch fails then the annotations are not relevant, as @handrews explains here: #530 (comment)

I like the original examples for the same reasons as @gregsdennis. It's visually very clear whats going on, the code will be trivial to write, and I too am still unsure what that 0 is doing there.

As for "should the keywords be on the ref", im a little torn between;

        "title": {
            "/properties/foo/$ref/title": "Title in reference target",
            "/properties/foo/title": "Title adjacent to reference"
        },
        "description": {
            "/properties/foo/$ref/description": "Even more text",
            "/properties/foo/description": "Lots of text"

and

        "title": {
            "/properties/foo/$ref": "Title in reference target",
            "/properties/foo": "Title adjacent to reference"
        },
        "description": {
            "/properties/foo/$ref": "Even more text",
            "/properties/foo": "Lots of text"

The first is more verbose and again visually very simplistic, and the latter is going to be a little lighter to work with for biiiiiig schemas with a lot of annotations. Either way there will be complaints. "It's annoying having to stick the keyword from the key onto the end of the string, this feels like constructing URLs!" or "This is so dumb why is it repeated so much.

There's not gonna be a winner but we should make a call.

@handrews
Copy link
Contributor Author

handrews commented Feb 6, 2018

I too am still unsure what that 0 is doing there.

Do you understand Relative JSON Pointers? (not sarcasm! They're not the most intuitive if you're not expecting them) I'm trying to figure out whether I need to explain what a relative JSON Pointer is, or just explain specifically how that particular relative JSON Pointer is evaluated, and why (although it seems likely to be moot for this issue as the other option is clearly more popular so far).

https://tools.ietf.org/html/draft-handrews-relative-json-pointer-01#section-3

@handrews
Copy link
Contributor Author

handrews commented Feb 6, 2018

@philsturgeon for the different options you gave:

Including the keyword is because the correspondence is not always direct. Consider that hyperlinks are just complex annotations. So there's the links keyword, but it's an array so there are links from /links/0, /links/1, /links/2, etc.

Also, a keyword could contribute a value to an annotation with a different name. I need to think on the right example for this concept, I'll update when I have a good one. For now consider this a tentative reason, I just wanted to get it out there so folks know there's more to this question than just "duplicate the keyword or don't duplicate it."

@philsturgeon
Copy link
Collaborator

We were talking at cross purposes for a moment there, but we solved it immediate offline and forgot to report back.

Including the Keyword in Collection Key

A)

        "title": {
            "/properties/foo/$ref/title": "Title in reference target",

B)

        "title": {
            "/properties/foo/$ref": "Title in reference target",

You said:

Including the keyword is because the correspondence is not always direct. Consider that hyperlinks are just complex annotations. So there's the links keyword, but it's an array so there are links from /links/0, /links/1, /links/2, etc.

In that case A) totally makes sense! If a keyword can contribute a value to another fields collected annotations, then we need to have the keyword in there! Dope. Maybe get an example written up, and we'll all put thumbs all over it.

Wut is 0

I understand the point of Relative JSON Pointers after reading it, but I think a loooot of folks will get confused seeing a 0/ to navigate through the root object. Even as a PHP developer that strikes me as weeeeeeirdy beardy, which is why I'm leaning towards the original comment vs the later suggestion

@handrews
Copy link
Contributor Author

a 0/ to navigate through the root object

The 0 is the staring point from the $ref target, not the root, but since absolutely no one is in favor of that option let's drop it for now.

@handrews
Copy link
Contributor Author

handrews commented Mar 4, 2018

[EDIT: I've changed my mind on annotations not tied to a single keyword, see the next comment]

@philsturgeon asked for an example of keywords contributing to another keyword's annotations. With some trepidation, I will demonstrate this with the unevaluatedProperties proposal developed in #515 and currently filed as #556. DO NOT discuss the merits of `unevaluatedProperties here!!! Do that in #556.

unevaluatedProperties needs to know which instance properties have been evaluated against at least one subschema from the other keywords in the same schema object.

To do this we need two features, one of which already exists:

  • Dynamic annotation values, where the annotation value depends on the instance. The links Hyper-Schema keyword already depends on this behavior.
  • Annotations that are not tied directly to a single keyword. This is new.

How this would work is that annotation behavior would be defined for properties, patternProperties, and additionalProperties such that whenever a property is successfully validated against one of their subschemas, that instance property is added to the annotation set that we will call evaluatedProperties.

As with all annotations, if the instance fails to validate against a schema at any point, all of its annotations, including those contributed by its subschemas, are dropped, and only the validation failure is propagated to the parent schema.

When handling unevaluatedProperties, it simply compares the contents of the evaluatedProperties annotations with the set of properties in the instance. Any instance properties that do not appear in the evaluatedProperties set are the ones to which unevaluatedProperties's subschema applies.

A benefit of this is that an implementation of unevaluatedProperties does not have to understand specific keyword behavior. It simply defines an interface (applying its subschema to everything not in the "evaluatedProperties" annotation), and defining that the "evaluationProperties" annotation may be affected by other keywords. Within the same basic applicator vocabulary, we define properties, patternProperties, and additionalProperties to do so, but an extension vocabulary could add more terms that affect "evaluatedProperties".

This mechanism of allowing/requiring other keywords to contribute to an annotation is important, particularly in a multi-vocabulary world, because it decouples the behavior of the keyword using the annotation from any specific vocabulary or keyword that contributes to the annotation.

@handrews
Copy link
Contributor Author

While working on #600 (Keyword behavior can depend on subschema or adjacent keyword annotation results), I worked through more details on how all of this should work. I came to the conclusion that having annotations associated with anything but the collecting keyword's name is far too confusing. It creates an unpredictable interface between subschema and parent schema evaluation.

The problems include:

  • Having unevaluatedProperties define an "evaluatedProperties" annotation and require various keywords to contribute to it means that the behavior of those keywords is no longer entirely controlled by their own definition. This means that you can't successfully implement properties without reading the spec for unevaluatedProperties.

  • The "evaluatedProperties" approach means that unknown future extension keywords can change the behavior of unevaluatedProperties. I originally considered that a feature but now consider it a bug. If you want something similar to unevaluatedProperties that interacts with additional keywords, then define a new keyword for that (I think this sort of unpredictability was one of @epoberezkin's concerns with unevaluatedProperties, so eliminating it should hopefully make the keyword more palatable).

So, in #600, you can see that I did the following:

  • items contributes to the "items" annotation
  • additionalItems reads the "items" annotation to determine its behavior
  • properties contributes to the "properties" annotation
  • patternProperties contributes to the "patternProperties" annotation
  • additionalProperties reads the "properties" and "patternProperties" annotations to determine its behavior

Implementations are explicitly not required to code these keywords exactly as specified in terms of annotations, as other implementations are likely more efficient. Also, these keywords need to work in implementations that opt out of collecting annotations.

Note that the schema location of the annotation is critical for implementing additionalItems and additionalProperties, as only the annotation values from keywords in the same object, as reached through the same sequence of references, are relevant.

unevaluatedItems and unevaluatedProperties rely on the above, plus:

  • additionalItems contributes to the "additionalItems" annotation
  • additionalProperties contributes to the "additionalProperties" annotation
  • unevaluatedItems contributes to the "unevaluatedItems" annotation, and reads the "items", "additionalItems", and previous "unevaluatedItems" annotations to determine its behavior
  • unevaluatedProperties contributes to the "unevaluatedProperties" annotation, and reads the "properties", "patternProperties", "additionalProperties", and previous "unevaluatedProperties" annotations to determine its behavior

With these keywords, annotations from all schema locations that have yet been processed (using the depth-first approach outlined in earlier comments) are considered.

Note that the unevaluated* properties need to define an evaluation order within their schema object. Specifically:

  • unevaluatedItems is evaluated after all in-place applicators, and all other applicators relevant to instances of type "array"
  • unevaluatedProperties is evaluated after all in-place applicators, and all other applicators relevant to instances of type "object"

There are a lot of issues with that but they're unrelated to annotation collection so I'll track those in #556, #557, and #561 (vocabulary support) or new issues.

@handrews
Copy link
Contributor Author

Regarding

/properties/foo/$ref/properties/bar/title
vs
["/properties/foo/$ref", "0/properties/bar/title"]

I no longer thing either will work, as the whole concept of working with a post-reference-resolution structure is more complicated than it looks at first, and schemas are more naturally and reliably identified with full URIs. Better solutions are being discussed in #396 (output/error schema), so I'd rather that discussion be consolidated over there.

This issue should continue to track the general annotation collection algorithm, and the required data to associate with the annotation (instance location, schema location, value, plus something around rolling up collected multiple values for things like readOnly where there's a formula for combining values). I'm working on a PR for those things as they're fairly well-understood by now. I think.

Once that's done, I'll close this, and we can continue the discussion of the exact data structure, syntax, representation formats, etc. in #396, which is now also targeted at draft-08.

@handrews
Copy link
Contributor Author

At this point I think everything here has either been addressed by PR #310, or is being tracked by #679 (Output formatting) or #635 (general concerns about the clarity of keyword behavior description, including annotations). I'm going to close this out because I think those two open issues are better for discussing any further work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants