Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hierarchical output structure needs clarification around which nodes are kept #1319

Open
gregsdennis opened this issue Sep 26, 2022 · 17 comments

Comments

@gregsdennis
Copy link
Member

Follow-up to #939

Currently, the specification says that all nodes are to be kept for the hierarchical structure. This needs to be clarified to state more precisely which nodes are intended.

This structure was meant to be a replacement for the 2019-09/2020-12 verbose structure. Therefore all nodes, including those which do not impact validation would be retained.

However, I realize this makes for a significantly larger structure in many cases. As such, I would like to propose that we default to only including those output units which affect the final outcome, but allowing for implementations to include all of them upon request (or configuration)

I'm open to opposing arguments and alternatives.

@handrews
Copy link
Contributor

@gregsdennis I've always been a little unclear about how verbose works. I recently noticed this (I'm sure I saw it before, but I'd forgotten):

This includes sub-schema validation results that would otherwise be removed (e.g. annotations for failed validations, successful validations inside a not keyword, etc.).

The spec mandates that annotations for failed validation scopes are deleted, so I don't understand what's happening here. On the other hand, I can see the value of such annotations being preserved for debugging, but technically doing so would require modifying the spec (unless those annotations are re-categorized as errors somehow).

@gregsdennis
Copy link
Member Author

I think its primary use is debugging. It can be helpful knowing why, e.g., unevaluatedProperties processed foo, which you can find out by following annotations (another reason I output even the "internal" annotations).

But I think that use case is probably less common, and we should default to a more common use case, which is people want what affects final outcome.

@handrews
Copy link
Contributor

another reason I output even the "internal" annotations

yes, I think most people do output them, if they support it at all. And I definitely see the debugging use.

Are you including annotations from failed subschemas, though? That is the part that I'm a bit confused about. Because the spec requires deleting them, but I can see them being useful for debugging as well so I wasn't sure what happened with verbose in that case.

@gregsdennis
Copy link
Member Author

I output them where they were produced, yes, but it should be apparent from the validation results that they don't "flow down" (or up, depending on your tree orientation) or factor into the outcome.

@handrews
Copy link
Contributor

I output them where they were produced, yes, but it should be apparent from the validation results that they don't "flow down" (or up, depending on your tree orientation) or factor into the outcome.

I don't really know what this means. Does this mean that you save them for output but not for use in parent schemas? Does this mean that sometimes there are both annotations and errors in the output unit?

@gregsdennis
Copy link
Member Author

Yeah, it's really confusing and definitely needs clarification. The thing is, I'm not sure what it should be, so this discussion is useful.


(This will use the new output structures because they're simpler.)

The way I currently have my implementation set up, I only output annotations on nodes that pass, including for hierarchical. That said, it does include annotations from nested locations that passed.

Example:

// schema
{
  "anyOf": [
    {
      "title": "bar is required, foo is specified",
      "type": "object",
      "properties": {
        "foo": {
          "title": "found foo"
        }
      },
      "required": [
        "bar"
      ]
    },
    {
      "title": "foo is required, bar is specified",
      "type": "object",
      "properties": {
        "bar": {
          "title": "found bar"
        }
      },
      "required": [
        "foo"
      ]
    }
  ]
}

// instance
{ "foo": "value" }

// output with nested passing nodes (my current setup)
{
  "valid": true,
  "evaluationPath": "",
  "schemaLocation": "https://json-everything/base#",
  "instanceLocation": "",
  "nested": [
    {
      "valid": false,
      "evaluationPath": "/anyOf/0",
      "schemaLocation": "https://json-everything/base#/anyOf/0",
      "instanceLocation": "",
      "errors": {
        "required": "Required properties [\"bar\"] were not present"
      },
      "nested": [
        {
          "valid": true,
          "evaluationPath": "/anyOf/0/properties/foo",
          "schemaLocation": "https://json-everything/base#/anyOf/0/properties/foo",
          "instanceLocation": "/foo",
          "annotations": {
            "title": "found foo"
          }
        }
      ]
    },
    {
      "valid": true,
      "evaluationPath": "/anyOf/1",
      "schemaLocation": "https://json-everything/base#/anyOf/1",
      "instanceLocation": "",
      "annotations": {
        "title": "foo is required, bar is specified",
        "properties": []
      }
    }
  ]
}

// output with all annotations, including those in failed nodes
{
  "valid": true,
  "evaluationPath": "",
  "schemaLocation": "https://json-everything/base#",
  "instanceLocation": "",
  "nested": [
    {
      "valid": false,
      "evaluationPath": "/anyOf/0",
      "schemaLocation": "https://json-everything/base#/anyOf/0",
      "instanceLocation": "",
      "annotations": {
        "title": "bar is required, foo is specified",
        "properties": [
          "foo"
        ]
      },
      "errors": {
        "required": "Required properties [\"bar\"] were not present"
      },
      "nested": [
        {
          "valid": true,
          "evaluationPath": "/anyOf/0/properties/foo",
          "schemaLocation": "https://json-everything/base#/anyOf/0/properties/foo",
          "instanceLocation": "/foo",
          "annotations": {
            "title": "found foo"
          }
        }
      ]
    },
    {
      "valid": true,
      "evaluationPath": "/anyOf/1",
      "schemaLocation": "https://json-everything/base#/anyOf/1",
      "instanceLocation": "",
      "annotations": {
        "title": "foo is required, bar is specified",
        "properties": []
      }
    }
  ]
}

I think for debugging purposes, showing all of the annotations (the second output example) is arguably more useful, but it's also really confusing if you don't understand how annotations work or how they're used internally.

I'm not really sure that hiding annotations the annotations at failed nodes (the first output example) is terribly useful for debugging, which kinda makes me wonder if even the nested nodes of a failure need to be shown.

Note that this probably also impacts what's included in basic, and it may also relate to #632.

Just for fun (and probably completeness, etc, whatever), here are a couple more options:

// output omitting children of failed nodes
{
  "valid": true,
  "evaluationPath": "",
  "schemaLocation": "https://json-everything/base#",
  "instanceLocation": "",
  "nested": [
    {
      "valid": false,
      "evaluationPath": "/anyOf/0",
      "schemaLocation": "https://json-everything/base#/anyOf/0",
      "instanceLocation": "",
      "errors": {
        "required": "Required properties [\"bar\"] were not present"
      }
    },
    {
      "valid": true,
      "evaluationPath": "/anyOf/1",
      "schemaLocation": "https://json-everything/base#/anyOf/1",
      "instanceLocation": "",
      "annotations": {
        "title": "foo is required, bar is specified",
        "properties": []
      }
    }
  ]
}

// output omitting children that don't impact validation result
{
  "valid": true,
  "evaluationPath": "",
  "schemaLocation": "https://json-everything/base#",
  "instanceLocation": "",
  "nested": [
    {
      "valid": true,
      "evaluationPath": "/anyOf/1",
      "schemaLocation": "https://json-everything/base#/anyOf/1",
      "instanceLocation": "",
      "annotations": {
        "title": "foo is required, bar is specified",
        "properties": []
      }
    }
  ]
}

It could be argued that something like a oneOf would need to show all of its children to prove that only one validated successfully (also what were the errors for the ones that didn't?). For something like not, presumably you'd always want the subtree.

@gregsdennis
Copy link
Member Author

gregsdennis commented Sep 28, 2022

Looking at the "include all annotations" one again, I think it could be made clearer by having another property (annotationsDropped here) indicating that the annotations were dropped.

{
  "valid": true,
  "evaluationPath": "",
  "schemaLocation": "https://json-everything/base#",
  "instanceLocation": "",
  "nested": [
    {
      "valid": false,
      "evaluationPath": "/anyOf/0",
      "schemaLocation": "https://json-everything/base#/anyOf/0",
      "instanceLocation": "",
      "annotationsDropped": true,
      "annotations": {
        "title": "bar is required, foo is specified",
        "properties": [
          "foo"
        ]
      },
      "errors": {
        "required": "Required properties [\"bar\"] were not present"
      },
      "nested": [
        {
          "valid": true,
          "evaluationPath": "/anyOf/0/properties/foo",
          "schemaLocation": "https://json-everything/base#/anyOf/0/properties/foo",
          "instanceLocation": "/foo",
          "annotations": {
            "title": "found foo"
          }
        }
      ]
    },
    {
      "valid": true,
      "evaluationPath": "/anyOf/1",
      "schemaLocation": "https://json-everything/base#/anyOf/1",
      "instanceLocation": "",
      "annotations": {
        "title": "foo is required, bar is specified",
        "properties": []
      }
    }
  ]
}

That way there's an explicit indication (as opposed to the implication of valid: false) that the annotations of that subtree aren't considered higher up in the tree.

@gregsdennis
Copy link
Member Author

A subtle alternative would be to list the annotations in a droppedAnnotation property (instead of having annotations and a separate property that indicated they were dropped).

{
  "valid": true,
  "evaluationPath": "",
  "schemaLocation": "https://json-everything/base#",
  "instanceLocation": "",
  "nested": [
    {
      "valid": false,
      "evaluationPath": "/anyOf/0",
      "schemaLocation": "https://json-everything/base#/anyOf/0",
      "instanceLocation": "",
      "droppedAnnotations": {
        "title": "bar is required, foo is specified",
        "properties": [
          "foo"
        ]
      },
      "errors": {
        "required": "Required properties [\"bar\"] were not present"
      },
      "nested": [
        {
          "valid": true,
          "evaluationPath": "/anyOf/0/properties/foo",
          "schemaLocation": "https://json-everything/base#/anyOf/0/properties/foo",
          "instanceLocation": "/foo",
          "annotations": {
            "title": "found foo"
          }
        }
      ]
    },
    {
      "valid": true,
      "evaluationPath": "/anyOf/1",
      "schemaLocation": "https://json-everything/base#/anyOf/1",
      "instanceLocation": "",
      "annotations": {
        "title": "foo is required, bar is specified",
        "properties": []
      }
    }
  ]
}

I think I like this better.

This actually affects basic as well. Perhaps, with default settings, only output units that directly affect the validation outcome or contain annotations that weren't dropped should be kept in either format. The spec could then have a provision to allow an implementation to provide an option that includes all output units and dropped annotations, and the dropped annotations would need to be in a droppedAnnotations property to provide clarity.

That would mean that you could have nodes that have:

  • failed validation
  • an error message
  • a list of dropped annotations

@handrews
Copy link
Contributor

@gregsdennis I like the droppedAnnotations idea.

@gregsdennis
Copy link
Member Author

I think the primary question I have for the default behavior (don't show everything) is how to specify what's considered to "affect validation output" for keywords that have multiple subschemas.

  • allOf - Failures could just show all of the failed subschemas. Passes would show only nodes with annotations.
  • anyOf - Failures would need to show all of the failed subschemas. Passes would show only nodes with annotations (same as allOf).
  • oneOf - Failures would need to show either all of the subschemas (if all failed) or the multiple passed subschemas (would failures need to be shown for a multiple-pass case?). Passes would only show the passed subschema.
  • properties & prefixItems would need to show all of the failed subschemas. Passes would show only nodes with annotations (same as allOf).

It seems that oneOf is the real outlier in the multiple-pass case. I suppose we could just say that a failure would show all subschemas (all failed subschemas wouldn't show the oneOf multiple-pass case). But then if only one subschema of an allOf fails, that's a lot of unnecessary information.

I'm not sure what to do with that. I'd prefer that the keywords themselves not specify what the output is, however they already do that for annotations. I'm open to ideas.

@gregsdennis
Copy link
Member Author

not is another interesting one. I might expect that the results of its subschema are always shown. If not passes, you'll likely want to see what failed inside of it. If it fails... ? Is it valuable to see a passed subschema validation result in this case? You're just going to get a lot of annotations that are ultimately dropped by not.

@handrews
Copy link
Contributor

handrews commented Nov 7, 2022

I think it's worth keeping in mind the possible interaction of this issue and discussion json-schema-org/community#236, particularly since @marksparkza is taking a crack at implementing something based on it.

That proposal has to do with enabling/disabling annotation collection on a more fine-grained level, and while I intend for it to be separate from the output system, both annotations and the more general concept of a node of data are things that are kept or not, which (if kept) show up in output.

I feel like having a general filtering layer would a.) allow for a wider variety of implementations and use cases, such as a more interactive approach with callbacks at various points of validation (don't ask for details, I literally just made that up right now) and b.) would allow the output part of the specification to stay simple. It outputs whatever it gets. It doesn't care whether there was filtering, or on what criteria, etc. Decisions about what types of filters should be easy or on by default (e.g. drop all nodes that fail validation) can be part of the filtering system.

@gregsdennis
Copy link
Member Author

gregsdennis commented Nov 7, 2022

I think that conversation is somewhat orthogonal to this one since annotations are merely a property in the output unit. This conversation is about the structure, i.e. "Which nested results are included?"

Edit: this just sank in...

Decisions about what types of filters should be easy or on by default (e.g. drop all nodes that fail validation) can be part of the filtering system.

@handrews
Copy link
Contributor

handrews commented Nov 7, 2022

Glad to see the edit (and yeah why do points always click right after we hit "send"? It's like they hide until you've already commented).

The other side of this is that both annotation collection and output nodes in general take up space, and potentially a lot of it. There are performance reasons to filter both. With something on the scale of the FHIR mega-schema, it might only be feasible to store output notes for a portion of the schema, so you might have to run with different filters in order to narrow the errors down and get the information you need. I'm thinking of this b/c I know someone (maybe you?) talked about error collection running out of memory.

@gregsdennis
Copy link
Member Author

gregsdennis commented May 7, 2023

Just an update. I'm implementing some basic pruning logic on a few keywords in JsonSchema.Net, and it looks promising.

From the test suite:

// schema
{
  "$schema": "https://json-schema.org/draft/2019-09/schema",
  "items": [
    {}
  ],
  "additionalItems": {
    "type": "integer"
  }
}

// instance
[null,2,3,"foo"]

// without pruning
{
  "valid": false,
  "evaluationPath": "",
  "schemaLocation": "https://json-everything.net/fc30dcb71c",
  "instanceLocation": "",
  "details": [
    {
      "valid": true,
      "evaluationPath": "",
      "schemaLocation": "https://json-everything.net/fc30dcb71c",
      "instanceLocation": ""
    },
    {
      "valid": true,
      "evaluationPath": "/additionalItems",
      "schemaLocation": "https://json-everything.net/fc30dcb71c#/additionalItems",
      "instanceLocation": "/1"
    },
    {
      "valid": true,
      "evaluationPath": "/additionalItems",
      "schemaLocation": "https://json-everything.net/fc30dcb71c#/additionalItems",
      "instanceLocation": "/2"
    },
    {
      "valid": false,
      "evaluationPath": "/additionalItems",
      "schemaLocation": "https://json-everything.net/fc30dcb71c#/additionalItems",
      "instanceLocation": "/3",
      "errors": {
        "type": "Value is \"string\" but should be \"integer\""
      }
    }
  ]
}

// with pruning
{
  "valid": false,
  "evaluationPath": "",
  "schemaLocation": "https://json-everything.net/263f35c2f9",
  "instanceLocation": "",
  "details": [
    {
      "valid": true,
      "evaluationPath": "/0",
      "schemaLocation": "https://json-everything.net/263f35c2f9#/0",
      "instanceLocation": "/0"
    },
    {
      "valid": false,
      "evaluationPath": "/additionalItems",
      "schemaLocation": "https://json-everything.net/263f35c2f9#/additionalItems",
      "instanceLocation": "/3",
      "errors": {
        "type": "Value is \"string\" but should be \"integer\""
      }
    }
  ]
}

With pruning enabled, it's much easier to see what the problem is.

@jam01
Copy link

jam01 commented Jun 3, 2024

Hey all. I'm wrapping up a validation and annotation collection library I started from scratch a couple of months back, and just found this issue plus the new doc regarding the new output format.

I based my output on the 2020-12 format and have some questions/concerns about the new one. What I'm not sure is what is the right gh issue or discussion to bring it up...? I found some discussions but they're 2+ years old now.

Anyway, please let me know where makes sense to comment a bit and ask questions on that :)

@gregsdennis
Copy link
Member Author

Hey @jam01. Feel free to open a new issue or a discussion on the community repo. Happy to answer questions.

In short, 2020-12 is the latest published. It's recommended, not required, by the specification. In my lib, I use the format that's in the new doc. (It's in a new doc because we've decided to extract it to allow others to devise other formats for non-machine audiences.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants