Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standard schema for results, including errors #396

Closed
trajano opened this issue Sep 5, 2017 · 105 comments
Closed

Standard schema for results, including errors #396

trajano opened this issue Sep 5, 2017 · 105 comments

Comments

@trajano
Copy link

trajano commented Sep 5, 2017

The schema provides a way of validating the input, but there should be a standard for the validation results as well.

@handrews
Copy link
Contributor

handrews commented Sep 5, 2017

EDIT: This is now tracking overall output formats, success or failure. See also #530 for more specifics about annotation output.

@trajano could you elaborate on this a bit more? Are you talking about input to a specific validator? Or input/results from some other system that you want to use validation with?

@trajano
Copy link
Author

trajano commented Sep 5, 2017

I was thinking that we should have something along the lines of a validation result structure akin to https://docs.oracle.com/javaee/7/api/javax/validation/ConstraintViolation.html

For example:

{
   "valid": true | false,
   "violations": [
      {
         "property-path" : "$/path/to[10]/bad/element",
         "invalid-value": "who's bad" | { "some": [ "complex", "object" ] }
         "rule-path": "$/path/in/schema/that/triggered/violation",
     } ,
      {
         "property-path" : "$/path/to[10]/bad/element",
         "invalid-value": "who's bad" | { "some": [ "complex", "object" ] }
         "rule-path": "$/path/in/schema/that/triggered/violation",
          // these may be optionally added but should not be part of the spec
         "message-template": "{0} is who is bad",
         "message": "who's bad is who is bad",
         "position": { "line": 52", "col": 10 }
     } 
    ]
}

@trajano
Copy link
Author

trajano commented Sep 5, 2017

That way each application need not develop their own validation error standard messages

@handrews
Copy link
Contributor

handrews commented Sep 5, 2017

@trajano I like the idea, in some form or another.

RFCs generally do not constrain implementations in this way, due to the wide variety of implementation languages, environments, and requirements such as performance vs usability. This idea helps interoperability among users of validators, rather than interoperability in schema processing and outcome. So I don't think this proposal belongs in the spec repo (at least not with its current scope).

I see two possible approaches:

  1. Write a schema for the desired output and perhaps we can discuss it with implementation maintainers and publish it as a recommendation on the web site (which is a separate GitHub repo). Depending on how broadly it is adopted maybe there will be a clear need to formalize it further.

  2. Approach it like the "application/problem+json" specification. That is an error-reporting format that is defined independent of the system producing the errors. You would obviously want specific schema error-oriented fields, and would not be going for the completely generic approach of that spec, but separating it encourages its use in implementations that can support it, without burdening more constrained implementations with support.

@erayd
Copy link

erayd commented Sep 5, 2017

I really like this:

  1. Approach it like the "application/problem+json" specification. That is an error-reporting format that is defined independent of the system producing the errors. You would obviously want specific schema error-oriented fields, and would not be going for the completely generic approach of that spec, but separating it encourages its use in implementations that can support it, without burdening more constrained implementations with support.

If it's attractive enough, I would certainly support it in the implementations I'm involved with.

@trajano
Copy link
Author

trajano commented Sep 5, 2017

I think application/violation+json as a MIME type sounds better. For those who may not know (like me) the thing before the +json is the MIME subtype.

@handrews
Copy link
Contributor

handrews commented Sep 5, 2017

@trajano @erayd I'm glad to see multiple people taking interest. I do not have time to focus on this right now, but here is what I would recommend:

  • Don't worry about making it an actual media type right now. You may want to do that, or you may (for instance) just want to define a specific extension of application/problem+json that can be identified by a schema describing the validation error output. Then you do not need to go through RFC, but can instead just apply two existing technologies and see how that goes to start.
  • File an issue to track this over in the json-schema-org.github.io repo. Storing your proposal there as extended documentation / best practices seems like the best starting point to me.
  • I'm going to close this as I do not think it belongs in this repo at this time. But that is just me managing the repo scope, not a rejection of the idea at all.

@trajano
Copy link
Author

trajano commented Sep 5, 2017

Closing this issue since it is out of scope of this project.

@trajano trajano closed this as completed Sep 5, 2017
@yurikhan
Copy link

I am interested in this. Subscribing for comments, even if only to know which other issue to track.

@handrews
Copy link
Contributor

There have been a few other issues around error reporting and/or formal definitions of validation results, so I have made an "output" label to track these. I will re-open this and label it, at least until we figure out where all of these concerns should really live.

Somewhat related: I'm defining a "recommended" output format for Hyper-Schema, although it is not a strict requirement. It does seem useful, though, which is why I am re-considering whether a recommendation, if not requirement, might be in-scope. The situation in hyper-schema is a little different, though (the mandatory output of the process is much more complex than a boolean result, even ignoring error reporting).

@Anthropic
Copy link
Collaborator

@handrews should validation results be a vocabulary? I think it would be awesome to have a consistent format/structure/schema for responses that validation implementations can work toward to ensure compatibility to make transferring from one validator to another more seamless. Just a week ago I posted this issue on djv to support a similar, more detailed, error message format to ajv. I think if ajv and djv the two fastest validators have a similar format then it could act as a starting point for a predictable validation result schema.

@handrews
Copy link
Contributor

@Anthropic I'm not sure I would consider output formats to be "vocabularies", as to me a vocabulary is a set of keywords to use in schemas to either annotate or assert conditions about an instance document. I think output documents are instances (and not also schemas). They can be described by a schema, but they do not constitute a schema as far as I can tell.

Unless I am misunderstanding what you mean? I don't think we need a new vocabulary to write a schema describing the output format.

When you look at the hyper-schema rewrite, you will see the definition of the output format there. Although it does not include an error reporting format.

I have no particular point here, just acknowledging that there is something to this topic and I'm not sure where it fits.

@Anthropic
Copy link
Collaborator

@handrews I see your non particular point 😉, what I meant was it just felt looking at the ajv error format with keyword, dataPath, schemaPath, params and message seemed to have its own keywords, so I wasn't sure where it belongs, you could say I was thinking out loud. Not concerned with the where as much as wanting to figure out or get ideas on the best place to define such a format.

@gregsdennis
Copy link
Member

I'm already doing something similar in my implementation, though it's merely in an object model. I'd have to update that model and make it serializable, but it's not a stretch for me to support something like this.

I like the idea of standardizing output.

@vearutop
Copy link

Important thing is that validation failures can be deeply nested by *Of keywords. And to understand the root error you'll need to follow sub-errors.
Sample error message:

No valid results for oneOf {
 0: Enum failed, enum: ["a"], data: "f" at #->properties:root->patternProperties[^[a-zA-Z0-9_]+$]:zoo->oneOf[0]
 1: Enum failed, enum: ["b"], data: "f" at #->properties:root->patternProperties[^[a-zA-Z0-9_]+$]:zoo->oneOf[1]
 2: No valid results for anyOf {
   0: Enum failed, enum: ["c"], data: "f" at #->properties:root->patternProperties[^[a-zA-Z0-9_]+$]:zoo->oneOf[2]->$ref[#/cde]->anyOf[0]
   1: Enum failed, enum: ["d"], data: "f" at #->properties:root->patternProperties[^[a-zA-Z0-9_]+$]:zoo->oneOf[2]->$ref[#/cde]->anyOf[1]
   2: Enum failed, enum: ["e"], data: "f" at #->properties:root->patternProperties[^[a-zA-Z0-9_]+$]:zoo->oneOf[2]->$ref[#/cde]->anyOf[2]
 } at #->properties:root->patternProperties[^[a-zA-Z0-9_]+$]:zoo->oneOf[2]->$ref[#/cde]
} at #->properties:root->patternProperties[^[a-zA-Z0-9_]+$]:zoo

I'm not sure standard way of error reporting is necessary (the only obvious use-case to me is to change one implementation to another and keep same error handling code), but I would implement it.

@handrews
Copy link
Contributor

@vearutop yeah, the *Of keywords are a major concern. I think the key for them is designing a simple raw error format that tools can navigate to provide better feedback. For example, being able to show errors in some sort of hierarchical drill-down display.

This is off the top of my head and not a serious proposal:

If I'm reading your example right you're working with a schema like the following:

{
  "type": "object",
  "properties": {
    "root": {
      "type": "object",
      "patternProperties": {
        "^[a-zA-Z0-9_]+$": {
          "oneOf": [
            {"enum": ["a"]},
            {"enum": ["b"]},
            {"$ref": "#/cde"}
          ]
        }
      }
    }
  },
  "cde": {
    "anyOf": [
      {"enum": ["c"]},
      {"enum": ["d"]},
      {"enum": ["e"]}
    ]
  }
}

with an instance of:

{
  "root": {
    "zoo": "f"
  }
}

An error data structure could look something like:

[
  {
    "instanceLocation": "/root/zoo",
    "instanceData": "f",
    "errors": [
      {
        "schemaLocation": ["#/properties/root/patternProperties/^[a-zA-Z0-9_]+$/oneOf"],
        "validSubschemas": [],
        "subschemaErrors": [
          {
            "schemaLocation": ["#/properties/root/patternProperties/^[a-zA-Z0-9_]+$/oneOf/0/enum"],
            "schemaValue": ["a"]
          },
          {
            "schemaLocation": ["#/properties/root/patternProperties/^[a-zA-Z0-9_]+$/oneOf/1/enum"],
            "schemaValue": ["b"]
          },
          {
            "schemaLocation": ["#/properties/root/patternProperties/^[a-zA-Z0-9_]+$/oneOf/2/$ref", "/cde/anyOf"],
            "validSubschemas": [],
            "subschemaErrors": [
              {
                "schemaLocation": ["#/properties/root/patternProperties/^[a-zA-Z0-9_]+$/oneOf/2/$ref", "#/cde/anyOf/0/enum"],
                "schemaValue": ["c"]
              },
              {
                "schemaLocation": ["#/properties/root/patternProperties/^[a-zA-Z0-9_]+$/oneOf/2/$ref", "#/cde/anyOf/1/enum"],
                "schemaValue": ["d"]
              },
              {
                "schemaLocation": ["#/properties/root/patternProperties/^[a-zA-Z0-9_]+$/oneOf/2/$ref", "#/cde/anyOf/2/enum"],
                "schemaValue": ["e"]
              }
            ]
          }
        ]
      }
    ]
  }
]

This organizes errors first by the location in the instance that is failing validation, then indicates each schema location against which it fails that may be the cause of the overall failure.

Instance locations are plain (non-URI-fragment) JSON Pointers, while schema locations are arrays of URIs with JSON pointer fragments where a new pointer is added to the array whenever you cross a $ref.

If any of the subschemas were valid, validSubschemas would provide the URIs with JSON Pointer fragments of those subschemas. In this case, none are, so it's an empty array.

The goal here is to provide all of the key data in a standardized way, but not necessarily the natural language phrasing. So there's nothing here that says "Value not present in enum", because there are many ways that that phrasing might be chosen, not to mention I18N/L10N/A11Y issues.

But this approach would let tools display something simple like "No valid oneOf subschemas at..." and only display each individual error when drilling down. In my experience, the biggest problem with *Of error reporting is that it's hard to sort through everything to figure out what the real problem is vs "errors" that are expected (particularly with oneOf where all but one should fail).

@handrews
Copy link
Contributor

Since draft-08 is intended to include annotation collection, which would involve a recommended output format, we should really handle a recommended error format at the same time. That would cover all of this issue, I believe, so I'm adding it to the milestone.

@vearutop
Copy link

I think it is worth adding keyword to error data structure.
@handrews I've tried to describe your example (with few changes) in JSON Schema: https://gist.github.com/vearutop/17ef696fe1426844b302e844076400c5

@vearutop
Copy link

vearutop commented Jul 6, 2018

It would be nice to come up with a rich example of schema and data, having multiple dereferences, not, oneOf to also see the case when missing failure is invalid. And then to assess how helpful are currently implemented error systems.

Also that example could contribute to test suite after we have a consensus.

@Relequestual
Copy link
Member

@vearutop thanks for the example, and your last comment.
I was going to suggest something similar, a more analytical approach of what people currently do, and if there's an superset of those implementations which cover as many requirements as are defined to satisfy this issue... probably need to define the requirements clearly also. BUT I realised that was a lot of work for someone to do, and I know I don't have the time.

@Anthropic
Copy link
Collaborator

Annotation Discussion on Slack

There was a good chat in Slack so I have added it if you click below.

View annotation conversation

@vearutop:
I am totally lost on annotations topic 🙂
would appreciate a simple example

@handrews:
@vearutop do you follow that title, description, readOnly, writeOnly, default, and examples are annotations? (They are not the only ones, but they are the most straightforward so if this is not clear we should start with explaining that)

@vearutop:
I am using few of those in code generation
title, description, default at least

@handrews:
Right. So these do not impact validation at all, correct?

@vearutop:
yes

@handrews:
They do, however, attach to a particular part of the instance. title in the root schema is presumably the title of the whole instance
title within a particular properties subschema is presumably the title of that property in the instance
In this sense, title annotates the instance with information
with me so far?

@vearutop:
yes, but still not sure to get the collection meaning

@handrews:
right ,I’m getting to that
this is just to make sure that we’re on the same page on the general concept of an annotation
so, to illustrate why “collection” is important, what is the meaning of {"allOf": [{"title": "Foo"}, {"title": "Bar"}]}?
is the title “Foo” or “Bar”? And why?
There is not an obvious universal answer

@vearutop:
let me explain 🙂

@handrews:
ok

@vearutop:
if that is a code generation, title: Foo would be a class of AllOf0, and title: Bar as class of AllOf1
AllOf would be a union of those

@handrews:
the key thing here is if that is a code generation
you know how to interpret that because you have a specific application in mind
in order to do that, you need to know:

  1. That both of these are from an allOf
  2. That “Foo” is from the first, and “Bar” is from the second
    This is because your application (code generation) assigns importance to the use of an allOf and the order of schemas within it
    (in contrast with validation, which does not care about the ordering at all)
    If you are writing a dedicated code generator, you can hardwire all of that. You know what you’re doing, you can write your assumptions into code. It will run faster that way.
    However, if someone is writing a generic library for processing JSON Schema, which many other applications will use, they cannot do taht
    Even if it’s for code generation, someone else’s code generator may assign different semantics to the use of allOf
    So “collecting” annotations means that the generic library provides information on:
  3. What keyword produced the annotation (title)
  4. Where it came from in the schema (the first subschema of allOf in the root schema)
  5. To where it is attached in the instance (the root of the instance) (edited)
    Given that information, your code generator would now know enough about those titles to figure out how to use them.
    Is that the most efficient way to write a code generator? Nope. But it’s an approach that will work across many applications. Flexibility vs efficiency.
    [I’m handwaving past some differences in working with and without instances, which @gregsdennis recently highlighted]

@vearutop:
so it could be an [{"keyword":"title","schemaPath":"#/allOf/0","dataPath":"#","value":"Foo"},...]? (edited)

@handrews:
Yup!

@vearutop:
hmhm, but usually it is not a big deal to traverse the schema recursively
thanks for explanation

@handrews:
@vearutop for a well-known application, especially one like code generation that works without instances, I would absolutely just recommend schema traversal
it’s why I wrote @cloudflare/json-schema-walker in JavaScript
For hyper-schema, where the links keyword is a complex annotation whose presence and value depends heavily on the instance, you want to be able to ask “what links exist for this instance”, and understand from where they came and to what data they are attached.
With an instance present, and validation keywords in use, you need to implement validation in order to traverse, so you might as well do the collection as part of that.
[I have to go get lunch now, I’ll check in later]

@vearutop:
okok, makes good sense if the collection of annotations is tied to validation

@handrews:
@vearutop as @gregsdennis observed the other day, collection of annotations is really a thing that happens with instances, and therefore needs to take validation into account. There is some other concept of using annotations by statically walking a schema without an instance that is relevant to code generation, ui generation, documentation, etc. that needs a name. I do not know what that name is.

@gregsdennis:
Please also note that I come from a validation point of view. Following this conversation has illuminated the other uses a bit more for me. @vearutop since you have more experience with the code generation side of this, is it distinctly easier to have a recursive output for this and possibly other static analysis purposes?

@Anthropic:
@gregsdennis @handrews @vearutop
I've had someone who does code generation tell me they convert the schema into an IR (intermediate representation) internally. Even within my own application we process the schema first, doing a transformation over the whole thing as we merge it with our array based ui schema to create our own IR for machine processing.

Henry, like Greg, I find this an illuminating conversation, in fact it has been more interesting to me than the bulk of the 90 comments on the error structure issue, perhaps some of it should be added there since Slack will delete this in future?

@vearutop
Copy link

I've tried to make a sample schema and response output, please check:
https://gist.github.com/vearutop/376e2a148e6df044417264722f0d1319

Same schema and test cases with ajv processing:
https://runkit.com/embed/w5fpvewlnetl

The idea is to provide a flat response (array of causes) for top level schema.
Optionally subschema can produce a recursive array of causes in causeContext.

A single cause is a structure of

  • keyword (not present for boolean schema)
  • schemaPath
  • dataPath
  • valid (true/false)
  • causeContext ( a list of causes where schema of current cause acts as a top level schema)

For multiple schema keywords (*Of), each element creates a separate causeContext.

References create a new causeContext allowing you to track $ref jumps.

I don't like having all causes in a single flat list because non top level causes add noise but does not help you to analyze the root cause until you rebuild flat list into hierarchy.

@expelledboy
Copy link

Hi guys, I am new to json schema, implementing a validation library in Erlang. Very interested in this discussion as a framework developer.

As a suggestion to reach consensus I suggest we summarize the points made here into an Enhancement Proposal, eg in a project wiki page? I have read through this mega-thread and conclusions have already been made, which detailed concisely might align the community.

@gregsdennis
Copy link
Member

@expelledboy that's a good idea, and it's been suggested before. I'll try to write up a new issue that summarizes where we are with this one. Then we can close this out.

@Anthropic
Copy link
Collaborator

@vearutop I agree with the idea of the default output and the causality output providing the same detail at the root error array so there is no difference in error counts or overall behaviour between the two.

@gregsdennis @vearutop
I feel like we could consider:

  1. basic (default): the flat array minimal output
  2. causality: as basic, but with @vearutop 's causeContext
  3. verbose: a hierarchy

Then for each level the Annotation/Error Description Object could be optionally alternated with an Enhanced Annotation/Error Description Object with more detail, message ids, data, embedded schemas, etc... to be decided... this gives implementations a lot of choice while not requiring anyone to go too far from their comfort zone until enough people request it of them.

I feel an implementation:
SHOULD provide basic
RECOMMENDED to provide causality and
OPTIONALLY provide verbose
only in so far as being able to claim they "support the standard annotation/error output". I like the idea of having a minimum requirement before someone can make that claim.

@expelledboy
Copy link

@gregsdennis Could I suggest rather just starting a wiki page, and continuing the discussion here? We may just end up with another mega-thread.

@gregsdennis
Copy link
Member

gregsdennis commented Jul 19, 2018

@expelledboy I'm not sure that's in line with the organization of the repo. Also, this topic is still very much in discussion. I'll discuss it over on Slack with the powers that be here.

@retrosight
Copy link

retrosight commented Jul 20, 2018

FWIW, for the purposes of troubleshooting what went wrong and providing the right level of information to the developer (balancing simplicity with verbosity) we settled on this:

https://github.com/retrosight/rest/blob/master/guidelines.md#errors-when-4xx

and

https://github.com/retrosight/rest/blob/master/schema/com-example-error-2018-03-01.schema.json

@vearutop
Copy link

Let's also discuss how we can expose that required property is missing.

My proposal based on previous structure is:
For schema {"required": ["foo", "bar"]} and data {"baz":1} to have this response

{
  "valid": false,
  "causeContext": [
    {"keyword": "required", "schemaPath": "#/required", "dataPath": "#", "valid": false, "causeContext": [
      {"dataPath": "#/foo", "valid": false},
      {"dataPath": "#/bar", "valid": false}
    ]}
  ]
}

or this response

{
  "valid": false,
  "causeContext": [
    {"keyword":"required", "schemaPath": "#/required", "dataPath": "#/foo", "valid": false},
    {"keyword":"required", "schemaPath": "#/required", "dataPath": "#/bar", "valid": false}
  ]
}

Not sure which one would be better.

@gregsdennis
Copy link
Member

I don't know why you'd treat required differently than any other keyword. It's just a messaging question. "One or more required properties were not found" should be sufficient. If the implementation wished to list the properties, that would be fine, too.

This is about the structure, though, not standardized messaging.

@vearutop
Copy link

But to me list of missing required properties is essential information, that's why I think required may need a special treatment.

If required was an object (like properties), there would be no problem to expose, as every missing property would have an own schemaLocation. But since required is an array.

I think I just got a rubber duck epiphany, schemaLocation can have failed property index :)

{
  "valid": false,
  "causeContext": [
    {"keyword":"required", "schemaPath": "#/required/0", "dataPath": "#/foo", "valid": false},
    {"keyword":"required", "schemaPath": "#/required/1", "dataPath": "#/bar", "valid": false}
  ]
}

@gregsdennis
Copy link
Member

Closing this issue in favor of the summary. ☝️ (Please read it.)

@Relequestual
Copy link
Member

Closed in favour of #643 as requested

@expelledboy
Copy link

@gregsdennis Should the discussion continue here, or there? Will you continue to maintain the first comment with updated conclusions?

We determined not to include the data because it could be arbitrarily large, and structure unknown.

I think the error result fields should be configurable according to the use-cases, perhaps then we should detail also the use-cases?

@gregsdennis
Copy link
Member

@expelledboy the preference is that the discussion continues on the new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests