Skip to content
This repository has been archived by the owner on Oct 24, 2023. It is now read-only.

Blog entry about defects found in public JSON schemas. #40

Closed
wants to merge 9 commits into from

Conversation

zx80
Copy link

@zx80 zx80 commented Sep 6, 2023

This add a 3 minites blog entry, a cover image from Unsplash and two small avatars.

Copy link
Member

@gregsdennis gregsdennis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every one of your examples are illustrations of typographical errors, specifically, keywords that are ineffectual because a closing brace was misplaced. This can hardly be attributed to the spec. This happens when writing JSON in general. It happens even more with YAML.

The analysis presented in this paper makes a false assumption about JSON Schema: that it's intended for data modelling. As such many of its conclusions are incorrect.

JSON Schema is a collection of constraints. Keywords are independent because it allows them to be combined however the user needs. If they want schemas that represent data modelling, that's possible, but then they need to understand how JSON Schema works in order to include the proper constraints that model that data.

Having a multitude of keywords enables users to isolate the behavior they want. Moreover, the vocabulary system allows them to create their own keywords and dialects in order to make JSON Schema into whatever they need. JSON Schema's flexibility allows people to have control over what they want to validate.

There is nothing inherent about JSON Schema that is causing authors to write bad schemas. Developers write bad code all of the time. C++ isn't inherently flawed because developers write code that manages memory poorly. That's just bad code. C++ simply offers more control over memory management. Sometimes you need that level of control.

The spec isn't at fault. What's lacking is proper tooling (and perhaps documentation) to help guide schema authors toward writing better schemas.

photo: /img/avatars/claire.jpg
link: https://www.linkedin.com/in/claire-medrala/
byline: Research Engineer
excerpt: Evidences show that schemas are hard to write, and suggest changes in the spec
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I absolutely object to putting (or hinting at) third party recommendations for spec changes in our blog.

This is the JSON Schema blog. It is a place for us to show off what it can do, not a forum for discussion about change or shortcomings. The appropriate place for that is issues and discussions.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I absolutely object to putting (or hinting at) third party recommendations for spec changes in our blog.

Ok. The blog is only for your recommendations and discussions. Fine, it is your blog after all. As an academic, we are more used to open discussions and disagreements.

This is the JSON Schema blog. It is a place for us to show off what it can do, not a forum for discussion about change or shortcomings. The appropriate place for that is issues and discussions.

Ok.

Comment on lines 24 to 25
These findings suggest key changes in JSON Schema specification which would block most
of encountered defects.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree with the conclusion that because people aren't using JSON Schema correctly (in many cases, they're typographical errors) that JSON Schema is at fault and needs to change.

@zx80
Copy link
Author

zx80 commented Sep 8, 2023

Every one of your examples are illustrations of typographical errors, specifically, keywords that are ineffectual because a closing brace was misplaced. This can hardly be attributed to the spec.

Yes and no. Misplacing a keyword leads the system to silently ignore the ineffectual keyword. It is a choice of the language semantics. Different choices could lead to your schema is invalid in many (but not all) cases.

The analysis presented in this paper makes a false assumption about JSON Schema: that it's intended for data modelling. As such many of its conclusions are incorrect.

I'm unclear on where you see this assumption. Our study does not assume a particular use case, whether data modelling or something else, because we do not really have any the relevant information regarding this!

We just look at existing schemata, without knowing why/for what they were developed, and look for factual errors.

JSON Schema is a collection of constraints. Keywords are independent because it allows them to be combined however the user needs. If they want schemas that represent data modelling, that's possible, but then they need to understand how JSON Schema works in order to include the proper constraints that model that data.

A lot of the allowed combinations do not make much sense. We did not found significant cases where it was an requirement to have such a freedom.

Having a multitude of keywords enables users to isolate the behavior they want. Moreover, the vocabulary system allows them to create their own keywords and dialects in order to make JSON Schema into whatever they need. JSON Schema's flexibility allows people to have control over what they want to validate.

{
  "type": "object",
  "minLength": 10,
  "pattern": "^[0-9]*[a-z]*$",
  "maxItems": 42,
  "minSize": 17
}

Why allowing the above non sense?

There is nothing inherent about JSON Schema that is causing authors to write bad schemas.

Trivial errors are silently ignored because of the chosen semantics, so the user is likely never
to find out. This does not cause bad schemas, but it helps.

Developers write bad code all of the time. C++ isn't inherently flawed because developers write code that manages memory poorly. That's just bad code. C++ simply offers more control over memory management. Sometimes you need that level of control.

A lot of errors are filtered out by a C++ Compiler, because of type checks, mandatory declarations, and so on.

The spec isn't at fault.

The evidence we gathered demonstrates that (1) people get it wrong quite often (>60%) and (2) some spec changes would improve this situation (we tested our proposals). AFAICS both of these points are facts.

I understand that the spec will be broken again on the next release, so you seem to also believe that it can be improved and that it is worth breaking compatibility. At last a point of agreement!

What's lacking is proper tooling (and perhaps documentation) to help guide schema authors toward writing better schemas.

There are hundreds of existing tools, but the right one is still missing?

There is a suggestion that a linter would help. Sure, we implicitely wrote one to detect the various errors reported in the paper. Now, if a linter somehow restrict the language by filtering issues, then why not try to put at least some of these restrictions in the language itself, so that all conformant tool would check them?

@gregsdennis
Copy link
Member

I'm unclear on where you see this assumption. Our study does not assume a particular use case, whether data modelling or something else, because we do not really have any the relevant information regarding this!

The assumption is present in the "invalid" cases you present. You're assuming that a schema has to align with data patterns in programming languages. You're assuming validation (which arguably is the primary purpose of JSON Schema). But there are many use cases, most of which we still don't know about.

For example, code generation. Many languages support union types. If I want to generate a union type, I might combine keywords that don't otherwise make sense.

Why allowing the above non sense?

It may be nonsense to us, but we can't guarantee that some user actually has a purpose for something like that.

JSON Schema is intentionally permissive in order to account for as many use cases as possible. Yes, many schemas appear to serve no purpose or contain ineffectual keywords, however it's impossible for us to rule out the possibility that some user has a real use for such a schema.

This is where a linter comes in. A linter will warn the user that a specific construct doesn't generally make sense, but the user still has the option to ignore the warning and do it anyway. If JSON Schema disallows such things, then the user no longer has that choice, and we've prevented them from doing what they want to do.

The point is to allow users to find new use cases without restriction. The solution to helping these users that you found is targeted tooling. Yes, some such tooling already exists, and we've partially built some. However what's there is not well-integrated into the common IDEs and editors, so they're not typically used.

There are hundreds of existing tools, but the right one is still missing?

Yes. The vast majority of the "hundreds of existing tools" are validators, and many of them don't support even the latest version of the spec, which is almost three years old.

Beyond validators, there are various generators and a few other targeted/single-purpose tools. Few of them are editors, and we are actively working with those to help them improve there offerings.

@gregsdennis
Copy link
Member

gregsdennis commented Sep 8, 2023

I also think the data sources have something to be desired.

  • Ref is absolutely a poor choice as it's a test suite. The schemas it contains are specifically crafted to verify that implementations meet the requirements of the various specs, and they're definitely NOT examples of real-world use.
  • Store - I don't have direct experience with and can't speak to.
  • ODS - I wouldn't trust generated schemas. I say as much in my documentation on my own generation library. Generation is a tool to get you started. It shouldn't go directly to production.
  • JSC - This is actually a good source of real-world usage, but it's limited to public repos. Many higher-quality schemas are used by enterprises and are proprietary or otherwise private. It's understandable that you don't have access to these.
  • Misc - I believe many publicly-accessible systems like Kubernetes and AWS are stuck using old versions, which means that users can't use newer features. This will skew your results (e.g. your conclusion of "people don't use $dynamic* so those keywords should be removed").

I'm surprised there's no mention of OpenAPI or AsyncAPI, arguably the largest usage points of JSON Schema. I wouldn't be surprised if more people used JSON Schema indirectly through one of these specs than they do directly.

There's still a lot of good work done with this study, and it would be useful for creating linting tools. I just don't agree with some of the conclusions.

But the biggest thing for me, though, is that I can't back putting third party spec change recommendations and advertising potential competitors or alternative proposals in our blog.

@benjagm
Copy link
Contributor

benjagm commented Sep 13, 2023

This add a 3 minutes blog entry, a cover image from Unsplash and two small avatars.

We'd like to thank you for contributing with this blog post proposal. We recognize the big effort behind the study backing the blog, and we are sure we can extract great insights from it, however this content differs from the Community driven content we expected. We'd like to learn from your work and be able to discuss about your conclusions, but most important make sure we serve the JSON Schema Community the best way.

This is why we'd like to invite you to move the discussion to this Community discussion and continue there a constructive conversation to take the most from this opportunity.

We'd like to acknowledge once again for this contribution. This situation inspired the community to work on publishing the blog guidelines and make this experience better in the future.

Please @zx80, join us in this discussion.

Users have a hard time remembering the 60 keywords and writing schemas.
We think that this can be significantly improved with limited changes to
the spec.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They could also be found with a linter mode, which has been proposed here - https://github.com/orgs/json-schema-org/discussions/323 and json-schema-org/json-schema-spec#1079

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pointers.

@benjagm
Copy link
Contributor

benjagm commented Oct 11, 2023

@zx80 Do you mind sending this PR to the new website repository? I am asking this because we just launched a new version of the JSON Schema website and now blog and website are in that same repository. As a consequence, this repository is going to be archived. Thanks a lot.

@zx80
Copy link
Author

zx80 commented Oct 12, 2023

@zx80 Do you mind sending this PR to the new website repository? I am asking this because we just launched a new version of the JSON Schema website and now blog and website are in that same repository. As a consequence, this repository is going to be archived. Thanks a lot.

Done as this PR.

@zx80 zx80 closed this Oct 12, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants