Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concise encoding for JSON Schema #259

Open
mkovatsc opened this issue Feb 24, 2017 · 17 comments
Open

Concise encoding for JSON Schema #259

mkovatsc opened this issue Feb 24, 2017 · 17 comments
Labels

Comments

@mkovatsc
Copy link

JSON Schema is highly relevant for machine-to-machine communication. The Internet of Things is expected to encompass a large share of resource-constrained devices. For semantic interoperability, they also require access to technologies such as JSON Schema.

Thus, the the authors should also look into concise encodings for JSON Schema, for instance, CBOR or EXI. CBOR is often used in conjunction with tags and numeric identifiers in IANA registries to overcome the issues of string identifiers. For details, the overall system architecture needs to be checked (how many and how often are Schemas shared, how are they stored, etc.)

@handrews
Copy link
Contributor

We also have a note to investigate EXI: #13

For details, the overall system architecture needs to be checked (how many and how often are Schemas shared, how are they stored, etc.)

We definitely expect that schemas are often pre-packaged and loaded from a local store indexed by the URI rather than being fetched from the URI. Fully dynamic systems will want to have clients download schemas as needed, but the spec says that schemas SHOULD be cacheable for long periods of time (I would argue indefinitely- if you need to change the schema it should get a new URI, but that's just me, and I'm sure someone will want to use a "latest" URI for schemas):

HTTP servers SHOULD set long-lived caching headers on JSON Schemas. HTTP clients SHOULD observe caching headers and not re-request documents within their freshness period. Distributed systems SHOULD make use of a shared cache and/or caching proxy.

@danielpeintner
Copy link

Note: I think that JSON schema could be easily improved w.r.t. efficient representations by providing more information about the actual type of a value.

Let me give you an example

        "currentTime": {
            "type": "string",
            "format": "date-time"
        }

This snippet is fine and allows for example EXI4JSON to pick the right codec (in this case dateTime).

A similar format identifier could be defined in http://json-schema.org/latest/json-schema-validation.html#rfc.section.7.3 for example for binary data also. Doing so would allow efficient representations easily detect binary data and store it much more efficiently.

Besides date-Time as a whole also "time" or "date" might be of interest which seems to be removed...

Yet another aspect could be regular expressions.

I hope this input is helpful. Thanks!

@handrews
Copy link
Contributor

This seems relevant:
https://github.com/quartzjer/JSCN

JSON Constrained Notation for encoding into CBOR. Doesn't look like he's published his first I-D yet but the repo is very active.

@handrews
Copy link
Contributor

@danielpeintner see #199 for "date" and "time". As for "regex" as a format, @Julian and @awwright were discussing the possibility that it was removed by accident at some point so feel free to file an issue about that if you'd like to track it.

@danielpeintner
Copy link

@handrews I am agreeing with what has been said in #199 ...

Once you get back to the discussion "which" format should be in and which not I would argue for having one format to describe binary data in JSON.
Having that said, I also think the list of available formats should not be very long.

@handrews
Copy link
Contributor

handrews commented May 2, 2017

@danielpeintner JSON Hyper-Schema has a feature for encoded binary data:
https://tools.ietf.org/html/draft-wright-json-schema-hyperschema-01#section-5.3

Would that work for you? I know that WoT's Thing Description is not using Hyper-Schema, but would the feature handle the case you have in mind?

@danielpeintner
Copy link

Yes, the JSON Hyper-Schema feature would work!

It defines that a string should be interpreted as binary data and that's exactly what I was looking for.

@handrews
Copy link
Contributor

@danielpeintner @mkovatsc Note that there is now a draft for JSON Constrained Notation.

I have not dug into it but does it cover some of what you need? While we can't officially reference a draft, if it is covering the topic I would rather allow it to be handled there. Should that stall, we can always pull it in if needed.

If JSCN is useful, what, if anything, is necessary for JSON Schema to say about it?

@handrews
Copy link
Contributor

@danielpeintner I have opened a discussion on using "media" outside of Hyper-Schema at #363

@handrews
Copy link
Contributor

@mkovatsc have you had a chance to look at JSCN (JSON Constrained Notation) to see how much it would help? If it is useful, then we can just focus on anything we need to do here (registering tags for schema keywords?) I don't have time to dig into this myself right now, but would be happy to see it move forwards. If not now, then we can come back to it after publishing the next draft in late October.

@mkovatsc
Copy link
Author

I had a look at the JSCN and am not fully sure what to make from it. It more tries to be a string compressor that wants to exploit the fact that the string represents JSON -- as it tries to preserve even things like whitespaces.

In general, it should be clear, that true, false, and null should be mapped to true, false, and null respectively. Numbers should not lose precision, yes, but when consuming JSON encoded data, nobody cares if there are semantically equivalent notations, such as 1 vs 1.0 or if the exponent character is upper or lower case.

There are even things that will not work out such as "Ordering of key/value pairs in JSON objects and CBOR maps MUST be preserved." This does not even hold for JSON implementations by definition of objects/dictionaries.

@handrews
Copy link
Contributor

@mkovatsc thank you for taking the time to investigate that. I wasn't quite sure what to make of it either but have been too focused on the next hyper-schema draft to really dig into it and its relation to CBOR.

It sounds like we can decisively disregard JSCN for our purposes, meaning we should define the most efficient possible mapping into CBOR ourselves. I gather this would involve registering a set of tags?

I will not have time to work on this for the draft that is due in October. Do you or anyone else from WoT have time to make a proposal? If not, do you think it's worth poking whatever CBOR community forums exist to see if someone in the CoAPI/CBOR world does?

@handrews handrews added the core label Sep 28, 2017
@handrews
Copy link
Contributor

@danielpeintner note that in the forthcoming draft we have moved the binary data media object over to the validation spec as contentMediaType and contentEncoding (as these names are in line with the rest of both validation and hyper-schema, while the media sub-object was a weird exception).

@danielpeintner
Copy link

Thanks for the information!

@epoberezkin
Copy link
Member

@handrews I was wondering why did we need "contentMediaType" and "contentEncoding" and not just some format(s)?

@handrews
Copy link
Contributor

@epoberezkin I actually did think about that and came up with a good reason why not. Now if only I can remember what it was....

I know I considered just saying "any media type can be a format", but given that "format" is not currently media type-oriented that felt awkward. And you'd still need "contentEncoding" or something similar. The media type and the encoding are (mostly) orthogonal.

Also, media types are extensible (including both registered and unregistered media types), and so are formats, and having two different types of extensible things in the same keyword seemed like a bad idea.

I really feel like I had a strong reason than either of those, but I'll have to think and see if I can remember it. As of right now, I'd say that these concepts seemed distinct and self-contained enough to be worth the other keywords.

Oh, one more thing was that I figured that it would simplify implementation requirements if the choice to support "format" validation and the choice to support "content*" validation were independent. With "format", we do give a starter, core set of formats, but with "content*" I feel like implementations would choose whichever media types and encodings seem most relevant to them, and trying to mandate a core set did not seem correct.

@epoberezkin
Copy link
Member

The media type and the encoding are (mostly) orthogonal.

Yes, but they could have been just different formats and you can apply two at the same time.

I figured that it would simplify implementation requirements if the choice to support "format" validation and the choice to support "content*" validation were independent.

That makes sense.

By the way, I've used formats "json" and "edn", I guess they would fit one of these keywords.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants