Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keywords for describing array patterns; "concat", "someOf", array repetition/kleene star #1323

Open
awwright opened this issue Oct 8, 2022 · 30 comments

Comments

@awwright
Copy link
Member

awwright commented Oct 8, 2022

Most regular/context-free grammars that are a subset of JSON documents have an equivalent representation in JSON Schema. However there's some kinds of JSON arrays that can be specified with a regular expression, but that has no JSON Schema equivalent. For example, an array whose items are an odd number of positive integers, then a boolean:

/^\[[1-9][0-9]*, ([1-9][0-9]*, [1-9][0-9]*, )* (true|false)\]$/

(Some whitespace/decimal forms supported in JSON is removed here for brevity; it is more complicated, but suffice to say it exists.)

In order to be able to convert between a regular/context-free language and JSON Schema, there needs to be some keywords that can implement the "array" counterparts to regular expressions in strings:

  • "concat", to concatenate multiple grammars together, specified by an array of subschemas.
  • "someOf", to describe alternation (validates if one or more of the subschemas is valid).
  • "repeatMin"/"repeatMax", to describe repetition and the Kleene star.

I was trying to spec out a hierarchy of JSON Schema features (mostly by processing & memory complexity), and I realized array items with regular patterns was missing from the O(n) complexity class.

The lack of feedback relating to this sort of thing suggests this would have limited usefulness as a core requirement, but I'd like to describe it here for completeness. The only instance I can think of are arrays where even items represent a name, and odd items represent the value. Node.js describes raw HTTP headers this way.

@handrews
Copy link
Contributor

handrews commented Oct 8, 2022

The only instance I can think of are arrays where even items represent a name, and odd items represent the value. Node.js describes raw HTTP headers this way.

Some web frameworks use arrays like this to specify query strings with duplicate key names.

Do these keywords require some sort of new underlying JSON Schema feature? If not, would this be a better fit in the vocabularies repo?

@awwright
Copy link
Member Author

awwright commented Oct 8, 2022

Some web frameworks use arrays like this to specify query strings with duplicate key names.

Yeah, another good example. Sometimes it's just less verbose to do an array with odd/even pairings rather than an array of vector arrays.

Do these keywords require some sort of new underlying JSON Schema feature? If not, would this be a better fit in the vocabularies repo?

The "concat" keyword does not fit well in with how most implementations perform validation, and most implementations will probably implement a naive backtracking algorithm in O(n^2) time or worse, even though it can be done in O(n) with much pre-computation. The others should be fairly easy:

I think "someOf" or a similar feature has been requested multiple times (though I can't find any off-hand), and it's fundamentally the same thing as "oneOf".

And "repeatMax": null could represent the kleene star, i.e. allow the defined items to occur any number of times. This is trivially implemented as well. (Likewise repeatMin: 1, repeatMax: 4 could mean "at least one occurrence, max 4 repeats" of an odd/even pair of items, i.e. the array would have to have 2, 4, 6, or 8 items).

@handrews
Copy link
Contributor

handrews commented Oct 8, 2022

@awwright can you explain a bit more about how concat would work? I admit I'm having a bit of trouble wrapping my head around it. I think part of that is lack of clarity regarding "grammar" here - are you talking about these keywords taking strings in a new language of some sort? Rather than working by applying and combining schemas in some way? I realize we already support regular expressions, but they are ubiquitous across a wide range of programming languages.

@awwright
Copy link
Member Author

awwright commented Oct 8, 2022

@handrews It's essentially the same thing as concatenation of characters or groups in regular expressions:

(ab|cd|123)[0-9]e

There's three elements being concatenated together here (one alternation then two character classes, the second matches exactly one character)

It can get complicated when there's a repetition followed by an alternation:

^(abc)*(a|b|c)d$

If the first four characters of input is abca, you don't know if the fourth character matches the first group or the second, you have to follow both paths until one path matches or mismatches. You can skip the backtracking algorithm and execute this in O(n) time by computing the union of both paths.

So to write a schema with odd/even items that ends with a boolean, it might look like this:

{
type: "array",
concat: [
    { type: "array", prefixItems: [ {type:"string"}, {type:"integer"} ], minItems: 2, maxItems: 2, repeatMin: 0, repeatMax: null },
    { type: "array", additionalItems: { type: "boolean" }, minItems: 1, maxItems: 1 }
]
}

... there's different ways the "repeat" keywords could interact with other keywords, I haven't fully thought it out yet. Maybe they only apply to "concat" instead of any of the *items keywords.

@handrews
Copy link
Contributor

handrews commented Oct 8, 2022

@awwright thanks, I think I understand this now. Definitely interesting, and yes possibly involving an underlying change in JSON Schema capabilities.

In this example, concat is essentially behaving as a repeatable prefixItems. If the default behavior of concat were to repeat indefinitely, then just that alone would be fine (items is essentially a single element that repeats indefinitely, so that's well within normal capabilities). [EDIT: I misread the concat part and didn't realize how far the side scroll went, will update with a correct assessment later.]

However, keywords with a more complex impact on how an applicator applies and evaluates its schemas might require a more substantial change, so I'm in favor of keeping this issue in this repository while we consider that.

Ultimately I think this would fit better as an extension vocabulary than in the standard applicator vocabulary, but it's a really interesting idea for one!

@handrews
Copy link
Contributor

handrews commented Oct 8, 2022

I'm going to rewrite this with formatting I find easier to follow, and in YAML because I'm lazy.

type: array
concat:
  - type: array
    prefixItems:
      - type: string
      - type: integer
    minItems: 2
    maxItems: 2
    repeatMin: 0
    repeatMax: null
  - type: array
    additionalItems:
        type: boolean
    minItems: 1
    maxItems: 1

OK, looking at it like this, I see where the difficulties would be in the current model. Basically, it's the block of prefixItems/minItems/maxItems that you want to repeat. The type constraint and the repeatMin/repeatMax keywords are outside of the affected block.

I think this would need to be reworked a bit to make sure that any control keywords are outside of the schema object that they need to control. Which has its own wrinkles because you're trying to change the start point of prefixItems which contradicts its (current) specification. I'll have to think on this more. The concat part still seems straightforward enough to me.

@gregsdennis
Copy link
Member

I agree with @handrews. I think what this needs is a new keyword for the repeating content.

I'd like to propose another new keyword to handle this: repeatedItems. It fits well with the repeatMin/Max as well. I'll stick with the YAML (🤢) because that's what's been used.

type: array
concat:
  - type: array
    repeatedItems:
      - type: string
      - type: integer
    repeatMin: 0
    repeatMax: null
  - type: array
    additionalItems:
        type: boolean
    minItems: 1
    maxItems: 1

(well... items, really, because additionalItems doesn't exist anymore)

Here, repeatedItems functions similarly to prefixItems except that it doesn't allow for follow-on items: it explicitly defines only those items which are to repeat. Using prefixItems entails some other complications, like the min/maxItems don't really apply like they normally would (they normally look at the whole array).

This also solves Henry's concern that repeatMin/Max should be defined outside of the bit that's being repeated. And it somewhat aligns with the contains keyword family.


The concat with an initial repeatedItems concerns me, however. With the schema above, how is [ "foo", 1, "bar", 2, true ] processed?

  1. true doesn't match the repeatedItems, so I expect evaluation would stop for this? (Say, under the annotation model, it leaves an annotation of 3 to indicate the index of the last successfully evaluated item.)
  2. Then the concat would need to forward that annotation (or otherwise communicate the result of the first subschema) into the second subschema so that items knows where to start evaluating.
  3. The second subschema evaluates... what?

The second subschema, as written, evaluates the whole array, which in this case would fail. But the concat somehow changes how this subschema operates? Or maybe it uses the results from the first subschema to slice the array into a subarray that the second subschema evaluates? I'm not really sure what's expected to happen under the covers. Both of these approaches are departures from how schema evaluation is expected to work.

One way that I can see to resolve this is to just remove the concat and have it all as a single schema, then have items (and unevaluatedItems) also depend on the results of repeatedItems.

type: array
repeatedItems:
  - type: string
  - type: integer
repeatMin: 0
repeatMax: null
items:
    type: boolean
minItems: 1
maxItems: 1

This solves the items problem, but that min/maxItems is still out of place. Maybe you could use something like prefixItems with an items: false so that you could specify exactly what you're looking for in prefixItems and allow nothing else.

type: array
repeatedItems:
  - type: string
  - type: integer
repeatMin: 0
repeatMax: null
prefixItems:
    type: boolean
items: false

But then it's strange that "prefixItems" isn't specifying what comes first, and it removes a lot of flexibility... which leads back to using concat to specify composition.

@jdesrosiers
Copy link
Member

jdesrosiers commented Oct 9, 2022

I think this is a really exciting idea. Array validation is definitely a place were JSON Schema is lacking.

The lack of feedback relating to this sort of thing suggests this would have limited usefulness as a core requirement

I disagree. It's certainly not super common, but I've seen quite a few SO questions over the years where I've had to tell people that JSON Schema can't express the weird sequence they need to describe. If we can come up with a good way to improve that situation, I think it's very well worth considering for the spec.

I'd like to suggest something similar to what @gregsdennis proposed, but a little inverted. The repeatMin and repeatMax would be defined such that they work like minContains and maxContains. They would describe the number of sequential items matches in the array. Like prefixItems, any items after repeatMax go unevaluated.

I think that would allow concat to be implemented with no other additional keywords and would mostly fit into the current architecture. The only change would be that array keyword implementations would have to take a starting index as input in addition to the array being validated. I'd have to implement it to be sure, but I think that's the only issue.

It's worth noting that concat would be a dynamic keyword because it would rely on annotation results of the schemas it processes to know how many items were evaluated and where the starting index should be for the next schema in the sequence.

Here are some examples. Sorry, I'm sticking with JSON because I find YAML difficult to read.

^ab{0,3}c$

{
  "type": "array",
  "concat": [
    {
      "prefixItems": [{ "const": "a" }]
    },
    {
      "items": { "const": "b" }, 
      "repeatMin": 0,
      "repeatMax": 3
    },
    {
      "prefixItems": [{ "const": "c" }]
    },
    { "items": false }
  ]
}

We can also define repeatMax/repeatMin to work with concat as well as items to describe more complex repetitions. concat would repeat indefinitely by default (kleene star) like items and could be constrained by repeatMax/repeatMin.

^(ab){0,3}$

{
  "type": "array",
  "concat": [
    { "prefixItems": [{ "const": "a" }, { "const": "b" }] }
  ],
  "repeatMax": 3,
  "unevaluatedItems": false
}

concat can also be nested for even more complex sequences.

^a(b{1,2}c){0,2}$

{
  "type": "array",
  "concat": [
    { "prefixItems": [{ "const": "a" }] },
    {
      "concat": [
        {
          "items": { "const": "b" },
          "repeatMin": 1,
          "repeatMax": 2
        },
        { "prefixItems": [{ "const": "c" }] },
      ],
      "repeatMax": 2
    },
    { "items": false }
  ]
}

Alternation can be done with anyOf/oneOf, so there's not need for someOf.

^((a|b)c)*$

{
  "type": "array",
  "concat": [
    {
      "anyOf": [
        { "prefixItems": [{ "const": "a" }] },
        { "prefixItems": [{ "const": "b" }] }
      ]
    },
    { "prefixItems": [{ "const": "c" }] }
  ]
}

I don't want to get deep into arguing about keyword naming just yet, but I'd suggest minRepeat/maxRepeat instead of repeatMin/repeatMax to stick with current conventions for min/max keyword names. "Repeat" might not be the best word either. Maybe "matches"? I'd also suggest the name sequence instead of concat. There might be a better word than "sequence", but I think the concept of concatenating grammars will be confusing. It's also worth noting that this makes the naming of prefixItems a little awkward because the items it describes are not necessarily at the beginning of the array.

@gregsdennis
Copy link
Member

gregsdennis commented Oct 9, 2022

I think this still breaks a basic tenet of how JSON Schema works: schema objects evaluate the instance independently.

For example, in the schemas above,

{ "prefixItems": [{ "const": "c" }] }

is itself dynamic because it doesn't know where to start and relies on the evaluations of previous subschemas for this information. (Note we do have keywords that rely on evaluations from other keywords, but we don't have that behavior in subschemas.)

For every other applicator that contains multiple subschemas, the schema objects operate completely independently of their siblings, thus each evaluates the entire (local) instance. concat, as proposed, breaks this idea.

This new behavior is a significant departure from the current operating model.

@handrews
Copy link
Contributor

handrews commented Oct 9, 2022

@gregsdennis yes, I agree. I think this could be addressed by creating a segmentItems to use in place of prefixItems, which would be similar but would not be restricted to the prefix. It might be possible to define interactions among keywords like segmentItems, minSegmentItems, maxSegmentItems, segmentStart, segmentRepeat, etc. that would keep this within the paradigm of determining an instance location and evaluating the schema against that location as independently as current keywords operate (which is not, strictly speaking, completely independently).

This is where I think it's important to not have "operator" keywords like minRepeat, etc. inside the schema on which they operate. That is a communication channel that does not fit the current paradigm.

However, adjacent keywords impacting each other is within the current paradigm. It's not immediately clear to me whether this introduces complexity beyond what our current paradigm supports. I think the key thing is to first see if we can keep this within the paradigm at all.

@jdesrosiers
Copy link
Member

I think this still breaks a basic tenet of how JSON Schema works: schema objects evaluate the instance independently.

Oh, I agree that there is an element that doesn't fit the current architecture. Sorry if I didn't make that clear enough.

This new behavior is a significant departure from the current operating model.

I'd have to implement it to decide if I think it's a significant departure or something that could be relatively easily incorporated. I honestly don't have a solution in mind, so it very well could end up being a significant change.

This is where I think it's important to not have "operator" keywords like minRepeat, etc. inside the schema on which they operate. That is a communication channel that does not fit the current paradigm.

I don't know if this was specifically in response to my comments, but I think the way I defined minRepeat/maxRepeat fits the current paradigm just fine. It works exactly like minContains/maxContains. Managing the starting index should be the only thing that doesn't fit the model.

I think the key thing is to first see if we can keep this within the paradigm at all.

100% agree, but at this point I don't think it's going to be possible to support this kind of thing without breaking the mold somewhere. I think we'll have to come up with a few ideas, determine how they don't fit the model, determine how to model could be adjusted, and decide if the break is worth it. Of course, I'll be thrilled to be wrong if someone comes up with a solution that does fit the model.

@handrews
Copy link
Contributor

handrews commented Oct 9, 2022

I don't know if this was specifically in response to my comments, but I think the way I defined minRepeat/maxRepeat fits the current paradigm just fine. It works exactly like minContains/maxContains. Managing the starting index should be the only thing that doesn't fit the model.

The difference is that prefixItems defines its start index, and minItems and maxItems are defined over the whole array. We'd either have to change that (which I'd rather not do- it's nice to have simple, straightforward keywords for the more common case), or use a segmentItems, minSegment, and maxSegment that determine their start index and length of scope based on adjacent keywords.

It's only the interaction of either concat or minRepeat/maxRepeat with prefixItems/minItems/maxItems or other keywords that define their relationship to array item positions that is problematic.

Currently, the applicator behavior is defined in §7.5 as:

From [the root], keywords known as applicators are used to determine which additional schemas are applied. Such schemas may be applied in-place to the current location, or to a child location.

The schemas to be applied may be present as subschemas comprising all or part of the keyword's value. Alternatively, an applicator may refer to a schema elsewhere in the same schema document, or in a different one. The mechanism for identifying such referenced schemas is defined by the keyword.

So if a segmentItems keyword read its starting point and repeat conditions from adjacent keywords, that would probably fit within the current paradigm. We can think of current keywords like this:

  • items reads its starting point from prefixItems and is repeated until the instance ends
  • unevaluatedItems computes its starting point from adjacent and dynamic subscope prefixItems, items, unevaluatedItems and contains, and repeats until the end of the instance except that it skips indices after the starting point that were covered by contains

We do not have a direct precedent for a minSegment/maxSegment that would take a starting point from which to count, but that feels like it probably fits well enough.

concat might be more problematic, because it's fine for a keyword to determine its starting point based on adjacent or dynamic subscope information (as noted above, we already do this), but determining the starting point of an entire subschema based on the output of another subschema, and then determining the starting point of segmentItems, etc. relative to the point determined via concat... that's quite different. The segmentItems would require information from sibling schemas by way of the parent concat keyword, and that is not a communication path we currently allow.

We do have a dynamic parent lookup behavior that $dynamicRef uses, but that's a very different context, and does not encompass the behavior of routing information from one sibling subschema to another. Even if/then/else is still just communication between adjacent keywords, not adjacent schema objects.

@gregsdennis
Copy link
Member

unevaluatedItems computes its starting point from adjacent and dynamic subscope

Scope! Yes! That's what feels weird.

Taking from @jdesrosiers example:

{
  "type": "array",
  "concat": [
    {
      "prefixItems": [{ "const": "a" }]
    },
    {
      "items": { "const": "b" }, 
      "repeatMin": 0,
      "repeatMax": 3
    },
    {
      "prefixItems": [{ "const": "c" }]
    },
    { "items": false }
  ]
}

In this schema, the scope (adjacent or dynamic) of /concat/2 is only itself. It can't see what its siblings do because those schemas are out of its scope.

@awwright
Copy link
Member Author

awwright commented Oct 10, 2022

Yeah, concat does use subschemas in a fundamentally different way than any other "applicator" keyword. It's not just testing subschemas against some single instance then aggregating the results: Here, the subschemas each affect each other.

Here's another way to think about concat: "concat" asserts that there is some way to split the instance apart into segments (into multiple arrays of any length), so that each segment validates against the corresponding subschema in concat.

Depending on what you know about the subschemas you can optimize this process, potentially as good as O(n), but worst case (when the subschemas are completely opaque), you'll need a recursive algorithm in O(n^m) time (instance length raised to the subschema count—yikes!).

@jdesrosiers
Copy link
Member

Yeah, concat does use subschemas in a fundamentally different way than any other "applicator" keyword. It's not just testing subschemas against some single instance then aggregating the results: Here, the subschemas each affect each other.

Right. I think that's effectively what @handrews was trying to say, but I didn't entirely follow the argument. Usually a keyword that uses an annotation as input would look that annotation up first and then do it's work. It's unprecedented, but is it really significantly different if it looks up annotations while it's doing it's work after it processes each sub-schema?

That's the case of the sub-schemas passing the new starting point back to concat. I think that should be fine. The problem is concat passing the initial starting point to the sub-schemas. That communication can't be done through annotations as far as I can see. That's where the architectural break would be. @handrews, if that's what you were saying, then we're in agreement.

@handrews, I think what made it hard for me to follow your last comment was that you introduced some keywords and I wasn't clear how you intended them to be defined and used. Maybe you can provide an example.

@handrews
Copy link
Contributor

@jdesrosiers

@handrews, I think what made it hard for me to follow your last comment was that you introduced some keywords and I wasn't clear how you intended them to be defined and used. Maybe you can provide an example.

(scrolls back) oh, huh. I thought I'd written one out but I didn't. It's probably in a half-written comment in another tab somewhere. I'll do that shortly.

The problem is concat passing the initial starting point to the sub-schemas. That communication can't be done through annotations as far as I can see. That's where the architectural break would be. @handrews, if that's what you were saying, then we're in agreement.

Yeah, this is a major part of it. We sort-of have an architectural channel here from $dynamicRef doing a lookup to the parent scope. But it doesn't quite fit b/c of how $dynamicAnchor is resource-scoped, so it's not as simple as "there's a parent annotation and you can look it up". One interpretation of Hyper-Schema's base would actually be pretty close to "there's a parent annotation and you can look it up" but no one has ever implemented that keyword (not even in the one draft-07 hyper-schema implementation that we have on the implementations page).

The concat approach also involves a sequential approach to evaluating its subschemas, which we generally have not had (although if/then/else sequences across keywords, as do other adjacent keyword dependencies). It would be another limit on parallelization and another requirement regarding maintaining state (I am actually in the process of writing up the evolution of interdependency/parallelization/state interactions as part of my response to the SDLC proposal).

I think in addition to the individual communication channels, we need to think through the larger communication patterns. To date, we have had very simple lateral-adjacent, child-to-parent, and parent-to-child communications. Combining those into a cousin-to-cousin communication, even when implemented on existing communication channels, introduces a more difficult to visualize and reason about set of interactions.

@awwright
Copy link
Member Author

The problem is concat passing the initial starting point to the sub-schemas

The concat approach also involves a sequential approach to evaluating its subschemas

As far as the specification text would be concerned, there shouldn't be a need to define how communication channels work. For defining the "concat" keyword, it should suffice to say "This keyword is valid when there is any way to split the instance into segments, such that that each segment is valid against the corresponding subschema." (Note in the case of zero subschemas, only the zero-length array matches.)

This is because there's different, equally legitimate ways to implement the algorithm.

When defining how an implementation works, consider the non-deterministic cases: If I split apart a string into an array of one item per character, could I write a schema that matches exactly the same strings as /^(abc)*abd$/? Note how if you consume an "a" then a "b", the next character can potentially be either "c" or "d".

While the idea of "passing the starting point" might be used internally by an implementation, this is only an optimization, either it must support backtracking, or you have to support nondeterministic results (that is, multiple logic paths being followed at the same time). And this technique won't be used by an O(n) algorithm at all, which will compute the union of the branches before consuming the input.

@handrews
Copy link
Contributor

handrews commented Oct 12, 2022

@awwright

As far as the specification text would be concerned, there shouldn't be a need to define how communication channels work. ... This is because there's different, equally legitimate ways to implement the algorithm.

This is not about defining how to implement the communication, it's about whether such communication channels (or other algorithmic approaches) should exist at all, as something that JSON Schema implementations should, in general, support. Which would no doubt lead to them being used in future keywords.

Adding these communication channels would have implications for parallel execution and other optimizations, just as adding runtime child-to-parent communication (for unevaluated*) and runtime parent-to-child communication (initially for $recursive*) had impacts.

This is what we need to consider before designing keywords that depend on this sort of communication channel and control flow existing.

@gregsdennis
Copy link
Member

gregsdennis commented Dec 15, 2022

Haha, I just saw that I proposed something like this a couple years ago. 🤦‍♂️

It occurs to me that we may need some sort of "match A for a items, then B for b items, then C for c items... then some more items.... and the last x have to match X" functionality. Not sure how to pull that off, though, without a complex definition and horrible syntax.

@jdesrosiers
Copy link
Member

This proposal randomly invaded my brain the other day while I was hiking and wouldn't leave.

One problem we identified was that prefixItems and items would need to behave differently when processed by concat. The difference being that they would have to have context about where the previous validation left off and treat that as their starting position when doing their own validation. Communicating that starting position is the other problem because such dynamic keyword communications don't fit well into the JSON Schema architecture.

So, I was thinking that in order to solve these problems, all the information about what index of the array is being evaluated needs to be defined by the keyword itself. It can't delegate to items or prefixItems. That means the sub-schemas in this keyword would only describe items, not segments. The keyword needs to define the segments.

I started off with a crazy verbose syntax and simplified until I ended up with something that was reasonable to use. Mostly coincidentally, I ended up with something that mimics regular expression syntax pretty closely.

(I'm going to use a different keyword name because this is very different than the original proposal.)

Here's a schema that represents an array with any number of "a"s but always ends with a "b".

{
  "$comment": "^a*b$",

  "type": "array",
  "sequence": [
    { "const": "a" }, "*",
    { "const": "b" },
    "$"
  ]
}

This one represents a sequence of pairs.

{
  "$comment": "^(ab)*",

  "type": "array",
  "sequence": [
    [
      { "const": "a" },
      { "const": "b" },
    ], "*"
  ]
}

Here's a complex one just to show what's possible.

{
  "$comment": "^a(b{1,2}c){0,2}$",

  "type": "array",
  "sequence": [
    { "const": "a" }, 
    [
      { "const": "b" }, "{1,2}",
      { "const": "c" }
    ], "{0,2}",
    "$"
  ]
}

This approach is intuitive because it looks and works just like a regular expression. This would be very power and it would be easy to introduce additional operators to add more functionality as needed in the future.

The main downside of this approach is that implementations of this keyword are essentially regular expression engines (although most likely with limited functionality). I'm sure that will prove difficult to implement properly. I implemented a very simple engine that seems to be working correctly and reasonably efficiently so far, but I would be surprised if I wasn't missing something.

I doubt that the demand for this much power in describing array items is going to be high enough for something like this to ever make it into the main specification, but it would be interesting to have as a custom vocabulary keyword.

@gregsdennis
Copy link
Member

gregsdennis commented Jun 28, 2023

I think this is a good start, but I'm not a fan of the {1,2} strings.

I have an allergic reaction to magic strings. Encoding logic in strings is a step toward that. (And it's another thing we don't have precedent for.)

Also having to "interpret" intent from the type of item (e.g. a string needs to be parsed, an array is a subsequence, etc.) is hard to deal with. I have to do a lot of this in my JSON Logic implementation, and it's a real headache.

I think the more "JSON Schema" argument is that this is expressing "min iterations" and "max iterations" constraints, which should be keywords.

Though more verbose, I'd prefer this:

{
  "$comment": "^a(b{1,2}c){0,2}$",

  "type": "array",
  "sequence": [
    { "control": "start" },
    { "schema": { "const": "a" } },
    {
      "schema": {
        "sequence": [
          {
            "schema": { "const": "b" },
            "maxCount": 2
          },
          { "schema": { "const": "c" } }
        ]
      },
      "minCount": 0,
      "maxCount": 2
    },
    { "control": "end" }
  ]
}

sequence is an array of items (segments) that each have the following properties:

  • control - enum of start and end (maybe others?), probably doesn't make sense with sibling properties
  • schema - describes the items (note how it can itself have a sequence keyword to support grouping)
  • minCount - minimum number of times this segment repeats, must be an integer, default is 1
  • maxCount - maximum number of times this ament repeats, must be any/inf/none/[something else] (for *) or an integer, default is 1

control or schema must be present, but not both, and minCount/maxCount are only valid with schema.

I think this fits better into the JSON Schema paradigm, and it's just as expressive.

@jdesrosiers
Copy link
Member

Your alternative is close to one of the intermediate steps I went through when iterating on this problem, but we can't nest schemas like in your example. I would expect your example to describe an array as an item rather than a segment of an array ([a, [b, b, c], [b, b, c]] rather than [a, b, b, c, b, b, c]). If we did define that pattern such that it describes a segment, then there would be no way to describe a nested array.

More importantly, we would have the problem of needing to pass the starting/ending position to/from the schema evaluation which is the problem I was trying to solve in the first place. This is easily fixed by introducing a new sub-keyword for groupings rather than using schema with sequence. (It makes the schema more readable as well so, win-win).

{
  "$comment": "^a(b{1,2}c){0,2}$",

  "type": "array",
  "sequence": [
    { "control": "start" },
    { "schema": { "const": "a" } },
    {
      "group": [
        {
          "schema": { "const": "b" },
          "maxCount": 2
        },
        { "schema": { "const": "c" } }
      ],
      "minCount": 0,
      "maxCount": 2
    },
    { "control": "end" }
  ]
}

This would be equivalent to what I proposed, but my problem with this approach is the use of the structured non-schema object to describe terms. It's something we've actively avoided in the past. We even replaced a keyword that does this (media => content*). My main problem with using such structures is that they can be confusing. People will confuse them with schemas. In this case I think it introduces a lot of verbosity with no benefit for the user.

I prefer the flat version. I think it's more user friendly. It's easier to write and to read. I tried to use keywords to describe min/max matches, but there wasn't a way to do it that didn't include some kind of wrapper and I didn't think that was worth the decreased usability the wrapper brings.

One of the things I really like about the flat version is that the syntax is so burdenless, that it could reasonably be used to describe simple cases while also being powerful enough to describe complex cases. (Examples: prefixItems: [{...}] <=> sequence: [{...}] / items: {...} <=> sequence: [{...}, "*"])

@gregsdennis
Copy link
Member

This is easily fixed by introducing a new sub-keyword for groupings rather than using schema with sequence. (It makes the schema more readable as well so, win-win).

I'm happy with this.

my problem with this approach is the use of the structured non-schema object to describe terms

We use non-schema constructs for keyword values all the time:

  • $defs, properties, patternProperties are dictionaries (maps) of schemas, not themselves schemas.
  • allOf, anyOf, oneOf are arrays of schemas, not themselves schemas.
  • the proposed propertyDependencies is a dictionary of dictionaries of schemas, also not itself a schema.

The difference between those and this is that this also contains other data. (And actually, discriminator, externalDocs, and xml from OpenAPI 3.1 are already doing this; they're just annotations instead of assertions.) I see no problem with creating such a keyword.

"sequence": [ { "const": "b" }, "{1,2}" ]

In this example, the "{1,2}" is the next item in the sequence array. It's not in any way connected to the previous item, which specifies the items themselves. If you separate these into two items, then you have to either

  • look back when you encounter {1,2} to get the requirements of the item(s) that should repeat, or
  • look forward when you encounter a schema or subsequence to know what the repetition is (if there is any).

This unnecessarily complicates implementation, and it ddoesn't align with JSON Schema's notion that items are independent. (I know we're in the context of a keyword here.) Everywhere else an array is used, the items in that array are independent of each other with no context shared between them. It makes sense, then, to minimize the context shared between the items for sequence.

It's better if the repetition specifiers are directly associated with what's being repeated, which is what my proposal does.

  • schema describes the thing being repeated
  • min/maxCount describe the repetition

This way, each item in sequence can be processed completely independently with the only context being what index to start on.

One of the things I really like about the flat version is that the syntax is so burdenless

It may appear to be simpler to read/write, but as I mentioned, it's SO much more difficult to implement. (Maybe it's easier in a dynamic language like JS, but in a strongly typed language, it's very burdensome.) I could do it, but I would hate it. (This comes from experience; I don't want to do go down this road again.)

Having strings (which are magic and have to be parsed), arrays (which imply subsequences), and booleans/objects (which signify schemas) is a very interpretive approach that we haven't even come close to with JSON Schema. There is no precedent for this.

I'm strongly against including something like this in the spec. Maybe as an external vocab, but not in the spec.


How would we validate the interpretive style of sequence in a meta-schema? It's fairly straightforward to write a schema to match my example (and yours with group).

{1,2} isn't a valid regex, but sequence: [ "{1,2}" ] seems like it would be perfectly fine according to what a meta-schema could validate. Trying to build a schema to fully validate this style of sequence seems like building a regex to validate a regex. It's going to be very complex, if it's even possible.

@jdesrosiers
Copy link
Member

We use non-schema constructs for keyword values all the time

That's not what I meant. By "structured object", I mean an object whose property names are semantic rather than values. That doesn't include arrays or objects used like dictionaries. I'm talking about objects within a keyword that have their own keywords, but aren't schemas.

And actually, discriminator, externalDocs, and xml from OpenAPI 3.1 are already doing this; they're just annotations instead of assertions.

Those are examples of "structured objects". There's no technical reason why a keyword can't contain structured objects. It's a usability concern. It's easier (especially for beginners) to understand a schema when all objects are either a schema or a dictionary of schemas. An object property name is always either a user provided value or JSON Schema keyword. I'm ok with making an exception if there's compelling reason, but IMO I don't think this clears that bar.

This unnecessarily complicates implementation

I think it's worth making things slightly more complicated for implementers if it makes authoring schemas and reading schemas easier for users. There is a good reason for it.

it ddoesn't align with JSON Schema's notion that items are independent. (I know we're in the context of a keyword here.) Everywhere else an array is used, the items in that array are independent of each other with no context shared between them.

I think this is an accidental property. It happens that it's true, but there's no reason it needs to be preserved.

each item in sequence can be processed completely independently with the only context being what index to start on.

In the POC implementation I did, I compile the more user-friendly flat structure into something very similar to the more implementation-friendly structure you're arguing for. That way I get the best of both worlds, easy for users and easy to evaluate.

It may appear to be simpler to read/write, but as I mentioned, it's SO much more difficult to implement. (Maybe it's easier in a dynamic language like JS, but in a strongly typed language, it's very burdensome.) I could do it, but I would hate it. (This comes from experience; I don't want to do go down this road again.)

I grant you that it would be more annoying to implement in a dynamic language, but I don't think it's hard enough that it's worth sacrificing user experience.

Having strings (which are magic and have to be parsed), arrays (which imply subsequences), and booleans/objects (which signify schemas) is a very interpretive approach that we haven't even come close to with JSON Schema. There is no precedent for this.

True, but this is a complex feature and I think it's no surprising that we'd end up with something we've never had before. As long as it fits in JSON Schema architecture, using a more complex structure than we've used before to describe one of the most complex features ever proposed is reasonable.

I'm strongly against including something like this in the spec. Maybe as an external vocab, but not in the spec.

I mostly agree. It's not that I'm opposed to the keyword (in either form), but that it's so rarely needed that I don't see it ever having enough demand to make it into the spec.

How would we validate the interpretive style of sequence in a meta-schema?

sequence can be described using sequence! There aren't many rules that would need to be covered. Mainly, just that a "{n,m}" item needs to be proceeded by an object or an array. It might get more complicated if more operators are added, but at this point, I think it shouldn't be bad at all. Even if it can't be be perfectly described, I don't think that would be a serious problem. The meta-schema isn't guaranteed to perfectly capture everything that could be wrong with a schema.

@gregsdennis
Copy link
Member

gregsdennis commented Jun 29, 2023

There is a good reason for it.

I disagree.

I can't back the implicit/interpretive syntax right now. It's a headache to implement and more trouble than it's worth. Sorry.

this is a complex feature and I think it's no surprising that we'd end up with something we've never had before

I could say this about "an object whose property names are semantic rather than values." What I've proposed is closer to what JSON Schema already has, and is therefore more easily digestible. There have been other proposals which have aimed to do this, and we've struck them down for the very reason you're citing: keywords don't themselves have properties.

If we're going to introduce something new, it should be something that is at least similar to what's already there, not something that's drastically different.

it ddoesn't align with JSON Schema's notion that items are independent. (I know we're in the context of a keyword here.) Everywhere else an array is used, the items in that array are independent of each other with no context shared between them.

I think this is an accidental property. It happens that it's true, but there's no reason it needs to be preserved.

It's absolutely an intentional property of JSON Schema!

  • prefixItems independently evaluates each item according to position in the respective arrays
  • items independently evaluates each item in the instance to the single schema it has
  • properties independently evaluates all properties by property name
  • additionalProperties/unevaluatedProperties independently evaluate all remaining properties
  • etc.

This independent evaluation has always been a foundational aspect of JSON Schema.

Items like "{1,2}" and "*" cannot be processed on their own because they have no independent meaning. But something like

{
  "schema": { "const": "b" },
  "maxCount": 2
}

is fully defined, so it can be processed independently.

@jdesrosiers
Copy link
Member

It seems like we aren't going to agree on this, but that's probably ok since no one is recommending it be considered for the spec. My main concern is that we seem to disagree so passionately about which properties of JSON Schema should be considered important architectural constraints, which are norms worth preserving, and which are coincidental convergences that don't need preserving.

The bottom line for me is that I don't think the nested structured object approach is user-friendly enough. Neither approach breaks any important architectural constraints and both break norms or coincidental convergences (depending on who you ask). Which is preferred is purely a matter of personal opinion. I wish there was a path to compromise, but I don't think there's an in-between in this case.


A couple things occurred to me yesterday that I wanted to mention even tho I don't see this keyword moving forward.

It occurred to me there needs to be a way to represent alternatives. If you have something simple like a|b, you could just encode that as a single schema that uses anyOf or oneOf, but that doesn't work if the alternation involves multiple items, such as ab|cd. So, there would have to be something added to the syntax of the keyword that can address this case. In the flat version we can add a | operator and the nested version can add an alternatives sub-keyword. (This is probably the problem Austin was trying to solve with someOf and we missed the significance.)

The other thing that occurred to me is that error reporting for this keyword is probably going to be awkward. Generally, in an applicator like this, I'd report any sub-schema failures that occurred if the keyword fails, but that could get excessive (and likely unhelpful) quickly depending on the pattern defined in the keyword. For example, an item might be validated by several different schemas in the evaluation process. Some of those failures might be normal and don't need to be reported. I don't think that finding which failures are relevant would be an easy task, but I haven't put much thought into it yet.

@gregsdennis
Copy link
Member

gregsdennis commented Jul 7, 2023

In thinking about implementing this, I tried to think in terms of regex (like in the examples we've been using), and I think I've come across a problem with my proposal.

^a*(ab)+$

The process here is:

  1. validate the start
  2. validate as few a as possible until ab can be validated next
  3. validate as few ab as possible, requiring at least 1, until the end can be validated next
  4. validate the end

This demonstrates that each sequence entry can't be evaluated completely on its own, and the process has to be orchestrated by the overall sequence keyword.

So my item-subschema independence argument is moot.


Another problem is that my model can't represent * or + (repeats with no upper bound) with just maxCount. A new property would have to be introduced (or allow maxCount: null, which feels wrong).

Trying to build this using my model would be

{
  "type": "array",
  "sequence": [
    { "control": "start" },
    {
      "schema": { "const": "a" },
      "minCount": 0,
      "maxCount": ?? // if the default is 1, then how to you represent unlimited?
    },
    {
      "schema": { "const": "ab" },
      // min count defaults to 1
      "maxCount": ?? // if the default is 1, then how to you represent unlimited?
    },
    { "control": "end" }
  ]
}

I'm still not really a fan of the implicit syntax because of my need for strongly-typed-everything, but if we can be explicit about the grammar, I think I might be persuaded.

  • objects/boolean MUST be interpreted as subschemas
  • arrays MUST be interpreted as groups/subsequences
  • strings MUST be controls, and only the following are allowed:
    • ^
    • *
    • +
    • ?
    • $
    • form of {1,2}
  • other types (just numbers/null) are invalid

I think this can be validated with a meta-schema easily enough using an anyOf (since they're mutually exclusive by type, we don't need oneOf). For the strings, I expect the single-char ones could be in an enum, and the repeat count string can be validated with pattern, and both of these listed separately in the anyOf. (If you prefer to use if/then, sure, but I don't think it's necessary.)

That takes care of the components, but we may also need to specify which components can come in which order. Maybe. I'm not sure. I mean, ^a^a$b$ is technically a valid regex, but it won't match anything. We have that kind of thing in JSON Schema as well. Maybe an "invalid" sequence is just another analog of the false schema.

I'd also like that we specify the algorithm I laid out in the first section of this comment, or something similar.

Am I missing anything, @jdesrosiers?

@jdesrosiers
Copy link
Member

validate as few a as possible until ab can be validated next

I think it's more complicated than that. Regular expression terms are said to match greedily, meaning they match as many as possible, not as few as possible. When I put your example into regex101.com, it describes * as "matches the previous token between zero and unlimited times, as many times as possible, giving back as needed (greedy)". However, I think Austin's description is technically more accurate, "This keyword is valid when there is any way to split the instance into segments, such that that each segment is valid against the corresponding subschema."

^a*(ab)+$

This example exposes a problem in my initial implementation (:cry:). It doesn't handle the "giving back as necessary" part properly. This forced me to do a little reading on the theory involved. If I understand correctly, this is the difference between deterministic finite atomata (DFA) and non-deterministic finite atomata (NFA).

I reimplemented using the Thompson Construction algorithm described in that article to compile to an NFA and used the recommended search algorithm. That worked out well, but had a few consequences. I think these are general regexp properties and not specific to this algorithm, but I'm not 100% sure about that.

This approach assumes that the expression is bounded, so it's not easy to implement explicit bounding with ^ and $. I decided to embrace the implicit bounding. I already implemented implicit start at the beginning of the array because I think it would be pretty unusual to not start at the beginning of the array. Leaving the pattern open-ended is only useful if you need to extend the pattern later, but there's no way to do that, so an implicit close doesn't seem like much is lost. Unbounded start or end can still be expressed with true, "*" if needed.

I also found the bounded repetition operators ({n}/{n,}/{n,m}) difficult to implement with this algorithm and I decided to skip them as well. They're convenient to reduce duplication, but there's nothing you can't express without them.

Leaving these things out feels lazy, but it's also a feature. Sticking to what's easy to implement makes it more likely others will choose to implement it.

I'm still not really a fan of the implicit syntax because of my need for strongly-typed-everything

This shouldn't be an issue for strongly typed languages in general, just ones that don't have support for union types. It's been a while since I've worked with C#, but I think it falls in that category.

if we can be explicit about the grammar, I think I might be persuaded.

Great! The rules you laid out was what I had in mind as well. (However, you've left out support for alternation (|).)

That takes care of the components, but we may also need to specify which components can come in which order.

That's exactly the kind of thing this keyword would be able to express! Here's what I came up with. (I thought of an alternate keyword name that I prefer over sequence that I use in the schema below: itemPattern.)

"$defs": {
  "itemPattern": {
    "type": "array",
    "itemPattern": [
      [ 
        { 
          "if": { "type": "array" },
          "then": { "$ref": "#/$defs/itemPattern" },
          "else": { "$dynamicRef": "meta" }
        },
        { "enum": ["?", "*", "+"] }, "?",
        "|",
        { "const": "|" }
      ], "*"
    ]
  }
}

I'd also like that we specify the algorithm I laid out in the first section of this comment, or something similar.

I'd rather not specify an algorithm. I think these are well defined concepts that we just need to reference. I don't know this domain well enough to do that yet, but I think that should be sufficient.

@gregsdennis
Copy link
Member

However, you've left out support for alternation (|).

Wouldn't that already be supported with oneOf inside a schema?

@jdesrosiers
Copy link
Member

It occurred to me at one point that oneOf/anyOf only solves alternation when a single item in the array can be one thing or another. If the alternation involves more than one item, oneOf/anyOf can't handle that case because they can only operate on one instance value at a time. I brought this up in a previous comment:

If you have something simple like a|b, you could just encode that as a single schema that uses anyOf or oneOf, but that doesn't work if the alternation involves multiple items, such as ab|cd.

Hopefully that makes sense. I didn't notice it until I wrote the meta-schema for this keyword and realized I needed something I couldn't express with anyOf/oneOf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants