Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify which regular expression production rule from the ECMA specification is intended #821

Closed
Julian opened this issue Nov 29, 2019 · 37 comments · Fixed by #965
Closed

Comments

@Julian
Copy link
Member

Julian commented Nov 29, 2019

Via @leadpony in json-schema-org/JSON-Schema-Test-Suite#309 [the context being "what behavior is expected if one uses \a, i.e. ASCII BEL, within a regular expression]:

There exist 2 regular expression patterns defined in the ECMA-262 specification.

21.2 RegExp (Regular Expression) Objects
B.1.4 Regular Expressions Patterns

Note that the latter pattern is a part of B. Additional ECMAScript Features for Web Browsers.

Both patterns do not allow a as a controll escape letter, but the final consequence seems to slightly differ.
In the latter pattern, \a is converted to a letter a, as Node.js does, because the letter is recognized as a part of IdentifyEscape.
In the former pattern, I believe a syntax error should be thrown, because a has an ID_CONTINUE Unicode property and cannot be handled as an IdentifyEscape.

IdentityEscape::
SourceCharacter but not UnicodeIDContinue

The JSON Schema Specification does not state which pattern should be followed, therefore we cannot make the test without some ambiguity.

We should clarify which behavior is intended, and therefore whether such a string is or is not a valid regex for the test suite.

@handrews
Copy link
Contributor

@Julian
Copy link
Member Author

Julian commented Jan 14, 2020

A ha, that does certainly seem relevant -- possibly the confusion leftover though is that section doesn't appear in the version of the ECMA262 spec that's linked -- that section from your PR looks like it refs 2011 ECMA spec, but the one linked in the current JSON Schema spec links the 2018 one -- though presumably the intention is the corresponding section in the new version of the spec I guess? Which probably answers the original question here (I think meaning \a should be an error, but will have to double check)

@handrews handrews added this to the draft-08-patch1 milestone Jan 15, 2020
@Relequestual
Copy link
Member

We should specify which "edition" of ECMA262 we are referencing.
The 2011 version was "edition 5.1" and the latest now that we link to is "edition 10".

Which version do we want to reference?

@Relequestual
Copy link
Member

I think to work this out, we should consider:

  • Which version has the broadest out of the box (browser and node) support.
  • What are the changes from the current version we intened (ES5.1) to whatever we pick now (ESWHATNOW?)

@Relequestual
Copy link
Member

Interestingly, OAS 3.0 references ECMA 262 sec-7.8.5 (https://www.ecma-international.org/ecma-262/5.1/#sec-7.8.5). A literal includes flags.

@MikeRalphson
Copy link
Contributor

In the unreleased v3.0.3 this should have been fixed by PR OAI/OpenAPI-Specification#1987

@Relequestual
Copy link
Member

I haven't been able to easily determine what the differences / changes to the relevent sections of the document from edition 5.1 to edition 10... at least no one is telling me and reading to find out will be time consumig!

@Relequestual Relequestual self-assigned this May 19, 2020
@brettz9

This comment has been minimized.

@karenetheridge

This comment has been minimized.

@brettz9

This comment has been minimized.

@karenetheridge

This comment has been minimized.

@brettz9

This comment has been minimized.

@Relequestual
Copy link
Member

After discussion on the OAS TSC call (2020-07-09), no one had an idea which edition of ECMA should be referenced.

We need to find someone who knows ECMA and what might have changed between editions.

We should consider if the most recent edition has been implemented fully by major implementers.

@MikeRalphson
Copy link
Contributor

This blog post indicates RegExp lookbehind assertions were added in ES2018, but caniuse only gives that a 'score' of 71.65% of global users having that support in browsers (other regex features rate 95%+). Using that as a rough guide to tool support, perhaps therefore we should cite an edition of the ECMA spec < 2018 (whichever edition that is).

@karenetheridge
Copy link
Member

also related: json-schema-org/JSON-Schema-Test-Suite#380

@brettz9

This comment has been minimized.

@Relequestual

This comment has been minimized.

@Relequestual
Copy link
Member

@karenetheridge so looks like that issue was referencing ECMA edition 10 rather than 5.5 like the spec. Would you be able to evaluate if that might have caused a problem please? =]

@karenetheridge
Copy link
Member

@Relequestual I don't understand the various ECMA dialects well enough to give a clear response here.

@Relequestual
Copy link
Member

@Relequestual I don't understand the various ECMA dialects well enough to give a clear response here.

You and me both...
(And just about everyone)

@ljharb
Copy link

ljharb commented Jul 14, 2020

Hi there! I'm an editor of ecma-262.

262 is a living standard, so I'd strongly suggest not linking to an outdated snapshot like https://www.ecma-international.org/ecma-262/11.0/ or https://tc39.es/ecma262/2020/, but instead, to link to https://tc39.es/ecma262/ which is the latest draft.

As it relates to implementations, everything post-ES2015 must have at least two implementations to qualify for stage 4, and inclusion into the standard - so anything in any of those links has at least two implementations, and for major features like regex stuff, two browsers.

personal regex opinion(my personal opinion for JSON schema is that all regexes should always be in `u` mode, since that handles Unicode text properly and is what people usually want, but I lack the context to argue it beyond that)

@Relequestual
Copy link
Member

Thanks for commenting @ljharb
We're wanting to refernece a specific section of ecma-262. The location changes between versions.
Additionally, we have tests.

While the current latest may have a set of requirements for regex, say the next version adds something new... the tests will become incorrect.

We only want to reference the specific version to nail down exact requirements for regular expression compliance.
Do you know how other specifications which want to nail down regular expression support requirements?

Thanks

@ljharb
Copy link

ljharb commented Jul 16, 2020

All URLs in the spec should work more or less forever (cool URLs don’t change); if any have changed for you, please let me know and we’ll restore them. Additionally, spec links since 2015 have all been named and not numbered, so they should remain correct no matter what.

@awwright
Copy link
Member

awwright commented Jul 21, 2020

@ljharb Way ahead of you on the Unicode support: json-schema-org/JSON-Schema-Test-Suite#264

Even though JSON Schema suggests limiting patterns to ECMAScript-compatible regular expressions, this is not intended to constrain the encoding of the string; JSON only decodes to a string of Unicode code points.

... Ultimately though, we had to move those tests to an optional suite because it turns out .Net doesn't support multibyte characters (surrogate pairs) in regular expressions, at all.


We've had poor experiences with "living standards" on the whole; and the IETF process, and most of us it looks like, strongly prefer publishing documents whose meaning doesn't change over time.

My own reasoning is since this is a normative reference, it's more akin to incorporating that specification piece entirely, not merely referring to it; and our selection of ECMA-262 is aiming to establish a least-common-denominator syntax likely to be supported across the hundreds of various regular expression dialects & implementations.

(If supporting the latest features were the goal, we would point to the IANA Media Type Registry instead, so as to always point to the latest publication—if there were a media type to reference, that is.)

@handrews
Copy link
Contributor

We've had poor experiences with "living standards" on the whole; and the IETF process, and most of us it looks like, strongly prefer publishing documents whose meaning doesn't change over time.

+1000000000000000

establish a least-common-denominator syntax

Yeah exactly. In practice, the vast majority of implementations use whatever regex library is common in their language. And if people want to get ultra-precise about regex features, that's what extension vocabularies are for.

You could add patternEcma261Latest and use a meta-schema that requires support for it (and if not, requires the implementation to error rout). That is one of the use cases for vocabularies: standards that have too many competing options for us to ever satisfy everyone.

@ljharb
Copy link

ljharb commented Jul 21, 2020

That’s fine; but a) I’d encourage you to rapidly update the normative reference every June to the latest edition, so you stay up to date intentionally; b) if you can point me to which parts of 262 you depend on, i can (as editor) add an editor’s note to make sure they’re not changed without at least pinging stakeholders such as yourselves.

@Relequestual
Copy link
Member

Appreciate the offer there, but I believe the way we DO references are by number, with the actual link to just the specification itself, as opposed to deep linking directly (allowing for multiple references).

Unless anyone gives us a compelling reason to reference the latest version of ecma262, I'm going to update our reference to point to the "historic" 5.1 version of ecma262 to avoid ambiguity.

@ljharb
Copy link

ljharb commented Aug 10, 2020

I'd strongly suggest not doing that, since the spec around JSON changed in ES2019 (tc39/ecma262#1396 and tc39/ecma262#1188 in particular).

I think it is a very unwise idea to point to obsolete snapshots of any specification, and I really hope you don't choose to do so.

@awwright
Copy link
Member

@Relequestual I think it's good advice, is there any reason NOT to link to the latest snapshot?

We reference the document, of course, but the URL can still be to the section/paragraph in question.

@Relequestual
Copy link
Member

Relequestual commented Aug 11, 2020

@awwright I assume you don't mean "the latest snaptshot URL", which would mean we would point to to a living standard document (which can be updated), which as you said earlier, we do not want.

We can point to the current latest version (11th edition (June 2020)), I have no problem with that, but I don't see any compelling reason why we should do so. While specific regex dialect support is optional as per the spec, it's going to be hard for anyone to be fully compliant, especially for browser based tooling as noted.

If we DO link to the latest and greatest, are we happy to accept the fact that 100% supporting all the optional parts of the spec, namly correct regex support, just simply won't be possible right now?

@Relequestual
Copy link
Member

/remind me to merge the PR and close this issue if there's been no discussion in 7 days!

@reminders reminders bot added the reminder label Sep 27, 2020
@reminders
Copy link

reminders bot commented Sep 27, 2020

@Relequestual set a reminder for Oct 4th 2020

@Relequestual
Copy link
Member

After some responses to a tweet, it looks reasonable to move to the current latest version.

Here's a list of "finished proposals" for tc39: https://github.com/tc39/proposals/blob/master/finished-proposals.md

It looks like the two relevant changes are:

Both of which were published as part of the 2018 version.

I conclude that moving to the latest version makes sense given general support, and given support for the specific syntax is optional anyway.

@Relequestual
Copy link
Member

Lookbehind assertions are currently not supported by Safari (both desktop and mobile): https://caniuse.com/js-regexp-lookbehind
Unicode property escapes are currently supported by all major browsers: https://caniuse.com/mdn-javascript_builtins_regexp_property_escapes

@awwright
Copy link
Member

Apparently the u flag in ECMAScript makes some regular expressions invalid, specifically, unnecessary escapes: tdegrunt/jsonschema#311
We’re going to have to define which flags to evaluate the regex with, if we reference ECMAScript

@Relequestual
Copy link
Member

@awwright We already previously reference ECMAScript. That hasn't changed.

@ljharb
Copy link

ljharb commented Sep 30, 2020

It would also be ideal to only ever use the u flag (not sure if you're doing that already).

Relequestual added a commit that referenced this issue Nov 4, 2020
Reference ECMAScript 11.0 specifically (#821)
Julian added a commit to json-schema-org/JSON-Schema-Test-Suite that referenced this issue Jun 15, 2022
Draft 2020-12 now references the ECMA 262 11.0 specification, and
thereby does not allow \a.

Refs: json-schema-org/json-schema-spec#821
Closes: #309
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment