[Unicode] Use Unicode 9’s `ID_Start` & `ID_Continue` for identifiers #1208

mathiasbynens · 2016-06-28T09:37:28Z

Unicode 8 has 109,830 ID_Start symbols; Unicode 9 has 117,007, i.e. 7,177 more (no removals).
Unicode 8 has 112,352 ID_Continue symbols; Unicode 9 has 119,691, i.e. 7,339 more (no removals).

E.g. these should not throw per Unicode 9:

Function('var \u{1E943}'); // new ID_Start
Function('var _\u{1E959}'); // new ID_Continue

I’ve attached a tarball containing results.js which contains the full list of new ID_Start and ID_Continue symbols in Unicode 9, and the Node.js script used to generate it.

unicode-9-identifiers.tar.gz

See also:

JavaScriptCore: https://bugs.webkit.org/show_bug.cgi?id=159203
SpiderMonkey: https://bugzilla.mozilla.org/show_bug.cgi?id=1282724
V8: https://bugs.chromium.org/p/v8/issues/detail?id=5155

The text was updated successfully, but these errors were encountered:

dilijev · 2016-06-28T21:27:44Z

I'll add that ECMA 2017 has a stipulation for which Unicode Standard version must be supported:

https://tc39.github.io/ecma262/#sec-conformance

A conforming implementation of ECMAScript must interpret source text input in conformance with the Unicode Standard, Version 8.0.0 or later and ISO/IEC 10646.

No reason we couldn't add this but it's not a high priority under the current version of the spec.

@bterlson Does that seem like a fair assessment?

mathiasbynens · 2016-06-29T06:36:34Z

The intent is to use the latest available Unicode version. See tc39/ecma262#620 for some discussion. @bterlson Perhaps it would be clearer to merge that PR now, and then update to refer to the (version-less) latest version in a future PR?

bterlson · 2016-06-29T23:36:21Z

@mathiasbynens It's only a few weeks until we have the final consensus on version-less latest reference :) But I see your point. We should be tracking Unicode 9 at this point.

dilijev · 2016-11-11T18:30:07Z

@bterlson I'm going ahead with this change, but just wanted to follow up about whether consensus was reached?

dilijev · 2016-11-11T19:50:50Z

@mathiasbynens I took your result script and added the following lines:

console.log("number of new ID_Start symbols:    " + new_ID_Start.length);
console.log("number of new ID_Continue symbols: " + new_ID_Continue.length);
console.log(new_ID_Continue.length - new_ID_Start.length);

And see the following counts:

number of new ID_Start symbols:    7177
number of new ID_Continue symbols: 7339
162

That doesn't quite match with the numbers included in your original post, so I'd like to clarify to make sure we're on the same page.

If I'm not mistaken, all ID_Start characters can be used as ID_Continue, but not all ID_Continue can be used as ID_Start. (If I understand correctly, ID_Continue would include numerals, which cannot be used as ID_Start.) It looks like the ES standard specifies using these Unicode character classes as they are.

(Side note: I find it interesting that it falls into the realm of the Unicode standard to define what characters can be used for identifiers, given that it seems like a language-specific implementation detail.)

mathiasbynens · 2016-11-14T12:48:02Z

@dilijev The numbers I posted were wrong, indeed. The script and the output it produces (and the numbers you logged) are correct. Here’s how I confirmed this:

> require('unicode-8.0.0/Binary_Property/ID_Start/code-points.js').length
109830

> require('unicode-9.0.0/Binary_Property/ID_Start/code-points.js').length
117007

> 117007 - 109830 // the script verifies there are no removals, so this is the # of new entries
7177

> require('unicode-8.0.0/Binary_Property/ID_Continue/code-points.js').length
112352

> require('unicode-9.0.0/Binary_Property/ID_Continue/code-points.js').length
119691

> 119691 - 112352 // the script verifies there are no removals, so this is the # of new entries
7339

I’ve updated the top post accordingly.

(Side note: I find it interesting that it falls into the realm of the Unicode standard to define what characters can be used for identifiers, given that it seems like a language-specific implementation detail.)

Interestingly, ES5 defined a list of Unicode General_Category values whose characters were allowed in IdentifierStart and IdentifierPart. ES6 / ES2015 simplified by referring to ID_Start / ID_Continue instead.

dilijev · 2016-11-15T00:46:03Z

Marking as External.

It looks like compliance with a particular Unicode version is an external library issue on all platforms.

On Windows we use
ABI::Windows::Data::Text::IUnicodeCharactersStatics::IsIdStart
ABI::Windows::Data::Text::IUnicodeCharactersStatics::IsIdContinue
(See lib\Runtime\PlatformAgnostic\Platform\Windows\UnicodeText.cpp)

For non-Windows we use ICU's u_IsIDStart(ch) and u_hasBinaryProperty(ch, UCHAR_ID_CONTINUE).
(See lib\Runtime\PlatformAgnostic\Platform\Linux\UnicodeText.ICU.cpp)

Confirmed the test cases above do not work in node 6.9.1 but do work in node 7.1.0.
Confirmed they don't work in ch on Windows.
Confirmed they don't work in ch on Linux either (as @digitalinfinity mentioned below, the ICU version we're using does not support Unicode 9).

digitalinfinity · 2016-11-15T00:53:41Z

Note that the different external libraries have different levels of Unicode versions supported. IIRC, the ICU version we support supports Unicode 8? Last I checked, Windows.Globalization.dll supported Unicode 6.1 or 6.2, IIRC. cc @bterlson

dilijev · 2016-11-15T01:13:38Z

Hmm in that case it might not be reasonable to wait for Windows to update their APIs for Unicode -- unless we plan to just accept that Unicode compliance on Windows will lag behind.

Would it be reasonable to take a dependency on ICU for ChakraCore and/or Chakra to resolve this issue? (If we did that it would most likely be out of scope for milestone 1.4.)

ICU can be used for Intl and Unicode support (although Intl support is turned off in xplat builds at the moment.) Is classification of characters a small enough component of ICU that we could implement it in our code base as a mini-ICU component rather than waiting for the entire set of Windows / ICU Unicode and Intl functions we depend on to catch up?

dilijev · 2016-11-15T22:00:31Z

Removing this from Milestone 1.4. If we wait for Windows to update the Unicode APIs, there's no action to take here. If we were to implement this functionality or change dependencies, it is too risky for this milestone.

dilijev · 2016-11-16T00:19:21Z

TC39's intent is generally that the latest version of the Unicode standard should be used in the latest version of the ECMAScript standard. Removed "Pending TC39 Consensus" label.

dilijev · 2017-07-11T22:27:30Z

Related #3271 #3050

dilijev · 2017-09-05T18:00:53Z

Closing this issue as External: by current design these classifications are determined by calling APIs in external libraries.

We can open a new issue if we decide we want to embed the ID_START and ID_CONTINUE classifications of characters directly into ChakraCore.

dilijev added the Bug label Jun 28, 2016

dilijev added the Pending TC39 Consensus label Jun 29, 2016

dilijev self-assigned this Nov 11, 2016

dilijev added this to the 1.4 milestone Nov 11, 2016

dilijev changed the title ~~Use Unicode 9’s ID_Start & ID_Continue for identifiers~~ [Unicode] Use Unicode 9’s ID_Start & ID_Continue for identifiers Nov 11, 2016

dilijev added the External label Nov 15, 2016

dilijev removed this from the 1.4 milestone Nov 15, 2016

dilijev removed the Pending TC39 Consensus label Nov 16, 2016

dilijev modified the milestone: Backlog Dec 17, 2016

dilijev modified the milestones: vNext, Backlog Jul 11, 2017

This was referenced Jul 11, 2017

Multiple characters not treated as valid variable part #3271

Closed

Unicode: Many characters incorrectly treated as whitespace #3050

Open

dilijev closed this as completed Sep 5, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Unicode] Use Unicode 9’s `ID_Start` & `ID_Continue` for identifiers #1208

[Unicode] Use Unicode 9’s `ID_Start` & `ID_Continue` for identifiers #1208

mathiasbynens commented Jun 28, 2016 •

edited

dilijev commented Jun 28, 2016

mathiasbynens commented Jun 29, 2016

bterlson commented Jun 29, 2016

dilijev commented Nov 11, 2016

dilijev commented Nov 11, 2016 •

edited

mathiasbynens commented Nov 14, 2016 •

edited

dilijev commented Nov 15, 2016 •

edited

digitalinfinity commented Nov 15, 2016

dilijev commented Nov 15, 2016 •

edited

dilijev commented Nov 15, 2016

dilijev commented Nov 16, 2016

dilijev commented Jul 11, 2017

dilijev commented Sep 5, 2017

[Unicode] Use Unicode 9’s ID_Start & ID_Continue for identifiers #1208

[Unicode] Use Unicode 9’s ID_Start & ID_Continue for identifiers #1208

Comments

mathiasbynens commented Jun 28, 2016 • edited

dilijev commented Jun 28, 2016

mathiasbynens commented Jun 29, 2016

bterlson commented Jun 29, 2016

dilijev commented Nov 11, 2016

dilijev commented Nov 11, 2016 • edited

mathiasbynens commented Nov 14, 2016 • edited

dilijev commented Nov 15, 2016 • edited

digitalinfinity commented Nov 15, 2016

dilijev commented Nov 15, 2016 • edited

dilijev commented Nov 15, 2016

dilijev commented Nov 16, 2016

dilijev commented Jul 11, 2017

dilijev commented Sep 5, 2017

[Unicode] Use Unicode 9’s `ID_Start` & `ID_Continue` for identifiers #1208

[Unicode] Use Unicode 9’s `ID_Start` & `ID_Continue` for identifiers #1208

mathiasbynens commented Jun 28, 2016 •

edited

dilijev commented Nov 11, 2016 •

edited

mathiasbynens commented Nov 14, 2016 •

edited

dilijev commented Nov 15, 2016 •

edited

dilijev commented Nov 15, 2016 •

edited