Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Unicode] Use Unicode 9’s ID_Start & ID_Continue for identifiers #1208

Closed
mathiasbynens opened this issue Jun 28, 2016 · 13 comments
Closed
Assignees
Milestone

Comments

@mathiasbynens
Copy link
Contributor

mathiasbynens commented Jun 28, 2016

  • Unicode 8 has 109,830 ID_Start symbols; Unicode 9 has 117,007, i.e. 7,177 more (no removals).
  • Unicode 8 has 112,352 ID_Continue symbols; Unicode 9 has 119,691, i.e. 7,339 more (no removals).

E.g. these should not throw per Unicode 9:

Function('var \u{1E943}'); // new ID_Start
Function('var _\u{1E959}'); // new ID_Continue

I’ve attached a tarball containing results.js which contains the full list of new ID_Start and ID_Continue symbols in Unicode 9, and the Node.js script used to generate it.

unicode-9-identifiers.tar.gz


See also:

@dilijev dilijev added the Bug label Jun 28, 2016
@dilijev
Copy link
Contributor

dilijev commented Jun 28, 2016

I'll add that ECMA 2017 has a stipulation for which Unicode Standard version must be supported:

https://tc39.github.io/ecma262/#sec-conformance

A conforming implementation of ECMAScript must interpret source text input in conformance with the Unicode Standard, Version 8.0.0 or later and ISO/IEC 10646.

No reason we couldn't add this but it's not a high priority under the current version of the spec.

@bterlson Does that seem like a fair assessment?

@mathiasbynens
Copy link
Contributor Author

The intent is to use the latest available Unicode version. See tc39/ecma262#620 for some discussion. @bterlson Perhaps it would be clearer to merge that PR now, and then update to refer to the (version-less) latest version in a future PR?

@bterlson
Copy link
Contributor

@mathiasbynens It's only a few weeks until we have the final consensus on version-less latest reference :) But I see your point. We should be tracking Unicode 9 at this point.

@dilijev dilijev self-assigned this Nov 11, 2016
@dilijev dilijev added this to the 1.4 milestone Nov 11, 2016
@dilijev
Copy link
Contributor

dilijev commented Nov 11, 2016

@bterlson I'm going ahead with this change, but just wanted to follow up about whether consensus was reached?

@dilijev dilijev changed the title Use Unicode 9’s ID_Start & ID_Continue for identifiers [Unicode] Use Unicode 9’s ID_Start & ID_Continue for identifiers Nov 11, 2016
@dilijev
Copy link
Contributor

dilijev commented Nov 11, 2016

@mathiasbynens I took your result script and added the following lines:

console.log("number of new ID_Start symbols:    " + new_ID_Start.length);
console.log("number of new ID_Continue symbols: " + new_ID_Continue.length);
console.log(new_ID_Continue.length - new_ID_Start.length);

And see the following counts:

number of new ID_Start symbols:    7177
number of new ID_Continue symbols: 7339
162

That doesn't quite match with the numbers included in your original post, so I'd like to clarify to make sure we're on the same page.

If I'm not mistaken, all ID_Start characters can be used as ID_Continue, but not all ID_Continue can be used as ID_Start. (If I understand correctly, ID_Continue would include numerals, which cannot be used as ID_Start.) It looks like the ES standard specifies using these Unicode character classes as they are.

(Side note: I find it interesting that it falls into the realm of the Unicode standard to define what characters can be used for identifiers, given that it seems like a language-specific implementation detail.)

@mathiasbynens
Copy link
Contributor Author

mathiasbynens commented Nov 14, 2016

@dilijev The numbers I posted were wrong, indeed. The script and the output it produces (and the numbers you logged) are correct. Here’s how I confirmed this:

> require('unicode-8.0.0/Binary_Property/ID_Start/code-points.js').length
109830

> require('unicode-9.0.0/Binary_Property/ID_Start/code-points.js').length
117007

> 117007 - 109830 // the script verifies there are no removals, so this is the # of new entries
7177

> require('unicode-8.0.0/Binary_Property/ID_Continue/code-points.js').length
112352

> require('unicode-9.0.0/Binary_Property/ID_Continue/code-points.js').length
119691

> 119691 - 112352 // the script verifies there are no removals, so this is the # of new entries
7339

I’ve updated the top post accordingly.

(Side note: I find it interesting that it falls into the realm of the Unicode standard to define what characters can be used for identifiers, given that it seems like a language-specific implementation detail.)

Interestingly, ES5 defined a list of Unicode General_Category values whose characters were allowed in IdentifierStart and IdentifierPart. ES6 / ES2015 simplified by referring to ID_Start / ID_Continue instead.

@dilijev
Copy link
Contributor

dilijev commented Nov 15, 2016

Marking as External.

It looks like compliance with a particular Unicode version is an external library issue on all platforms.

On Windows we use
ABI::Windows::Data::Text::IUnicodeCharactersStatics::IsIdStart
ABI::Windows::Data::Text::IUnicodeCharactersStatics::IsIdContinue
(See lib\Runtime\PlatformAgnostic\Platform\Windows\UnicodeText.cpp)

For non-Windows we use ICU's u_IsIDStart(ch) and u_hasBinaryProperty(ch, UCHAR_ID_CONTINUE).
(See lib\Runtime\PlatformAgnostic\Platform\Linux\UnicodeText.ICU.cpp)


Confirmed the test cases above do not work in node 6.9.1 but do work in node 7.1.0.
Confirmed they don't work in ch on Windows.
Confirmed they don't work in ch on Linux either (as @digitalinfinity mentioned below, the ICU version we're using does not support Unicode 9).

@digitalinfinity
Copy link
Contributor

Note that the different external libraries have different levels of Unicode versions supported. IIRC, the ICU version we support supports Unicode 8? Last I checked, Windows.Globalization.dll supported Unicode 6.1 or 6.2, IIRC. cc @bterlson

@dilijev
Copy link
Contributor

dilijev commented Nov 15, 2016

Hmm in that case it might not be reasonable to wait for Windows to update their APIs for Unicode -- unless we plan to just accept that Unicode compliance on Windows will lag behind.

Would it be reasonable to take a dependency on ICU for ChakraCore and/or Chakra to resolve this issue? (If we did that it would most likely be out of scope for milestone 1.4.)

ICU can be used for Intl and Unicode support (although Intl support is turned off in xplat builds at the moment.) Is classification of characters a small enough component of ICU that we could implement it in our code base as a mini-ICU component rather than waiting for the entire set of Windows / ICU Unicode and Intl functions we depend on to catch up?

@dilijev dilijev removed this from the 1.4 milestone Nov 15, 2016
@dilijev
Copy link
Contributor

dilijev commented Nov 15, 2016

Removing this from Milestone 1.4. If we wait for Windows to update the Unicode APIs, there's no action to take here. If we were to implement this functionality or change dependencies, it is too risky for this milestone.

@dilijev
Copy link
Contributor

dilijev commented Nov 16, 2016

TC39's intent is generally that the latest version of the Unicode standard should be used in the latest version of the ECMAScript standard. Removed "Pending TC39 Consensus" label.

@dilijev
Copy link
Contributor

dilijev commented Jul 11, 2017

Related #3271 #3050

@dilijev
Copy link
Contributor

dilijev commented Sep 5, 2017

Closing this issue as External: by current design these classifications are determined by calling APIs in external libraries.

We can open a new issue if we decide we want to embed the ID_START and ID_CONTINUE classifications of characters directly into ChakraCore.

@dilijev dilijev closed this as completed Sep 5, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants