Allow unicode letters in identifiers #716

gurdiga · 2012-10-28T17:56:38Z

Hey Anton!

This patch is re issue #301.

I’m writing an law-specialized app and I’ve chosen to write them in Romanian because it would have been more difficult to translate the domain terms. :o) This is the reason why I need jshint to validate my code with Unicode letters in identifiers.

So here is what I did:

Picked the hex codes for the Unicode 'Letter, Uppercase' and 'Letter, Lowercase' categories from fileformat.info. However, because I could not find a way to encode 5-digit (ECMA allows only 4) Unicode escape sequence, I had to leave out 1016 out of 3192 Unicode letters.

Please take a look and let me know what you think.

michaelficarra · 2012-10-28T18:26:02Z

https://github.com/tolmasky/language/blob/53f75464902bae56712f79a30015a0b87336a17c/languages/JavaScript.language#L656-665

gurdiga · 2012-10-28T18:51:46Z

@michaelficarra Oh, nice! :) Seems like UnicodeLetter covers more than Lu and Ll, interesting… I’m wondering if they faced the 5-hex-digit-problem I’ve discribed… (scratching my head)

michaelficarra · 2012-10-28T23:36:18Z

I believe that's what the sequence / "\uD80D" [\uDC00-\uDC2E] may be for.

gurdiga · 2012-10-29T05:19:43Z

@michaelficarra Um… I’m not sure I understand… 😕

Let me rephrase is what I meant. So, we have Unicode letters like “ц” and “ţ” that are outside a-zA-Z but are valid and useful in respective languages. And we can encode them in JS with Unicode sequences \u0446 and \u0163 respectively.

But then we have Unicode letters like MATHEMATICAL ITALIC SMALL A which have a 5-digit-hex code and which we’d encode as \u1d44e, but unfortunately this doesn’t work, as the JS parser reads this as \u1d44 + e, so instead of that nice small “a” you get ᵄe. 😟

Maybe there is another way, maybe we might just use the raw characters and paste them in the source code, I mean instead of a regex like [\u1d44e-\u1d467] to just put this, which I expect to work, but can get hard to read and messy for characters that might not implemented be in your terminal font, for example the characters in the range [\u10401-\u10427] or this respectively, are displayed as a small empty rectangles in my browser and about the same in my terminal (on Ubuntu). So, we have a kind of choice here… 🌿

BTW: even here, I got 422 (Unprocessable Entity) from GitHub when I’ve tried to submit the raw special characters 😉

gurdiga · 2012-10-30T05:54:45Z

I have updated the patch: since we can’t encode 5-digit letters, I’ve decided not to use Unicode escape sequences at all, and just use raw characters. This way we can have all the letters. 👍

One tricky thing that I’ve found is that JS doesn’t allow specifying a RegExp range that would start or end with a 5-digit letter: you can’t have his range with raw letters: [\u1d44e-\u1d467], you get a “SyntaxError: Invalid regular expression: …: Range out of order in character class”. This is why I had to “inline” those ranges. 😄

valueof · 2012-10-30T05:58:34Z

Err, I'm sure there must be a better way than keeping a block of random unicode characters in the code.

gurdiga · 2012-10-30T06:01:41Z

Good ideas are always welcome. 😉

gurdiga · 2012-10-30T06:46:21Z

@antonkovalyov, c’mon man let’s work this out: did you have something concrete in mind? Help me out a bit. Let’s get this done.

neonstalwart · 2012-10-30T14:40:32Z

perhaps this will help... https://github.com/bestiejs/punycode.js/blob/a6ea357aae89c90733dedd83cc4dac6982a69478/punycode.js#L128-147 (which i came across by reading http://mathiasbynens.be/notes/javascript-escapes)

that code allows for something like this

ucs2encode([119886]); // -> MATHEMATICAL ITALIC SMALL A

a little massaging to make it fit this use case and i think it should be suitable for generating the string of chars you want to use in the regular expression.

gurdiga · 2012-10-30T17:43:39Z

@neonstalwart Thank you for jumping into this. 😃

I feel dumb… it seems like I’m on a different wave-length than the rest of the world because I can’t get to the core of other people arguments (there were @michaelficarra and @antonkovalyov before) 😳.

Let me at least try to understand what you’re proposing. 😳 So we collect all those codes, as an array of codes, including the 5-digit ones, and then we use ucs2encode() to build the final string which we pass to new RegExp(). As far as I can understand, this may be a benefit if it’s in a way better to have a 4xlarge array of codes instead of a 1xlarge string with raw characters. Is this the idea that you were trying to express? 😣

neonstalwart · 2012-10-30T18:02:16Z

@gurdiga that is almost the idea i was getting at.

i was assuming that the characters you're interested in are represented by a continuous range of values - eg 0x1d44e to 0x1d4671. i was also thinking that rather than have a ucs2encode that takes an array of values, you might just have it convert a single value (array or single value doesn't matter so much, but assume encode in the code below will operate on a single value). then you would loop over the range of values to produce the string that generates the regular expression - no need for large arrays of chars or numbers in the code.

var current = 0x1d44e,
    upper = 0x1d4671,
    range = '';

while (current <= upper) {
    range += encode(current++);
}

// then go on to build the regular expression with the range of chars

you might extract this into a function that takes a lower (current in the snippet) and upper and returns range so that you can reuse it etc.

the key part i was linking to in that other code is the way you can create characters outside of the range that can be expressed by \uxxxx.

gurdiga · 2012-10-30T19:53:26Z

OK, so, in that collection of letters we have 173 ranges (60 of which involves 5-digit letters) and 1045 “singles”. Now, the choice we’d have with encode() is to encode them as arrays of strings representing their Unicode escape sequences (or a string that we’d then split into an array), before looping over each of the pairs, and then over singles.

If I’m not missing anything, this this sounds a bit of overkill compared to just putting the raw letters in the code, which, in the end are just… letters… aren’t they? 😄

michaelficarra · 2012-10-31T15:24:47Z

@gurdiga: From http://mathiasbynens.be/notes/javascript-encoding#surrogate-pairs,

Characters outside the BMP, e.g. U+1D306 tetragram for centre ( character omitted due to github technical issue ), can only be encoded in UTF-16 using two 16-bit code units: 0xD834 0xDF06. This is called a surrogate pair. Note that a surrogate pair only represents a single character.

See http://mothereff.in/js-escapes#0%F0%9D%8C%86 for an example.

gurdiga · 2012-10-31T15:33:39Z

That is good to know… I’m wondering if we can use this knowledge for our issue here… ❓

michaelficarra · 2012-10-31T15:43:11Z

Seriously? /\uD834\uDF06/.test('the-character-that-github-does-not-like')

edit: Quoting you:

However, because I could not find a way to encode 5-digit (ECMA allows only 4) Unicode escape sequence, I had to leave out 1016 out of 3192 Unicode letters.

Use surrogate Pairs. Problem solved.

gurdiga · 2012-10-31T15:55:22Z

@michaelficarra Thank you for the hint. Yes, this helps encoding the 5-hex-digit Unicode letters. 👍

gurdiga · 2012-11-02T05:40:44Z

Updated the isAlpha() helper to recognize Unicode letters.

gurdiga · 2012-11-02T05:54:17Z

Extended test to include capitalized identifiers.

michaelficarra · 2012-11-02T14:06:00Z

...what happened to using unicode escapes and character ranges?

gurdiga · 2012-11-02T14:22:11Z

@michaelficarra it seems to me that the benefit is too small for the effort: we’d get ~4x more JS code than with raw letters, part of which we’d have to loop over (the 5-digit ranges) to make them usable, which I’d expect to make the parser slower. And in the end, those are just letters, I mean what’s the point to escape them?

michaelficarra · 2012-11-02T15:17:30Z

we’d get ~4x more JS code

Not if you use ranges in character classes. That first link I sent you makes extensive use of ranges. Here's an excerpt:

\u1EFF-\u1F07\u1F10-\u1F15\u1F20-\u1F27\u1F30-\u1F37\u1F40-\u1F45\u1F50-\u1F57

That's not much longer than enumerating them. @antonkovalyov's comment above makes me think that it's desirable to avoid the characters themselves.

valueof · 2012-11-02T18:42:10Z

I don't really care whether it's ~4x more JavaScript code or not because most people use JSHint locally. We will need to move character classes into a separate file, though. You can check out branch code for an example of moving declarative stuff out of jshint.js.

I don't like the idea of using actual characters because they might bring all kind of problems in various editors. Also looking at dozens of lines of this: ⳄⳆⳈⳊⳌⳎⳐⳒⳔⳖⳘⳚⳜⳞⳠⳢⳫⳭⳲꙀꙂ is no fun.

I'm in favor of taking a Unicode table and overall approach from Traceur.

gurdiga · 2012-11-03T03:46:18Z

@antonkovalyov

I'm in favor of taking a Unicode table and overall approach from Traceur.

They use a function to determine if a character is an Unicode letter or not:

function isUnicodeLetter(ch) {
  var cc = ch.charCodeAt(0);
  for (var i = 0; i < unicodeLetterTable.length; i++) {
    if (cc < unicodeLetterTable[i][0])
      return false;
    if (cc <= unicodeLetterTable[i][1])
      return true;
  }
  return false;
}

and we use regular expressions to match tokens, so I’m not sure how to apply that to our context. And if we stick to regular expressions there is no way to include 5-digit-code letters other than inlining those ranges. If there are no other options I’ll go ahead this way.

I got the idea of how to move them out for JS platforms that have a require() thingy, but what about the browser? I see there is no browser.js in ./src/platforms so far. Are there any plans to add? I mean this is the particular use case that brought me here: I’m using it inside qUnit. ❓

@michaelficarra

That first link I sent you makes extensive use of ranges.

if you mean this https://github.com/tolmasky/language/blob/53f75464902bae56712f79a30015a0b87336a17c/languages/JavaScript.language#L656-665 than it looks to me like pseudo-code, I mean they just show the codes, that’s it. So far I didn’t find a way to include 5-digit-code letter ranges in a regular expression other than inlining them.

I’ve also tried splitting the 5-digit-code into surrogate pairs as they do it here: http://mothereff.in/js-escapes#0%F0%9D%8C%86, so for /[\u1D770-\u1D7C9]/ which does’t work, I ended up with /[\uD835\uDFAA-\uD835\uDFC9]/, which still errors:

> r = new RegExp('[\uD835\uDFAA-\uD835\uDFC9]')
< SyntaxError: Invalid regular expression: /[blah-lah]/: Range out of order in character class

Am I missing something? 😳

valueof · 2012-11-03T17:25:01Z

@gurdiga You are right about Traceur, sorry. We should go with ranges then.

I am actually working on an (experimental) branch that moves both lexer and regular expressions out of jshint.js. Right now all lint checks pass but many tests fail. I will try to finish and merge it this weekend.

We do use require for that and before the 1.0.0 release I will add a new command to make.js that will compile a browser-friendly version of JSHint using browserify. So you can work with JSHint as with any other Node app.

Picked the hex codes for the Unicode 'Letter, Uppercase' and 'Letter, Lowercase' categories from fileformat.info.

gurdiga · 2012-11-08T08:34:41Z

@antonkovalyov I’ve exported the letter codes to src/unicode/letters.js. Please take a look.

gurdiga · 2012-11-25T12:27:13Z

@antonkovalyov I’m wondering if there is anything more that I can help with to get this in. ❓

Sorry if I’m being impatient, I’m just implementing server-side JS linting and wanted to take advantage of this tweak. 😃

valueof · 2012-11-26T00:58:51Z

Nope. Sorry for being slow, I was working on a lexer/tokenizer rewrite (now merged in) and put all patches on hold. I'll get to your patch within a few days.

valueof · 2012-12-21T23:23:44Z

@gurdiga So a couple of weeks ago I rewrote our lexer to make it more robust and get rid of obnoxiously long regular expressions. That also meant that we were able to use Traceur's approach to handling unicode symbols with a table. So I went ahead and implemented that.

But thanks for your pull request, it definitely got the ball rolling!

This commit adds support for unicode identifiers such as Антон or π. It also supports escaped unicode sequences since stuff like \d1d44 is a valid identifier. Implementation approach and data was mostly borrowed from Google's Traceur compiler and rewritten a little bit to match our lexer's structure. Closes jshintGH-716. Closes jshintGH-301.

valueof mentioned this pull request Nov 3, 2012

Unexpected 'ŝ'. #729

Closed

Allow unicode letters in identifiers

7302005

Picked the hex codes for the Unicode 'Letter, Uppercase' and 'Letter, Lowercase' categories from fileformat.info.

valueof closed this in 5a7be3c Dec 21, 2012

gurdiga mentioned this pull request Oct 7, 2013

Lexer support for non-ascii literals angular/angular.js#2174

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow unicode letters in identifiers #716

Allow unicode letters in identifiers #716

gurdiga commented Oct 28, 2012

michaelficarra commented Oct 28, 2012

gurdiga commented Oct 28, 2012

michaelficarra commented Oct 28, 2012

gurdiga commented Oct 29, 2012

gurdiga commented Oct 30, 2012

valueof commented Oct 30, 2012

gurdiga commented Oct 30, 2012

gurdiga commented Oct 30, 2012

neonstalwart commented Oct 30, 2012

gurdiga commented Oct 30, 2012

neonstalwart commented Oct 30, 2012

gurdiga commented Oct 30, 2012

michaelficarra commented Oct 31, 2012

gurdiga commented Oct 31, 2012

michaelficarra commented Oct 31, 2012

gurdiga commented Oct 31, 2012

gurdiga commented Nov 2, 2012

gurdiga commented Nov 2, 2012

michaelficarra commented Nov 2, 2012

gurdiga commented Nov 2, 2012

michaelficarra commented Nov 2, 2012

valueof commented Nov 2, 2012

gurdiga commented Nov 3, 2012

valueof commented Nov 3, 2012

gurdiga commented Nov 8, 2012

gurdiga commented Nov 25, 2012

valueof commented Nov 26, 2012

valueof commented Dec 21, 2012

Allow unicode letters in identifiers #716

Allow unicode letters in identifiers #716

Conversation

gurdiga commented Oct 28, 2012

michaelficarra commented Oct 28, 2012

gurdiga commented Oct 28, 2012

michaelficarra commented Oct 28, 2012

gurdiga commented Oct 29, 2012

gurdiga commented Oct 30, 2012

valueof commented Oct 30, 2012

gurdiga commented Oct 30, 2012

gurdiga commented Oct 30, 2012

neonstalwart commented Oct 30, 2012

gurdiga commented Oct 30, 2012

neonstalwart commented Oct 30, 2012

gurdiga commented Oct 30, 2012

michaelficarra commented Oct 31, 2012

gurdiga commented Oct 31, 2012

michaelficarra commented Oct 31, 2012

gurdiga commented Oct 31, 2012

gurdiga commented Nov 2, 2012

gurdiga commented Nov 2, 2012

michaelficarra commented Nov 2, 2012

gurdiga commented Nov 2, 2012

michaelficarra commented Nov 2, 2012

valueof commented Nov 2, 2012

gurdiga commented Nov 3, 2012

valueof commented Nov 3, 2012

gurdiga commented Nov 8, 2012

gurdiga commented Nov 25, 2012

valueof commented Nov 26, 2012

valueof commented Dec 21, 2012