Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow unicode letters in identifiers #716

Closed
wants to merge 1 commit into from
Closed

Allow unicode letters in identifiers #716

wants to merge 1 commit into from

Conversation

gurdiga
Copy link

@gurdiga gurdiga commented Oct 28, 2012

Hey Anton!

This patch is re issue #301.

I’m writing an law-specialized app and I’ve chosen to write them in Romanian because it would have been more difficult to translate the domain terms. :o) This is the reason why I need jshint to validate my code with Unicode letters in identifiers.

So here is what I did:

Picked the hex codes for the Unicode 'Letter, Uppercase' and 'Letter, Lowercase' categories from fileformat.info. However, because I could not find a way to encode 5-digit (ECMA allows only 4) Unicode escape sequence, I had to leave out 1016 out of 3192 Unicode letters.

Please take a look and let me know what you think.

@gurdiga
Copy link
Author

gurdiga commented Oct 28, 2012

@michaelficarra Oh, nice! :) Seems like UnicodeLetter covers more than Lu and Ll, interesting… I’m wondering if they faced the 5-hex-digit-problem I’ve discribed… (scratching my head)

@michaelficarra
Copy link

I believe that's what the sequence / "\uD80D" [\uDC00-\uDC2E] may be for.

@gurdiga
Copy link
Author

gurdiga commented Oct 29, 2012

@michaelficarra Um… I’m not sure I understand… 😕

Let me rephrase is what I meant. So, we have Unicode letters like “ц” and “ţ” that are outside a-zA-Z but are valid and useful in respective languages. And we can encode them in JS with Unicode sequences \u0446 and \u0163 respectively.

But then we have Unicode letters like MATHEMATICAL ITALIC SMALL A which have a 5-digit-hex code and which we’d encode as \u1d44e, but unfortunately this doesn’t work, as the JS parser reads this as \u1d44 + e, so instead of that nice small “a” you get ᵄe. 😟

Maybe there is another way, maybe we might just use the raw characters and paste them in the source code, I mean instead of a regex like [\u1d44e-\u1d467] to just put this, which I expect to work, but can get hard to read and messy for characters that might not implemented be in your terminal font, for example the characters in the range [\u10401-\u10427] or this respectively, are displayed as a small empty rectangles in my browser and about the same in my terminal (on Ubuntu). So, we have a kind of choice here… 🌿

BTW: even here, I got 422 (Unprocessable Entity) from GitHub when I’ve tried to submit the raw special characters 😉

@gurdiga
Copy link
Author

gurdiga commented Oct 30, 2012

I have updated the patch: since we can’t encode 5-digit letters, I’ve decided not to use Unicode escape sequences at all, and just use raw characters. This way we can have all the letters. 👍

One tricky thing that I’ve found is that JS doesn’t allow specifying a RegExp range that would start or end with a 5-digit letter: you can’t have his range with raw letters: [\u1d44e-\u1d467], you get a “SyntaxError: Invalid regular expression: …: Range out of order in character class”. This is why I had to “inline” those ranges. 😄

@valueof
Copy link
Member

valueof commented Oct 30, 2012

Err, I'm sure there must be a better way than keeping a block of random unicode characters in the code.

@gurdiga
Copy link
Author

gurdiga commented Oct 30, 2012

Good ideas are always welcome. 😉

@gurdiga
Copy link
Author

gurdiga commented Oct 30, 2012

@antonkovalyov, c’mon man let’s work this out: did you have something concrete in mind? Help me out a bit. Let’s get this done.

@neonstalwart
Copy link

perhaps this will help... https://github.com/bestiejs/punycode.js/blob/a6ea357aae89c90733dedd83cc4dac6982a69478/punycode.js#L128-147 (which i came across by reading http://mathiasbynens.be/notes/javascript-escapes)

that code allows for something like this

ucs2encode([119886]); // -> MATHEMATICAL ITALIC SMALL A 

a little massaging to make it fit this use case and i think it should be suitable for generating the string of chars you want to use in the regular expression.

@gurdiga
Copy link
Author

gurdiga commented Oct 30, 2012

@neonstalwart Thank you for jumping into this. 😃

I feel dumb… it seems like I’m on a different wave-length than the rest of the world because I can’t get to the core of other people arguments (there were @michaelficarra and @antonkovalyov before) 😳.

Let me at least try to understand what you’re proposing. 😳 So we collect all those codes, as an array of codes, including the 5-digit ones, and then we use ucs2encode() to build the final string which we pass to new RegExp(). As far as I can understand, this may be a benefit if it’s in a way better to have a 4xlarge array of codes instead of a 1xlarge string with raw characters. Is this the idea that you were trying to express? 😣

@neonstalwart
Copy link

@gurdiga that is almost the idea i was getting at.

i was assuming that the characters you're interested in are represented by a continuous range of values - eg 0x1d44e to 0x1d4671. i was also thinking that rather than have a ucs2encode that takes an array of values, you might just have it convert a single value (array or single value doesn't matter so much, but assume encode in the code below will operate on a single value). then you would loop over the range of values to produce the string that generates the regular expression - no need for large arrays of chars or numbers in the code.

var current = 0x1d44e,
    upper = 0x1d4671,
    range = '';

while (current <= upper) {
    range += encode(current++);
}

// then go on to build the regular expression with the range of chars

you might extract this into a function that takes a lower (current in the snippet) and upper and returns range so that you can reuse it etc.

the key part i was linking to in that other code is the way you can create characters outside of the range that can be expressed by \uxxxx.

@gurdiga
Copy link
Author

gurdiga commented Oct 30, 2012

OK, so, in that collection of letters we have 173 ranges (60 of which involves 5-digit letters) and 1045 “singles”. Now, the choice we’d have with encode() is to encode them as arrays of strings representing their Unicode escape sequences (or a string that we’d then split into an array), before looping over each of the pairs, and then over singles.

If I’m not missing anything, this this sounds a bit of overkill compared to just putting the raw letters in the code, which, in the end are just… letters… aren’t they? 😄

@michaelficarra
Copy link

@gurdiga: From http://mathiasbynens.be/notes/javascript-encoding#surrogate-pairs,

Characters outside the BMP, e.g. U+1D306 tetragram for centre ( character omitted due to github technical issue ), can only be encoded in UTF-16 using two 16-bit code units: 0xD834 0xDF06. This is called a surrogate pair. Note that a surrogate pair only represents a single character.

See http://mothereff.in/js-escapes#0%F0%9D%8C%86 for an example.

@gurdiga
Copy link
Author

gurdiga commented Oct 31, 2012

That is good to know… I’m wondering if we can use this knowledge for our issue here… ❓

@michaelficarra
Copy link

Seriously? /\uD834\uDF06/.test('the-character-that-github-does-not-like')

edit: Quoting you:

However, because I could not find a way to encode 5-digit (ECMA allows only 4) Unicode escape sequence, I had to leave out 1016 out of 3192 Unicode letters.

Use surrogate Pairs. Problem solved.

@gurdiga
Copy link
Author

gurdiga commented Oct 31, 2012

@michaelficarra Thank you for the hint. Yes, this helps encoding the 5-hex-digit Unicode letters. 👍

@gurdiga
Copy link
Author

gurdiga commented Nov 2, 2012

Updated the isAlpha() helper to recognize Unicode letters.

@gurdiga
Copy link
Author

gurdiga commented Nov 2, 2012

Extended test to include capitalized identifiers.

@michaelficarra
Copy link

...what happened to using unicode escapes and character ranges?

@gurdiga
Copy link
Author

gurdiga commented Nov 2, 2012

@michaelficarra it seems to me that the benefit is too small for the effort: we’d get ~4x more JS code than with raw letters, part of which we’d have to loop over (the 5-digit ranges) to make them usable, which I’d expect to make the parser slower. And in the end, those are just letters, I mean what’s the point to escape them?

@michaelficarra
Copy link

we’d get ~4x more JS code

Not if you use ranges in character classes. That first link I sent you makes extensive use of ranges. Here's an excerpt:

\u1EFF-\u1F07\u1F10-\u1F15\u1F20-\u1F27\u1F30-\u1F37\u1F40-\u1F45\u1F50-\u1F57

That's not much longer than enumerating them. @antonkovalyov's comment above makes me think that it's desirable to avoid the characters themselves.

@valueof
Copy link
Member

valueof commented Nov 2, 2012

I don't really care whether it's ~4x more JavaScript code or not because most people use JSHint locally. We will need to move character classes into a separate file, though. You can check out branch code for an example of moving declarative stuff out of jshint.js.

I don't like the idea of using actual characters because they might bring all kind of problems in various editors. Also looking at dozens of lines of this: ⳄⳆⳈⳊⳌⳎⳐⳒⳔⳖⳘⳚⳜⳞⳠⳢⳫⳭⳲꙀꙂ is no fun.

I'm in favor of taking a Unicode table and overall approach from Traceur.

@gurdiga
Copy link
Author

gurdiga commented Nov 3, 2012

@antonkovalyov

I'm in favor of taking a Unicode table and overall approach from Traceur.

They use a function to determine if a character is an Unicode letter or not:

function isUnicodeLetter(ch) {
  var cc = ch.charCodeAt(0);
  for (var i = 0; i < unicodeLetterTable.length; i++) {
    if (cc < unicodeLetterTable[i][0])
      return false;
    if (cc <= unicodeLetterTable[i][1])
      return true;
  }
  return false;
}

and we use regular expressions to match tokens, so I’m not sure how to apply that to our context. And if we stick to regular expressions there is no way to include 5-digit-code letters other than inlining those ranges. If there are no other options I’ll go ahead this way.

I got the idea of how to move them out for JS platforms that have a require() thingy, but what about the browser? I see there is no browser.js in ./src/platforms so far. Are there any plans to add? I mean this is the particular use case that brought me here: I’m using it inside qUnit. ❓

@michaelficarra

That first link I sent you makes extensive use of ranges.

if you mean this https://github.com/tolmasky/language/blob/53f75464902bae56712f79a30015a0b87336a17c/languages/JavaScript.language#L656-665 than it looks to me like pseudo-code, I mean they just show the codes, that’s it. So far I didn’t find a way to include 5-digit-code letter ranges in a regular expression other than inlining them.

I’ve also tried splitting the 5-digit-code into surrogate pairs as they do it here: http://mothereff.in/js-escapes#0%F0%9D%8C%86, so for /[\u1D770-\u1D7C9]/ which does’t work, I ended up with /[\uD835\uDFAA-\uD835\uDFC9]/, which still errors:

> r = new RegExp('[\uD835\uDFAA-\uD835\uDFC9]')
< SyntaxError: Invalid regular expression: /[blah-lah]/: Range out of order in character class

Am I missing something? 😳

@valueof
Copy link
Member

valueof commented Nov 3, 2012

@gurdiga You are right about Traceur, sorry. We should go with ranges then.

I am actually working on an (experimental) branch that moves both lexer and regular expressions out of jshint.js. Right now all lint checks pass but many tests fail. I will try to finish and merge it this weekend.

We do use require for that and before the 1.0.0 release I will add a new command to make.js that will compile a browser-friendly version of JSHint using browserify. So you can work with JSHint as with any other Node app.

@valueof valueof mentioned this pull request Nov 3, 2012
Picked the hex codes for the Unicode 'Letter, Uppercase' and 'Letter,
Lowercase' categories from fileformat.info.
@gurdiga
Copy link
Author

gurdiga commented Nov 8, 2012

@antonkovalyov I’ve exported the letter codes to src/unicode/letters.js. Please take a look.

@gurdiga
Copy link
Author

gurdiga commented Nov 25, 2012

@antonkovalyov I’m wondering if there is anything more that I can help with to get this in. ❓

Sorry if I’m being impatient, I’m just implementing server-side JS linting and wanted to take advantage of this tweak. 😃

@valueof
Copy link
Member

valueof commented Nov 26, 2012

Nope. Sorry for being slow, I was working on a lexer/tokenizer rewrite (now merged in) and put all patches on hold. I'll get to your patch within a few days.

@valueof valueof closed this in 5a7be3c Dec 21, 2012
@valueof
Copy link
Member

valueof commented Dec 21, 2012

@gurdiga So a couple of weeks ago I rewrote our lexer to make it more robust and get rid of obnoxiously long regular expressions. That also meant that we were able to use Traceur's approach to handling unicode symbols with a table. So I went ahead and implemented that.

But thanks for your pull request, it definitely got the ball rolling!

jugglinmike pushed a commit to jugglinmike/jshint that referenced this pull request Oct 21, 2014
This commit adds support for unicode identifiers such as Антон or
π. It also supports escaped unicode sequences since stuff like
\d1d44 is a valid identifier.

Implementation approach and data was mostly borrowed from Google's
Traceur compiler and rewritten a little bit to match our lexer's
structure.

Closes jshintGH-716.
Closes jshintGH-301.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants