New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode characters in identifier names #301

Closed
alawatthe opened this Issue Oct 9, 2011 · 8 comments

Comments

Projects
None yet
6 participants
@alawatthe
Copy link

alawatthe commented Oct 9, 2011

With jsHint it is currently not possible to use identifiers which do not match the RegExp /^([a-zA-Z_$][a-zA-Z0-9_$]*)$/ (see line 753).
In the following code each line would fail in jsHint.

π = 3.1415;
josé = 42;
Ж = "test";

But ECMA-Script also allows every character in one of the Unicode categories “Uppercase letter (Lu)”, “Lowercase letter (Ll)”, “Titlecase letter (Lt)”, “Modifier letter (Lm)”, “Other letter (Lo)”, or “Letter number (Nl)” in identifier names. (see http://es5.github.com/#x7.6).

Therefor I think we should allow these characters, too.

There are some possibilities to accomplish this task. These three I found so far:

a long RegExp or a list

As there are approx. 15000 valid chars, the RegExp/list would be very long.

try and catch

An other way is trying to assign a value to the variable.
If it works, the variable name should be valid, if it throws an error the variable name should be wrong. The only problem is IE, as it allows chars like ☺ as variable names, which is wrong (see http://stackoverflow.com/questions/7451524/why-arent-and-valid-javascript-variable-names).

upperCase !== lowerCase trick

Mihai Bazon found a trick to identify Unicode letters. He used this trick because of a bug in Firefox which is fixed now.
The chars in the categories Lu, Ll, Lt and Lm have an .toUpperCase() version that differs from the .toLowerCase() version. We could use it like this

function isUnicodeLetter(char) {
 return char.toUpperCase() !== char.toLowerCase();
} 

The problem is, that not every valid char satisfies this (example: ↂ).

What are the others doing?

I think we should use the long RegExp. Shall I write a pull request?

What do you think?
alawatthe/alex

@valueof

This comment has been minimized.

Copy link
Member

valueof commented Oct 17, 2011

I personally dislike any non-trivial regular expressions because they always find a way to bite you in the ass at some point. I'd suggest to use a list (as in Traceur) when writing a patch.

@WolfgangKluge

This comment has been minimized.

Copy link
Contributor

WolfgangKluge commented Oct 17, 2011

I don't see a benefit from the list to a long regexp.
list:

var t = [
    [\u0001, \u0010], [\u0020, \u0030], [\u0040, \u0050]
]

vs regexp

var t = /[\u0001-\u0010\u00020-\u0030\u0040-\u0050]

both versions will be unreadable at the end ;). So I don't have a preference here. The parser should do this job ;)

But I think JSHint should warn when characters other than e.g. [A-Za-z0-9_$] are used. In my opinion it's bad style to use г (\u0433, cyrillic) over r (at least if used between other ansi characters). That's something a obfuscator need.. However, pi in your example is a good use (except, that there is the even more precise Math.PI already) - so such a warning has to be optional (as everything).

Currently JSHint runs into an endless loop.

@valueof

This comment has been minimized.

Copy link
Member

valueof commented Oct 17, 2011

JSHint should definitely warn by default. As for the list vs. regexp I personally find line breaks in strings and regexps very ugly.

@alawatthe

This comment has been minimized.

Copy link

alawatthe commented Oct 18, 2011

I had a look into the problem yesterday. I used a string containing the letters, more or less like this:

var letter = '\\u0041-\\u005A\\u0061-\\u007A\\u00AA\\u00B5\\u00BA\\u00C0-\\u00D6\\u00D8-\\u00F6\\u00F8-\\u02C1\\u02C6-\\u02D1' + ... and some more lines

and constructed the RegExps out of the string. I personally prefer line breaks in strings instead of too long lines.

There are some other issues I encountered:

var \u0061 = 1234;
console.log(a);   // logs 1234 

Is valid JavaScript, but to support it, we would have to:

  • transform \uXXXX into the letter,
  • check if the letter is a valid letter in an identifier and
  • mark the letter as a being defined, otherwise we get a "'a' is not defined" error.
    (of course we would have to mark \uXXXX the same way, if we define a letter)

In my opinion this is a very bad coding style and we shouldn't allow the use of escape sequences in identifier names at all.
(It would make the implementation much easier.)

@BYK

This comment has been minimized.

Copy link
Contributor

BYK commented Jan 14, 2012

May be the library mentioned at http://stackoverflow.com/a/873600/90297 can help about using unicode aware regular expressions.

@3rd-Eden

This comment has been minimized.

Copy link

3rd-Eden commented May 13, 2012

Just bumped in to this issue as well. Are there any "easy" work arounds known for this? (Except renaming variables of course)

@valueof

This comment has been minimized.

Copy link
Member

valueof commented Aug 20, 2012

This is not worth fixing here and shouldn't be a problem in Next (with Esprima).

@valueof valueof closed this Aug 20, 2012

@Sense545

This comment has been minimized.

Copy link

Sense545 commented Sep 12, 2012

This awnser from stackoverflow http://stackoverflow.com/questions/1661197/valid-characters-for-javascript-variable-names/9337047#9337047 has a very refined regex which is also used on http://mothereff.in/js-variables

Can you please update jshint to use the 11236 characters long regex?

@valueof valueof reopened this Nov 2, 2012

@valueof valueof closed this in 5a7be3c Dec 21, 2012

jugglinmike added a commit to jugglinmike/jshint that referenced this issue Oct 21, 2014

Add support for unicode identfiers.
This commit adds support for unicode identifiers such as Антон or
π. It also supports escaped unicode sequences since stuff like
\d1d44 is a valid identifier.

Implementation approach and data was mostly borrowed from Google's
Traceur compiler and rewritten a little bit to match our lexer's
structure.

Closes jshintGH-716.
Closes jshintGH-301.

danielctull pushed a commit to danielctull-forks/jshint that referenced this issue Jan 14, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment