Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tools: non-Ascii linter for /lib only #18043

Closed
wants to merge 4 commits into from
Closed

Conversation

SirR4T
Copy link
Contributor

@SirR4T SirR4T commented Jan 8, 2018

Non-ASCII characters in /lib get compiled into the node binary,
and may bloat the binary size unnecessarily. A linter rule may
help prevent this.

Fixes: #11209

Checklist
  • make -j4 test (UNIX), or vcbuild test (Windows) passes
  • commit message follows commit guidelines
Affected core subsystem(s)

eslint

@nodejs-github-bot nodejs-github-bot added the tools Issues and PRs related to the tools directory. label Jan 8, 2018
Copy link
Member

@addaleax addaleax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make it clear somewhere in a comment that this only catches characters in literals, not the source code itself (which is what we would eventually want to have instead)

@SirR4T
Copy link
Contributor Author

SirR4T commented Jan 8, 2018

oh? then @hkal 's approach of parsing all tokens would be preferred? I had assumed the eslint selector would select the source code too, i see where i might have gone wrong 😅

Also, doubts:

  • the regexp /[^\n\x20-\x7e]/ doesn't capture \r, which I have seen some instances of, while running the linter. I should be adding that in the regexp, right?
  • I have seen instances of \u0000 and \u000c etc. in code. I should be disabling this rule for these, right?

@addaleax
Copy link
Member

addaleax commented Jan 8, 2018

@SirR4T I’m not sure how to do it, but yes, you want all tokens and even comments.

I have seen instances of \u0000 and \u000c etc. in code. I should be disabling this rule for these, right?

It’s fine to have the escape sequences spelled out as escape sequences in the source code. What matters is whether the files themselves contain characters that don’t fit into the ASCII range, because that’s what the script that bakes those files into the node binary looks for.

I hope that’s helpful. :)

@SirR4T
Copy link
Contributor Author

SirR4T commented Jan 8, 2018

now I'm running into a very peculiar error, while doing make lint-js:

I'm unable to disable the below error, via the regular methods (//eslint-disable-line, //eslint-disable-next-line, etc.)

$ make lint-js
Running JS linter...

/Users/sarat/Play/Github/node/lib/internal/test/unicode.js
  1:1  error  Non-ASCII character '✓' detected  non-ascii-character

✖ 1 problem (1 error, 0 warnings)
  1 error, 0 warnings potentially fixable with the `--fix` option.

make: *** [lint-js] Error 1

@SirR4T
Copy link
Contributor Author

SirR4T commented Jan 9, 2018

@addaleax : review latest commit? This was the approach followed in the (now closed) pull #11371, but I think that this is catching much lesser violations.

@not-an-aardvark
Copy link
Contributor

I'm a bit confused about why non-ascii characters would bloat the binary size. Aren't the files encoded as UTF-8?

@SirR4T
Copy link
Contributor Author

SirR4T commented Jan 10, 2018

@not-an-aardvark that is what I understood, from #11129 (comment) :

I would have to assume it is because the external string API does not directly support UTF8. That is, if you look at ExternalStringResource, it assumes a utf16_t* buffer while ExternalOneByteStringResource assumes const char*. If the strings were stored as UTF8 we would incur an additional cost at startup that is not necessary.

Is that not correct?

@addaleax
Copy link
Member

@SirR4T Yes, that is correct. Files that don’t fit into ASCII are currently saved as UTF-16 instead:

node/tools/js2c.py

Lines 230 to 240 in 1e0f331

def Render(var, data):
# Treat non-ASCII as UTF-8 and convert it to UTF-16.
if any(ord(c) > 127 for c in data):
template = TWO_BYTE_STRING
data = map(ord, data.decode('utf-8').encode('utf-16be'))
data = [data[i] * 256 + data[i+1] for i in xrange(0, len(data), 2)]
data = ToCArray(data)
else:
template = ONE_BYTE_STRING
data = ToCString(data)
return template.format(var=var, data=data)

Copy link
Member

@addaleax addaleax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, this still looks fine to me, but I’m not an expert for eslint anyway.

If it doesn’t catch the diagram in lib/timers.js and the check mark in lib/internal/test/unicode.js (for both of which we’ll want eslint-disable comments), it’s probably not quite strict enough, though.

// Rule Definition
//------------------------------------------------------------------------------

const nonAsciiRegexPattern = new RegExp(/[^\r\n\x20-\x7e]/);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to use a RegExp literal like this?:

const nonAsciiRegexPattern = /[^\r\n\x20-\x7e]/;

Seems more readable to me and also more in line with our general coding style. (I'm kind of surprised this isn't caught by a lint rule itself, to be honest. Or maybe it is?)

const commentTokens = source.getAllComments();
const tokens = sourceTokens.concat(commentTokens);

tokens.forEach((token) => reportIfError(node, token));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to match on source.text rather than each individual token, to ensure that non-ascii whitespace is detected (which would not be part of any token or comment).

@SirR4T
Copy link
Contributor Author

SirR4T commented Jan 11, 2018

Yay! 🎉

Could catch the lib/timers.js too, using source.text. But am still unable to bypass / skip the rule, for it ☹️

Any ideas? @Trott @not-an-aardvark

@SirR4T
Copy link
Contributor Author

SirR4T commented Jan 19, 2018

cc: @Trott @not-an-aardvark @addaleax

can someone help me out here? I'm unable to skip the lint checks, inside lib/ folder. Need to skip the rules for lib/timers.js, and the check mark in lib/internal/test/unicode.js.

@not-an-aardvark
Copy link
Contributor

I think there are a few issues:

  • You're using a comment like // eslint-disable non-ascii-character, but I think it should be a block comment (/* eslint-disable non-ascii-character */)
  • When the rule reports a problem, the location of the problem is currently always the top of the file, because you're passing the Program node to context.report. Instead, it would be better if the report location were the point in the file that contains the non-ascii character. To do this, you could find the index of the non-ascii character in sourceCode.text, and then pass { ... loc: sourceCode.getLocFromIndex(theIndex) } to context.report.

If you make both of those changes, I think the disable comments will work as expected.

Non-ASCII characters in /lib get compiled into the node binary,
and may bloat the binary size unnecessarily. A linter rule may
help prevent this.

Fixes: nodejs#11209
the linter should detect not just literals, but also source code
and all comments too.
@SirR4T
Copy link
Contributor Author

SirR4T commented Jan 19, 2018

Thanks @not-an-aardvark ! worked like a charm! pushing the latest changes, once the build is done.

@BridgeAR
Copy link
Member

BridgeAR commented Feb 2, 2018

@BridgeAR BridgeAR added the author ready PRs that have at least one approval, no pending requests for changes, and a CI started. label Feb 2, 2018
@addaleax
Copy link
Member

addaleax commented Feb 4, 2018

Landed in c45afe8

@addaleax addaleax closed this Feb 4, 2018
@addaleax addaleax removed the author ready PRs that have at least one approval, no pending requests for changes, and a CI started. label Feb 4, 2018
addaleax pushed a commit that referenced this pull request Feb 4, 2018
Non-ASCII characters in /lib get compiled into the node binary,
and may bloat the binary size unnecessarily. A linter rule may
help prevent this.

PR-URL: #18043
Fixes: #11209
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>
Reviewed-By: Teddy Katz <teddy.katz@gmail.com>
MylesBorins pushed a commit that referenced this pull request Feb 21, 2018
Non-ASCII characters in /lib get compiled into the node binary,
and may bloat the binary size unnecessarily. A linter rule may
help prevent this.

PR-URL: #18043
Fixes: #11209
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>
Reviewed-By: Teddy Katz <teddy.katz@gmail.com>
MylesBorins pushed a commit that referenced this pull request Feb 21, 2018
Non-ASCII characters in /lib get compiled into the node binary,
and may bloat the binary size unnecessarily. A linter rule may
help prevent this.

PR-URL: #18043
Fixes: #11209
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>
Reviewed-By: Teddy Katz <teddy.katz@gmail.com>
MylesBorins pushed a commit that referenced this pull request Feb 21, 2018
Non-ASCII characters in /lib get compiled into the node binary,
and may bloat the binary size unnecessarily. A linter rule may
help prevent this.

PR-URL: #18043
Fixes: #11209
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>
Reviewed-By: Teddy Katz <teddy.katz@gmail.com>
@MylesBorins MylesBorins mentioned this pull request Feb 21, 2018
@MylesBorins
Copy link
Contributor

Should this be backported to v8.x-staging or v6.x-staging? If yes please follow the guide and raise a backport PR, if not let me know or add the dont-land-on label.

SirR4T added a commit to SirR4T/node that referenced this pull request Mar 29, 2018
Non-ASCII characters in /lib get compiled into the node binary,
and may bloat the binary size unnecessarily. A linter rule may
help prevent this.

PR-URL: nodejs#18043
Fixes: nodejs#11209
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>
Reviewed-By: Teddy Katz <teddy.katz@gmail.com>
SirR4T added a commit to SirR4T/node that referenced this pull request Mar 29, 2018
Non-ASCII characters in /lib get compiled into the node binary,
and may bloat the binary size unnecessarily. A linter rule may
help prevent this.

PR-URL: nodejs#18043
Fixes: nodejs#11209
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>
Reviewed-By: Teddy Katz <teddy.katz@gmail.com>
BethGriggs pushed a commit that referenced this pull request Apr 10, 2018
Non-ASCII characters in /lib get compiled into the node binary,
and may bloat the binary size unnecessarily. A linter rule may
help prevent this.

PR-URL: #18043
Backport-PR-URL: #19499
Fixes: #11209
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>
Reviewed-By: Teddy Katz <teddy.katz@gmail.com>
@MylesBorins MylesBorins mentioned this pull request May 2, 2018
MayaLekova pushed a commit to MayaLekova/node that referenced this pull request May 8, 2018
Non-ASCII characters in /lib get compiled into the node binary,
and may bloat the binary size unnecessarily. A linter rule may
help prevent this.

PR-URL: nodejs#18043
Fixes: nodejs#11209
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>
Reviewed-By: Teddy Katz <teddy.katz@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tools Issues and PRs related to the tools directory.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Linter rule: use ASCII quotes not Unicode ones
7 participants