Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lightparse: improve the detection of regexp #16

Closed
glathoud opened this issue Mar 21, 2014 · 4 comments
Closed

lightparse: improve the detection of regexp #16

glathoud opened this issue Mar 21, 2014 · 4 comments

Comments

@glathoud
Copy link
Owner

When re-parsing minified code false positive detections of regexps are triggered by a division sign because the current strategy is still too naive. This is a problem if we want to reparse the minified code - where most newlines have been removed - to e.g. further shorten it.

blah = x / y; ...... line goes further ...... ; foo = bar / bin

-> false positive with the current implementation: / y ..... bar / detected as a regexp because the identifier x is not taken into account.

Task: reject such false positive regexps.

Difficulty: we want to try to keep lightparse "light" i.e. without a full tokenizer (mmm maybe I won't avoid it in the long term, but for now this way if it goes). So we should not only reject the simple false positives but also prevent introducing false negatives:

for (...) /true regexp/.exec();    // Here the regexp is correct!
while(...) /true regexp/.exec();    // Here the regexp is correct!
do (...) /true regexp/.exec();  while ( )  /another true regexp/ ...  // Here both regexps are correct!
similarly with if(...) /true regexp/
similarly with else /true regexp/

See also the remarks at the beginning of section 7 of the ECMAscript 5 spec:

NOTE There are no syntactic grammar contexts where both a leading division or division-assignment, and a leading RegularExpressionLiteral are permitted. This is not affected by semicolon insertion (see 7.9); in examples such as the following:
14 © Ecma International 2011
a = b
 /hi/g.exec(c).map(d);
where the first non-whitespace, non-comment character after a LineTerminator is slash (/) and the syntactic context allows division or division-assignment, no semicolon is inserted at the LineTerminator. That is, the above example is interpreted in the same way as:
a = b / hi / g.exec(c).map(d);
@glathoud
Copy link
Owner Author

Looks complicate to implement with the current implementation.
Maybe a fix would be to start with a separate tokenizer, e.g. pick the one from narcissus:
http://glat.info/jscheck/narcissus.jsparse.js

Also using narcissus function declarations and named function expressions could be properly distinguished (not sure it would matter here, though), and - more important - automatic semi-colon "insertion" would be supported.

Mmmm humbling lesson. Maybe I'll just switch the lightparse.js internal implementation to narcissus, while still offering the same interface to the outside.

@glathoud
Copy link
Owner Author

Note: Using narcissus would require to make it extendable so that it recognizes metaret, metafun and inline.

@glathoud
Copy link
Owner Author

Found several other issues which become difficult to solve using lightparse. Again narcissus would be desirable here (or any equivalent). In particular, lightparse callArr does not distinguish between function declarations / expressions / actual calls.

Another complication: add support for dotted function declaration names (for metaparse etc.).

Maybe for the time being we can just stop trying to shorten local names using lightparse, and do it separately afterwards.

@glathoud
Copy link
Owner Author

acorn looks like a good candidate for a parser : small yet complete implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant