lightparse: improve the detection of regexp #16

glathoud · 2014-03-21T08:53:46Z

When re-parsing minified code false positive detections of regexps are triggered by a division sign because the current strategy is still too naive. This is a problem if we want to reparse the minified code - where most newlines have been removed - to e.g. further shorten it.

blah = x / y; ...... line goes further ...... ; foo = bar / bin

-> false positive with the current implementation: / y ..... bar / detected as a regexp because the identifier x is not taken into account.

Task: reject such false positive regexps.

Difficulty: we want to try to keep lightparse "light" i.e. without a full tokenizer (mmm maybe I won't avoid it in the long term, but for now this way if it goes). So we should not only reject the simple false positives but also prevent introducing false negatives:

for (...) /true regexp/.exec();    // Here the regexp is correct!
while(...) /true regexp/.exec();    // Here the regexp is correct!
do (...) /true regexp/.exec();  while ( )  /another true regexp/ ...  // Here both regexps are correct!
similarly with if(...) /true regexp/
similarly with else /true regexp/

See also the remarks at the beginning of section 7 of the ECMAscript 5 spec:

NOTE There are no syntactic grammar contexts where both a leading division or division-assignment, and a leading RegularExpressionLiteral are permitted. This is not affected by semicolon insertion (see 7.9); in examples such as the following:
14 © Ecma International 2011
a = b
 /hi/g.exec(c).map(d);
where the first non-whitespace, non-comment character after a LineTerminator is slash (/) and the syntactic context allows division or division-assignment, no semicolon is inserted at the LineTerminator. That is, the above example is interpreted in the same way as:
a = b / hi / g.exec(c).map(d);

The text was updated successfully, but these errors were encountered:

glathoud · 2014-03-21T09:50:40Z

Looks complicate to implement with the current implementation.
Maybe a fix would be to start with a separate tokenizer, e.g. pick the one from narcissus:
http://glat.info/jscheck/narcissus.jsparse.js

Also using narcissus function declarations and named function expressions could be properly distinguished (not sure it would matter here, though), and - more important - automatic semi-colon "insertion" would be supported.

Mmmm humbling lesson. Maybe I'll just switch the lightparse.js internal implementation to narcissus, while still offering the same interface to the outside.

glathoud · 2014-03-21T14:58:48Z

Note: Using narcissus would require to make it extendable so that it recognizes metaret, metafun and inline.

glathoud · 2014-03-21T15:55:14Z

Found several other issues which become difficult to solve using lightparse. Again narcissus would be desirable here (or any equivalent). In particular, lightparse callArr does not distinguish between function declarations / expressions / actual calls.

Another complication: add support for dotted function declaration names (for metaparse etc.).

Maybe for the time being we can just stop trying to shorten local names using lightparse, and do it separately afterwards.

glathoud · 2014-03-25T12:33:41Z

acorn looks like a good candidate for a parser : small yet complete implementation.

…not pass.

glathoud pushed a commit that referenced this issue Apr 8, 2014

re #16 extended acorn to optionally support inline.

2166970

glathoud pushed a commit that referenced this issue Jun 12, 2014

re #16 v2: inline fully supported, all dev tests passed. Todo: build.

0cc1350

glathoud pushed a commit that referenced this issue Jun 14, 2014

re #16: v2: added tests about regexp parsing, which lightparse could …

9dcbcc8

…not pass.

glathoud closed this as completed in 6e963c8 Jun 14, 2014

glathoud mentioned this issue Jun 14, 2014

some profiling of js.metaret itself #17

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lightparse: improve the detection of regexp #16

lightparse: improve the detection of regexp #16

glathoud commented Mar 21, 2014

glathoud commented Mar 21, 2014

glathoud commented Mar 21, 2014

glathoud commented Mar 21, 2014

glathoud commented Mar 25, 2014

lightparse: improve the detection of regexp #16

lightparse: improve the detection of regexp #16

Comments

glathoud commented Mar 21, 2014

glathoud commented Mar 21, 2014

glathoud commented Mar 21, 2014

glathoud commented Mar 21, 2014

glathoud commented Mar 25, 2014