Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode escaping in regex is incorrectly minified. #2569

Closed
workmanw opened this issue Dec 9, 2017 · 11 comments · Fixed by #2576
Closed

Unicode escaping in regex is incorrectly minified. #2569

workmanw opened this issue Dec 9, 2017 · 11 comments · Fixed by #2576
Labels

Comments

@workmanw
Copy link

workmanw commented Dec 9, 2017

Bug report

ES5

Uglify version (uglifyjs -V)
uglify-js 3.2.1

JavaScript input

new RegExp("<object[^>]*>.*?<\/object>|<span[^>]*>.*?<\/span>|<(?:object|embed|svg|img|div|span|p|a)[^>]*>|(?:\uD83C\uDFF3)\uFE0F?\u200D?(?:\uD83C\uDF08)|(?:\uD83D\uDC41)\uFE0F?\u200D?(?:\uD83D\uDDE8)\uFE0F?|[#-9]\uFE0F?\u20E3|(?:(?:\uD83C\uDFF4)(?:\uDB40[\uDC60-\uDCFF]){1,6})|(?:\uD83C[\uDDE0-\uDDFF]){2}|(?:(?:\uD83D[\uDC68\uDC69]))\uFE0F?(?:\uD83C[\uDFFA-\uDFFF])?\u200D?(?:[\u2695\u2696\u2708]|\uD83C[\uDF3E-\uDFED]|\uD83D[\uDCBB\uDCBC\uDD27\uDD2C\uDE80\uDE92])|(?:\uD83D[\uDC68\uDC69]|\uD83E[\uDDD0-\uDDDF])(?:\uD83C[\uDFFA-\uDFFF])?\u200D?[\u2640\u2642\u2695\u2696\u2708]?\uFE0F?|(?:(?:\u2764|\uD83D[\uDC66-\uDC69\uDC8B])[\u200D\uFE0F]{0,2}){1,3}(?:\u2764|\uD83D[\uDC66-\uDC69\uDC8B])|(?:(?:\u2764|\uD83D[\uDC66-\uDC69\uDC8B])\uFE0F?){2,4}|(?:\uD83D[\uDC68\uDC69\uDC6E\uDC71-\uDC87\uDD75\uDE45-\uDE4E]|\uD83E[\uDD26\uDD37]|\uD83C[\uDFC3-\uDFCC]|\uD83E[\uDD38-\uDD3E]|\uD83D[\uDEA3-\uDEB6]|\u26f9|\uD83D\uDC6F)\uFE0F?(?:\uD83C[\uDFFB-\uDFFF])?\u200D?[\u2640\u2642]?\uFE0F?|(?:[\u261D\u26F9\u270A-\u270D]|\uD83C[\uDF85-\uDFCC]|\uD83D[\uDC42-\uDCAA\uDD74-\uDD96\uDE45-\uDE4F\uDEA3-\uDECC]|\uD83E[\uDD18-\uDD3E])\uFE0F?(?:\uD83C[\uDFFB-\uDFFF])|(?:[\u2194-\u2199\u21a9-\u21aa]\uFE0F?|[\u0023\u002a]|[\u3030\u303d]\uFE0F?|(?:\ud83c[\udd70-\udd71]|\ud83c\udd8e|\ud83c[\udd91-\udd9a])\uFE0F?|\u24c2\uFE0F?|[\u3297\u3299]\uFE0F?|(?:\ud83c[\ude01-\ude02]|\ud83c\ude1a|\ud83c\ude2f|\ud83c[\ude32-\ude3a]|\ud83c[\ude50-\ude51])\uFE0F?|[\u203c\u2049]\uFE0F?|[\u25aa-\u25ab\u25b6\u25c0\u25fb-\u25fe]\uFE0F?|[\u00a9\u00ae]\uFE0F?|[\u2122\u2139]\uFE0F?|\ud83c\udc04\uFE0F?|[\u2b05-\u2b07\u2b1b-\u2b1c\u2b50\u2b55]\uFE0F?|[\u231a-\u231b\u2328\u23cf\u23e9-\u23f3\u23f8-\u23fa]\uFE0F?|\ud83c\udccf|[\u2934\u2935]\uFE0F?)|[\u2700-\u27bf]\uFE0F?|[\ud800-\udbff][\udc00-\udfff]\uFE0F?|[\u2600-\u26FF]\uFE0F?|[\u0030-\u0039]\uFE0F", "g");

Source of this line of code is from the emojione library (emojione.js#L160).

The uglifyjs CLI command executed or minify() options used.

uglifyjs --compress --beautify beautify=false,semicolons=false --mangle -- emojione.js

JavaScript output or error produced.

new RegExp("<object[^>]*>.*?</object>|<span[^>]*>.*?</span>|<(?:object|embed|svg|img|div|span|p|a)[^>]*>|(?:🏳)️??(?:🌈)|(?:👁)️??(?:🗨)️?|[#-9]️?⃣|(?:(?:🏴)(?:\udb40[\udc60-\udcff]){1,6})|(?:\ud83c[\udde0-\uddff]){2}|(?:(?:\ud83d[\udc68�]))️?(?:\ud83c[\udffa-\udfff])??(?:[⚕⚖✈]|\ud83c[\udf3e-\udfed]|\ud83d[\udcbb�\udd27�\ude80�])|(?:\ud83d[\udc68�]|\ud83e[\uddd0-\udddf])(?:\ud83c[\udffa-\udfff])??[♀♂⚕⚖✈]?️?|(?:(?:❤|\ud83d[\udc66-\udc69�])[‍️]{0,2}){1,3}(?:❤|\ud83d[\udc66-\udc69�])|(?:(?:❤|\ud83d[\udc66-\udc69�])️?){2,4}|(?:\ud83d[\udc68�\udc6e�-\udc87�\ude45-\ude4e]|\ud83e[\udd26�]|\ud83c[\udfc3-\udfcc]|\ud83e[\udd38-\udd3e]|\ud83d[\udea3-\udeb6]|⛹|👯)️?(?:\ud83c[\udffb-\udfff])??[♀♂]?️?|(?:[☝⛹✊-✍]|\ud83c[\udf85-\udfcc]|\ud83d[\udc42-\udcaa�-\udd96�-\ude4f�-\udecc]|\ud83e[\udd18-\udd3e])️?(?:\ud83c[\udffb-\udfff])|(?:[↔-↙↩-↪]️?|[#*]|[〰〽]️?|(?:\ud83c[\udd70-\udd71]|🆎|\ud83c[\udd91-\udd9a])️?|Ⓜ️?|[㊗㊙]️?|(?:\ud83c[\ude01-\ude02]|🈚|🈯|\ud83c[\ude32-\ude3a]|\ud83c[\ude50-\ude51])️?|[‼⁉]️?|[▪-▫▶◀◻-◾]️?|[©®]️?|[™ℹ]️?|🀄️?|[⬅-⬇⬛-⬜⭐⭕]️?|[⌚-⌛⌨⏏⏩-⏳⏸-⏺]️?|🃏|[⤴⤵]️?)|[✀-➿]️?|[\ud800-\udbff][\udc00-\udfff]️?|[☀-⛿]️?|[0-9]️","g")

The above snippet is unparsable.


Some version bi-secting shows that this last worked correctly in v3.0.25 and was first broken in v3.1.0. I took a look at the compare for thos two versions, but I just don't posses the required skillset to debug this stuff. I wish I could be more help.

@kzc
Copy link
Contributor

kzc commented Dec 9, 2017

That library is probably using some illegal unpaired surrogate. See: #2242

This works:

$ bin/uglifyjs emojione.js -m -c -b beautify=false,ascii_only | node -p
/<object[^>]*>.*?<\/object>|<span[^>]*>.*?<\/span>|<(?:object|embed|svg|img|div|span|p|a)[^>]*>|(?:🏳)️?‍?(?:🌈)|(?:👁)️?‍?(?:🗨)️?|[#-9]️?⃣|(?:(?:🏴)(?:�[�-�]){1,6})|(?:�[�-�]){2}|(?:(?:�[��]))️?(?:�[�-�])?‍?(?:[⚕⚖✈]|�[�-�]|�[������])|(?:�[��]|�[�-�])(?:�[�-�])?‍?[♀♂⚕⚖✈]?️?|(?:(?:❤|�[�-��])[‍️]{0,2}){1,3}(?:❤|�[�-��])|(?:(?:❤|�[�-��])️?){2,4}|(?:�[����-���-�]|�[��]|�[�-�]|�[�-�]|�[�-�]|⛹|👯)️?(?:�[�-�])?‍?[♀♂]?️?|(?:[☝⛹✊-✍]|�[�-�]|�[�-��-��-��-�]|�[�-�])️?(?:�[�-�])|(?:[↔-↙↩-↪]️?|[#*]|[〰〽]️?|(?:�[�-�]|🆎|�[�-�])️?|Ⓜ️?|[㊗㊙]️?|(?:�[�-�]|🈚|🈯|�[�-�]|�[�-�])️?|[‼⁉]️?|[▪-▫▶◀◻-◾]️?|[©®]️?|[™ℹ]️?|🀄️?|[⬅-⬇⬛-⬜⭐⭕]️?|[⌚-⌛⌨⏏⏩-⏳⏸-⏺]️?|🃏|[⤴⤵]️?)|[✀-➿]️?|[�-�][�-�]️?|[☀-⛿]️?|[0-9]️/g

@kzc
Copy link
Contributor

kzc commented Dec 9, 2017

Reduced test case:

$ bin/uglifyjs -V
uglify-es 3.2.1
$ cat regex.js 
new RegExp("[\udc42-\udcaa\udd74-\udd96\ude45-\ude4f\udea3-\udecc]");
$ cat regex.js | bin/uglifyjs 
new RegExp("[\udc42-\udcaa�-\udd96�-\ude4f�-\udecc]");
$ cat regex.js | bin/uglifyjs -b ascii_only
new RegExp("[\udc42-\udcaa\udd74-\udd96\ude45-\ude4f\udea3-\udecc]");

$ cat regex.js | bin/uglifyjs | bin/uglifyjs -b ascii_only
new RegExp("[\udc42-\udcaa\ufffd-\udd96\ufffd-\ude4f\ufffd-\udecc]");

Unpaired surrogates must be output in ascii in the default binary output mode.

@alexlamsl alexlamsl added the bug label Dec 9, 2017
@alexlamsl
Copy link
Collaborator

uglify-es parser recognises surrogate pairs - so the question is does this work on harmony?

If so, I guess it's just a matter of backporting part of that logic onto master.

@kzc
Copy link
Contributor

kzc commented Dec 9, 2017

uglify-es parser recognises surrogate pairs - so the question is does this work on harmony?

#2242 has the background of master and harmony diverging - and how node streams cannot handle unpaired binary surrogates.

The reduced test case above was done with harmony.

Other ES parsers keep and use the raw string. Uglify does not. We probably have to step through the string char by char to output it properly in binary mode.

@workmanw
Copy link
Author

workmanw commented Dec 9, 2017

@kzc Awesome. Thanks so much for helping produce the reduced test case.

@alexlamsl Thanks for the fast response / tagging.

@kzc
Copy link
Contributor

kzc commented Dec 9, 2017

Even though it can probably be accommodated by uglify, I still have my doubts that the original string with unpaired surrogates - even in ascii form - is valid ECMAScript. The ES spec is silent on the use of unpaired surrogates in both strings and RegExp. It's probably a defacto browser thing.

Even node converts such a regex to a string by replacing the lone surrogates with the Unicode Character 'REPLACEMENT CHARACTER' (U+FFFD):

$ cat regex.js | node -p
/[�-��-��-��-�]/

$ cat regex.js | node -p | xxd
0000000: 2f5b efbf bd2d efbf bdef bfbd 2def bfbd  /[...-......-...
0000010: efbf bd2d efbf bdef bfbd 2def bfbd 5d2f  ...-......-...]/
0000020: 0a                                       .

Related unicode regular expression spec:

"It is permissible, but not required, to match an isolated surrogate code point (such as \u{D800}), which may occur in Unicode Strings. "
http://unicode.org/reports/tr18/#Supplementary_Characters

and related discussion:

"lone surrogates cannot be part of any valid UTF"
http://unicode.org/pipermail/unicode/2015-October/002979.html

@kzc
Copy link
Contributor

kzc commented Dec 9, 2017

Those affected can use the -b beautify=false,ascii_only workaround in the meantime.

@kzc
Copy link
Contributor

kzc commented Dec 10, 2017

The fix will be in v3.2.3.

@workmanw
Copy link
Author

Awesome!! Thank you guys.

@mchern
Copy link

mchern commented Feb 8, 2018

I am at uglify 3.3.1 and it still screw my regexes breaking the prod builds
and If I set the ascii_only=true the next compiler doesn't see .factory method

@danielberndt
Copy link

minifiying commonmark.js when building via the most recent react-create-app yields this error:

screen shot 2018-03-30 at 19 58 18

I've installed v3.3.9 of uglify-js

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants