Cannot match Unicode characters outside the BMP #1

nylen · 2017-07-10T18:14:18Z

This is a bit of a mess...

PEG.js does not allow specifying characters outside the BMP (more than two bytes) in its grammars. JavaScript handles these as multi-byte sequences, for example \ud83d\udca9. Presumably this works fine with vanilla PEG.js.

In phpegjs, we use PCRE to split all characters in the input. This handles emoji characters as, for example, a single \u{1f4a9}.

It would be very difficult to bridge this gap. We'd have to write logic (in JavaScript) to accept string literals containing multibyte sequences like \ud83d\udca9, calculate what their length would be in PHP, then adjust the generated code to account for the difference between the string length in PHP and in JavaScript.

That still wouldn't handle matching character classes, because JavaScript and PEG.js would have to use two separate character classes like [\ud83d][\udca9], but PHP would have to use [\x{1f4a9}] instead.

Fortunately this is only going to be a problem if your grammar itself needs to match these characters. Hopefully you don't need to do this.

nylen added 2 commits July 10, 2017 19:54

Add failing emoji test

88e20d4

Add expected value of emoji test

db58ae0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot match Unicode characters outside the BMP #1

Cannot match Unicode characters outside the BMP #1

nylen commented Jul 10, 2017

Cannot match Unicode characters outside the BMP #1

Are you sure you want to change the base?

Cannot match Unicode characters outside the BMP #1

Conversation

nylen commented Jul 10, 2017