Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot match Unicode characters outside the BMP #1

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

nylen
Copy link
Owner

@nylen nylen commented Jul 10, 2017

This is a bit of a mess...

PEG.js does not allow specifying characters outside the BMP (more than two bytes) in its grammars. JavaScript handles these as multi-byte sequences, for example \ud83d\udca9. Presumably this works fine with vanilla PEG.js.

In phpegjs, we use PCRE to split all characters in the input. This handles emoji characters as, for example, a single \u{1f4a9}.

It would be very difficult to bridge this gap. We'd have to write logic (in JavaScript) to accept string literals containing multibyte sequences like \ud83d\udca9, calculate what their length would be in PHP, then adjust the generated code to account for the difference between the string length in PHP and in JavaScript.

That still wouldn't handle matching character classes, because JavaScript and PEG.js would have to use two separate character classes like [\ud83d][\udca9], but PHP would have to use [\x{1f4a9}] instead.

Fortunately this is only going to be a problem if your grammar itself needs to match these characters. Hopefully you don't need to do this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant