Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect parsing of UTF-16 escape sequences #37

Closed
schlndh opened this issue Jul 1, 2021 · 2 comments
Closed

Incorrect parsing of UTF-16 escape sequences #37

schlndh opened this issue Jul 1, 2021 · 2 comments

Comments

@schlndh
Copy link

schlndh commented Jul 1, 2021

I'm trying to parse a string which contains this character escaped as UTF-16, but the result is incorrect. I debugged it a little bit and it seems that the issue is due to submitting the UTF-16 units to unicodeToUtf8 one-by-one, rather than decoding the unicode codepoint and then submitting that to unicodeToUtf8.

Here is a code that reproduces the problem:

// This doesn't work.
echo \Peast\Peast::latest('"\uD83D\uDE00"')->parse()->getBody()[0]->getExpression()->getValue() . "\n";
// These work.
echo json_decode('"\uD83D\uDE00"') . "\n";
echo \Peast\Syntax\Utils::unicodeToUtf8(0x0001F600) . "\n";
echo \Peast\Peast::latest('"\u{1F600}"')->parse()->getBody()[0]->getExpression()->getValue() . "\n";
echo \Peast\Peast::latest('"😀"')->parse()->getBody()[0]->getExpression()->getValue() . "\n";
// This also works (i.e. it shows the smiley face).
console.log("\uD83D\uDE00");

I tested this with PHP 7.4 and the latest master (b33fa0d).

@mck89
Copy link
Owner

mck89 commented Jul 11, 2021

I think that the form in which a single character is represented by 2 unicode points is the Modified utf8 with surrogate pairs that is used when converting from utf16 to utf8.

I'm still trying to understand if that is the case and, if so, if there's some way to group those characters without refactoring the one-by-one logic (this problem should affect strings, templates and variabile names).

I'm very busy right now but i will try to work on it in some weeks.

@mck89 mck89 closed this as completed in cd50aa9 Jul 24, 2021
@mck89
Copy link
Owner

mck89 commented Jul 24, 2021

I've just released a new version with surrogate pairs support in strings and templates. No need to change variables name parsing since they are not allowed as variable names. Thank you for reporting!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants