Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Unicode in JSON input #5504

Merged
merged 2 commits into from Aug 17, 2019
Merged

Conversation

moneromooo-monero
Copy link
Collaborator

No description provided.

@@ -162,6 +185,57 @@ namespace misc_utils
val.push_back('\\');break;
case '/': //Slash character
val.push_back('/');break;
case 'u': //Unicode code point
if (it + 1 == buf_end || it + 2 == buf_end || it + 3 == buf_end || it + 4 == buf_end)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (buf_end - it < 4)

}
else
{
uint32_t dst = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This portions looks like a:

uint16_t dst = 0;
for (unsigned count = 0; count < 4; ++count)
{
    dst <<= 4;
    const unsigned char tmp = isx[*++it];
    CHECK_AND_ASSERT_THROW_MES(tmp != 0xff, "Bad unicode encoding");
    dst |= tmp;
}

Also note the uint16_t because that is the entire range that can be extracted here.

val.push_back(0x80 | ((dst >> 6) & 0x3f));
val.push_back(0x80 | (dst & 0x3f));
}
else if (dst <= 0x10ffff)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This value range is not possible with 4 hex characters. Anything in this range is provided as a UTF-16 surrogate pair. This requires parsing two 16-bit values. If the first 16-bit value is between 0xD800–0xDBFF, then another 16-bit value must follow which represents the entire code point.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know how to encode code points > 0xffff ? All the examples I've fund use \uxxxx with 4 digits.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh you mean this encoding is actually UTF-16, not raw code points ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the RFC says the value is UTF-16 and a peek at rapidjson shows that at least one major implementation has done it this way.

luigi1111 added a commit that referenced this pull request Aug 17, 2019
eeca5ca epee: support unicode in parsed strings (moneromooo-monero)
3e11bb5 functional_tests: test creating wallets with local language names (moneromooo-monero)
@luigi1111 luigi1111 merged commit 3e11bb5 into monero-project:master Aug 17, 2019
@vtnerd
Copy link
Contributor

vtnerd commented Aug 18, 2019

Sorry for not re-reviewing this PR earlier, UTF-16 surrogate pairs are still not handled properly. Anything outside of the Basic Multilingual Plane will decode incorrectly (which are less common characters at least).

@vtnerd
Copy link
Contributor

vtnerd commented Aug 18, 2019

I will try to update it with tests later this week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants