Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Unicode in JSON input #5504

Open
wants to merge 2 commits into
base: master
from

Conversation

Projects
None yet
2 participants
@moneromooo-monero
Copy link
Contributor

commented Apr 30, 2019

No description provided.

@@ -162,6 +185,57 @@ namespace misc_utils
val.push_back('\\');break;
case '/': //Slash character
val.push_back('/');break;
case 'u': //Unicode code point
if (it + 1 == buf_end || it + 2 == buf_end || it + 3 == buf_end || it + 4 == buf_end)

This comment has been minimized.

Copy link
@vtnerd

vtnerd May 1, 2019

Contributor

if (buf_end - it < 4)

}
else
{
uint32_t dst = 0;

This comment has been minimized.

Copy link
@vtnerd

vtnerd May 1, 2019

Contributor

This portions looks like a:

uint16_t dst = 0;
for (unsigned count = 0; count < 4; ++count)
{
    dst <<= 4;
    const unsigned char tmp = isx[*++it];
    CHECK_AND_ASSERT_THROW_MES(tmp != 0xff, "Bad unicode encoding");
    dst |= tmp;
}

Also note the uint16_t because that is the entire range that can be extracted here.

val.push_back(0x80 | ((dst >> 6) & 0x3f));
val.push_back(0x80 | (dst & 0x3f));
}
else if (dst <= 0x10ffff)

This comment has been minimized.

Copy link
@vtnerd

vtnerd May 1, 2019

Contributor

This value range is not possible with 4 hex characters. Anything in this range is provided as a UTF-16 surrogate pair. This requires parsing two 16-bit values. If the first 16-bit value is between 0xD800–0xDBFF, then another 16-bit value must follow which represents the entire code point.

This comment has been minimized.

Copy link
@moneromooo-monero

moneromooo-monero May 1, 2019

Author Contributor

Do you know how to encode code points > 0xffff ? All the examples I've fund use \uxxxx with 4 digits.

This comment has been minimized.

Copy link
@moneromooo-monero

moneromooo-monero May 1, 2019

Author Contributor

Oh you mean this encoding is actually UTF-16, not raw code points ?

This comment has been minimized.

Copy link
@vtnerd

vtnerd May 2, 2019

Contributor

Yes, the RFC says the value is UTF-16 and a peek at rapidjson shows that at least one major implementation has done it this way.

@moneromooo-monero moneromooo-monero force-pushed the moneromooo-monero:euni branch from 3ccfc3c to d012094 May 1, 2019

@moneromooo-monero moneromooo-monero force-pushed the moneromooo-monero:euni branch from d012094 to 4144dbf May 1, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.