-
-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Unicode in JSON input #5504
Conversation
@@ -162,6 +185,57 @@ namespace misc_utils | |||
val.push_back('\\');break; | |||
case '/': //Slash character | |||
val.push_back('/');break; | |||
case 'u': //Unicode code point | |||
if (it + 1 == buf_end || it + 2 == buf_end || it + 3 == buf_end || it + 4 == buf_end) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (buf_end - it < 4)
} | ||
else | ||
{ | ||
uint32_t dst = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This portions looks like a:
uint16_t dst = 0;
for (unsigned count = 0; count < 4; ++count)
{
dst <<= 4;
const unsigned char tmp = isx[*++it];
CHECK_AND_ASSERT_THROW_MES(tmp != 0xff, "Bad unicode encoding");
dst |= tmp;
}
Also note the uint16_t
because that is the entire range that can be extracted here.
val.push_back(0x80 | ((dst >> 6) & 0x3f)); | ||
val.push_back(0x80 | (dst & 0x3f)); | ||
} | ||
else if (dst <= 0x10ffff) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This value range is not possible with 4 hex characters. Anything in this range is provided as a UTF-16 surrogate pair. This requires parsing two 16-bit values. If the first 16-bit value is between 0xD800–0xDBFF
, then another 16-bit value must follow which represents the entire code point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know how to encode code points > 0xffff ? All the examples I've fund use \uxxxx with 4 digits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh you mean this encoding is actually UTF-16, not raw code points ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
d012094
to
4144dbf
Compare
4144dbf
to
3e11bb5
Compare
Sorry for not re-reviewing this PR earlier, UTF-16 surrogate pairs are still not handled properly. Anything outside of the Basic Multilingual Plane will decode incorrectly (which are less common characters at least). |
I will try to update it with tests later this week. |
No description provided.