Serialize UTF-8 string with Unicode escapes #687

pkierski · 2017-10-03T14:02:27Z

Handling Unicode charactes, assuming strings are UTF-8

cdunn2001 · 2017-10-04T01:20:34Z

Thanks. Looks efficient. I hope you don't mind that I squashed.

I guess we should have tests for Unicode, but this is an excellent start.

cdunn2001 · 2017-12-17T08:50:02Z

@pkierski, There seems to be a bug in this change. See #711.

Arminius · 2017-12-18T12:18:37Z

Hi, I'm not happy with the fact that you assume that internal strings are UTF-8. We use JsonCpp in an old Windows application that uses MBCS strings (specifically, encoded with ISO-8859-1). This works well with JsonCpp versions up to 1.8.3, but this change breaks our Json output (yielding Gibberish where there were non-ASCII characters).

We do have UTF-8 output with the released version 1.8.3 even though we use ISO-8859-1 internally, why do you now have to change the API so that it requires UTF-8 strings internally?

cdunn2001 · 2017-12-18T16:40:29Z

Hmmm. That's interesting.

I don't know how to avoid assuming a string encoding. I guess we could keep a vector of Unicode characters instead of strings, but you'd still have to decode your MBCS strings before handing them to jsoncpp. I guess your data worked because we failed to translate those strings into Unicode JSON escapes, yes?

What do you propose?

pkierski · 2017-12-18T21:02:21Z

I thing UTF-8 is the best choice for any C/C++ JSON library (see String and Character Issues in RFC 7159). As far as std::string doesn't contain information about encoding we can assume UTF-8 for best interoperability.

Alternatively it's possible to add kind of switch for serialization and parsing string like "byte array"; without any conversion, just for compatibility with code which makes such assumption.

cdunn2001 · 2017-12-19T00:24:50Z

I agree with @pkierski. @Arminius, would that work for you? We're suggesting a setting which would turn off Unicode escapes completely. Your strings would simply end up in the JSON file. Is that ok?

Arminius · 2018-01-08T11:55:48Z

As far as I understand it now, it's basically a miracle that I ever got correct UTF-8 output when using MBCS strings, and even then it only worked because I never used strings containing characters that need to be escaped. I don't think it's a good idea to add such a switch - it would essentially revert to incorrect behaviour.

As for the legacy application I mentioned, I suppose we'll continue using JsonCpp 1.8.3 until we can sort out the necessary conversions ourselves.

However, what you really should do if you're going to assume UTF-8 strings is document that fact. It should be clear to anyone using JsonCpp that strings that are assigned to a Json::Value must be encoded as UTF-8.

xujintao · 2018-01-12T05:59:31Z

does this necessary? Shouldn't the serialize result be utf8?
for example

Json::Value root;
root["name"] = "你的名字";   //Chinese, and use utf8 
Json::FastWriter fwriter;
std::string retStr = fwriter.write(root);
std::cout << retStr;

before commit, the result is utf8 string like this:

{"name", "你的名字"}

but now, it turns out to be a unicode string like this:

{"name", "\u4f60\u7684\u540d\u5b57"}

I really want the former serialize result. @cdunn2001 @pkierski

pkierski · 2018-01-12T08:46:15Z

JSON document is valid UTF-8 document without explicit UTF-8 encoding and escaping as far as you pass valid UTF-8 encoded string into appropriate Set() method. It's not true if someone use eg. one-byte, non-UTF-8 encoding like CP-1250. I know in you case escaping is wasting of space (6 or 12 bytes instead of 3 or 4). So my proposition is to add kind of flag: "check UTF-8 and use escape sequence for strings" vs. "pass strings content as is".

Arminius · 2018-01-12T08:53:54Z

I disagree here, JSON is supposed to be a human-readable format. Escaping characters that don't need to be escaped is erroneous behaviour in that case. You shouldn't have to pass an extra flag to make sure your JSON output doesn't become an unreadable mess of escape sequences when you use non-Latin characters.

xujintao · 2018-01-12T10:23:29Z

I agree with @Arminius .
I would stay with JsonCpp 1.8.3.

cdunn2001 · 2018-01-20T21:20:23Z

@Arminius, I agree. JSON does not require Unicode to be escaped. However, the binary representation of an unescaped Unicode document requires an encoding. This is not a question of the internal representation, but of the external.

Anyway, it seems that we've agreed to a solution, right? The behavior of escaping the non-ascii characters should be configurable. Any volunteers?

pkierski added 2 commits October 3, 2017 15:59

Serialize UTF-8 string with Unicode escapes

e97acf3

Remove complilation warning

d54b471

cdunn2001 merged commit 42a161f into open-source-parsers:master Oct 4, 2017

pkierski deleted the deserialize-as-utf-8 branch October 4, 2017 07:20

cdunn2001 mentioned this pull request Dec 17, 2017

Is this intended? #711

Closed

cdunn2001 added the help wanted label Jan 20, 2018

cdunn2001 mentioned this pull request Jan 20, 2018

error while string has Chinese punctuation #727

Closed

cdunn2001 removed the help wanted label Jan 20, 2018

cdunn2001 mentioned this pull request Jan 20, 2018

"Styled string" representation doesn't handle binary data properly #724

Closed

cdunn2001 mentioned this pull request Jul 15, 2018

The problem of incorrect transcoding in Chinese #803

Closed

nyh mentioned this pull request Jul 22, 2018

Failure on test_select_json_types on master scylladb/scylladb#3622

Closed

dmc31a42 mentioned this pull request Oct 18, 2018

중간과정 1 dmc31a42/UnityL10nTool#5

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Serialize UTF-8 string with Unicode escapes #687

Serialize UTF-8 string with Unicode escapes #687

pkierski commented Oct 3, 2017

cdunn2001 commented Oct 4, 2017

cdunn2001 commented Dec 17, 2017

Arminius commented Dec 18, 2017

cdunn2001 commented Dec 18, 2017

pkierski commented Dec 18, 2017

cdunn2001 commented Dec 19, 2017

Arminius commented Jan 8, 2018

xujintao commented Jan 12, 2018

pkierski commented Jan 12, 2018

Arminius commented Jan 12, 2018

xujintao commented Jan 12, 2018

cdunn2001 commented Jan 20, 2018

Serialize UTF-8 string with Unicode escapes #687

Serialize UTF-8 string with Unicode escapes #687

Conversation

pkierski commented Oct 3, 2017

cdunn2001 commented Oct 4, 2017

cdunn2001 commented Dec 17, 2017

Arminius commented Dec 18, 2017

cdunn2001 commented Dec 18, 2017

pkierski commented Dec 18, 2017

cdunn2001 commented Dec 19, 2017

Arminius commented Jan 8, 2018

xujintao commented Jan 12, 2018

pkierski commented Jan 12, 2018

Arminius commented Jan 12, 2018

xujintao commented Jan 12, 2018

cdunn2001 commented Jan 20, 2018