Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialize UTF-8 string with Unicode escapes #687

Merged
merged 2 commits into from
Oct 4, 2017

Conversation

pkierski
Copy link
Contributor

@pkierski pkierski commented Oct 3, 2017

Handling Unicode charactes, assuming strings are UTF-8

@cdunn2001 cdunn2001 merged commit 42a161f into open-source-parsers:master Oct 4, 2017
@cdunn2001
Copy link
Contributor

Thanks. Looks efficient. I hope you don't mind that I squashed.

I guess we should have tests for Unicode, but this is an excellent start.

@pkierski pkierski deleted the deserialize-as-utf-8 branch October 4, 2017 07:20
@cdunn2001
Copy link
Contributor

@pkierski, There seems to be a bug in this change. See #711.

@cdunn2001 cdunn2001 mentioned this pull request Dec 17, 2017
@Arminius
Copy link

Hi, I'm not happy with the fact that you assume that internal strings are UTF-8. We use JsonCpp in an old Windows application that uses MBCS strings (specifically, encoded with ISO-8859-1). This works well with JsonCpp versions up to 1.8.3, but this change breaks our Json output (yielding Gibberish where there were non-ASCII characters).

We do have UTF-8 output with the released version 1.8.3 even though we use ISO-8859-1 internally, why do you now have to change the API so that it requires UTF-8 strings internally?

@cdunn2001
Copy link
Contributor

Hmmm. That's interesting.

I don't know how to avoid assuming a string encoding. I guess we could keep a vector of Unicode characters instead of strings, but you'd still have to decode your MBCS strings before handing them to jsoncpp. I guess your data worked because we failed to translate those strings into Unicode JSON escapes, yes?

What do you propose?

@pkierski
Copy link
Contributor Author

I thing UTF-8 is the best choice for any C/C++ JSON library (see String and Character Issues in RFC 7159). As far as std::string doesn't contain information about encoding we can assume UTF-8 for best interoperability.

Alternatively it's possible to add kind of switch for serialization and parsing string like "byte array"; without any conversion, just for compatibility with code which makes such assumption.

@cdunn2001
Copy link
Contributor

I agree with @pkierski. @Arminius, would that work for you? We're suggesting a setting which would turn off Unicode escapes completely. Your strings would simply end up in the JSON file. Is that ok?

@Arminius
Copy link

Arminius commented Jan 8, 2018

As far as I understand it now, it's basically a miracle that I ever got correct UTF-8 output when using MBCS strings, and even then it only worked because I never used strings containing characters that need to be escaped. I don't think it's a good idea to add such a switch - it would essentially revert to incorrect behaviour.

As for the legacy application I mentioned, I suppose we'll continue using JsonCpp 1.8.3 until we can sort out the necessary conversions ourselves.

However, what you really should do if you're going to assume UTF-8 strings is document that fact. It should be clear to anyone using JsonCpp that strings that are assigned to a Json::Value must be encoded as UTF-8.

@xujintao
Copy link

does this necessary? Shouldn't the serialize result be utf8?
for example

Json::Value root;
root["name"] = "你的名字";   //Chinese, and use utf8 
Json::FastWriter fwriter;
std::string retStr = fwriter.write(root);
std::cout << retStr;

before commit, the result is utf8 string like this:

{"name", "你的名字"}

but now, it turns out to be a unicode string like this:

{"name", "\u4f60\u7684\u540d\u5b57"}

I really want the former serialize result. @cdunn2001 @pkierski

@pkierski
Copy link
Contributor Author

JSON document is valid UTF-8 document without explicit UTF-8 encoding and escaping as far as you pass valid UTF-8 encoded string into appropriate Set() method. It's not true if someone use eg. one-byte, non-UTF-8 encoding like CP-1250. I know in you case escaping is wasting of space (6 or 12 bytes instead of 3 or 4). So my proposition is to add kind of flag: "check UTF-8 and use escape sequence for strings" vs. "pass strings content as is".

@Arminius
Copy link

I disagree here, JSON is supposed to be a human-readable format. Escaping characters that don't need to be escaped is erroneous behaviour in that case. You shouldn't have to pass an extra flag to make sure your JSON output doesn't become an unreadable mess of escape sequences when you use non-Latin characters.

@xujintao
Copy link

I agree with @Arminius .
I would stay with JsonCpp 1.8.3.

@cdunn2001
Copy link
Contributor

@Arminius, I agree. JSON does not require Unicode to be escaped. However, the binary representation of an unescaped Unicode document requires an encoding. This is not a question of the internal representation, but of the external.

Anyway, it seems that we've agreed to a solution, right? The behavior of escaping the non-ascii characters should be configurable. Any volunteers?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants