How to know if a string was parsed as utf-8? #406

Closed
pboettch opened this Issue Dec 29, 2016 · 14 comments

Projects

None yet

3 participants

@pboettch
pboettch commented Dec 29, 2016 edited

For my schema-validator I needed to check the length of a string value. std::length() gives the character-count which is not OK if the string is utf-8.

I wrote my own-function which works for ascii and utf-8.

Could I do it differently? Should nlohmann::json somehow inform (with a method) me about the fact that a unicode-string had been parsed?

@nlohmann
Owner

I do not really understand the issue. Can you please provide an example where std::basic_string::length are not sufficient?

@pboettch

Of course. This code

std::cerr << "string: " << instance << ", "
          << "length: " << instance.get<std::string>().length() << ", "
          << "size: " << instance.get<std::string>().size() << ", "
          << "utf8-size: " << utf8_length(instance) << "\n";

gives

string: "💩💩", length: 8, size: 8, utf8-size: 2

on

"data":"\uD83D\uDCA9\uD83D\uDCA9"

The validator expects 2. I know this is not a JSON-HPP issue so I'm unsure who to blame ;-) .

@nlohmann
Owner

I see. I wonder if your function is actually correct in counting the UTF-8 characters - is it really so simple?

@nlohmann
Owner

From my point of view, I think this counting issue is out of scope of this library. Though a "count UTF-8 character" function is handy, I fear that it may bloat the API.

@pboettch
pboettch commented Jan 3, 2017 edited

This library parses Unicode and UTF-8-strings silently into a std::string. Thus, one should never use size() or length() (== byte-count) to check the string-length but a function similar to the one I'm using. Always.

A method (bool is_utf8()) could indicate whether this is a UTF-8-string or not. This information could then be used to check the size in a correct manner.

Maybe explaining it in the documentation is enough.

@nlohmann
Owner
nlohmann commented Jan 3, 2017

I don't quite understand: JSON is defined to used Unicode (though this library only supports UTF-8), so I would not know what except true to return for is_utf8. I understand that you'd like either a proper character/glyph/whatever count (which std::string::size() will not be able to provide) or at least a bool contains_multibyte_encoded_codepoints() function.

Am I wrong?

@pboettch
pboettch commented Jan 3, 2017

I'd like to know which counting method I need to apply based on what and how it has been parsed into the std::string.

The utf-8-counting method works, but needs to be located on the user-side.

How to prevent users in the future from falling into the same trap as I did? How many users really need the real character-count and are not aware of multibyte-encoding-problems?

@jaredgrubb

std::string has no concept of encoding. You can put UTF8, ISO8859-1, UCS2, UTF32, or whatever you like into a std::string. You have to keep track of the encoding external to the string (or, better, just assume UTF8 everywhere and convert from/to it at the "boundaries" of your program). If your program has to handle data and doesn't know what the encoding is, there are algorithms that can try to guess, but they're not foolproof and you're in scary territory at that point. There are very few cases where you should be unsure of what you're getting -- a text editor or web browser is a good legitimate example, but there are many bad ones, and you should never guess without giving User UI to have a user confirm what you've done.

I don't think adding Unicode tools to a JSON library is helpful. It's a slippery slope (for example, counting code points can include or not include the "combining" modifiers like ◌ͤ, handling surrogates, coalation, normalization, locales, etc). There are entire C++ libraries for Unicode handling because it's hard, and if you need them, you should use them -- even for "simple" UTF8.

@nlohmann
Owner
nlohmann commented Jan 3, 2017

I agree with @jaredgrubb. All the library can do is to document that it in fact stores strings as UTF-8 and the user has an interface to the stored bytes as std::string. Anything beyond this (i.e., providing a string type with a nice Unicode-friendly interface) is out of scope of a JSON library.

@pboettch
pboettch commented Jan 4, 2017

Coming back to my original question: How to know if a string was parsed as utf-8? The answer is: you don't, but you should assume that within this library std::string-value is always multibyte-encoded and take the necessary precautions.

@nlohmann
Owner
nlohmann commented Jan 4, 2017

So it's a documentation issue?

@nlohmann
Owner
nlohmann commented Jan 4, 2017

I shall add notes to the documentation about the encoding of the stored strings.

@nlohmann nlohmann added this to the Release 2.0.11 milestone Jan 4, 2017
@nlohmann nlohmann self-assigned this Jan 4, 2017
@nlohmann nlohmann closed this Jan 4, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment