For my schema-validator I needed to check the length of a string value. std::length() gives the character-count which is not OK if the string is utf-8.
I wrote my own-function which works for ascii and utf-8.
Could I do it differently? Should nlohmann::json somehow inform (with a method) me about the fact that a unicode-string had been parsed?
I do not really understand the issue. Can you please provide an example where std::basic_string::length are not sufficient?
Of course. This code
std::cerr << "string: " << instance << ", "
<< "length: " << instance.get<std::string>().length() << ", "
<< "size: " << instance.get<std::string>().size() << ", "
<< "utf8-size: " << utf8_length(instance) << "\n";
string: "💩💩", length: 8, size: 8, utf8-size: 2
The validator expects 2. I know this is not a JSON-HPP issue so I'm unsure who to blame ;-) .
I see. I wonder if your function is actually correct in counting the UTF-8 characters - is it really so simple?
(Yes, it seems to be so simple ;-) http://stackoverflow.com/questions/7298059/how-to-count-characters-in-a-unicode-string-in-c)
From my point of view, I think this counting issue is out of scope of this library. Though a "count UTF-8 character" function is handy, I fear that it may bloat the API.
This library parses Unicode and UTF-8-strings silently into a std::string. Thus, one should never use size() or length() (== byte-count) to check the string-length but a function similar to the one I'm using. Always.
A method (bool is_utf8()) could indicate whether this is a UTF-8-string or not. This information could then be used to check the size in a correct manner.
Maybe explaining it in the documentation is enough.
I don't quite understand: JSON is defined to used Unicode (though this library only supports UTF-8), so I would not know what except true to return for is_utf8. I understand that you'd like either a proper character/glyph/whatever count (which std::string::size() will not be able to provide) or at least a bool contains_multibyte_encoded_codepoints() function.
Am I wrong?
I'd like to know which counting method I need to apply based on what and how it has been parsed into the std::string.
The utf-8-counting method works, but needs to be located on the user-side.
How to prevent users in the future from falling into the same trap as I did? How many users really need the real character-count and are not aware of multibyte-encoding-problems?
std::string has no concept of encoding. You can put UTF8, ISO8859-1, UCS2, UTF32, or whatever you like into a std::string. You have to keep track of the encoding external to the string (or, better, just assume UTF8 everywhere and convert from/to it at the "boundaries" of your program). If your program has to handle data and doesn't know what the encoding is, there are algorithms that can try to guess, but they're not foolproof and you're in scary territory at that point. There are very few cases where you should be unsure of what you're getting -- a text editor or web browser is a good legitimate example, but there are many bad ones, and you should never guess without giving User UI to have a user confirm what you've done.
I don't think adding Unicode tools to a JSON library is helpful. It's a slippery slope (for example, counting code points can include or not include the "combining" modifiers like ◌ͤ, handling surrogates, coalation, normalization, locales, etc). There are entire C++ libraries for Unicode handling because it's hard, and if you need them, you should use them -- even for "simple" UTF8.
I agree with @jaredgrubb. All the library can do is to document that it in fact stores strings as UTF-8 and the user has an interface to the stored bytes as std::string. Anything beyond this (i.e., providing a string type with a nice Unicode-friendly interface) is out of scope of a JSON library.
Coming back to my original question: How to know if a string was parsed as utf-8? The answer is: you don't, but you should assume that within this library std::string-value is always multibyte-encoded and take the necessary precautions.
So it's a documentation issue?
I shall add notes to the documentation about the encoding of the stored strings.
📝 added documentation wrt. UTF-8 strings #406
Added a note to the readme and the string type (http://nlohmann.github.io/json/classnlohmann_1_1basic__json_ab63e618bbb0371042b1bec17f5891f42.html#ab63e618bbb0371042b1bec17f5891f42).