-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JSON Parser cannot parse escapes of German Umlaut vowels #3216
Comments
The JSON parser uses UTF-8 encoding (as is common practice), so any "umlauts" will be converted to their respective 2-byte UTF-8 sequence. If you need the strings in a different encoding (e.g. Windows-1252), you can use the Poco::TextConverter class to convert the string from UTF-8 to an 8-bit encoding. |
Thanks for the quick response. OK, but the text conversion can only be done after parsing has been completed on single JSON items... Is there a chance to do the conversion directly on the complete parsed JSON object ptr or to provide the JSON parser directly with the required encoding (e.g. Windows-1252)? |
No, that's not possible. |
Would it be possible to get this feature in the next release of the Poco library? It would make the JSON parsing so much easier. |
This issue is stale because it has been open for 365 days with no activity. |
This issue was closed because it has been inactive for 60 days since being marked as stale. |
My application receives JSON objects that may contain the escape codes of German Umlaut vowels (Ä,Ö,Ü,ä,ö,ü), as shown in the example below.
Unfortunately, the German Umlaut vowels like 'ü' (\u00FC) will be incorrectly parsed by the JSON parser.
It will be parsed into two characters (ü) and not into the Umlaut vowel (ü).
Looking into the UTF8Encoding::convert function (UTF8Encoding.cpp), the function only covers the "Basic Latin" characters (U+0000 ... U+007F),
but not the UTF-8 supplements, like the "Latin-1 Supplement" (U+0080 ... U+00FF) that contains the German Umlaut vowels, see below.
Escape characters within the "Latin-1 Supplement" range will be converted into two characters.
Is there a chance to get these supplement escapes parsed correctly?
The text was updated successfully, but these errors were encountered: