JSON Parser cannot parse escapes of German Umlaut vowels #3216

stephanger · 2021-03-11T10:49:43Z

My application receives JSON objects that may contain the escape codes of German Umlaut vowels (Ä,Ö,Ü,ä,ö,ü), as shown in the example below.

// JSON example
std::string strJSON = "{\"Area\":\"Bereich1\",\"Section\":\"Abluft\\u00FCberwachung\"}";

Poco::JSON::Parser parser;
Poco::Dynamic::Var result = parser.parse(strJSON);
Poco::JSON::Object::Ptr pObject = result.extract<Poco::JSON::Object::Ptr>();
Poco::JSON::Object jsonResponse = *pObject;

std::string strArea = jsonResponse.get("Area");		// parsed: Bereich1
std::string strSection = jsonResponse.get("Section");	// parsed: AbluftÃ¼berwachung

Unfortunately, the German Umlaut vowels like 'ü' (\u00FC) will be incorrectly parsed by the JSON parser.
It will be parsed into two characters (Ã¼) and not into the Umlaut vowel (ü).

Looking into the UTF8Encoding::convert function (UTF8Encoding.cpp), the function only covers the "Basic Latin" characters (U+0000 ... U+007F),
but not the UTF-8 supplements, like the "Latin-1 Supplement" (U+0080 ... U+00FF) that contains the German Umlaut vowels, see below.

int UTF8Encoding::convert(int ch, unsigned char* bytes, int length) const
{
	if (ch <= 0x7F)
	{
		if (bytes && length >= 1)
			*bytes = (unsigned char) ch;
		return 1;
	}
	else if (ch <= 0x7FF)
	{
		if (bytes && length >= 2)
		{
			*bytes++ = (unsigned char) (((ch >> 6) & 0x1F) | 0xC0);
			*bytes   = (unsigned char) ((ch & 0x3F) | 0x80);
		}
		return 2;
	}
	else if (ch <= 0xFFFF)
	{
		if (bytes && length >= 3)
		{
			*bytes++ = (unsigned char) (((ch >> 12) & 0x0F) | 0xE0);
			*bytes++ = (unsigned char) (((ch >> 6) & 0x3F) | 0x80);
			*bytes   = (unsigned char) ((ch & 0x3F) | 0x80);
		}
		return 3;
	}
	else if (ch <= 0x10FFFF)
	{
		if (bytes && length >= 4)
		{
			*bytes++ = (unsigned char) (((ch >> 18) & 0x07) | 0xF0);
			*bytes++ = (unsigned char) (((ch >> 12) & 0x3F) | 0x80);
			*bytes++ = (unsigned char) (((ch >> 6) & 0x3F) | 0x80);
			*bytes   = (unsigned char) ((ch & 0x3F) | 0x80);
		}
		return 4;
	}
	else return 0;
}

Escape characters within the "Latin-1 Supplement" range will be converted into two characters.

Is there a chance to get these supplement escapes parsed correctly?

The text was updated successfully, but these errors were encountered:

obiltschnig · 2021-03-11T11:02:49Z

The JSON parser uses UTF-8 encoding (as is common practice), so any "umlauts" will be converted to their respective 2-byte UTF-8 sequence. If you need the strings in a different encoding (e.g. Windows-1252), you can use the Poco::TextConverter class to convert the string from UTF-8 to an 8-bit encoding.

stephanger · 2021-03-11T11:21:13Z

Thanks for the quick response.

OK, but the text conversion can only be done after parsing has been completed on single JSON items...

Is there a chance to do the conversion directly on the complete parsed JSON object ptr or to provide the JSON parser directly with the required encoding (e.g. Windows-1252)?

obiltschnig · 2021-03-11T11:32:30Z

No, that's not possible.

stephanger · 2021-03-11T11:44:23Z

Would it be possible to get this feature in the next release of the Poco library?

It would make the JSON parsing so much easier.

github-actions · 2022-03-12T02:42:17Z

This issue is stale because it has been open for 365 days with no activity.

github-actions · 2022-05-11T03:06:53Z

This issue was closed because it has been inactive for 60 days since being marked as stale.

github-actions bot added the stale label Mar 12, 2022

github-actions bot closed this as completed May 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JSON Parser cannot parse escapes of German Umlaut vowels #3216

JSON Parser cannot parse escapes of German Umlaut vowels #3216

stephanger commented Mar 11, 2021

obiltschnig commented Mar 11, 2021

stephanger commented Mar 11, 2021

obiltschnig commented Mar 11, 2021

stephanger commented Mar 11, 2021

github-actions bot commented Mar 12, 2022

github-actions bot commented May 11, 2022

JSON Parser cannot parse escapes of German Umlaut vowels #3216

JSON Parser cannot parse escapes of German Umlaut vowels #3216

Comments

stephanger commented Mar 11, 2021

obiltschnig commented Mar 11, 2021

stephanger commented Mar 11, 2021

obiltschnig commented Mar 11, 2021

stephanger commented Mar 11, 2021

github-actions bot commented Mar 12, 2022

github-actions bot commented May 11, 2022