Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON Parser cannot parse escapes of German Umlaut vowels #3216

Closed
stephanger opened this issue Mar 11, 2021 · 6 comments
Closed

JSON Parser cannot parse escapes of German Umlaut vowels #3216

stephanger opened this issue Mar 11, 2021 · 6 comments
Labels

Comments

@stephanger
Copy link

My application receives JSON objects that may contain the escape codes of German Umlaut vowels (Ä,Ö,Ü,ä,ö,ü), as shown in the example below.

// JSON example
std::string strJSON = "{\"Area\":\"Bereich1\",\"Section\":\"Abluft\\u00FCberwachung\"}";

Poco::JSON::Parser parser;
Poco::Dynamic::Var result = parser.parse(strJSON);
Poco::JSON::Object::Ptr pObject = result.extract<Poco::JSON::Object::Ptr>();
Poco::JSON::Object jsonResponse = *pObject;

std::string strArea = jsonResponse.get("Area");		// parsed: Bereich1
std::string strSection = jsonResponse.get("Section");	// parsed: Abluftüberwachung

Unfortunately, the German Umlaut vowels like 'ü' (\u00FC) will be incorrectly parsed by the JSON parser.
It will be parsed into two characters (ü) and not into the Umlaut vowel (ü).

Looking into the UTF8Encoding::convert function (UTF8Encoding.cpp), the function only covers the "Basic Latin" characters (U+0000 ... U+007F),
but not the UTF-8 supplements, like the "Latin-1 Supplement" (U+0080 ... U+00FF) that contains the German Umlaut vowels, see below.

int UTF8Encoding::convert(int ch, unsigned char* bytes, int length) const
{
	if (ch <= 0x7F)
	{
		if (bytes && length >= 1)
			*bytes = (unsigned char) ch;
		return 1;
	}
	else if (ch <= 0x7FF)
	{
		if (bytes && length >= 2)
		{
			*bytes++ = (unsigned char) (((ch >> 6) & 0x1F) | 0xC0);
			*bytes   = (unsigned char) ((ch & 0x3F) | 0x80);
		}
		return 2;
	}
	else if (ch <= 0xFFFF)
	{
		if (bytes && length >= 3)
		{
			*bytes++ = (unsigned char) (((ch >> 12) & 0x0F) | 0xE0);
			*bytes++ = (unsigned char) (((ch >> 6) & 0x3F) | 0x80);
			*bytes   = (unsigned char) ((ch & 0x3F) | 0x80);
		}
		return 3;
	}
	else if (ch <= 0x10FFFF)
	{
		if (bytes && length >= 4)
		{
			*bytes++ = (unsigned char) (((ch >> 18) & 0x07) | 0xF0);
			*bytes++ = (unsigned char) (((ch >> 12) & 0x3F) | 0x80);
			*bytes++ = (unsigned char) (((ch >> 6) & 0x3F) | 0x80);
			*bytes   = (unsigned char) ((ch & 0x3F) | 0x80);
		}
		return 4;
	}
	else return 0;
}

Escape characters within the "Latin-1 Supplement" range will be converted into two characters.

Is there a chance to get these supplement escapes parsed correctly?

@obiltschnig
Copy link
Member

The JSON parser uses UTF-8 encoding (as is common practice), so any "umlauts" will be converted to their respective 2-byte UTF-8 sequence. If you need the strings in a different encoding (e.g. Windows-1252), you can use the Poco::TextConverter class to convert the string from UTF-8 to an 8-bit encoding.

@stephanger
Copy link
Author

Thanks for the quick response.

OK, but the text conversion can only be done after parsing has been completed on single JSON items...

Is there a chance to do the conversion directly on the complete parsed JSON object ptr or to provide the JSON parser directly with the required encoding (e.g. Windows-1252)?

@obiltschnig
Copy link
Member

No, that's not possible.

@stephanger
Copy link
Author

Would it be possible to get this feature in the next release of the Poco library?

It would make the JSON parsing so much easier.

@github-actions
Copy link

This issue is stale because it has been open for 365 days with no activity.

@github-actions github-actions bot added the stale label Mar 12, 2022
@github-actions
Copy link

This issue was closed because it has been inactive for 60 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants