I expect to see a � character, since the value is outside of the range of US-ASCII (most significant bit is 1)
What did you see instead?
I will see the character € from the Windows 1252 encoding instead. This is caused because go is re-using Windows 1252 for US-ASCII. Similar issues arise for out of bounds characters in other charactersets, for example tis-620 maps to windows874. Now if I want to correctly parse the text I need to read through the decoded runes and test if any of them are out of bounds. If I want to use windows874 for just tis-620 characters, I would have to do a similar manual exclusion of out of bounds characters. I do not know of a way to create my own characterset so that these problems can be avoided.
The text was updated successfully, but these errors were encountered:
Other languages like Java allow users to differentiate between these character sets. For decoding legacy text, it is not ideal to use a superset characterset like windows1252, because invalid characters which are not expressible in the subset character set can be inserted into the result if the user's text is invalid. This requires developers to implement workarounds to ensure invalid text is not contained in the result.