Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Strings are not required to be UTF-8. Go source code is required to be UTF-8. There is a complex path between the two.
In short, there are three kinds of strings. They are:
- the substring of the source that lexes into a string literal.
- a string literal.
- a value of type string.
Only the first is required to be UTF-8. The second is required to be written in UTF-8, but its contents are interpreted various ways and may encode arbitrary bytes. The third can contain any bytes at all.
Try this on:
var s string = "\xFF語"
"\xFF語", UTF-8 encoded. The data:
22 5c 78 46 46 e8 aa 9e 22
\xFF語 (between the quotes). The data:
5c 78 46 46 e8 aa 9e
The string value (unprintable; this is a UTF-8 stream). The data:
ff e8 aa 9e
And for record, the characters (code points):
<erroneous byte FF, will appear as U+FFFD if you range over the string value> 語 U+8a9e