Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.Sign up
GitHub is where the world builds software
Millions of developers and companies build, ship, and maintain their software on GitHub — the largest and most advanced development platform in the world.
go/parser: reject files with BOMs not at the beginning. #5265
$ go fmt x.go #attached You see no error. $ go build x.go You get an error: ./x.go:4: Unicode (UTF-8) BOM in middle of file The error is correct: this is an illegal Go source file. I suspect the parser isn't rejecting BOMs properly. They are allowed only as the first code point in a source file. It's a minor point but consistency among tools would be good.
The Go Programming Language Specification Version of September 4, 2012 http://golang.org/ref/spec Source code representation Source code is Unicode text encoded in UTF-8. The text is not canonicalized, so a single accented code point is distinct from the same character constructed from combining an accent and a letter; those are treated as two code points. For simplicity, this document will use the unqualified term character to refer to a Unicode code point in the source text. Each code point is distinct; for instance, upper and lower case letters are different characters. Implementation restriction: For compatibility with other tools, a compiler may disallow the NUL character (U+0000) in the source text. The Unicode Standard, Version 6.2 Chapter 3 Conformance http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf D95 When represented in UTF-8, the byte order mark [U+FEFF] turns into the byte sequence <EF BB BF>. D89 In a Unicode encoding form: A Unicode string is said to be in a particular Unicode encoding form if and only if it consists of a well-formed Unicode code unit sequence of that Unicode encoding form. • A Unicode string consisting of a well-formed UTF-8 code unit sequence is said to be in UTF-8. Such a Unicode string is referred to as a valid UTF-8 string, or a UTF-8 string for short. D92 • Any UTF-8 byte sequence that does not match the patterns listed in Table 3-7 is ill-formed. Table 3-7 lists all of the byte sequences that are well-formed in UTF-8. Table 3-7. Well-Formed UTF-8 Byte Sequences [in pertinent part] Code Points First Byte Second Byte Third Byte Fourth Byte U+E000..U+FFFF EE..EF 80..BF 80..BF The Unicode specification defines UTF-8. It looks to me as if the UTF-8 byte sequence <EF BB BF>, for the BOM U+FEFF code point, is defined by Unicode as a well-formed sequence of UTF-8 bytes. Therefore, I'm surprised that Go does not accept it. Are there any other well-formed sequences of UTF-8 bytes does Go not accept, apart from the NUL character? Does this break the Go 1 guarantee that "Source code is Unicode text encoded in UTF-8.", except that "a compiler may disallow the NUL character (U+0000)"?