Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

go/parser: reject files with BOMs not at the beginning. #5265

Closed
robpike opened this issue Apr 10, 2013 · 2 comments
Closed

go/parser: reject files with BOMs not at the beginning. #5265

robpike opened this issue Apr 10, 2013 · 2 comments
Assignees

Comments

@robpike
Copy link
Contributor

@robpike robpike commented Apr 10, 2013

$ go fmt x.go #attached

You see no error.

$ go build x.go

You get an error:
    ./x.go:4: Unicode (UTF-8) BOM in middle of file
The error is correct: this is an illegal Go source file. I suspect the parser isn't
rejecting BOMs properly. They are allowed only as the first code point in a source file.

It's a minor point but consistency among tools would be good.

Attachments:

  1. x.go (47 bytes)
@peterGo
Copy link
Contributor

@peterGo peterGo commented Apr 11, 2013

Comment 1:

The Go Programming Language Specification
Version of September 4, 2012
http://golang.org/ref/spec
Source code representation
Source code is Unicode text encoded in UTF-8. The text is not canonicalized, so a single
accented code point is distinct from the same character constructed from combining an
accent and a letter; those are treated as two code points. For simplicity, this document
will use the unqualified term character to refer to a Unicode code point in the source
text.
Each code point is distinct; for instance, upper and lower case letters are different
characters.
Implementation restriction: For compatibility with other tools, a compiler may disallow
the NUL character (U+0000) in the source text. 
The Unicode Standard, Version 6.2
Chapter 3 Conformance
http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf
D95
When represented in UTF-8, the byte order mark [U+FEFF] turns into the byte sequence
<EF BB BF>.
D89
In a Unicode encoding form: A Unicode string is said to be in a particular Unicode
encoding form if and only if it consists of a well-formed Unicode code unit sequence of
that Unicode encoding form.
• A Unicode string consisting of a well-formed UTF-8 code unit sequence is said to be
in UTF-8. Such a Unicode string is referred to as a valid UTF-8 string, or a UTF-8
string for short.
D92
• Any UTF-8 byte sequence that does not match the patterns listed in Table 3-7 is
ill-formed.
Table 3-7 lists all of the byte sequences that are well-formed in UTF-8. 
Table 3-7. Well-Formed UTF-8 Byte Sequences [in pertinent part]
Code Points        First Byte Second Byte Third Byte Fourth Byte
U+E000..U+FFFF     EE..EF     80..BF      80..BF 
The Unicode specification defines UTF-8. It looks to me as if the UTF-8 byte sequence
<EF BB BF>, for the BOM U+FEFF code point, is defined by Unicode as a well-formed
sequence of UTF-8 bytes. Therefore, I'm surprised that Go does not accept it. Are there
any other well-formed sequences of UTF-8 bytes does Go not accept, apart from the NUL
character?
Does this break the Go 1 guarantee that "Source code is Unicode text encoded in UTF-8.",
except that "a compiler may disallow the NUL character (U+0000)"?
@robpike
Copy link
Contributor Author

@robpike robpike commented Apr 11, 2013

Comment 2:

Clarification of the spec in https://golang.org/cl/8649043
This isn't really a bug, just an inconsistency, since the property falls under the
'implementation restriction' clause. Retracting the issue.

Status changed to Retracted.

@golang golang locked and limited conversation to collaborators Jun 24, 2016
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants
You can’t perform that action at this time.