Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: NFD-normalized unicode identifiers result in malformed error messages #33271

Closed
jaddr2line opened this issue Jul 25, 2019 · 7 comments
Closed
Labels
Milestone

Comments

@jaddr2line
Copy link

@jaddr2line jaddr2line commented Jul 25, 2019

What version of Go are you using (go version)?

$ go version
go version go1.12.7 linux/amd64

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/jadenw/.cache/go-build"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/jadenw/go"
GOPROXY=""
GORACE=""
GOROOT="/usr/lib/go"
GOTMPDIR=""
GOTOOLDIR="/usr/lib/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build217674304=/tmp/go-build -gno-record-gcc-switches"

What did you do?

I put in an NFD-normalized unicode identifier.

Smallest reproducible playground: https://play.golang.org/p/wf6_glAAIRP

Internally, the ä is encoded as two unicode code points rather than one. If it is NFC normalized (one code point), it works. The spec is fairly vague as to how it should handle NFD unicode.

What did you expect to see?

Either:

  1. it compiles
  2. a properly formatted error message saying why it does not work

What did you see instead?

prog.go:3:7: illegal character U+0308 '̈' (and 1 more errors)

On some text renderers, the combining mark actually merges into the quote at the end of the error message.

Solutions

First, the spec should probably be clarified as to how it processes multi-codepoint letters.
Additionally, combining marks should probably not be written into the error message in raw form.

@robpike

This comment has been minimized.

Copy link
Contributor

@robpike robpike commented Jul 25, 2019

I think the spec is pretty clear.

identifier = letter { letter | unicode_digit } .

and

letter        = unicode_letter | "_" .

and

unicode_letter = /* a Unicode code point classified as "Letter" */ .

What is vague?

There are outstanding issues to broaden the definition of identifiers, but as it stands the spec quite clearly does not include combining characters.

@jaddr2line

This comment has been minimized.

Copy link
Author

@jaddr2line jaddr2line commented Jul 25, 2019

Ok. Yeah I misread that originally, missing the definition of unicode_letter. Then really the error message should just be fixed so it doesn't combine, and that would be it.

@jaddr2line jaddr2line changed the title cmd/compile: NFD-normalized unicode identifiers cmd/compile: NFD-normalized unicode identifiers result in malformed error messages Jul 25, 2019
@zerkms

This comment has been minimized.

Copy link

@zerkms zerkms commented Jul 25, 2019

Then really the error message should just be fixed so it doesn't combine, and that would be it.

I'm not following why just the error message should be fixed: why that code is illegal?

The spec probably should switch from using the unicode letter term to unicode character

@ALTree

This comment has been minimized.

Copy link
Member

@ALTree ALTree commented Jul 25, 2019

The spec probably should switch from using the unicode letter term to unicode character

@zerkms For combining, that's #20706, which is being considered for Go2.

@jaddr2line

This comment has been minimized.

Copy link
Author

@jaddr2line jaddr2line commented Jul 26, 2019

For reference, this issue was not a feature request for that. When I originally did this what I considered a bug was the error message not properly protecting the use of the combining mark - such that it composes with the quotes on compatible text renders, as seen at the end of this error (rendered in atom):
image
As can be seen in the screenshot, a combining mark combined with the second single-quote in the error message.

The reason I had referenced the spec was I was checking what the appropriate definition and missed the unicode_letter definition.

As of now, I am not very well acquainted with go's parser, and have not yet pinned down the creation point of this error. Later (probably after gophercon) I can try to read through and make a fix.

In order to see how other systems handled this, I looked at the Wikipedia page on combining characters. Wikipedia prepends a dotted circle (U+25CC) to the combining mark in order to handle the combination. This also indicates visually that it is meant to be a combining mark, as the dotted circle can be used to represent the absence of a base character. Would combining the mark onto the dotted circle be an appropriate way to render the error? The error from the beginning of the thread would render as:

prog.go:3:7: illegal character U+0308 '◌̈' (and 1 more errors)
@ALTree ALTree added this to the Unplanned milestone Jul 26, 2019
@ALTree

This comment has been minimized.

Copy link
Member

@ALTree ALTree commented Jul 26, 2019

@jadr2ddude Thanks for clarifying. What you are asking makes sense, but I'm not sure "combining characters in identifiers" is a common enough mistake to warrant special handling in the displaying of error messages.

Moreover, it seems that applying the combining character to a is not enough to ensure the error message is correctly displayed everywhere; for example your message above, on my machine, is rendered as:

combining

Also cc @griesemer

@jaddr2line

This comment has been minimized.

Copy link
Author

@jaddr2line jaddr2line commented Jul 27, 2019

Eh, this isnt too critical and the problem might not be worth the fix.

@jaddr2line jaddr2line closed this Jul 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.