New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

go/scanner: Scanner.Scan sometimes returns token.ILLEGAL and empty literal string #28112

Closed
dmitshur opened this Issue Oct 10, 2018 · 9 comments

Comments

Projects
None yet
4 participants
@dmitshur
Member

dmitshur commented Oct 10, 2018

What version of Go are you using (go version)?

$ go version
go version devel +c870d56f98 Wed Oct 10 02:08:36 2018 +0000 darwin/amd64

Does this issue reproduce with the latest release?

Yes it does, with Go 1.11.1.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/dmitri/Library/Caches/go-build"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/dmitri/Dropbox/Work/2013/GoLanding:/Users/dmitri/Dropbox/Work/2013/GoLand"
GOPROXY=""
GORACE=""
GOROOT="/Users/dmitri/Dropbox/Work/2015/golang-contribute/go"
GOTMPDIR=""
GOTOOLDIR="/Users/dmitri/Dropbox/Work/2015/golang-contribute/go/pkg/tool/darwin_amd64"
GCCGO="gccgo"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/b8/66r1c5856mqds1mrf2tjtq8w0000gn/T/go-build878012089=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

I tried scanning over the following minified Go code containing illegal tokens:

a..b

With the following program:

package main

import (
	"fmt"
	"go/scanner"
	"go/token"
)

func main() {
	// src is the input that we want to tokenize.
	src := []byte("a..b")

	// Initialize the scanner.
	var s scanner.Scanner
	fset := token.NewFileSet()                      // positions are relative to fset
	file := fset.AddFile("", fset.Base(), len(src)) // register input "file"
	s.Init(file, src, nil /* no error handler */, scanner.ScanComments)

	// Repeated calls to Scan yield the token sequence found in the input.
	for {
		pos, tok, lit := s.Scan()
		if tok == token.EOF {
			break
		}
		fmt.Printf("%s\t%s\t%q\n", fset.Position(pos), tok, lit)
	}
}

(Playground: https://play.golang.org/p/No77Yx0D4Ki.)

What did you expect to see?

go/scanner.Scanner.Scan documentation says:

If the returned token is token.ILLEGAL, the literal string is the offending character.

In all other cases, Scan returns an empty literal string.

Perhaps I'm misunderstanding the documentation, but I expected that if returned token is token.ILLEGAL, then the literal string should not be empty. I expected to see the literal string contain the offending character or characters.

1:1	IDENT	"a"
1:2	ILLEGAL	".."
1:4	IDENT	"b"
1:5	;	"\n"

What did you see instead?

A token.ILLEGAL token and empty literal string:

1:1	IDENT	"a"
1:2	ILLEGAL	""
1:4	IDENT	"b"
1:5	;	"\n"

/cc @griesemer

@dmitshur dmitshur added this to the Go1.12 milestone Oct 10, 2018

@cznic

This comment has been minimized.

Show comment
Hide comment
@cznic

cznic Oct 10, 2018

Contributor

The ELLIPSIS token is missing its last character, so the the "offending character" in this case is empty. The leading ".." sequence is ok, but the only valid character after that is a third '.'.

Contributor

cznic commented Oct 10, 2018

The ELLIPSIS token is missing its last character, so the the "offending character" in this case is empty. The leading ".." sequence is ok, but the only valid character after that is a third '.'.

@griesemer

This comment has been minimized.

Show comment
Hide comment
@griesemer

griesemer Oct 10, 2018

Contributor

@cznic The offending character here is the 'b' - it if were a '.' it would be ok. This is clearly a bug in the scanner.

Contributor

griesemer commented Oct 10, 2018

@cznic The offending character here is the 'b' - it if were a '.' it would be ok. This is clearly a bug in the scanner.

@cznic

This comment has been minimized.

Show comment
Hide comment
@cznic

cznic Oct 10, 2018

Contributor

Heh, used a .. b in my test by mistake. Still, I'm not sure if the offending character, when ir does not start a token, should be consumed as it may start a valid token.

Contributor

cznic commented Oct 10, 2018

Heh, used a .. b in my test by mistake. Still, I'm not sure if the offending character, when ir does not start a token, should be consumed as it may start a valid token.

@griesemer

This comment has been minimized.

Show comment
Hide comment
@griesemer

griesemer Oct 10, 2018

Contributor

@cznic My apologies - I concluded a bit too quickly. It's not clear here what exactly the offending character should be. Your suggestion of an empty character might be just as valid. It's certainly problematic that the 'if' starting on scanner.go:740 has not 'else' branch and thus silently exits the big 'switch' statement. Because the result 'tok' is not set at all, it returns the initial (invalid) token.

Contributor

griesemer commented Oct 10, 2018

@cznic My apologies - I concluded a bit too quickly. It's not clear here what exactly the offending character should be. Your suggestion of an empty character might be just as valid. It's certainly problematic that the 'if' starting on scanner.go:740 has not 'else' branch and thus silently exits the big 'switch' statement. Because the result 'tok' is not set at all, it returns the initial (invalid) token.

@griesemer

This comment has been minimized.

Show comment
Hide comment
@griesemer

griesemer Oct 10, 2018

Contributor

For reference: #28128

Contributor

griesemer commented Oct 10, 2018

For reference: #28128

@griesemer

This comment has been minimized.

Show comment
Hide comment
@griesemer

griesemer Oct 10, 2018

Contributor

Actually, while this is a bug in the scanner (and there are others, related to this), the spec is pretty clear:

While breaking the input into tokens, the next token is the longest sequence of characters that form a valid token.

So the longest valid sequence of characters here is a dot . followed by another one. That is, the scanner should not return an ILLEGAL token at all. That is what cmd/compile's scanner does.

I vaguely remember having been aware of this for a long time but never bothered to get it 100% correct in go/scanner because it doesn't matter in practice. There's no Go code containing two dots ".." that is valid, so as long as we get an error, we're mostly ok.

There are probably other such cases such as invalid octal or hex numbers, say: 07a. Is that the octal literal 07 followed by an a, or an invalid octal constant? Following the word of the spec exactly, it should be the former. On the other hand, an error complaining about an invalid octal constant may be clearer to a reader than an error due to invalid syntax because the octal 07 is incorrectly followed by the identifier a. Maybe the spec should be less demanding in those cases.

I will try to address the specific problem here if I can find an easy fix (which doesn't slow down the scanner everywhere because we need to keep more state around) but leave the more general problem open.

Contributor

griesemer commented Oct 10, 2018

Actually, while this is a bug in the scanner (and there are others, related to this), the spec is pretty clear:

While breaking the input into tokens, the next token is the longest sequence of characters that form a valid token.

So the longest valid sequence of characters here is a dot . followed by another one. That is, the scanner should not return an ILLEGAL token at all. That is what cmd/compile's scanner does.

I vaguely remember having been aware of this for a long time but never bothered to get it 100% correct in go/scanner because it doesn't matter in practice. There's no Go code containing two dots ".." that is valid, so as long as we get an error, we're mostly ok.

There are probably other such cases such as invalid octal or hex numbers, say: 07a. Is that the octal literal 07 followed by an a, or an invalid octal constant? Following the word of the spec exactly, it should be the former. On the other hand, an error complaining about an invalid octal constant may be clearer to a reader than an error due to invalid syntax because the octal 07 is incorrectly followed by the identifier a. Maybe the spec should be less demanding in those cases.

I will try to address the specific problem here if I can find an easy fix (which doesn't slow down the scanner everywhere because we need to keep more state around) but leave the more general problem open.

@gopherbot

This comment has been minimized.

Show comment
Hide comment
@gopherbot

gopherbot Oct 10, 2018

Change https://golang.org/cl/141337 mentions this issue: go/scanner: don't return token.INVALID for ".." sequence

gopherbot commented Oct 10, 2018

Change https://golang.org/cl/141337 mentions this issue: go/scanner: don't return token.INVALID for ".." sequence

@dmitshur

This comment has been minimized.

Show comment
Hide comment
@dmitshur

dmitshur Oct 11, 2018

Member

There's no Go code containing two dots ".." that is valid,

In case you're curious how I discovered this, I ran into this in issue #28082. You left a comment there containing a fenced code block marked up with "Go" syntax, but it was actually a diff of Go code.

I was viewing that comment with some software that performs syntax highlighting for Go using a highlighter implemented on top of go/scanner, and it was behaving incorrectly on the index 1de7cd81b2..38b50f2596 100644 line because I made the wrong assumption about literal being non-empty when token is token.ILLEGAL.

Member

dmitshur commented Oct 11, 2018

There's no Go code containing two dots ".." that is valid,

In case you're curious how I discovered this, I ran into this in issue #28082. You left a comment there containing a fenced code block marked up with "Go" syntax, but it was actually a diff of Go code.

I was viewing that comment with some software that performs syntax highlighting for Go using a highlighter implemented on top of go/scanner, and it was behaving incorrectly on the index 1de7cd81b2..38b50f2596 100644 line because I made the wrong assumption about literal being non-empty when token is token.ILLEGAL.

@griesemer

This comment has been minimized.

Show comment
Hide comment
@griesemer

griesemer Oct 11, 2018

Contributor

@dmitshur Interesting. The scanner should always return the offending character (lit) if there's an ILLEGAL token. I believe this should be the case now.

Contributor

griesemer commented Oct 11, 2018

@dmitshur Interesting. The scanner should always return the offending character (lit) if there's an ILLEGAL token. I believe this should be the case now.

@gopherbot gopherbot closed this in f64fd66 Oct 11, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment