Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regexp: LiteralPrefix behaving incorrectly #30425

Open
shishkander opened this Issue Feb 27, 2019 · 9 comments

Comments

Projects
None yet
6 participants
@shishkander
Copy link

shishkander commented Feb 27, 2019

What version of Go are you using (go version)?

$ go version
go version go1.12 linux/amd64

Does this issue reproduce with the latest release?

Yeah, go1.12 has just been released.

What did you do?

package main
import (
	"fmt"
	"regexp"
)
func main() {
	l, _ := regexp.MustCompile("^prefix\\d+.$").LiteralPrefix()
	fmt.Printf("literalPrefix: %q", l)
}

See also on Play or even more

What did you expect to see?

literalPrefix: "prefix"

What did you see instead?

literalPrefix: ""
@agnivade

This comment has been minimized.

Copy link
Member

agnivade commented Feb 27, 2019

@Gnouc

This comment has been minimized.

Copy link
Contributor

Gnouc commented Feb 27, 2019

Maybe just adding document, because prefix is applied for unanchored matches only

prefix string // required prefix in unanchored matches

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

ianlancetaylor commented Feb 27, 2019

Yes, this is a documentation issue. The code is behaving as intended.

@antong

This comment has been minimized.

Copy link
Contributor

antong commented Feb 27, 2019

Do note that if the expression is a bit more anchored, then LiteralPrefix() applies again:

func main() {
	fmt.Println(regexp.MustCompile("^foo").LiteralPrefix())
	fmt.Println(regexp.MustCompile("^foo$").LiteralPrefix())
	fmt.Println(regexp.MustCompile("^foo.bar$").LiteralPrefix())
	// For go version 1.6 to 1.12
	// Output:
	//  false
	// foo true
	// foo false
}

The behaviour has changed through versions. The relevant versions I found:

go version go1.2.2 linux/amd64
 false
 false
 false

go version go1.3.3 linux/amd64
 false
foo false
foo false

go version go1.6.4 linux/amd64
 false
foo true
foo false

@Gnouc

This comment has been minimized.

Copy link
Contributor

Gnouc commented Feb 27, 2019

@antong go 1.6 to 1.12 seems to be correct.

Only ^foo$ can be turned to literal match, ^foo only matches at start, . in ^foo.bar$ can be anything.

@antong

This comment has been minimized.

Copy link
Contributor

antong commented Feb 28, 2019

@Gnouc ,Well, ^foo$ also only matches at start and is also "anchored". I'm sure there is some reasoning behind this, but I don't really understand it based on the doc and the info in this thread.

I found only one place in the standard library where LiteralPrefix() is used, in suffixarray.FindAllIndex(). To me it looks like suffixarray assumes that ^foo$ has no literal prefix (as it was in go1.2).

https://play.golang.org/p/j27u5DdE1oS :

sufixarray.FindAllIndex(re) in "banana":
"n."   found at: [[2 4] [4 6]]
"^n."  found at: []
"^n.$" found at: [[4 6]], should be []?
"a"    found at: [[1 2] [3 4] [5 6]]
"^a"   found at: []
"^a$"  found at: [[1 2] [3 4] [5 6]], should be []?
@Gnouc

This comment has been minimized.

Copy link
Contributor

Gnouc commented Feb 28, 2019

@antong: I mean ^foo$ is the same as foo., ^foo isn't.

@antong

This comment has been minimized.

Copy link
Contributor

antong commented Mar 1, 2019

I must be missing something obvious, but why would "^foo$" deserve to return a prefix but not "^foo" ? Any match to either of these needs to have the prefix "foo" (discounting multiline flag etc.).

regexp/syntax also has a variant Prog.Prefix() that (in go1.12 and all previous go versions) behaves like LiteralPrefix() did in go1.2 (https://play.golang.org/p/pUzVZIhot8s):

"foo"   : prefix="foo"   complete=true
"foo$"  : prefix="foo"   complete=false
"^foo"  : prefix=""      complete=false
"^foo$" : prefix=""      complete=false

Now I must admit I've never used LiteralPrefix() myself, but to me it seems the way it worked in go1.2 and the way it has always worked in regexp/syntax would be more useful to a user. Take for instance the only user in the standard lib, suffixarray. It uses the prefix to locate potential starting points for a regexp match in the FindAllIndex() function. That makes it useful that regexps anchored to the beginning of the text ("^foo") do not return a prefix. But if a regexp that is anchored in both ends ("^foo$") again does return a prefix, that makes it not useful for that purpose. I'd even say that suffixarray has a bug in this regard (https://play.golang.org/p/j27u5DdE1oS).

So I would like to understand why "^foo$" returns a literal prefix even though it is anchored. I believe I do understand how it happens: There is an optimization that takes a separate path for regexps that can be compiled for "one-pass" execution, and the prefix is calculated differently for these. But to me it sounds like an internal implementation detail and not an external feature of a regular expression that should have relevance on the output.

For reference, here are LiteralPrefix() results for different Go versions (https://play.golang.org/p/KmN93nSPy6A):

go1.2
"foo"   : prefix="foo"   complete=true
"foo$"  : prefix="foo"   complete=false
"^foo"  : prefix=""      complete=false
"^foo$" : prefix=""      complete=false

go1.3 - 1.5
"foo"   : prefix="foo"   complete=true
"foo$"  : prefix="foo"   complete=false
"^foo"  : prefix=""      complete=false
"^foo$" : prefix="foo"   complete=false

go1.6 - 1.12
"foo"   : prefix="foo"   complete=true
"foo$"  : prefix="foo"   complete=false
"^foo"  : prefix=""      complete=false
"^foo$" : prefix="foo"   complete=true
@griesemer

This comment has been minimized.

Copy link
Contributor

griesemer commented Mar 6, 2019

cc: @rsc for insights on the behavior change of LiteralPrefix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.