Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regexp: Go regexp lib is unacceptably slow #26943

Closed
kamilgregorczyk opened this issue Aug 12, 2018 · 12 comments
Closed

regexp: Go regexp lib is unacceptably slow #26943

kamilgregorczyk opened this issue Aug 12, 2018 · 12 comments

Comments

@kamilgregorczyk
Copy link

@kamilgregorczyk kamilgregorczyk commented Aug 12, 2018

What version of Go are you using (go version)?

1.10.3

What operating system and processor architecture are you using (go env)?

GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/kgregorczyk/Library/Caches/go-build"
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/kgregorczyk/go"
GORACE=""
GOROOT="/usr/local/Cellar/go/1.10.3/libexec"
GOTMPDIR=""
GOTOOLDIR="/usr/local/Cellar/go/1.10.3/libexec/pkg/tool/darwin_amd64"
GCCGO="gccgo"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/xj/xhtb63fs26n04mzf069xnmz000cmqh/T/go-build831834857=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

I took this file and loaded is as string https://github.com/zemirco/sf-city-lots-json/blob/master/citylots.json

package main

import (
"fmt",
"ioutil"
)
func LoadJsonSchemaFromFileAsString() (string) {
	byteArray, e := ioutil.ReadFile("test.json")
	if e != nil {
		panic(e)
	}
	return string(byteArray)
}

(...)

func UnorderedListOfNumbers(value string){
	r, _ := regexp.Compile("[+-]?[0-9]+")
	start := time.Now()
	values := r.FindAllString(value, -1)
	fmt.Println(time.Now().Sub(start))
}

func main(){
    fmt.Println(UnorderedListOfNumbers(LoadJsonSchemaFromFileAsString()))
}

What did you expect to see?

I saw what I was supposed to, although execution time is just horrible, it took 21 seconds on macbook pro to find all strings, in python3.7 it took 8,965 seconds ( 2.5 times faster! in python!!)

def get_test_data_as_string():
    with open("test.json", "r") as file:
        return file.read()

string = get_test_data_as_string()
start = time.time()
re.findall("[+-]?[0-9]+", string)
print(time.time() - start()
@ianlancetaylor ianlancetaylor changed the title Go regexp lib is unacceptably slow regexp: Go regexp lib is unacceptably slow Aug 12, 2018
@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Aug 12, 2018

Dup of #26623 and #19629. But this issue does have a clean test case, which helps.

Note that although Go often runs faster Python, Python's regexp implementation is in C, so a pure comparison of regexp performance is comparing Go to C. Still, ideally Go should not be that much slower. But as you can see in the discussion on the other issues, there are several aspects to this.

@kamilgregorczyk
Copy link
Author

@kamilgregorczyk kamilgregorczyk commented Aug 12, 2018

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Aug 12, 2018

Thanks, are there any plans, milestones etc ?

Not that I know of. I would be happy to hear about anybody planning to work on this, but because the issues are fairly complex the work needs to be discussed first.

@kamilgregorczyk
Copy link
Author

@kamilgregorczyk kamilgregorczyk commented Aug 12, 2018

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Aug 12, 2018

are you aware of any other regexp lib which works faster?

Not in Go, no.

@dsnet dsnet added the Performance label Aug 13, 2018
@FMNSSun
Copy link

@FMNSSun FMNSSun commented Aug 13, 2018

In case this is just about finding numbers: If you're interested in finding numbers you can get this down to <2s by writing a custom scanner for numbers. But yes, go's regexp package isn't yet as fast as other tools. Regular expressions are great because they are easy to use so that you don't need to write a lot of code manually but in terms of pure performance writing a custom scanner can turn out to be much faster.

@junyer
Copy link
Contributor

@junyer junyer commented Aug 14, 2018

Would it be feasible to move the syntax.InstRune* match checks from step() to add()? A thread failing in step() constitutes wasted work – even for a regular expression as simple as [+-]?[0-9]+.

@junyer
Copy link
Contributor

@junyer junyer commented Aug 14, 2018

Also, what about using slice assignment when a thread is enqueued? Let cap have copy-on-write semantics.

@junyer
Copy link
Contributor

@junyer junyer commented Aug 14, 2018

Also also, it might be worth evaluating the benefit of using a slice as a stack instead of recursing. Anything to reduce the overhead of syntax.InstAlt instructions.

@rsc
Copy link
Contributor

@rsc rsc commented Aug 18, 2018

I have some optimizations coming for Go 1.12, but not 2.5X.
Even so, we don't need another issue telling us regexp is slower than C implementations.
Let's make this a duplicate of the others.

@rsc rsc closed this Aug 18, 2018
@kamilgregorczyk
Copy link
Author

@kamilgregorczyk kamilgregorczyk commented Aug 18, 2018

@rajender
Copy link
Contributor

@rajender rajender commented Aug 18, 2018

@kamilgregorczyk
As per Go Release Cycle, Go1.12 will be released in Feb 2019.

https://github.com/golang/go/wiki/Go-Release-Cycle

@golang golang locked and limited conversation to collaborators Aug 18, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
8 participants
You can’t perform that action at this time.