Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: optimise ranges over string(byteSlice) #27148

mvdan opened this issue Aug 22, 2018 · 9 comments

cmd/compile: optimise ranges over string(byteSlice) #27148

mvdan opened this issue Aug 22, 2018 · 9 comments
NeedsInvestigation Performance


Copy link

@mvdan mvdan commented Aug 22, 2018

Take these two benchmarks:

The first code is cleaner, more idiomatic, and easier to write/maintain. The second is much trickier, and I'm not even sure I wrote it correctly.

Lucky for us, the first tends to perform about the same or slightly better in terms of time:

$ go test -bench=. -benchmem
goos: linux
goarch: amd64
BenchmarkIdiomatic-4      500000              2389 ns/op            2688 B/op          1 allocs/op
BenchmarkManual-4         500000              2645 ns/op               0 B/op          0 allocs/op
ok      2.578s

However, as one can see, it still incurs an extra allocation. I'm not sure why that is - via go test -gcflags=-m, I can see ./f_test.go:16:27: BenchmarkIdiomatic string(Input) does not escape.

We have optimized other common string conversion patterns, such as switch string(byteSlice)
in 566e3e0, and I believe someMap[string(byteSlice)] was optimized too.

Would it be possible to get rid of this allocation somehow? My knowledge of the runtime and compiler is a bit limited, but I'd think that it is possible.

As for a real use case - there's a few occurrences in the standard library that could be greatly simplified. For example, in encoding/json we'd remove ten lines of tricky code:

diff --git a/src/encoding/json/decode.go b/src/encoding/json/decode.go
index 2e734fb39e..fa09b85aab 100644
--- a/src/encoding/json/decode.go
+++ b/src/encoding/json/decode.go
@@ -1210,20 +1210,11 @@ func unquoteBytes(s []byte) (t []byte, ok bool) {
 	// then no unquoting is needed, so return a slice of the
 	// original bytes.
 	r := 0
-	for r < len(s) {
-		c := s[r]
-		if c == '\\' || c == '"' || c < ' ' {
-			break
-		}
-		if c < utf8.RuneSelf {
-			r++
-			continue
-		}
-		rr, size := utf8.DecodeRune(s[r:])
-		if rr == utf8.RuneError && size == 1 {
+	for i, c := range string(s) {
+		if c == '\\' || c == '"' || c < ' ' || c == utf8.RuneError {
-		r += size
+		r = i
 	if r == len(s) {
 		return s, true

However, that currently means a bad regression in speed and allocations:

name           old time/op    new time/op    delta
CodeDecoder-4    27.3ms ± 1%    31.2ms ± 1%   +14.16%  (p=0.002 n=6+6)

name           old speed      new speed      delta
CodeDecoder-4  71.1MB/s ± 1%  62.3MB/s ± 1%   -12.41%  (p=0.002 n=6+6)

name           old alloc/op   new alloc/op   delta
CodeDecoder-4    2.74MB ± 0%    5.13MB ± 0%   +87.28%  (p=0.002 n=6+6)

name           old allocs/op  new allocs/op  delta
CodeDecoder-4     77.5k ± 0%    184.3k ± 0%  +137.72%  (p=0.002 n=6+6)

I haven't investigated why my microbenchmark is faster when simpler, while json gets so much slower when made simpler.

Any input appreciated. cc @josharian @martisch @randall77 @TocarIP

@mvdan mvdan added Performance NeedsInvestigation labels Aug 22, 2018
@mvdan mvdan added this to the Go1.12 milestone Aug 22, 2018
Copy link
Member Author

@mvdan mvdan commented Aug 22, 2018

Another similar optimization:

Perhaps there's a better way to iterate over the utf8 runes in a byte slice that I'm missing. Hopefully not, as that would be a bit embarassing :)

Copy link

@martisch martisch commented Aug 22, 2018

One issue why this has not been optimized like range []byte(string) yet to not allocate is that it is hard to prove in general that in "for i, c := range string(s)" the s is not changed within the for loop. A single byte pointer changed anywhere within computations (even in other packages or through interfaces) within the for loop can change s and thereby change the result from the case where s is copied on loop entry.

However there might be cases like this were it could trigger with some very conservative analysis.

Maybe we could solve this problem by having a range loop variant that iterates over the runes of a byte slice (taking some of the changes to byte slice into account) and does not need to allocate in general. Similar to map iteration not iterating over a snapshot of the map on loop entry.

The other question is why BenchmarkManual is so much slower even if it does not need to allocate.

Copy link

@martisch martisch commented Aug 22, 2018

Note also there is a slight difference between the two versions of the code given:
c == utf8.RuneError is not the same check as rr == utf8.RuneError && size == 1.

Copy link
Member Author

@mvdan mvdan commented Aug 22, 2018

Conservative analysis sounds fine to me. I imagine that most of the use cases that matter would not involve pointers, reflection, nor any other magic.

The range loop variant seems interesting. Are there any downsides to this method? I imagine it would essentially be as if the compiler "expanded" the idiomatic code into the manual code.

The other question is why BenchmarkManual is so much slower even if it does not need to allocate.

Yes - but note that in encoding/json, the opposite is true. So it might just be an issue with my benchmark code.

c == utf8.RuneError is not the same check as rr == utf8.RuneError && size == 1

That came to mind, but I wasn't sure. How could I have written the idiomatic version to behave equally?

At least encoding/json doesn't care - both versions of the code pass all the tests. Which is not to say that they're equivalent or equally valid, but at least the idiomatic version isn't horribly broken.

Copy link

@martisch martisch commented Aug 22, 2018

Note that its not only any "magic" that cant be used but no function (even ones inserted by the compiler) can be used that is not known during compile time or function that changes a byte pointer or byte slice as we need to prove those can not alias to the slice we iterating over. This will need per function information #25999.

Copy link

@josharian josharian commented Aug 23, 2018

However, as one can see, it still incurs an extra allocation.

Where exactly is the allocation? pprof should tell you.

Copy link

@quasilyte quasilyte commented Sep 19, 2018

string(Input) does not escape, but that expression is compiled to
runtime.slicebytetostring which does allocate (but not always, see below).

And now a little experiment:

- var Input = bytes.Repeat([]byte("abcde"), 500)
+ var Input = bytes.Repeat([]byte("abcde"), 5)
BenchmarkIdiomatic-8 50000000  20.9 ns/op 0 B/op 0 allocs/op
BenchmarkManual-8    100000000 23.1 ns/op 0 B/op 0 allocs/op

The problem is that non-escaping tmp buffer is allocated on stack and
it has very limited capacity.

So if string being converted does not fit that buffer, it will be heap-allocated:

// The constant is known to the compiler.
// There is no fundamental theory behind this number.
const tmpStringBufSize = 32

type tmpBuf [tmpStringBufSize]byte

Somewhere inside slicebytetostring:

// buf is tmpBuf
if buf != nil && len(b) <= len(buf) {
	p = unsafe.Pointer(buf)
} else {
	p = mallocgc(uintptr(len(b)), nil, false)

I think there was an issue somewhere to create readonly []byte slice
views for temporary non-escaping strings. The problem is that if someone
holds that []byte and mutates it while we're iterating over that readonly view,
code will behave differently, since string(s) creates a copy.
(Update: #2205)

Copy link
Member Author

@mvdan mvdan commented Nov 28, 2018

Doesn't look like anyone plans on working on this, so I'm removing it from 1.12 for now. I think we should keep the issue open for a while, to see if there's any interest.

Copy link
Member Author

@mvdan mvdan commented Apr 16, 2019

Poking around encoding/json, I just found one of the reasons why the hand-rolled version is slower; the r += size statement throws off the prove pass, since r could overflow as a signed integer to become negative, so c := s[r] has a bounds check. That's unfortunate, and does show up in the CPU profile.

I can't imagine a way to remove that bounds check, other than having the compiler treat utf8.DecodeRune in a special way. Tricks like if r += size; r < 0 { break } don't seem to appease prove either; but in any case, that would be replacing a bounds check with another branch.

I still think there should be an easy and performant way to iterate over the runes in a []byte, without losing performance nor incurring allocations/copies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
NeedsInvestigation Performance
None yet

No branches or pull requests

4 participants