Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

strings: 10-30% speed regression in Contains from 1.13 to tip #35686

Open
zikaeroh opened this issue Nov 19, 2019 · 20 comments
Open

strings: 10-30% speed regression in Contains from 1.13 to tip #35686

zikaeroh opened this issue Nov 19, 2019 · 20 comments

Comments

@zikaeroh
Copy link
Contributor

@zikaeroh zikaeroh commented Nov 19, 2019

What version of Go are you using (go version)?

$ go version
go version go1.13.4 linux/amd64
go version devel +8cf5293c Tue Nov 19 06:10:03 2019 +0000 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/jake/.cache/go-build"
GOENV="/home/jake/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/jake/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/lib/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/lib/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/home/jake/testproj/strtest/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build998582668=/tmp/go-build -gno-record-gcc-switches"

What did you do?

package strtest

import (
	"strconv"
	"strings"
	"testing"
)

var sink1 bool

func BenchmarkContains(b *testing.B) {
	tests := []string{
		"This is just a cool test.",
		"This is just a cool test. (Or is it?)",
		"Hello, (_USER_)!",
		"RIP count reset to (_VARS_RIP-(_GAME_CLEAN_)_SET_0_), neat.",
		"(_USER_) rubs on (_PARAMETER_) 's booty! PogChamp / (_(_|",
	}

	for i, test := range tests {
		b.Run(strconv.Itoa(i), func(b *testing.B) {
			for i := 0; i < b.N; i++ {
				sink1 = strings.Contains(test, "(_")
			}
		})
	}
}

Benchmark this against 1.13 and tip.

What did you expect to see?

No difference in performance (or hopefully better).

What did you see instead?

Contains is consistently slower. benchstat-ing the above with go test -run=- -bench . -cpu=1,4 -count=10 .:

name          old time/op  new time/op   delta
Contains/0    10.1ns ± 2%   11.4ns ± 1%  +12.94%  (p=0.000 n=10+10)
Contains/0-4  10.1ns ± 0%   11.4ns ± 2%  +12.57%  (p=0.000 n=7+10)
Contains/1    12.1ns ± 1%   13.0ns ± 1%   +7.56%  (p=0.000 n=9+9)
Contains/1-4  12.1ns ± 2%   12.8ns ± 2%   +5.48%  (p=0.000 n=8+10)
Contains/2    7.70ns ± 3%   9.82ns ± 6%  +27.52%  (p=0.000 n=10+10)
Contains/2-4  7.82ns ± 3%   9.64ns ± 2%  +23.25%  (p=0.000 n=9+10)
Contains/3    9.31ns ± 1%  11.01ns ± 1%  +18.22%  (p=0.000 n=9+10)
Contains/3-4  9.45ns ± 3%  11.11ns ± 2%  +17.56%  (p=0.000 n=9+9)
Contains/4    7.61ns ± 1%   9.87ns ± 1%  +29.60%  (p=0.000 n=8+9)
Contains/4-4  7.81ns ± 3%   9.73ns ± 3%  +24.69%  (p=0.000 n=10+9)

This has come out of some performance testing between 1.13 and tip in a project I'm working on which shows a consistent 6% regression for a hand-written parser (https://github.com/hortbot/hortbot/blob/master/internal/cbp/cbp_test.go), and this was the first thing I noticed while comparing the profiles. The first thing the parser does is check for one of two tokens before doing any work, with the assumption that if neither are present, there's no work to be done.

In the grand scheme of things, maybe a few extra nanoseconds isn't such a big deal, but I haven't tested other functions in strings quite yet.

@mvdan
Copy link
Member

@mvdan mvdan commented Nov 19, 2019

Perhaps bisect what commit introduced this performance regression? It might have been a change in strings or bytes, or something in cmd/compile.

@ianlancetaylor ianlancetaylor added this to the Go1.14 milestone Nov 19, 2019
@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Nov 19, 2019

@zikaeroh
Copy link
Contributor Author

@zikaeroh zikaeroh commented Nov 19, 2019

I tried to bisect this and got CL 206938, but I don't feel confident that my bisect was high quality. I'm going to try and retest later on a more consistent machine to make sure I'm not measuring this wrong.

Newer test was:

package strtest

import (
	"strconv"
	"strings"
	"testing"
)

var (
	sink1 bool
	sink2 bool
)

func BenchmarkContains(b *testing.B) {
	tests := []string{
		"This is just a cool test.",
		"This is just a cool test. (Or is it?)",
		"RIP count reset to (_VARS_RIP-(_GAME_CLEAN_)_SET_0_), neat.",
	}

	for i, test := range tests {
		b.Run(strconv.Itoa(i), func(b *testing.B) {
			for i := 0; i < b.N; i++ {
				sink1 = strings.Contains(test, "(_")
				sink2 = strings.Contains(test, "_)")
			}
		})
	}
}
@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Nov 19, 2019

I'm reasonably confident that CL 206938 is not the cause of a slowdown in strings.Contains.

@opennota
Copy link

@opennota opennota commented Nov 20, 2019

I don't see any regression. If anything, time/op on tip is better (compared to 1.13.4):

name          old time/op  new time/op  delta
Contains/0-4  18.4ns ± 1%  18.0ns ± 0%  -2.12%  (p=0.000 n=10+7)
Contains/1-4  20.7ns ± 1%  20.5ns ± 0%  -0.87%  (p=0.000 n=10+10)
Contains/2-4  20.6ns ± 0%  20.5ns ± 0%  -0.27%  (p=0.029 n=9+9)
@zikaeroh
Copy link
Contributor Author

@zikaeroh zikaeroh commented Nov 20, 2019

I too can't reproduce this on another machine, but it's very consistent on my main dev machine. I'm working on bisecting this and trying to eliminate any outside variables that could be affecting things, and I'll reply with results (and close this if it turns out to be nothing after all).

@zikaeroh
Copy link
Contributor Author

@zikaeroh zikaeroh commented Nov 20, 2019

Another benchmark, made longer in runtime to try and make it more consistent:

package strtest

import (
	"strings"
	"testing"
)

var sink bool

func BenchmarkContains(b *testing.B) {
	vs := []string{"(_", "_)", " ", ".", "?"}
	const str = "This is just a cool test. (Or is it?)"
	for i := 0; i < b.N; i++ {
		for _, v := range vs {
			sink = strings.Contains(str, v)
		}
	}
}

Between Go 1.13.4 and 15bff20 (Oct 31st):

name        old time/op  new time/op  delta
Contains-4  45.4ns ± 2%  42.1ns ± 2%  -7.16%  (p=0.000 n=10+10)

But between Go 1.13.4 and the current tip (ce7829f):

name        old time/op  new time/op  delta
Contains-4  45.4ns ± 2%  46.7ns ± 3%  +2.84%  (p=0.000 n=10+10)
@zikaeroh
Copy link
Contributor Author

@zikaeroh zikaeroh commented Nov 20, 2019

@ianlancetaylor Unbelievably, I bisected again with the above code, and again got the same CL:

2d8c1995b9c4123ab7ac35a1ba1497b290755d4d is the first bad commit
commit 2d8c1995b9c4123ab7ac35a1ba1497b290755d4d
Author: Ian Lance Taylor <iant@golang.org>
Date:   Fri Nov 15 10:05:13 2019 -0800

    runtime: release timersLock while running timer

I too doubt that this is related directly to strings.Contains, but maybe there's something else going on (maybe in the testing library itself? some other goroutine?). It being a timers CL is what made me think my bisect was bad, but getting it again is suspicious.

What is also interesting is that within the same test run, I get similar results, but immediately running it again will give me another set of different but still consistent (according to benchstat) times.

At 2d8c199:

name        old time/op  new time/op  delta
Contains-4  45.4ns ± 2%  49.4ns ± 2%  +8.90%  (p=0.000 n=10+10)
name        old time/op  new time/op  delta
Contains-4  45.4ns ± 2%  47.0ns ± 1%  +3.45%  (p=0.000 n=10+9)
name        old time/op  new time/op  delta
Contains-4  45.4ns ± 2%  47.5ns ± 2%  +4.58%  (p=0.000 n=10+10)

At 2d8c199~1:

name        old time/op  new time/op  delta
Contains-4  45.4ns ± 2%  45.6ns ± 2%   ~     (p=0.432 n=10+9)
ame        old time/op  new time/op  delta
Contains-4  45.4ns ± 2%  45.0ns ± 2%   ~     (p=0.088 n=10+9)
name        old time/op  new time/op  delta
Contains-4  45.4ns ± 2%  45.3ns ± 3%   ~     (p=0.838 n=10+10)

Bisect log:

git bisect start
# bad: [ce7829fdb2f9acc6a228e0fc8093afe6d2ca989c] doc: add section for major version suffixes to module reference
git bisect bad ce7829fdb2f9acc6a228e0fc8093afe6d2ca989c
# good: [3f21c2381d9b0f0977f388cc89104f557a7d2c88] os/signal: don't ignore SIGINT in TestAtomicStop child process
git bisect good 3f21c2381d9b0f0977f388cc89104f557a7d2c88
# good: [73d20f8186a091c8d7e81b621136770981cf8e44] runtime: always use Go signal stack in non-cgo program
git bisect good 73d20f8186a091c8d7e81b621136770981cf8e44
# bad: [1046a9cb1b348a91e551ed2d1714e0926a1b04b1] cmd/link,cmd/internal/obj/riscv: implement archreloc for riscv64
git bisect bad 1046a9cb1b348a91e551ed2d1714e0926a1b04b1
# good: [fff450256a541decd2bb2060c46b788d92692faa] cmd/go: convert TestUserOverrideFlags to a script test
git bisect good fff450256a541decd2bb2060c46b788d92692faa
# good: [9bba63bbbe43ee6370367220feb46717ec50ea9c] cmd/compile: make a better bogus line for empty infinite loops
git bisect good 9bba63bbbe43ee6370367220feb46717ec50ea9c
# good: [3efa09f3862b68f0dfa5012455f475836e040e91] time: deflake TestTicker
git bisect good 3efa09f3862b68f0dfa5012455f475836e040e91
# bad: [2d8c1995b9c4123ab7ac35a1ba1497b290755d4d] runtime: release timersLock while running timer
git bisect bad 2d8c1995b9c4123ab7ac35a1ba1497b290755d4d
# first bad commit: [2d8c1995b9c4123ab7ac35a1ba1497b290755d4d] runtime: release timersLock while running timer

Feel free to retitle this issue appropriately, if needed.

@zikaeroh
Copy link
Contributor Author

@zikaeroh zikaeroh commented Nov 20, 2019

Hm, my mistake, that's not the same CL, but another timers related CL... I'm not sure what this means, then.

@martisch
Copy link
Contributor

@martisch martisch commented Nov 20, 2019

What CPU does run on your dev machine and at what CL are you building at tip?

Any CL can potentially cause the benchmark code to align differently and then benchmark the effects of branch alignment. I have seen in the past that it can matter a lot where the benchmark loop is placed in the file which can cause different alignment.

With all the side channel attacks on caching/branch prediction there can also be benchmark differences due to different microcode versions of the same CPU: https://www.phoronix.com/scan.php?page=article&item=intel-jcc-microcode&num=1

@zikaeroh
Copy link
Contributor Author

@zikaeroh zikaeroh commented Nov 20, 2019

i7-6600U on Linux 5.3.11, all software mitigations disabled. But, I'm on Arch where the ucode version is dated 2019-11-15, so I can roll that back and see what happens.

I'll have another crack at benchmarking and trying things this afternoon when I have time.

@zikaeroh
Copy link
Contributor Author

@zikaeroh zikaeroh commented Dec 6, 2019

Apologies for getting around to this so late. I ran some tests with my previous benchmark, and while the microcode I was running did have a negative performance impact, the effect appears to be even between Go 1.13 and gotip.

Here are my results. There are two benchstats, one from running 1.13 first, then one running 1.13 after gotip.

With no microcode loaded:

name        old time/op  new time/op  delta
Contains-4  40.2ns ± 0%  55.4ns ± 3%  +37.96%  (p=0.000 n=9+10)
name        old time/op  new time/op  delta
Contains-4  41.3ns ± 2%  55.4ns ± 3%  +34.11%  (p=0.000 n=10+10)

With a microcode from ~6 months ago:

intel-ucode 20190618-1
name        old time/op  new time/op  delta
Contains-4  40.3ns ± 0%  56.5ns ± 1%  +40.01%  (p=0.000 n=10+10)
name        old time/op  new time/op  delta
Contains-4  41.4ns ± 2%  56.5ns ± 1%  +36.54%  (p=0.000 n=9+10)

With the latest microcode:

intel-ucode 20191115-2
name        old time/op  new time/op  delta
Contains-4  43.1ns ± 1%  61.0ns ± 1%  +41.47%  (p=0.000 n=10+10)
name        old time/op  new time/op  delta
Contains-4  43.9ns ± 1%  61.0ns ± 1%  +38.70%  (p=0.000 n=10+10)

My gotip:

go version devel +0915a19 Fri Dec 6 05:12:15 2019 +0000 linux/amd64

I've also found that code that "looks" like Contains would (i.e. a range over a string with a bit of boolean logic) also shows this sort of slowdown.

@zikaeroh
Copy link
Contributor Author

@zikaeroh zikaeroh commented Dec 6, 2019

This is probably less helpful, but I'm finding this while benchmarking a parser I've written (which does call strings.Contains to pre-check, but does more afterward that's a lot of string iteration and slicing), where I'm seeing this slowdown. But, I'm not so good at comparing CPU profiles.

name                   old time/op    new time/op    delta
Parse/No_actions-4       64.0ns ± 2%    70.2ns ± 1%   +9.62%  (p=0.000 n=10+9)
Parse/Parens-4           66.8ns ± 1%    73.6ns ± 1%  +10.15%  (p=0.000 n=10+9)
Parse/Single_action-4     313ns ± 1%     342ns ± 2%   +9.33%  (p=0.000 n=9+10)
Parse/Nested-4            483ns ± 1%     532ns ± 1%  +10.25%  (p=0.000 n=9+10)
Parse/ASCII_art-4         971ns ± 1%    1044ns ± 1%   +7.49%  (p=0.000 n=10+9)

https://github.com/hortbot/hortbot/blob/master/internal/cbp/cbp.go

@markdryan
Copy link
Contributor

@markdryan markdryan commented Feb 4, 2020

I ran some tests with my previous benchmark, and while the microcode I was running did have a negative performance impact, the effect appears to be even between Go 1.13 and gotip.

@zikaeroh It's possible that the performance effects you are seeing with the newer microcode is due to issue #35881. You could confirm this by:

  • Applying CL 206837 locally
  • export GOAMD64="alignedjumps"
  • rebuilding your Go tool chain
  • executing $GOSRCREPO/bin/go env to check GOAMD64 is set correctly.
  • re-running your benchmark with the new tool chain.
@martisch
Copy link
Contributor

@martisch martisch commented Feb 4, 2020

While at it can please test if its just a matter of general jump alignment and set funcAlign to 32 instead of 16, compile tip and rebenchmark:

funcAlign = 16

@zikaeroh
Copy link
Contributor Author

@zikaeroh zikaeroh commented Feb 5, 2020

First, plain gotip at f770366:

name                   old time/op    new time/op    delta
Parse/No_actions-4       66.5ns ± 2%    70.6ns ± 3%  +6.15%  (p=0.000 n=10+10)
Parse/Parens-4           69.6ns ± 1%    72.5ns ± 1%  +4.09%  (p=0.000 n=10+9)
Parse/Single_action-4     333ns ± 1%     346ns ± 1%  +4.10%  (p=0.000 n=9+10)
Parse/Nested-4            511ns ± 1%     539ns ± 1%  +5.40%  (p=0.000 n=9+10)
Parse/ASCII_art-4        1.01µs ± 0%    1.07µs ± 1%  +6.16%  (p=0.000 n=9+10)
name                   old time/op    new time/op    delta
Parse/No_actions-4       66.5ns ± 2%    68.9ns ± 0%  +3.74%  (p=0.000 n=10+9)
Parse/Parens-4           69.6ns ± 1%    72.7ns ± 1%  +4.45%  (p=0.000 n=10+10)
Parse/Single_action-4     333ns ± 1%     345ns ± 1%  +3.71%  (p=0.000 n=9+10)
Parse/Nested-4            511ns ± 1%     537ns ± 1%  +5.13%  (p=0.000 n=9+10)
Parse/ASCII_art-4        1.01µs ± 0%    1.07µs ± 1%  +6.22%  (p=0.000 n=9+10)

There's some variance between runs.

@markdryan I applied that CL to tip. Running the benchmarks from my most recent post gives:

name                   old time/op    new time/op    delta
Parse/No_actions-4       66.5ns ± 2%    63.5ns ± 0%  -4.47%  (p=0.000 n=10+9)
Parse/Parens-4           69.6ns ± 1%    66.6ns ± 1%  -4.41%  (p=0.000 n=10+10)
Parse/Single_action-4     333ns ± 1%     330ns ± 0%  -0.90%  (p=0.010 n=9+7)
Parse/Nested-4            511ns ± 1%     504ns ± 1%  -1.31%  (p=0.000 n=9+10)
Parse/ASCII_art-4        1.01µs ± 0%    1.03µs ± 1%  +2.21%  (p=0.000 n=9+9)

Pretty consistently between runs. In fact, many of my benchmarks see similar improvements (full run of my project here: https://gist.github.com/zikaeroh/351939ff3b3657f34092caa974733a96).

@martisch Changing only that alignment:

name                   old time/op    new time/op    delta
Parse/No_actions-4       66.5ns ± 2%    68.2ns ± 0%  +2.63%  (p=0.000 n=10+9)
Parse/Parens-4           69.6ns ± 1%    71.4ns ± 1%  +2.48%  (p=0.000 n=10+10)
Parse/Single_action-4     333ns ± 1%     337ns ± 1%  +1.40%  (p=0.000 n=9+10)
Parse/Nested-4            511ns ± 1%     519ns ± 0%  +1.59%  (p=0.000 n=9+9)
Parse/ASCII_art-4        1.01µs ± 0%    1.05µs ± 1%  +3.70%  (p=0.000 n=9+9)
name                   old time/op    new time/op    delta
Parse/No_actions-4       66.5ns ± 2%    68.4ns ± 1%  +2.89%  (p=0.000 n=10+10)
Parse/Parens-4           69.6ns ± 1%    71.3ns ± 1%  +2.33%  (p=0.000 n=10+8)
Parse/Single_action-4     333ns ± 1%     352ns ±10%  +5.71%  (p=0.000 n=9+9)
Parse/Nested-4            511ns ± 1%     522ns ± 1%  +2.09%  (p=0.000 n=9+10)
Parse/ASCII_art-4        1.01µs ± 0%    1.05µs ± 1%  +4.35%  (p=0.000 n=9+10)

Some variance, seemingly better than tip.

I was actually considering closing this in reference to #35881 (specifically, #35881 (comment); or at least mentioning it), as it seemed somewhat similar in nature, but I'm definitely not an expert in this area.

@markdryan
Copy link
Contributor

@markdryan markdryan commented Feb 5, 2020

@zikaeroh Thanks so much for taking the time to re-run your benchmarks. I have some questions

  1. What do old/time and new/time represent in the first set of benchmarks?
  2. What microcode version are you testing on? Is it the latest? Are all benchmarks run on the same microcode?
  3. You posted results for a different benchmark (Contains-4) above, that seemed to be affected by the microcode update. Could you possibly re-run that benchmark with and without CL 206837 on the latest microcode?
@markdryan
Copy link
Contributor

@markdryan markdryan commented Feb 5, 2020

Leaving aside the potential performance effects of the microcode update for a moment, it's possible that the remaining performance effects you see from one build/run to another on your Contains-4 benchmark are due to the alignment of your test string, "str". If I follow the chain of calls correctly through the standard library code, some iterations of your test loop will end up in bytealg.IndexByte. This method uses AVX2 instructions when the string you're searching is > 32 bytes, which your string happens to be. It doesn't, however, seem to contain any peeling code, so the performance of IndexByte may vary on machines that support AVX2, depending on whether or not the string being searched is 32 byte aligned. One way to determine whether data alignment is responsible for some of the inconsistencies you see with your benchmark would be to make the test string shorter, e.g., 31 bytes.

@zikaeroh
Copy link
Contributor Author

@zikaeroh zikaeroh commented Feb 17, 2020

Sorry, no idea why I didn't reply to this.

  1. In my original post, it was Go 1.13 to tip at the time. When I reported this, I was assuming it was a regression in the library, as I could reliably see that 1.13 was fine and tip was not.
  2. This post has been around for a while, but for the most part I've been at 20191115 (from the Arch repos; it's had some minor packaging changes but it's still fundamentally the same November microcode) -- other than the tests I did earlier that were on multiple microcode versions.
  3. Sure. Here are results; old is tip, new is tip with the CL cherry picked on top. The "31" variant is my original test, but I cut the string to 31 bytes.
name          old time/op  new time/op  delta
Contains-4    43.6ns ± 1%  42.1ns ± 0%  -3.47%  (p=0.000 n=10+9)
Contains31-4  43.0ns ± 1%  40.0ns ± 1%  -6.96%  (p=0.000 n=10+9)

name          old time/op  new time/op  delta
Contains-4    43.5ns ± 1%  42.2ns ± 1%  -2.99%  (p=0.000 n=10+10)
Contains31-4  43.0ns ± 1%  39.8ns ± 0%  -7.39%  (p=0.000 n=10+9)

The code for this benchmark is in my original post.

@zikaeroh
Copy link
Contributor Author

@zikaeroh zikaeroh commented Mar 18, 2020

For what it's worth, tip's performance in my Contains/Index heavy benchmarks has gone back to levels similar to 1.13. Bisecting back to where it became "good" gives me CL 156998, which makes sense as that CL was intended to improve strings/bytes's performance. But, it's not related directly to the theory of it being an alignment issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
7 participants
You can’t perform that action at this time.