Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: sema: many many goroutines queueing up on many many distinct addresses -> slow #38420

Open
navytux opened this issue Apr 13, 2020 · 3 comments
Milestone

Comments

@navytux
Copy link
Contributor

@navytux navytux commented Apr 13, 2020

Hello up there. libcsp claims to be 10x faster on a benchmark involving sync.WaitGroup for a divide-and-conqueror summation program. I've analyzed the profile and most of the time is being spent in runtime.(*semaRoot).dequeue and runtime.(*semaRoot).queue triggered by calls to sync.WaitGroup .Done and .Wait. The benchmark uses many (~ Ngoroutines) different WaitGroups and semaphores simultaneously.

Go commit 45c6f59 says

There is still an assumption here that in real programs you don't have many many goroutines queueing up on many many distinct addresses. If we end up with that problem, we can replace the top-level list with a treap.

It seems the above particular scenario is being hit in this benchmark.


What version of Go are you using (go version)?

$ go version
go version go1.14.2 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE="off"
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/kirr/.cache/go-build"
GOENV="/home/kirr/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/kirr/src/neo:/home/kirr/src/tools/go/g.env"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/home/kirr/src/tools/go/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/home/kirr/src/tools/go/go/pkg/tool/linux_amd64"
GCCGO="/usr/bin/gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build136588881=/tmp/go-build -gno-record-gcc-switches"

What did you do?

Ran benchmarks in libcsp.

What did you expect to see?

Libcsp and Go versions comparable in terms of speed.

What did you see instead?

Go version 10x slower.

navytux referenced this issue Apr 13, 2020
If there are many goroutines contending for two different locks
and both locks hash to the same semaRoot, the scans to find the
goroutines for a particular lock can end up being O(n), making
n lock acquisitions quadratic.

As long as only one actively-used lock hashes to each semaRoot
there's no problem, since the list operations in that case are O(1).
But when the second actively-used lock hits the same semaRoot,
then scans for entries with for a given lock have to scan over the
entries for the other lock.

Fix this problem by changing the semaRoot to hold only one sudog
per unique address. In the running example, this drops the length of
that list from O(n) to 2. Then attach other goroutines waiting on the
same address to a separate list headed by the sudog in the semaRoot list.
Those "same address list" operations are still O(1), so now the
example from above works much better.

There is still an assumption here that in real programs you don't have
many many goroutines queueing up on many many distinct addresses.
If we end up with that problem, we can replace the top-level list with
a treap.

Fixes #17953.

Change-Id: I78c5b1a5053845275ab31686038aa4f6db5720b2
Reviewed-on: https://go-review.googlesource.com/36792
Run-TryBot: Russ Cox <rsc@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
@ianlancetaylor ianlancetaylor added this to the Go1.15 milestone Apr 13, 2020
@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Apr 13, 2020

@dvyukov
Copy link
Member

@dvyukov dvyukov commented Apr 14, 2020

Storing part of the data on the side mapped using a global container is unnecessary overhead and contention. sync.Cond removed all contention and O(n) behavior with sync.syncSema in a20b4a6a9d (nowadays sync.notifyList). We could do the same for all other sync primitives.

@navytux
Copy link
Contributor Author

@navytux navytux commented Apr 14, 2020

@dvyukov thanks for feedback. Glad to hear there is a good path forward with embedded wait queue heads.

@odeke-em odeke-em modified the milestones: Go1.15, Go1.16 May 31, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants
You can’t perform that action at this time.