New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: sub optimal gc scalability #21056

Open
TocarIP opened this Issue Jul 17, 2017 · 7 comments

Comments

Projects
None yet
7 participants
@TocarIP
Contributor

TocarIP commented Jul 17, 2017

What version of Go are you using (go version)?

go version devel +4e9c86a Wed Jun 28 17:33:40 2017 +0000 linux/amd64

What operating system and processor architecture are you using (go env)?

GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/localdisk/itocar/gopath"
GORACE=""
GOROOT="/localdisk/itocar/golang"
GOTOOLDIR="/localdisk/itocar/golang/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build633808214=/tmp/go-build -gno-record-gcc-switches"
CXX="g++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"

What did you do?

Run test/bench/garbage/tree2 on machine with 88 threads (2 sockets, 22 cores per socket, 2 threads per core) 2x Xeon E5-2699 v4
With following options:
./tree2 -cpus=88 -heapsize=1000000000 -cpuprofile=tree2.pprof

What did you expect to see?

runtime.gcDrain taking insignificant amount of time

What did you see instead?

runtime.gcDrain taking about half of all time:

Showing top 10 nodes out of 33
      flat  flat%   sum%        cum   cum%
    36.95s 45.03% 45.03%     75.98s 92.59%  runtime.gcDrain /localdisk/itocar/golang/src/runtime/mgcmark.go
    12.38s 15.09% 60.11%     12.38s 15.09%  runtime.(*lfstack).pop /localdisk/itocar/golang/src/runtime/lfstack.go
     7.51s  9.15% 69.27%      7.51s  9.15%  runtime.greyobject /localdisk/itocar/golang/src/runtime/mgcmark.go
     6.28s  7.65% 76.92%     19.49s 23.75%  runtime.scanobject /localdisk/itocar/golang/src/runtime/mgcmark.go
     4.54s  5.53% 82.45%      4.54s  5.53%  runtime.(*lfstack).push /localdisk/itocar/golang/src/runtime/lfstack.go

Looking into runtime.gcDrain, I see that almost all time is spent on
35.66s 35.66s 924: if work.full == 0 {

I couldn't reproduce this behavior on machine with small number of cores.
Looking into cache miss profile shows that this is due to all cores updating head of work.full,
which causes all reads needed for check to miss cache.

@TocarIP

This comment has been minimized.

Contributor

TocarIP commented Jul 17, 2017

Also reproducible on 1.8

@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented Jul 17, 2017

@ianlancetaylor ianlancetaylor added this to the Go1.10 milestone Jul 17, 2017

@aclements

This comment has been minimized.

Member

aclements commented Jul 18, 2017

Curiously, while I was able to reproduce this on a 48 thread (24 core) machine with ./tree2 -cpus=48 -heapsize=1000000000, I wasn't able to reproduce it on that machine with garbage -benchmem 1024.

@TocarIP, have you seen this in other/more realistic benchmarks? I wonder if there's something pessimal about tree2's heap.

@TocarIP

This comment has been minimized.

Contributor

TocarIP commented Jul 18, 2017

I've got a report from customer about excessive cache misses in gcdrain, but there were no reproducer. Tree2 is just the random benchmark were I was able to reproduce it. For ./garbage -benchmem=1024 I see about 13% time in gcdrain[N] on machine above (88 threads) with almost all the time spent in the same work.full == 0 check
perf output below:

 12.19%  garbage  garbage           [.] runtime.scanobject
   6.68%  garbage  garbage           [.] runtime.gcDrainN
   6.49%  garbage  garbage           [.] runtime.gcDrain
   6.45%  garbage  garbage           [.] runtime.greyobject
   5.90%  garbage  garbage           [.] runtime.mallocgc
   4.87%  garbage  garbage           [.] runtime.heapBitsForObject
   4.59%  garbage  garbage           [.] runtime.lock
   2.77%  garbage  garbage           [.] runtime.lfstackpop
@gopherbot

This comment has been minimized.

gopherbot commented Sep 11, 2017

Change https://golang.org/cl/62971 mentions this issue: runtime: reduce contention in gcDrain

@RLH

This comment has been minimized.

Contributor

RLH commented Sep 12, 2017

@TocarIP

This comment has been minimized.

Contributor

TocarIP commented Sep 13, 2017

I agree that tree2 is not very representative, but I'm not sure what is representative.
E. g. golang.org/x/benchmarks/garbage got 10% faster in 88-thread case:

Garbage/benchmem-MB=1024-88                   944µs ± 1%                   854µs ± 2%  -9.50%

Is this more representative?
I'm more than willing to measure some other, more representative case, I just not sure what exactly to measure.
As for literature, idea of having several read/write locations was proposed in e. g. https://people.csail.mit.edu/shanir/publications/Lock_Free.pdf . TBB uses similar data-structure for priority queue https://github.com/01org/tbb/blob/tbb_2018/src/tbb/task_stream.h but with an extra bit mask for faster checking. I din't bother with bitmask because current version reduced time spent in GcDrain to reasonable level, by itself.

domodwyer added a commit to domodwyer/mpjbt that referenced this issue Sep 27, 2017

core: minimise effects of a golang GC bug
When running on machines with large numbers of cores (~40+) a Golang
garbage collection bug seems to dominate runtime - see
golang/go#21056 for more info.

This change limits the number of cores to what we found was most
performant on our hardware - others may want to experiment.

@rsc rsc modified the milestones: Go1.10, Go1.11 Nov 22, 2017

@ianlancetaylor ianlancetaylor modified the milestones: Go1.11, Go1.12 Jul 10, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment