Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: sub optimal gc scalability #21056

TocarIP opened this issue Jul 17, 2017 · 7 comments

runtime: sub optimal gc scalability #21056

TocarIP opened this issue Jul 17, 2017 · 7 comments
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance


Copy link

TocarIP commented Jul 17, 2017

What version of Go are you using (go version)?

go version devel +4e9c86a Wed Jun 28 17:33:40 2017 +0000 linux/amd64

What operating system and processor architecture are you using (go env)?

GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build633808214=/tmp/go-build -gno-record-gcc-switches"

What did you do?

Run test/bench/garbage/tree2 on machine with 88 threads (2 sockets, 22 cores per socket, 2 threads per core) 2x Xeon E5-2699 v4
With following options:
./tree2 -cpus=88 -heapsize=1000000000 -cpuprofile=tree2.pprof

What did you expect to see?

runtime.gcDrain taking insignificant amount of time

What did you see instead?

runtime.gcDrain taking about half of all time:

Showing top 10 nodes out of 33
      flat  flat%   sum%        cum   cum%
    36.95s 45.03% 45.03%     75.98s 92.59%  runtime.gcDrain /localdisk/itocar/golang/src/runtime/mgcmark.go
    12.38s 15.09% 60.11%     12.38s 15.09%  runtime.(*lfstack).pop /localdisk/itocar/golang/src/runtime/lfstack.go
     7.51s  9.15% 69.27%      7.51s  9.15%  runtime.greyobject /localdisk/itocar/golang/src/runtime/mgcmark.go
     6.28s  7.65% 76.92%     19.49s 23.75%  runtime.scanobject /localdisk/itocar/golang/src/runtime/mgcmark.go
     4.54s  5.53% 82.45%      4.54s  5.53%  runtime.(*lfstack).push /localdisk/itocar/golang/src/runtime/lfstack.go

Looking into runtime.gcDrain, I see that almost all time is spent on
35.66s 35.66s 924: if work.full == 0 {

I couldn't reproduce this behavior on machine with small number of cores.
Looking into cache miss profile shows that this is due to all cores updating head of work.full,
which causes all reads needed for check to miss cache.

Copy link
Contributor Author

TocarIP commented Jul 17, 2017

Also reproducible on 1.8

Copy link

ianlancetaylor commented Jul 17, 2017

CC @RLH @aclements

Copy link

aclements commented Jul 18, 2017

Curiously, while I was able to reproduce this on a 48 thread (24 core) machine with ./tree2 -cpus=48 -heapsize=1000000000, I wasn't able to reproduce it on that machine with garbage -benchmem 1024.

@TocarIP, have you seen this in other/more realistic benchmarks? I wonder if there's something pessimal about tree2's heap.

@bradfitz bradfitz added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jul 18, 2017
Copy link
Contributor Author

TocarIP commented Jul 18, 2017

I've got a report from customer about excessive cache misses in gcdrain, but there were no reproducer. Tree2 is just the random benchmark were I was able to reproduce it. For ./garbage -benchmem=1024 I see about 13% time in gcdrain[N] on machine above (88 threads) with almost all the time spent in the same work.full == 0 check
perf output below:

 12.19%  garbage  garbage           [.] runtime.scanobject
   6.68%  garbage  garbage           [.] runtime.gcDrainN
   6.49%  garbage  garbage           [.] runtime.gcDrain
   6.45%  garbage  garbage           [.] runtime.greyobject
   5.90%  garbage  garbage           [.] runtime.mallocgc
   4.87%  garbage  garbage           [.] runtime.heapBitsForObject
   4.59%  garbage  garbage           [.] runtime.lock
   2.77%  garbage  garbage           [.] runtime.lfstackpop

Copy link

gopherbot commented Sep 11, 2017

Change mentions this issue: runtime: reduce contention in gcDrain

Copy link

RLH commented Sep 12, 2017

Copy link
Contributor Author

TocarIP commented Sep 13, 2017

I agree that tree2 is not very representative, but I'm not sure what is representative.
E. g. got 10% faster in 88-thread case:

Garbage/benchmem-MB=1024-88                   944µs ± 1%                   854µs ± 2%  -9.50%

Is this more representative?
I'm more than willing to measure some other, more representative case, I just not sure what exactly to measure.
As for literature, idea of having several read/write locations was proposed in e. g. . TBB uses similar data-structure for priority queue but with an extra bit mask for faster checking. I din't bother with bitmask because current version reduced time spent in GcDrain to reasonable level, by itself.

domodwyer added a commit to domodwyer/mpjbt that referenced this issue Sep 27, 2017
When running on machines with large numbers of cores (~40+) a Golang
garbage collection bug seems to dominate runtime - see
golang/go#21056 for more info.

This change limits the number of cores to what we found was most
performant on our hardware - others may want to experiment.
@rsc rsc modified the milestones: Go1.10, Go1.11 Nov 22, 2017
@ianlancetaylor ianlancetaylor modified the milestones: Go1.11, Go1.12 Jul 10, 2018
@aclements aclements modified the milestones: Go1.12, Go1.13 Jan 8, 2019
@aclements aclements modified the milestones: Go1.13, Go1.14 Jun 25, 2019
@rsc rsc modified the milestones: Go1.14, Backlog Oct 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
None yet

No branches or pull requests

7 participants