Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: allocation performance worse on two socket server #47831

Open
gangdeng-intel opened this issue Aug 20, 2021 · 3 comments · May be fixed by #48236
Open

runtime: allocation performance worse on two socket server #47831

gangdeng-intel opened this issue Aug 20, 2021 · 3 comments · May be fixed by #48236

Comments

@gangdeng-intel
Copy link

@gangdeng-intel gangdeng-intel commented Aug 20, 2021

What version of Go are you using (go version)?

$ go version
1.16.4

Does this issue reproduce with the latest release?

Did not try on latest release

What operating system and processor architecture are you using (go env)?

go env Output
$ go env

GO111MODULE="on"
GOARCH="amd64"
GOBIN=""
GOCACHE="/root/.cache/go-build"
GOENV="/root/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/root/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/root/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/lib/golang"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/lib/golang/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/dev/null"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build394973932=/tmp/go-build -gno-record-gcc-switches"

What did you do?

I ran go based application (TiDB v5.1, which is compiled by go 1.16.4) on 2 socket server (24c per socket), and compared its performance with on 1 socket (I used numactl to bind application to a single socket).

What did you expect to see?

Application has better performance on 2 socket comparing with on 1 socket. As 2 sockets have double number of cpu cores.

What did you see instead?

I found application has worse performance on 2 socket comparing with on a single socket. More specifically, application has 93% performance on 2 socket comparing with on 1 socket.

I used perf top to check hotspots of application. The top hot function is runtime.HeapBitsSetType. The function took 6.45% overhead on 1 socket while took 17.25% overhead on 2 sockets.

Then I used perf c2c to check if the performance degeneration was caused by cache sharing issue. With the tool, I found runtime.heapBitsSetType is the major source of HITM. And major source of store came from runtime.(*mspan).sweep (79.16%), runtime.(*mheap).relcaim (13.34%) and runtime.sweepone(7.48%). More detail data as below.

----- HITM ------ -- Store REfs -- -------- CL ------ Functions
RmtHitm LclHitm L1 Hit Off Node PA Cnt
2.03% 0.38% 79.16% 0x0 1 1 runtime.(*mspan).sweep
0.86% 0.07% 7.48% 0x30 1 1 runtime.sweepone
0.18% 0.03% 13.34% 0x30 1 1 runtime.(*mheap).relcaim
89.98% 92.52% 0.00% 0x38 1 1 runtime.heapBitsSetType

Then I located the lines of code in above functions:
For runtime.heapBitsSetType, the related line of code is: "ha := mheap_.arenas[arena.l1()][arena.l2()]" (in heapBitsForAddr func of mbitmap.go)
For runtime.(*mspan).sweep, the related line of code is: "atomic.Xadd64(&mheap_.pagesSwept, int64(s.npages))" (in sweep func of mgcsweep.go)
For runtime.sweepone, the related line of code is "atomic.Xadduintptr(&mheap_.reclaimCredit, npages)" (in sweepone func of mgcsweep.go)
For runtime.(*mheap).relcaim, the related line of code is "if atomic.Casuintptr(&h.reclaimCredit, credit, credit-take) {" (in reclaim func of mheap.go)

According to above code, I located the relatived variables defined in mheap struct:pagesSwept, reclaimCredit, arenas. And suppose pagesSwept is the start of a cacheline at 0x00, then reclaimCredit will locate at 0x30, arenas will locate at 0x38.
It will exactly match output of perf c2c tool (CL off).

0x00 pagesSwept uint64
0x08 pagesSweptBasis uint64
0x10 sweepHeapLiveBasis uint64
0x18 sweepPagesPerByte float64
0x20 scavengeGoal uint64
0x28 reclaimIndex uint64
0x30 reclaimCredit uintptr
0x38 arenas [1 << arenaL1Bits]*[1 << arenaL2Bits]*heapArena

According to above analysis, I think the performance issue on 2 socket server can be solved by either adding padding before arenas or changing order of the related variable definition in mheap struct to ensure they locate at different cachelines.

@mknyszek mknyszek changed the title Go runtime has performance issue on 2 socket server runtime: performance worse on two socket server Aug 20, 2021
@mknyszek mknyszek added this to the Backlog milestone Aug 20, 2021
@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Aug 20, 2021

This is somewhat expected. The Go runtime doesn't do anything special for NUMA nodes, so there's probably a lot of cross-socket traffic it's producing that actively makes things worse.

CC @prattmic @aclements

@mknyszek mknyszek changed the title runtime: performance worse on two socket server runtime: allocation performance worse on two socket server Aug 20, 2021
@gopherbot
Copy link

@gopherbot gopherbot commented Sep 8, 2021

Change https://golang.org/cl/348230 mentions this issue: add 64 bytes padding to fix HITM issue across CPU sockets

@nightlyone
Copy link
Contributor

@nightlyone nightlyone commented Sep 26, 2021

Also known as False Sharing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

4 participants