What version of Go are you using (go version)?
CL 387975 (patch set 5)
$ go version
go version devel go1.18-c217a17823 Tue Mar 1 10:12:02 2022 -0800 linux/amd64
Does this issue reproduce with the latest release?
n/a
What operating system and processor architecture are you using (go env)?
- amd64 (AMD EPYC)
- Debian 11 bullseye (stock GCE image; some light checking of similar behavior on Ubuntu 20.04 LTS)
go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/thepudds1460/.cache/go-build"
GOENV="/home/thepudds1460/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/thepudds1460/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/thepudds1460/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/home/thepudds1460/sdk/gotip"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/home/thepudds1460/sdk/gotip/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="devel go1.18-c217a17823 Tue Mar 1 10:12:02 2022 -0800"
GCCGO="gccgo"
GOAMD64="v1"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/home/thepudds1460/arena-benchmark-wip/go.mod"
GOWORK=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build1871014333=/tmp/go-build -gno-record-gcc-switches"
What did you do?
I was interested in understanding the performance described in the arena proposal (#51317), so ran a few benchmarks using the prototype arena implementation in CL 387975 (patch set 5).
I didn't look very much at the interface{}/reflect based API, and instead started by adding a simple generic API for arena.NewOf[T any](a *Arena) *T.
Looking at the initial performance on this particular benchmark (links below), some things that jumped out:
- the cost of first writing to newly allocated memory in the user code seemed to dominate the overall time in these benchmarks.
- high system time.
- the kernel was spending time in in places like handle_mm_fault, flush_tlb_mm_range, and friends.
- many minor page faults.
I poked at it a few different ways, and concluded the arenas in the benchmark were not getting huge pages. I checked some OS-level settings that didn't seem to help (e.g., /sys/kernel/mm/transparent_hugepage/enabled was defaulted to always, /sys/kernel/mm/transparent_hugepage/defrag defaulted to madvise but changing to always didn't help).
I then built up a simple C program that does a similar series of mmap(... PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, ...), mmap(... PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, ...), madvise(... MADV_HUGEPAGE) on 64 MB chunks at a time in a light attempt to emulate syscalls done by the unmodified arena code. The C program seemed to get "correct" behavior of huge pages. I also used strace to contrast syscalls by the Go runtime vs. glibc malloc (which also does mmaps under the covers and ends up "correctly" with huge pages on this machine).
Based on that, I tried a few modifications to the Go runtime, with the main results below. The largest improvement was forcing huge pages by doing memclr within the runtime on each new 2MB piece of the 64 MB arena chunks.
Some heavy caveats include these were quick YOLO changes to poke at the performance I was observing, and probably not the actual changes you would want 😅, and of course, all of this might be a red herring or an OS config issue or user error or something else entirely...
What did you see?
Here is a summary of main performance results, all with GOMAXPROCS=8:
Baseline: no arenas
- Benchmark -- this is the baseline benchmark, which is the currently the fastest Go entry for the binarytree benchmark from the Benchmark Games site.
With unmodified go (no arenas):
$ go install github.com/thepudds/arena-performance/cmd/binarytree-original@f595de77
$ binarytree-original 21 # 21 controls the size of the trees
5.12 sec wall clock
34.53 sec user
5.50 sec system
40.03 sec total cpu
537 MB RSS
76,886 minor page faults
Runtime patch 1: add arenas using arena.NewOf[T any]
- Go patch -- add simple implementation of
arena.NewOf[T any](a *Arena) *T. No other changes.
- Benchmark -- change original benchmark to use arenas via the generic API.
With modified go:
$ go install github.com/thepudds/arena-performance/cmd/binarytree-arena@f595de77
$ binarytree-arena 21
12.50 sec wall clock (144% increase from baseline)
20.80 sec user
77.92 sec system
98.72 sec total cpu (147% increase from baseline)
750 MB RSS (40% increase from baseline)
2,334,682 minor page faults
Runtime patch 2: memclr 2MB pieces of arena chunks prior to allowing use
- Go patch -- also add calls to memclrNoHeapPointers within runtime/arena.go and reflect/arena.go to help commit memory in 2MB pieces, which seems to result in huge pages from kernel. There is very likely a better change than this, but this seemed to help substantially and at least suggests huge pages have an impact.
- Benchmark -- same as prior
With modified go:
$ go install github.com/thepudds/arena-performance/cmd/binarytree-arena@f595de77
$ binarytree-arena 21
2.55 sec wall clock (50% reduction from baseline)
19.34 sec user
0.89 sec system
20.23 sec total cpu (49& reduction from baseline)
765 MB RSS (42% increase from baseline)
14,492 minor page faults
Runtime patch 3: unmap chunk once >8 MB is used
- Go patch -- also change reflect/arena.go so that arena Free unmaps a chunk once >8 MB is used (rather than waiting for full 64MB)
- Benchmark -- same as prior
With modified go:
$ go install github.com/thepudds/arena-performance/cmd/binarytree-arena@f595de77
$ binarytree-arena 21
2.71 sec wall clock (47% reduction from baseline)
20.11 sec user
0.97 sec system
21.08 sec total cpu (47% reduction from baseline)
312 MB RSS (42% reduction from baseline)
13,980 minor page faults
Sample benchmark output
The benchmark creates a small number of large binary trees and a large number of small binary trees, and also walks each tree to count its nodes. Here is sample output from binarytree-arena 21. Larger values will create more and larger trees.
$ binarytree-arena 21
stretch tree of depth 22 arenas: 1 nodes: 8388607 MB: 128.0
2097152 trees of depth 4 arenas: 992 nodes: 65011712 MB: 992.0
524288 trees of depth 6 arenas: 1015 nodes: 66584576 MB: 1016.0
131072 trees of depth 8 arenas: 1017 nodes: 66977792 MB: 1022.0
32768 trees of depth 10 arenas: 993 nodes: 67076096 MB: 1023.5
8192 trees of depth 12 arenas: 911 nodes: 67100672 MB: 1023.9
2048 trees of depth 14 arenas: 683 nodes: 67106816 MB: 1024.0
512 trees of depth 16 arenas: 512 nodes: 67108352 MB: 1024.0
128 trees of depth 18 arenas: 128 nodes: 67108736 MB: 1024.0
32 trees of depth 20 arenas: 32 nodes: 67108832 MB: 1024.0
long lived tree of depth 21 arenas: 1 nodes: 4194303 MB: 64.0
What version of Go are you using (
go version)?CL 387975 (patch set 5)
Does this issue reproduce with the latest release?
n/a
What operating system and processor architecture are you using (
go env)?go envOutputWhat did you do?
I was interested in understanding the performance described in the arena proposal (#51317), so ran a few benchmarks using the prototype arena implementation in CL 387975 (patch set 5).
I didn't look very much at the interface{}/reflect based API, and instead started by adding a simple generic API for
arena.NewOf[T any](a *Arena) *T.Looking at the initial performance on this particular benchmark (links below), some things that jumped out:
I poked at it a few different ways, and concluded the arenas in the benchmark were not getting huge pages. I checked some OS-level settings that didn't seem to help (e.g.,
/sys/kernel/mm/transparent_hugepage/enabledwas defaulted toalways,/sys/kernel/mm/transparent_hugepage/defragdefaulted tomadvisebut changing toalwaysdidn't help).I then built up a simple C program that does a similar series of
mmap(... PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, ...),mmap(... PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, ...),madvise(... MADV_HUGEPAGE)on 64 MB chunks at a time in a light attempt to emulate syscalls done by the unmodified arena code. The C program seemed to get "correct" behavior of huge pages. I also used strace to contrast syscalls by the Go runtime vs. glibc malloc (which also does mmaps under the covers and ends up "correctly" with huge pages on this machine).Based on that, I tried a few modifications to the Go runtime, with the main results below. The largest improvement was forcing huge pages by doing memclr within the runtime on each new 2MB piece of the 64 MB arena chunks.
Some heavy caveats include these were quick YOLO changes to poke at the performance I was observing, and probably not the actual changes you would want 😅, and of course, all of this might be a red herring or an OS config issue or user error or something else entirely...
What did you see?
Here is a summary of main performance results, all with
GOMAXPROCS=8:Baseline: no arenas
With unmodified go (no arenas):
Runtime patch 1: add arenas using arena.NewOf[T any]
arena.NewOf[T any](a *Arena) *T. No other changes.With modified go:
Runtime patch 2: memclr 2MB pieces of arena chunks prior to allowing use
With modified go:
Runtime patch 3: unmap chunk once >8 MB is used
With modified go:
Sample benchmark output
The benchmark creates a small number of large binary trees and a large number of small binary trees, and also walks each tree to count its nodes. Here is sample output from
binarytree-arena 21. Larger values will create more and larger trees.