Skip to content

arena: possible performance improvements: huge pages, free approach #51667

@thepudds

Description

@thepudds

What version of Go are you using (go version)?

CL 387975 (patch set 5)

$ go version
go version devel go1.18-c217a17823 Tue Mar 1 10:12:02 2022 -0800 linux/amd64

Does this issue reproduce with the latest release?

n/a

What operating system and processor architecture are you using (go env)?

  • amd64 (AMD EPYC)
  • Debian 11 bullseye (stock GCE image; some light checking of similar behavior on Ubuntu 20.04 LTS)
go env Output
$ go env

GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/thepudds1460/.cache/go-build"
GOENV="/home/thepudds1460/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/thepudds1460/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/thepudds1460/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/home/thepudds1460/sdk/gotip"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/home/thepudds1460/sdk/gotip/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="devel go1.18-c217a17823 Tue Mar 1 10:12:02 2022 -0800"
GCCGO="gccgo"
GOAMD64="v1"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/home/thepudds1460/arena-benchmark-wip/go.mod"
GOWORK=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build1871014333=/tmp/go-build -gno-record-gcc-switches"

What did you do?

I was interested in understanding the performance described in the arena proposal (#51317), so ran a few benchmarks using the prototype arena implementation in CL 387975 (patch set 5).

I didn't look very much at the interface{}/reflect based API, and instead started by adding a simple generic API for arena.NewOf[T any](a *Arena) *T.

Looking at the initial performance on this particular benchmark (links below), some things that jumped out:

  • the cost of first writing to newly allocated memory in the user code seemed to dominate the overall time in these benchmarks.
  • high system time.
  • the kernel was spending time in in places like handle_mm_fault, flush_tlb_mm_range, and friends.
  • many minor page faults.

I poked at it a few different ways, and concluded the arenas in the benchmark were not getting huge pages. I checked some OS-level settings that didn't seem to help (e.g., /sys/kernel/mm/transparent_hugepage/enabled was defaulted to always, /sys/kernel/mm/transparent_hugepage/defrag defaulted to madvise but changing to always didn't help).

I then built up a simple C program that does a similar series of mmap(... PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, ...), mmap(... PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, ...), madvise(... MADV_HUGEPAGE) on 64 MB chunks at a time in a light attempt to emulate syscalls done by the unmodified arena code. The C program seemed to get "correct" behavior of huge pages. I also used strace to contrast syscalls by the Go runtime vs. glibc malloc (which also does mmaps under the covers and ends up "correctly" with huge pages on this machine).

Based on that, I tried a few modifications to the Go runtime, with the main results below. The largest improvement was forcing huge pages by doing memclr within the runtime on each new 2MB piece of the 64 MB arena chunks.

Some heavy caveats include these were quick YOLO changes to poke at the performance I was observing, and probably not the actual changes you would want 😅, and of course, all of this might be a red herring or an OS config issue or user error or something else entirely...

What did you see?

Here is a summary of main performance results, all with GOMAXPROCS=8:

Baseline: no arenas

  • Benchmark -- this is the baseline benchmark, which is the currently the fastest Go entry for the binarytree benchmark from the Benchmark Games site.

With unmodified go (no arenas):

$ go install github.com/thepudds/arena-performance/cmd/binarytree-original@f595de77
$ binarytree-original 21    # 21 controls the size of the trees
     5.12 sec wall clock
    34.53 sec user
     5.50 sec system
    40.03 sec total cpu 
      537 MB RSS
   76,886 minor page faults

Runtime patch 1: add arenas using arena.NewOf[T any]

  • Go patch -- add simple implementation of arena.NewOf[T any](a *Arena) *T. No other changes.
  • Benchmark -- change original benchmark to use arenas via the generic API.

With modified go:

$ go install github.com/thepudds/arena-performance/cmd/binarytree-arena@f595de77
$ binarytree-arena 21
    12.50 sec wall clock   (144% increase from baseline)
    20.80 sec user
    77.92 sec system
    98.72 sec total cpu    (147% increase from baseline)
      750 MB RSS           (40% increase from baseline)
2,334,682 minor page faults

Runtime patch 2: memclr 2MB pieces of arena chunks prior to allowing use

  • Go patch -- also add calls to memclrNoHeapPointers within runtime/arena.go and reflect/arena.go to help commit memory in 2MB pieces, which seems to result in huge pages from kernel. There is very likely a better change than this, but this seemed to help substantially and at least suggests huge pages have an impact.
  • Benchmark -- same as prior

With modified go:

$ go install github.com/thepudds/arena-performance/cmd/binarytree-arena@f595de77
$ binarytree-arena 21
     2.55 sec wall clock   (50% reduction from baseline)
    19.34 sec user
     0.89 sec system
    20.23 sec total cpu    (49& reduction from baseline)
      765 MB RSS           (42% increase from baseline)
   14,492 minor page faults

Runtime patch 3: unmap chunk once >8 MB is used

  • Go patch -- also change reflect/arena.go so that arena Free unmaps a chunk once >8 MB is used (rather than waiting for full 64MB)
  • Benchmark -- same as prior

With modified go:

$ go install github.com/thepudds/arena-performance/cmd/binarytree-arena@f595de77
$ binarytree-arena 21
     2.71 sec wall clock   (47% reduction from baseline)
    20.11 sec user
     0.97 sec system
    21.08 sec total cpu    (47% reduction from baseline)
      312 MB RSS           (42% reduction from baseline)
   13,980 minor page faults

Sample benchmark output

The benchmark creates a small number of large binary trees and a large number of small binary trees, and also walks each tree to count its nodes. Here is sample output from binarytree-arena 21. Larger values will create more and larger trees.

$ binarytree-arena 21

   stretch tree of depth 22       arenas: 1      nodes: 8388607    MB: 128.0
  2097152 trees of depth 4        arenas: 992    nodes: 65011712   MB: 992.0
   524288 trees of depth 6        arenas: 1015   nodes: 66584576   MB: 1016.0
   131072 trees of depth 8        arenas: 1017   nodes: 66977792   MB: 1022.0
    32768 trees of depth 10       arenas: 993    nodes: 67076096   MB: 1023.5
     8192 trees of depth 12       arenas: 911    nodes: 67100672   MB: 1023.9
     2048 trees of depth 14       arenas: 683    nodes: 67106816   MB: 1024.0
      512 trees of depth 16       arenas: 512    nodes: 67108352   MB: 1024.0
      128 trees of depth 18       arenas: 128    nodes: 67108736   MB: 1024.0
       32 trees of depth 20       arenas: 32     nodes: 67108832   MB: 1024.0
long lived tree of depth 21       arenas: 1      nodes: 4194303    MB: 64.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    NeedsInvestigationSomeone must examine and confirm this is a valid issue and not a duplicate of an existing one.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions