Description
What version of Go are you using (go version
)?
$ go version tip version, on arm64
Does this issue reproduce with the latest release?
Yes.
What operating system and processor architecture are you using (go env
)?
go env
Output
$ go env
GO111MODULE=""
GOARCH="arm64"
GOBIN=""
GOCACHE="/home/xiaji01/.cache/go-build"
GOENV="/home/xiaji01/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="arm64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/xiaji01/.go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/xiaji01/.go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/home/xiaji01/src/go.gc"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/home/xiaji01/src/go.gc/pkg/tool/linux_arm64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/home/xiaji01/src/go.gc/src/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build944660995=/tmp/go-build -gno-record-gcc-switches"
What did you do?
live-lock problems are spotted on arm64 in a couple of micro benchmarks from text/tabwriter (there should be more similar cases).
If a goroutine is running 'nosplit' functions which are time consuming, like
bulkBarrierPreWriteSrcOnly
memmove
called in growslice when the slice is a large one, and its background worker tries to suspend it to do stack scanning, the G under preemption may hardly make progress, especially on arm64 machines which seem to experience poor signal handling performance.
pprof data of BenchmarkTable/1x100000/new on an arm64 machine:
(pprof) top
Showing nodes accounting for 12.46s, 95.70% of 13.02s total
Dropped 118 nodes (cum <= 0.07s)
Showing top 10 nodes out of 40
flat flat% sum% cum cum%
5.41s 41.55% 41.55% 5.41s 41.55% runtime.futex
3.22s 24.73% 66.28% 3.59s 27.57% runtime.nanotime (inline)
0.99s 7.60% 73.89% 0.99s 7.60% runtime.tgkill
0.84s 6.45% 80.34% 0.84s 6.45% runtime.osyield
0.77s 5.91% 86.25% 0.77s 5.91% runtime.epollwait
0.37s 2.84% 89.09% 0.37s 2.84% runtime.nanotime1
0.33s 2.53% 91.63% 6.38s 49.00% runtime.suspendG
0.30s 2.30% 93.93% 0.30s 2.30% runtime.getpid
0.14s 1.08% 95.01% 0.14s 1.08% runtime.procyield
0.09s 0.69% 95.70% 1.38s 10.60% runtime.preemptM (inline)
I'm thinking to introduce a flag into G to indicate whether it's running a time-consuming nosplit function and let suspendG yield to avoid the live-lock, the flag is set manually for known functions, I tried it for bulkBarrierPreWriteSrcOnly and memmove in runtime.growslice and runtime.makeslicecopy when the slice is longer than a threshold (select 4K for now), the text/tabwriter package witnessed significant improvement on arm64:
arm64-1 perf:
name old time/op new time/op delta
Table/1x10/new-224 6.59µs ± 8% 6.67µs ± 7% ~ (p=0.841 n=5+5)
Table/1x10/reuse-224 1.94µs ± 1% 1.94µs ± 1% ~ (p=0.802 n=5+5)
Table/1x1000/new-224 494µs ±13% 381µs ± 3% -22.85% (p=0.008 n=5+5)
Table/1x1000/reuse-224 185µs ± 0% 185µs ± 0% -0.37% (p=0.016 n=5+4)
Table/1x100000/new-224 2.73s ±73% 0.04s ± 2% -98.46% (p=0.008 n=5+5)
Table/1x100000/reuse-224 4.25s ±76% 0.02s ± 1% -99.55% (p=0.008 n=5+5)
Table/10x10/new-224 20.7µs ± 6% 20.1µs ± 5% ~ (p=0.421 n=5+5)
Table/10x10/reuse-224 8.89µs ± 0% 8.91µs ± 0% +0.21% (p=0.032 n=5+5)
Table/10x1000/new-224 1.70ms ± 7% 1.59ms ± 6% ~ (p=0.151 n=5+5)
Table/10x1000/reuse-224 908µs ± 0% 902µs ± 0% -0.59% (p=0.032 n=5+5)
Table/10x100000/new-224 2.34s ±62% 0.14s ± 2% -94.21% (p=0.008 n=5+5)
Table/10x100000/reuse-224 911ms ±72% 103ms ± 2% -88.66% (p=0.008 n=5+5)
Table/100x10/new-224 143µs ±15% 138µs ± 5% ~ (p=0.841 n=5+5)
Table/100x10/reuse-224 78.1µs ± 0% 78.1µs ± 0% ~ (p=0.841 n=5+5)
Table/100x1000/new-224 13.1ms ± 6% 11.9ms ± 5% -9.16% (p=0.008 n=5+5)
Table/100x1000/reuse-224 8.13ms ± 0% 8.15ms ± 1% ~ (p=1.000 n=5+5)
Table/100x100000/new-224 1.29s ±17% 1.31s ±10% ~ (p=0.310 n=5+5)
Table/100x100000/reuse-224 1.23s ± 2% 1.26s ± 5% ~ (p=0.286 n=4+5)
arm64-2 perf:
name old time/op new time/op delta
Table/1x10/new-64 4.33µs ± 1% 5.40µs ± 4% +24.92% (p=0.004 n=6+5)
Table/1x10/reuse-64 1.77µs ± 0% 1.77µs ± 0% -0.36% (p=0.024 n=6+6)
Table/1x1000/new-64 318µs ±10% 379µs ±10% +19.04% (p=0.008 n=5+5)
Table/1x1000/reuse-64 172µs ± 0% 171µs ± 0% ~ (p=0.537 n=6+5)
Table/1x100000/new-64 4.26s ±88% 0.05s ± 4% -98.77% (p=0.002 n=6+6)
Table/1x100000/reuse-64 4.79s ±73% 0.02s ± 2% -99.61% (p=0.004 n=6+5)
Table/10x10/new-64 14.6µs ± 3% 16.4µs ± 9% +12.51% (p=0.004 n=5+6)
Table/10x10/reuse-64 8.76µs ± 0% 8.79µs ± 0% ~ (p=0.329 n=5+6)
Table/10x1000/new-64 1.18ms ± 3% 1.32ms ± 4% +12.12% (p=0.002 n=6+6)
Table/10x1000/reuse-64 890µs ± 0% 897µs ± 0% +0.84% (p=0.002 n=6+6)
Table/10x100000/new-64 1.43s ±36% 0.16s ± 3% -88.56% (p=0.004 n=5+6)
Table/10x100000/reuse-64 375ms ±55% 138ms ±11% -63.31% (p=0.004 n=5+6)
Table/100x10/new-64 103µs ± 0% 105µs ± 4% ~ (p=0.429 n=5+6)
Table/100x10/reuse-64 79.6µs ± 0% 80.0µs ± 1% ~ (p=0.329 n=5+6)
Table/100x1000/new-64 10.2ms ± 3% 10.5ms ± 1% ~ (p=0.052 n=6+5)
Table/100x1000/reuse-64 8.75ms ± 0% 9.23ms ± 6% ~ (p=0.126 n=5+6)
Table/100x100000/new-64 1.42s ± 6% 1.46s ±11% ~ (p=0.394 n=6+6)
Table/100x100000/reuse-64 1.40s ± 0% 1.45s ± 6% ~ (p=0.690 n=5+5)
Pyramid/10-64 15.3µs ±45% 14.5µs ±20% ~ (p=0.662 n=6+5)
Pyramid/100-64 1.16ms ±13% 0.88ms ±11% -23.90% (p=0.002 n=6+6)
Pyramid/1000-64 71.6ms ± 8% 77.8ms ±12% ~ (p=0.056 n=5+5)
Ragged/10-64 13.2µs ±17% 14.0µs ± 2% ~ (p=0.792 n=6+5)
Ragged/100-64 104µs ± 3% 123µs ± 2% +18.21% (p=0.008 n=5+5)
Ragged/1000-64 1.12ms ±12% 1.35ms ± 9% +20.55% (p=0.002 n=6+6)
Code-64 3.51µs ± 1% 3.80µs ± 3% +8.46% (p=0.004 n=5+6)
x86 perf:
name old time/op new time/op delta
Table/1x10/new-32 5.36µs ± 2% 5.58µs ± 3% +4.03% (p=0.004 n=6+6)
Table/1x10/reuse-32 1.50µs ± 8% 1.48µs ± 0% ~ (p=0.433 n=6+5)
Table/1x1000/new-32 318µs ± 2% 355µs ± 2% +11.43% (p=0.004 n=5+6)
Table/1x1000/reuse-32 127µs ± 0% 134µs ± 1% +5.36% (p=0.004 n=6+5)
Table/1x100000/new-32 49.9ms ± 3% 48.1ms ± 3% -3.57% (p=0.026 n=6+6)
Table/1x100000/reuse-32 13.8ms ± 1% 15.2ms ± 1% +9.87% (p=0.002 n=6+6)
Table/10x10/new-32 17.4µs ± 3% 17.6µs ± 2% ~ (p=0.310 n=6+6)
Table/10x10/reuse-32 7.20µs ± 1% 7.21µs ± 0% ~ (p=0.429 n=5+6)
Table/10x1000/new-32 1.38ms ± 1% 1.43ms ± 2% +4.16% (p=0.002 n=6+6)
Table/10x1000/reuse-32 687µs ± 1% 693µs ± 0% ~ (p=0.052 n=6+5)
Table/10x100000/new-32 131ms ± 2% 133ms ± 5% ~ (p=0.699 n=6+6)
Table/10x100000/reuse-32 89.2ms ± 2% 90.3ms ± 2% ~ (p=0.177 n=5+6)
Table/100x10/new-32 122µs ± 1% 122µs ± 1% ~ (p=0.632 n=6+5)
Table/100x10/reuse-32 62.3µs ± 0% 62.4µs ± 0% ~ (p=0.429 n=5+6)
Table/100x1000/new-32 12.2ms ± 4% 12.0ms ± 3% ~ (p=0.180 n=6+6)
Table/100x1000/reuse-32 6.29ms ± 0% 6.32ms ± 0% +0.54% (p=0.015 n=6+6)
Table/100x100000/new-32 1.01s ± 2% 1.00s ± 4% ~ (p=0.429 n=6+5)
Table/100x100000/reuse-32 972ms ±10% 962ms ±18% ~ (p=1.000 n=6+6)
Pyramid/10-32 14.3µs ± 4% 14.8µs ± 6% ~ (p=0.240 n=6+6)
Pyramid/100-32 833µs ± 2% 841µs ± 2% ~ (p=0.394 n=6+6)
Pyramid/1000-32 53.5ms ± 1% 56.2ms ± 3% +4.98% (p=0.004 n=5+6)
Ragged/10-32 15.1µs ± 0% 14.8µs ± 1% -1.75% (p=0.008 n=5+5)
Ragged/100-32 127µs ± 4% 130µs ± 2% ~ (p=0.180 n=6+6)
Ragged/1000-32 1.27ms ± 5% 1.33ms ± 3% +3.96% (p=0.026 n=6+6)
Code-32 3.75µs ± 1% 3.81µs ± 2% +1.67% (p=0.048 n=5+6)
The potential overhead is acquiring the current G in the two slice functions, which seem to be tiny.
I'm working on benchmarking more packages and evaluate its impact to x86 platform.
Any comment is highly appreciated.
What did you expect to see?
Better performance.
What did you see instead?
Metadata
Metadata
Assignees
Labels
Type
Projects
Status