Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: frequent enlisting of short-lived background workers leads to performance regression with async preemption #37465

Open
klauspost opened this issue Feb 26, 2020 · 32 comments
Assignees
Milestone

Comments

@klauspost
Copy link
Contributor

@klauspost klauspost commented Feb 26, 2020

What version of Go are you using (go version)?

λ go version
go version go1.14 windows/amd64

λ go version
go version go1.13.8 windows/amd64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

go env Output
λ go env                                                                                                                                       
set GO111MODULE=                                                                                                                               
set GOARCH=amd64                                                                                                                               
set GOBIN=                                                                                                                                     
set GOCACHE=C:\Users\klaus\AppData\Local\go-build                                                                                              
set GOENV=C:\Users\klaus\AppData\Roaming\go\env                                                                                                
set GOEXE=.exe                                                                                                                                 
set GOFLAGS=                                                                                                                                   
set GOHOSTARCH=amd64                                                                                                                           
set GOHOSTOS=windows                                                                                                                           
set GONOPROXY=                                                                                                                                 
set GONOSUMDB=                                                                                                                                 
set GOOS=windows                                                                                                                               
set GOPATH=e:\gopath                                                                                                                           
set GOPRIVATE=                                                                                                                                 
set GOPROXY=https://goproxy.io                                                                                                                 
set GOROOT=c:\go                                                                                                                               
set GOSUMDB=sum.golang.org                                                                                                                     
set GOTMPDIR=                                                                                                                                  
set GOTOOLDIR=c:\go\pkg\tool\windows_amd64                                                                                                     
set GCCGO=gccgo                                                                                                                                
set AR=ar                                                                                                                                      
set CC=gcc                                                                                                                                     
set CXX=g++                                                                                                                                    
set CGO_ENABLED=1                                                                                                                              
set GOMOD=                                                                                                                                     
set CGO_CFLAGS=-g -O2                                                                                                                          
set CGO_CPPFLAGS=                                                                                                                              
set CGO_CXXFLAGS=-g -O2                                                                                                                        
set CGO_FFLAGS=-g -O2                                                                                                                          
set CGO_LDFLAGS=-g -O2                                                                                                                         
set PKG_CONFIG=pkg-config                                                                                                                      
set GOGCCFLAGS=-m64 -mthreads -fmessage-length=0 -fdebug-prefix-map=d:\temp\wintemp\go-build155272862=/tmp/go-build -gno-record-gcc-switches   

What did you do?

Run benchmark: https://play.golang.org/p/WeuJg6yaOuJ

go test -bench=. -test.benchtime=10s used to test.

What did you expect to see?

Close or similar benchmark speeds.

What did you see instead?

40% performance regression.

λ benchcmp go113.txt go114.txt
benchmark                                       old ns/op     new ns/op     delta
BenchmarkCompressAllocationsSingle/flate-32     87026         121741        +39.89%
BenchmarkCompressAllocationsSingle/gzip-32      88654         122632        +38.33%

This is not a purely theoretical benchmark. While suboptimal, this is the easiest way to compress a piece of data, so this will be seen in the wild. It could also indicate a general regression for applications allocating a lot.

Edit: This is not related to changes in the referenced packages. Seeing this when using packages outside the stdlib as well.

@klauspost

This comment has been minimized.

Copy link
Contributor Author

@klauspost klauspost commented Feb 26, 2020

Somebody proposed to remove 'ReportAllocs'. It has an impact, but not very big:

λ benchcmp go113-no.txt go114-no.txt
benchmark                                       old ns/op     new ns/op     delta
BenchmarkCompressAllocationsSingle/flate-32     88082         117794        +33.73%
BenchmarkCompressAllocationsSingle/gzip-32      89279         118946        +33.23%
@klauspost

This comment has been minimized.

Copy link
Contributor Author

@klauspost klauspost commented Feb 26, 2020

Turning off DEP seems to reduce the impact, but still considerably slower:

λ bcdedit.exe /set {current} nx AlwaysOff -> Reboot. Confirm setting with λ wmic OS Get DataExecutionPrevention_SupportPolicy -> 0.

λ benchcmp go113-depoff.txt go114-depoff.txt
benchmark                                       old ns/op     new ns/op     delta
BenchmarkCompressAllocationsSingle/flate-32     87619         107249        +22.40%
BenchmarkCompressAllocationsSingle/gzip-32      89344         107670        +20.51%

This actually matches my initial measurements before isolating the issue, so I am unsure if this is just some variance based on system state.

@cagedmantis cagedmantis changed the title go 1.14: Big allocation speed regression runtime: big allocation speed regression in go1.14 Feb 26, 2020
@josharian

This comment has been minimized.

Copy link
Contributor

@josharian josharian commented Feb 26, 2020

@bcmills bcmills added the Performance label Feb 26, 2020
@cagedmantis

This comment has been minimized.

Copy link
Contributor

@cagedmantis cagedmantis commented Feb 26, 2020

@cagedmantis cagedmantis added this to the Backlog milestone Feb 26, 2020
@mknyszek mknyszek self-assigned this Feb 26, 2020
@mknyszek

This comment was marked as outdated.

Copy link
Contributor

@mknyszek mknyszek commented Feb 26, 2020

The regression in the benchmark is reproducible on Linux (but to a lesser extent).

@mknyszek

This comment has been minimized.

Copy link
Contributor

@mknyszek mknyszek commented Feb 26, 2020

Or... hm. I spoke too soon. 5 runs shows that the benchmark is pretty noisy on Linux and it's not obvious that there's a regression.

name                                old time/op    new time/op    delta
CompressAllocationsSingle/flate-48     182µs ±12%     169µs ± 4%    ~     (p=0.151 n=5+5)
CompressAllocationsSingle/gzip-48      170µs ± 7%     184µs ± 9%    ~     (p=0.056 n=5+5)

name                                old alloc/op   new alloc/op   delta
CompressAllocationsSingle/flate-48     814kB ± 0%     814kB ± 0%  +0.00%  (p=0.016 n=4+5)
CompressAllocationsSingle/gzip-48      814kB ± 0%     814kB ± 0%  +0.00%  (p=0.008 n=5+5)

name                                old allocs/op  new allocs/op  delta
CompressAllocationsSingle/flate-48      16.0 ± 0%      16.0 ± 0%    ~     (all equal)
CompressAllocationsSingle/gzip-48       17.0 ± 0%      17.0 ± 0%    ~     (all equal)

That's what I get for making conclusions on just one benchmark run. This probably needs many more runs to see a concrete difference, if one exists.

Since this seems much more easily reproducible on Windows, I'm going to try there.

@leafrhythms

This comment has been minimized.

Copy link

@leafrhythms leafrhythms commented Feb 26, 2020

Cannot reproduce on linux.

$ go env

GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/leaf/.cache/go-build"
GOENV="/home/leaf/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/leaf/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/lib/go-1.14"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/lib/go-1.14/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build636896530=/tmp/go-build -gno-record-gcc-switches" 

$ ~/go/bin/benchstat bench_1.1{3,4}.txt

name old time/op new time/op delta
CompressAllocationsSingle/flate-8 180µs ± 2% 181µs ± 4% ~ (p=1.000 n=10+10)
CompressAllocationsSingle/gzip-8 183µs ± 7% 204µs ± 1% +11.38% (p=0.000 n=9+10)

Edit: benchmark run with ReportAlloc removed.

@mknyszek

This comment has been minimized.

Copy link
Contributor

@mknyszek mknyszek commented Feb 26, 2020

While I set up my Windows environment: @klauspost what suggests to you that this is a regression in allocation performance vs. something else?

@klauspost

This comment has been minimized.

Copy link
Contributor Author

@klauspost klauspost commented Feb 26, 2020

@mknyszek It is only really doing allocations. The compression part is trivial, but it needs enough content to activate all allocations.

Profiles: alloc-pprof.zip

λ go tool pprof cpu113.pprof
Type: cpu
Time: Feb 26, 2020 at 5:05pm (CET)
Duration: 25.46s, Total samples = 1.01mins (238.80%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 39.56s, 65.07% of 60.80s total
Dropped 227 nodes (cum <= 0.30s)
Showing top 10 nodes out of 144
      flat  flat%   sum%        cum   cum%
    11.28s 18.55% 18.55%     14.95s 24.59%  runtime.scanobject
     5.51s  9.06% 27.62%      5.54s  9.11%  runtime.stdcall2
     4.86s  7.99% 35.61%      4.86s  7.99%  runtime.runGCProg
     4.76s  7.83% 43.44%      4.76s  7.83%  runtime.memclrNoHeapPointers
     3.82s  6.28% 49.72%      3.83s  6.30%  runtime.stdcall3
        2s  3.29% 53.01%      2.03s  3.34%  runtime.runqgrab
     1.97s  3.24% 56.25%      1.97s  3.24%  runtime.procyield
     1.94s  3.19% 59.44%      1.96s  3.22%  runtime.(*lfstack).pop
     1.86s  3.06% 62.50%      1.90s  3.12%  runtime.stdcall1
     1.56s  2.57% 65.07%     21.19s 34.85%  runtime.gcDrain

λ go tool pprof cpu114.pprof
Type: cpu
Time: Feb 26, 2020 at 5:07pm (CET)
Duration: 26.61s, Total samples = 1.34mins (303.24%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 1.07mins, 79.31% of 1.34mins total
Dropped 271 nodes (cum <= 0.01mins)
Showing top 10 nodes out of 95
      flat  flat%   sum%        cum   cum%
  0.54mins 40.51% 40.51%   0.55mins 40.59%  runtime.stdcall2
  0.17mins 12.93% 53.43%   0.22mins 16.34%  runtime.scanobject
  0.12mins  8.99% 62.42%   0.12mins  8.99%  runtime.memclrNoHeapPointers
  0.08mins  5.94% 68.36%   0.08mins  6.00%  runtime.stdcall1
  0.07mins  4.87% 73.23%   0.07mins  4.87%  runtime.runGCProg
  0.02mins  1.65% 74.88%   0.02mins  1.65%  runtime.procyield
  0.02mins  1.28% 76.15%   0.02mins  1.28%  runtime.runqgrab
  0.02mins  1.17% 77.32%   0.02mins  1.18%  runtime.stdcall7
  0.01mins  1.00% 78.32%   0.02mins  1.61%  runtime.greyobject
  0.01mins  0.99% 79.31%   0.03mins  1.90%  runtime.findObject
@klauspost

This comment has been minimized.

Copy link
Contributor Author

@klauspost klauspost commented Feb 26, 2020

@leitzler

This comment has been minimized.

Copy link
Contributor

@leitzler leitzler commented Feb 26, 2020

FWIW, I ran it on Linux Arm64 as well (shared VM so anything goes), but the results seems consitent:

$ benchcmp go11{3,4}.txt
benchcmp is deprecated in favor of benchstat: https://pkg.go.dev/golang.org/x/perf/cmd/benchstat
benchmark                                      old ns/op     new ns/op     delta
BenchmarkCompressAllocationsSingle/flate-8     408292        491127        +20.29%
BenchmarkCompressAllocationsSingle/gzip-8      365706        453792        +24.09%

$ go env

$ go env
GO111MODULE=""
GOARCH="arm64"
GOBIN=""
GOCACHE="/home/leitzler/.cache/go-build"
GOENV="/home/leitzler/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="arm64"
GOHOSTOS="linux"
GOINSECURE=""
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/leitzler/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_arm64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build024325012=/tmp/go-build -gno-record-gcc-switches"
@mknyszek

This comment has been minimized.

Copy link
Contributor

@mknyszek mknyszek commented Feb 26, 2020

@klauspost Thanks for the profiles! I was about to do this myself.

I disagree that this indicates anything to with allocation; none of those symbols are involved in allocation. The first one I see is in the top 20 in the diff, and it's right at the bottom (and it's a deleted symbol at that!).

(pprof) top 20
Showing nodes accounting for 21.44s, 35.26% of 60.80s total
Dropped 261 nodes (cum <= 0.30s)
Showing top 20 nodes out of 173
      flat  flat%   sum%        cum   cum%
    27.17s 44.69% 44.69%     27.21s 44.75%  runtime.stdcall2
    -3.74s  6.15% 38.54%     -3.75s  6.17%  runtime.stdcall3
     2.93s  4.82% 43.36%      2.94s  4.84%  runtime.stdcall1
     2.49s  4.10% 47.45%      2.49s  4.10%  runtime.memclrNoHeapPointers
    -1.48s  2.43% 45.02%     -1.50s  2.47%  runtime.(*lfstack).pop
    -0.97s  1.60% 43.42%        -1s  1.64%  runtime.runqgrab
     0.94s  1.55% 44.97%      0.95s  1.56%  runtime.stdcall7
    -0.93s  1.53% 43.44%     -0.93s  1.53%  runtime.runGCProg
    -0.89s  1.46% 41.97%     31.05s 51.07%  runtime.gcDrain
    -0.88s  1.45% 40.53%     20.40s 33.55%  runtime.lock
    -0.85s  1.40% 39.13%     -1.77s  2.91%  runtime.scanobject
    -0.64s  1.05% 38.08%     -0.64s  1.05%  runtime.procyield
    -0.55s   0.9% 37.17%     -0.53s  0.87%  runtime.heapBits.next
     0.50s  0.82% 37.99%      0.50s  0.82%  runtime.shouldStealTimers (inline)
    -0.45s  0.74% 37.25%     -0.14s  0.23%  runtime.findObject
    -0.43s  0.71% 36.55%     -1.43s  2.35%  runtime.runqsteal
    -0.41s  0.67% 35.87%     -4.83s  7.94%  runtime.findrunnable
    -0.38s  0.62% 35.25%     -0.39s  0.64%  runtime.(*lfstack).push
     0.33s  0.54% 35.79%      0.34s  0.56%  runtime.scanblock
    -0.32s  0.53% 35.26%     -0.40s  0.66%  runtime.(*treapNode).updateInvariants

What's more interesting to me is the prevalence of stdcall2 in the diff. Peeking at that a few times...

(pprof) peek runtime.stdcall2 
Showing nodes accounting for 19.88s, 32.70% of 60.80s total
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context 	 	 
----------------------------------------------------------+-------------
                                            21.65s 79.57% |   runtime.semasleep
                                             5.56s 20.43% |   runtime.preemptM
    27.17s 44.69% 44.69%     27.21s 44.75%                | runtime.stdcall2
                                             0.04s  0.15% |   runtime.stdcall
----------------------------------------------------------+-------------
(pprof) peek runtime.semasleep
Showing nodes accounting for 19.88s, 32.70% of 60.80s total
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context 	 	 
----------------------------------------------------------+-------------
                                            21.68s   100% |   runtime.lock
     0.03s 0.049% 0.049%     21.68s 35.66%                | runtime.semasleep
                                            21.65s 99.86% |   runtime.stdcall2
----------------------------------------------------------+-------------
(pprof) peek runtime.lock
Showing nodes accounting for 19.88s, 32.70% of 60.80s total
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context 	 	 
----------------------------------------------------------+-------------
                                            27.34s 134.02% |   runtime.preemptM
                                            -2.33s 11.42% |   runtime.(*mheap).alloc_m
                                            -1.57s  7.70% |   runtime.findrunnable
                                            -0.79s  3.87% |   runtime.stopm
                                            -0.62s  3.04% |   runtime.exitsyscallfast_pidle
                                            -0.33s  1.62% |   runtime.gcResetMarkState
                                            -0.32s  1.57% |   runtime.(*mheap).allocManual
                                            -0.31s  1.52% |   runtime.(*mheap).freeSpan.func1
                                            -0.22s  1.08% |   runtime.bgscavenge.func2
                                            -0.14s  0.69% |   runtime.timerproc
                                             0.14s  0.69% |   runtime.goschedImpl
                                             0.13s  0.64% |   runtime.(*mheap).allocSpan
                                            -0.11s  0.54% |   runtime.startm
                                            -0.07s  0.34% |   runtime.addtimer
                                            -0.07s  0.34% |   runtime.handoffp
                                            -0.07s  0.34% |   runtime.scavengeSleep
                                            -0.06s  0.29% |   runtime.gcBgMarkWorker.func2
                                            -0.05s  0.25% |   runtime.exitsyscall0
                                            -0.04s   0.2% |   runtime.gcstopm
                                             0.04s   0.2% |   runtime.gcFlushBgCredit
                                            -0.03s  0.15% |   runtime.runSafePointFn
                                            -0.03s  0.15% |   runtime.wakeScavenger
                                             0.02s 0.098% |   runtime.(*pageAlloc).scavengeOne.func1
                                            -0.02s 0.098% |   runtime.addspecial
                                             0.02s 0.098% |   runtime.forEachP
                                            -0.02s 0.098% |   runtime.acquireSudog
                                            -0.02s 0.098% |   runtime.gcSweep
                                            -0.02s 0.098% |   runtime.stackcache_clear
                                            -0.02s 0.098% |   runtime.stopTheWorldWithSema
                                             0.01s 0.049% |   runtime.(*mcentral).cacheSpan
                                             0.01s 0.049% |   runtime.freespecial
                                             0.01s 0.049% |   runtime.mProf_Free
                                            -0.01s 0.049% |   runtime.(*mheap).freeManual
                                            -0.01s 0.049% |   runtime.(*mheap).reclaimChunk
                                            -0.01s 0.049% |   runtime.nextMarkBitArenaEpoch
                                            -0.01s 0.049% |   runtime.scavengeSleep.func1
                                            -0.01s 0.049% |   runtime.setprofilebucket
                                            -0.01s 0.049% |   runtime.startTheWorldWithSema
                                                 0     0% |   runtime.(*mcentral).freeSpan
                                                 0     0% |   runtime.(*mcentral).uncacheSpan
                                                 0     0% |   runtime.(*mheap).reclaim
                                                 0     0% |   runtime.mProf_Malloc
                                                 0     0% |   runtime.schedEnableUser
    -0.88s  1.45%  1.45%     20.40s 33.55%                | runtime.lock
                                            21.68s 106.27% |   runtime.semasleep
                                            -0.58s  2.84% |   runtime.procyield
                                             0.14s  0.69% |   runtime.osyield
                                             0.03s  0.15% |   runtime.semacreate
                                             0.01s 0.049% |   runtime.newstack
----------------------------------------------------------+-------------
                                            -0.15s   100% |   runtime.checkdead
    -0.15s  0.25%  1.69%     -0.15s  0.25%                | runtime.lockextra
----------------------------------------------------------+-------------
(pprof) peek runtime.preemptM
Showing nodes accounting for 19.88s, 32.70% of 60.80s total
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context 	 	 
----------------------------------------------------------+-------------
                                            38.18s   100% |   runtime.preemptone
     0.15s  0.25%  0.25%     38.18s 62.80%                | runtime.preemptM
                                            27.34s 71.61% |   runtime.lock
                                             5.56s 14.56% |   runtime.stdcall2
                                             2.92s  7.65% |   runtime.stdcall1
                                             1.14s  2.99% |   runtime.unlock
                                             0.95s  2.49% |   runtime.stdcall7
                                             0.07s  0.18% |   runtime.wantAsyncPreempt (inline)
                                             0.03s 0.079% |   runtime.gFromTLS
                                             0.02s 0.052% |   runtime.isAsyncSafePoint
----------------------------------------------------------+-------------

It looks like a lock in preemptM is either contended or hit very often. I'll try to see exactly what line it's coming from.

@klauspost Your SVGs suggest the same thing (look at the difference between stdcall2 in both) AFAICT.

CC @aclements because asynchronous preemption.

@mknyszek

This comment has been minimized.

Copy link
Contributor

@mknyszek mknyszek commented Feb 26, 2020

@leitzler Could you share some profiles? I don't readily have access to dedicated arm64 hardware and I'm curious if that's the same thing or a different issue.

@mknyszek

This comment has been minimized.

Copy link
Contributor

@mknyszek mknyszek commented Feb 26, 2020

@aclements Most of it is coming from os_windows.go:1185 in go1.14. The line is lock(&suspendLock).

@mknyszek mknyszek changed the title runtime: big allocation speed regression in go1.14 runtime: compression benchmark regression in go1.14 due to async preemption Feb 26, 2020
@mknyszek

This comment has been minimized.

Copy link
Contributor

@mknyszek mknyszek commented Feb 26, 2020

Digging deeper with @cherrymui, it looks like most of the preemption requests are coming from runtime.(*gcControllerState).enlistWorker. This benchmark is single-threaded, and enlistWorker only tries to preempt user code instead of taking an idle P. I notice that GOMAXPROCS=32 for the original reported issue, so it should just take an idle P (since there's lots!) but it doesn't. It preempts user code instead.

There's a comment there about #19112 which talks about deadlocks for why we can't just take an idle P...

@mknyszek

This comment has been minimized.

Copy link
Contributor

@mknyszek mknyszek commented Feb 26, 2020

@leitzler Can you please file a new issue? It's almost certainly unrelated to this one, because this is all about Windows-specific code.

@klauspost

This comment has been minimized.

Copy link
Contributor Author

@klauspost klauspost commented Feb 26, 2020

@mknyszek No expert on this, but it could be processor specific AFAICT. Running on Ryzen 3950x here - 16c/32t. I will run on my 6c/12t Intel laptop and see if it reproduces as well.

Edit: 6c/12t: Shows 24% degradation as well. HOWEVER, by running with -cpu=4 it is neutral. I will run a few more test on my desktop, since my laptop is a bit noisier.

@leitzler

This comment has been minimized.

Copy link
Contributor

@leitzler leitzler commented Feb 26, 2020

I’ll give it another run with profiles later tonight when I get home. (In a new issue depending on how this turns out meanwhile)

@klauspost

This comment has been minimized.

Copy link
Contributor Author

@klauspost klauspost commented Feb 26, 2020

The number of threads affects the numbers quite a bit:

commandline: go test -bench=. -test.benchtime=10s -cpu=1,2,4,8,16,32

λ benchcmp 113t.txt 114t.txt
benchmark                                       old ns/op     new ns/op     delta
BenchmarkCompressAllocationsSingle/flate        83597         65062         -22.17%
BenchmarkCompressAllocationsSingle/flate-2      83767         91519         +9.25%
BenchmarkCompressAllocationsSingle/flate-4      75547         80647         +6.75%
BenchmarkCompressAllocationsSingle/flate-8      79197         84755         +7.02%
BenchmarkCompressAllocationsSingle/flate-16     82064         97530         +18.85%
BenchmarkCompressAllocationsSingle/flate-32     88006         106613        +21.14%
BenchmarkCompressAllocationsSingle/gzip         91624         108324        +18.23%
BenchmarkCompressAllocationsSingle/gzip-2       80929         97885         +20.95%
BenchmarkCompressAllocationsSingle/gzip-4       79115         81557         +3.09%
BenchmarkCompressAllocationsSingle/gzip-8       82221         85995         +4.59%
BenchmarkCompressAllocationsSingle/gzip-16      86477         96064         +11.09%
BenchmarkCompressAllocationsSingle/gzip-32      91024         108272        +18.95%

Even though the test is single-threaded, the total number of threads available has a very big impact.

I am rerunning the benchmark with -test.count=5, but so far the numbers match up. The outliers are not flukes AFAICT.

Edit: To confirm the numbers, here is 5 runs of each:

name                                old time/op  new time/op   delta
CompressAllocationsSingle/flate     82.4µs ± 1%   64.5µs ± 0%  -21.70%  (p=0.008 n=5+5)
CompressAllocationsSingle/flate-2   85.4µs ± 1%   89.6µs ± 2%   +4.98%  (p=0.008 n=5+5)
CompressAllocationsSingle/flate-4   75.7µs ± 1%   84.7µs ± 1%  +11.82%  (p=0.008 n=5+5)
CompressAllocationsSingle/flate-8   77.3µs ± 1%   84.7µs ± 0%   +9.58%  (p=0.008 n=5+5)
CompressAllocationsSingle/flate-16  83.0µs ± 2%   97.4µs ± 2%  +17.28%  (p=0.008 n=5+5)
CompressAllocationsSingle/flate-32  87.8µs ± 1%  108.7µs ± 0%  +23.79%  (p=0.008 n=5+5)
CompressAllocationsSingle/gzip      91.2µs ± 0%  107.6µs ± 1%  +17.92%  (p=0.008 n=5+5)
CompressAllocationsSingle/gzip-2    83.8µs ± 2%   95.5µs ± 5%  +13.98%  (p=0.008 n=5+5)
CompressAllocationsSingle/gzip-4    79.1µs ± 0%   82.7µs ± 2%   +4.47%  (p=0.008 n=5+5)
CompressAllocationsSingle/gzip-8    80.3µs ± 1%   87.9µs ± 0%   +9.49%  (p=0.008 n=5+5)

(benchmark auto-terminated, but the numbers for gzip-16 and gzip-32 should be confirmed as well).

@cherrymui

This comment has been minimized.

Copy link
Contributor

@cherrymui cherrymui commented Feb 26, 2020

Yeah, I think this is consistent with our previous analysis. There is no contention if there is just one thread. The more threads, the more threads trying to preempt each other, thus the heavier contention.

@leitzler

This comment has been minimized.

Copy link
Contributor

@leitzler leitzler commented Feb 26, 2020

@leitzler Could you share some profiles? I don't readily have access to dedicated arm64 hardware and I'm curious if that's the same thing or a different issue.

It isn't dedicated hardware so I'm not sure it is that much worth, it's a 8 vcpu 8 gb VM. I re-ran the test, 10 times each alternating between 1.13 & 1.14 and it was actually way less difference:

$ benchstat go113.txt go114.txt
name                               old time/op    new time/op    delta
CompressAllocationsSingle/flate-8     437µs ± 3%     469µs ± 1%   +7.37%  (p=0.000 n=10+9)
CompressAllocationsSingle/gzip-8      381µs ± 2%     435µs ± 3%  +14.28%  (p=0.000 n=9+10)

profiles.zip

@klauspost

This comment has been minimized.

Copy link
Contributor Author

@klauspost klauspost commented Feb 26, 2020

@mknyszek I don't know why you changed the title to 'compression benchmark'. This is not related to compression.

Change the benchmark to:

		for i := 0; i < b.N; i++ {
			b := make([]byte, 800<<10)
			_, _ = io.Copy(ioutil.Discard, bytes.NewBuffer(b))
		}

and the result is:

λ benchcmp 113.txt 114.txt
benchmark                                       old ns/op     new ns/op     delta
BenchmarkAllocation/flate-32     49657         56362         +13.50%
@mknyszek

This comment has been minimized.

Copy link
Contributor

@mknyszek mknyszek commented Feb 26, 2020

@klauspost That's true, thanks for pointing it out. I'll fix it again.

@leitzler Looking at the profiles... it looks like it's exactly the same issue. runtime.tgkill is suddenly at the top of the profile and it comes from signalM, which in turn comes from preemptM. Then it follows the same path up to enlistWorker.

@mknyszek mknyszek changed the title runtime: compression benchmark regression in go1.14 due to async preemption runtime: frequent enlisting of short-lived background workers leads to performance regression with async preemption Feb 26, 2020
@mknyszek

This comment has been minimized.

Copy link
Contributor

@mknyszek mknyszek commented Feb 28, 2020

I think the answer here is to just fix #19112 and this will mostly go away.

Putting on the help wanted preemptively (ha) in case it turns out #19112 is too hard to fix (or I just need help fixing it), then we need a different solution.

@mknyszek

This comment has been minimized.

Copy link
Contributor

@mknyszek mknyszek commented Feb 28, 2020

So I think fixing #19112 is feasible... but it's tricky. Anything we do would basically need an invariant that you can't try to change a G's status while holding sched.lock because the scan path can acquire it in the general case.

The fix is really simple, I think: make injectglist just casgstatus(_Gwaiting, _Grunnable) for all the goroutines in the list before putting them on the global run queue. AFAICT this is the only place in the whole codebase where we try to change a G's status while holding sched.lock but maybe this is too restrictive. I'm pretty sure it's also safe to do this without the sched.lock because Gs on a gList are only owned by the list.

The alternative is to have enlistWorker hand off the "wakep" to somebody else (like a "p waker" goroutine or sysmon or something) but the delay there may be too high.

@aclements and @ianlancetaylor, WDYT?

Yet another alternative is to not try to solve this problem and instead avoid having enlistWorker fire so often. This would require a heuristic or something for when more mark workers are actually needed, and would require more thought.

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Feb 28, 2020

@mknyszek Well, what I think is that I think someone should review https://golang.org/cl/216198 which seems to already do what you want.

@mknyszek

This comment has been minimized.

Copy link
Contributor

@mknyszek mknyszek commented Feb 28, 2020

@ianlancetaylor Ah. Well, I totally forgot about that patch. :) Happy to review it.

@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Mar 17, 2020

Change https://golang.org/cl/223797 mentions this issue: runtime: prefer to wake an idle P when enlisting bg mark workers

@networkimprov

This comment has been minimized.

Copy link

@networkimprov networkimprov commented Mar 17, 2020

Should this have Milestone 1.15 and backport to 1.14?

@mknyszek

This comment has been minimized.

Copy link
Contributor

@mknyszek mknyszek commented Mar 17, 2020

@networkimprov Milestone 1.15 definitely. Backport to 1.14 might be possible, it's a bit risky though (just https://golang.org/cl/223797 isn't actually enough; the change it depends on is larger).

@mknyszek mknyszek modified the milestones: Backlog, Go1.15 Mar 17, 2020
@mknyszek

This comment has been minimized.

Copy link
Contributor

@mknyszek mknyszek commented Mar 25, 2020

@klauspost I haven't gotten around to verifying this myself yet (some noise problems with the Windows box I have access to), but it would still be good to make sure the change fixes the problem for you (perhaps there's a difference between our Windows systems; and it also looks like maybe you have a less noisy system). Could you please try to run your benchmark with Go at the current master branch and https://golang.org/cl/223797 applied on top?

@klauspost

This comment has been minimized.

Copy link
Contributor Author

@klauspost klauspost commented Mar 25, 2020

I will attempt to verify it tomorrow 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
You can’t perform that action at this time.