Skip to content

runtime/pprof: block profiler: fix sampling bias #44192

@felixge

Description

@felixge

What version of Go are you using (go version)?

$ go version
go version go1.15.7 darwin/amd64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/felix.geisendoerfer/Library/Caches/go-build"
GOENV="/Users/felix.geisendoerfer/Library/Application Support/go/env"
GOEXE=""
GOFLAGS="-count=1"
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/felix.geisendoerfer/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/felix.geisendoerfer/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/Cellar/go/1.15.7_1/libexec"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/Cellar/go/1.15.7_1/libexec/pkg/tool/darwin_amd64"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/Users/felix.geisendoerfer/go/src/github.com/felixge/go-profiler-notes/examples/block-bias/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/mg/xwzs985x35789zk2l14nk8k80000gn/T/go-build460068460=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

I created a program with 2 goroutines. One is blocked on a slow time.NewTicker() loop, the other on one that is 10x faster. Then I observed the program using the block profiler with the duration of the slow ticker as the rate.

Code: https://gist.github.com/felixge/6f9701dd26707dbe7d351805860a4608

What did you expect to see?

Both goroutines are blocked 100% of their time, so I'd expect both of them to show up equally in the block profile like shown below:

CleanShot 2021-02-09 at 20 54 14@2x

What did you see instead?

The block profiler sampling seems to be biased, preferring infrequent long block events over frequent short events, even when they contribute to the same amount of overall blocking. This can be seen in the block profile result I actually got:

CleanShot 2021-02-09 at 20 53 33@2x

I've done a lot of additional research on this which can be found here: https://github.com/felixge/go-profiler-notes/blob/main/block.md#sampling-bias

Proposal

I propose to fix the profiler in a way that removes the sampling bias without changing the API or increasing overhead. Concretely this might be done with a patch like this:

felixge/go@master...debias-blockprofile-rate

I've described this in more detail along with a simulation here: https://github.com/felixge/go-profiler-notes/blob/main/block.md#simulation--proposal-for-improvement

The downside to this approach is that it will obscure the "average event duration" which could currently be determined by dividing the delay/nanoseconds by the contentions/count values in the profile. I have a few ideas around this that I'd be happy to discuss.

Additionally I propose to improve the documentation for the block profiler to give people a better idea what the rate value means and perhaps also make some recommendations about overhead.

I'm happy to work on all of the above and submit the patches for it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    FrozenDueToAgeNeedsFixThe path to resolution is known, but the work has not been done.

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions