Skip to content

cmd/trace: give more context to "Scheduler wait" table column on /goroutine page #63430

@rhysh

Description

@rhysh

What version of Go are you using (go version)?

$ go1.21 version
go version go1.21.2 darwin/arm64

Does this issue reproduce with the latest release?

Yes, this is present in the Go 1.21 series and in the development branch.

What operating system and processor architecture are you using (go env)?

go env Output
$ go1.21 env
GO111MODULE=''
GOARCH='arm64'
GOBIN=''
GOCACHE='/Users/rhys/Library/Caches/go-build'
GOENV='/Users/rhys/Library/Application Support/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='arm64'
GOHOSTOS='darwin'
GOINSECURE=''
GOMODCACHE='/Users/rhys/go/pkg/mod'
GONOPROXY='*'
GONOSUMDB='*'
GOOS='darwin'
GOPATH='/Users/rhys/go'
GOPRIVATE='*'
GOPROXY='direct'
GOROOT='/Users/rhys/go/version/go1.21'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='local'
GOTOOLDIR='/Users/rhys/go/version/go1.21/pkg/tool/darwin_arm64'
GOVCS=''
GOVERSION='go1.21.2'
GCCGO='gccgo'
AR='ar'
CC='clang'
CXX='clang++'
CGO_ENABLED='1'
GOMOD='/dev/null'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -arch arm64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -ffile-prefix-map=/var/folders/lf/n4ry0qv5639_3z0nhyhy3z7c0000gs/T/go-build360557930=/tmp/go-build -gno-record-gcc-switches -fno-common'

What did you do?

I ran the reproducer for #63422 and collected an execution trace. I used go tool trace to open a browser view, clicked the "Goroutine analysis" link, and clicked on one of the goroutine types (testing.tRunner). I looked at the "Scheduler wait" column, and saw that one of the goroutines had spend a total of 84ms waiting to run.

$ go1.21 test ./h2_unplug_test.go -test.run=/http1 -request-timeout=10ms -request-count=30000 -unplug-interval=10000 -test.trace=/tmp/t
--- FAIL: TestNetsplit (2.33s)
    --- FAIL: TestNetsplit/http1 (2.33s)
        h2_unplug_test.go:155: total=30000 unplugs=3 failures=3
FAIL
FAIL    command-line-arguments  2.497s
FAIL

$ go1.21 tool trace /tmp/t
2023/10/06 14:48:04 Parsing trace...
2023/10/06 14:48:05 Splitting trace...
2023/10/06 14:48:07 Opening browser. Trace viewer is listening on http://127.0.0.1:54953

What did you expect to see?

I expected that figure to come with the relevant context: That goroutine played a role in executing 30,000 HTTP requests in sequence. It was scheduled a total of 60,035 times. Its average wait time was 1.4µs, and (if I've calculated correctly) its p99 wait time was 6.3µs, the p100 / worst case time was 532µs, and the second-worst was 82µs.

$ go1.21 tool trace -d /tmp/t | awk '{if ($2=="GoStart" && $4=="g=8") {n+=1}} END {print n}'
2023/10/06 14:56:01 Parsing trace...
60035

I think that table should have an additional column showing the number of times the goroutine has been scheduled (the count of GoStart events), "60035" in this case.

Following that, I also think that table should give some hints about the tail behavior of the goroutine's scheduler interactions. Maybe that's the p100 / worst case scheduling latency.

What did you see instead?

The table shows "84ms" of "Scheduler wait", without any of that additional context. That makes it look like the scheduler has problems, when a much closer look reveals that this is an especially chatty goroutine ... one that uses the scheduler perhaps a little too often, but which has a good experience nearly every time.

Screenshot 2023-10-06 at 2 48 21 PM

CC @golang/runtime , and @mknyszek in particular

Metadata

Metadata

Assignees

No one assigned

    Labels

    NeedsInvestigationSomeone must examine and confirm this is a valid issue and not a duplicate of an existing one.compiler/runtimeIssues related to the Go compiler and/or runtime.

    Type

    No type

    Projects

    Status

    Todo

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions