Skip to content

time, runtime: scheduled timer may never fire if GOMAXPROCS is reduced #45716

@prashantv

Description

@prashantv

This issue is related to #44868 (and seems to be resolved with the same fix) but I wanted to file a separate issue as the impact is different, and may help others running into the same issue.

What version of Go are you using (go version)?

$ go version
go version go1.16.2 linux/amd64

Does this issue reproduce with the latest release?

Not yet,, go1.16.3 seems fine (only tested with simple repro, not on production). This does reproduce with 1.15.11.

This is likely fixed by https://go-review.googlesource.com/c/go/+/300610/.

What operating system and processor architecture are you using (go env)?

linux, amd64

go env Output
$ go env
GO111MODULE="on"
GOARCH="amd64"
[..]
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
[...]
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/home/user/.gimme/versions/go1.16.2.linux.amd64"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/home/user/.gimme/versions/go1.16.2.linux.amd64/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.16.2"
[..]

What did you do?

Created a background goroutine to do some periodic work every 500ms. The background goroutine created a ticker, and used a for loop that waited on the ticker.

Concurrently, the main goroutine reduced GOMAXPROCS (based on some system config).

Repro:
https://gist.github.com/prashantv/fdd710e270244efa23ba7a3fcb45b088

This needs to be run in a loop as it happens quite infrequently, it should eventually freeze with a log like:

$ while [ true ] ; do ./stuck-timer-repro || break; done
[...]
attempt 4
========= Likely bug =======
timer 53 timed out after 10s of waiting 0

The repro happens in a few seconds on go1.16.2.

The process is then blocked waiting on the ticker's channel, and allows the process to be debugged.

What did you expect to see?

Ticker channel to receive ticks periodically, ensuring periodic work run roughly twice a second.

What did you see instead?

Ticker did not tick at all, and periodic work did not happen.

More Context

In an internal production service, we noticed some periodic work was not happening as expected. After ruling out other possibilities, the only explanation left was that the ticker was not ticking. Using /debug/pprof/goroutines, I was able to find "stuck" select calls the appropriate stacktrace; the ticker was supposed to fire every few hundred ms, but the select was stuck for many minutes (including thousands of minutes).

Using delve with a core, I discovered that the timer was added to a P that was dead. The same can be reproduced with the repro, and an example delve sessions showing the state is below:

delve debugging

wait for the stuck-docker-repro to print the "likely bug" as above before running delve

dlv attach $(pidof stuck-timer-repro)
(dlv) grs
Goroutine 1 - User: ./stuck-timer-repro.go:52 main.run (0x4a74e5)
Goroutine 2 - User: /home/user/.gimme/versions/go1.16.2.linux.amd64/src/runtime/proc.go:337 runtime.gopark (0x435985)
Goroutine 3 - User: /home/user/.gimme/versions/go1.16.2.linux.amd64/src/runtime/proc.go:337 runtime.gopark (0x435985)
Goroutine 4 - User: /home/user/.gimme/versions/go1.16.2.linux.amd64/src/runtime/proc.go:337 runtime.gopark (0x435985)
Goroutine 5 - User: /home/user/.gimme/versions/go1.16.2.linux.amd64/src/runtime/proc.go:337 runtime.gopark (0x435985)
[5 goroutines]

switch to the main goroutine so we can find the timer object

(dlv) gr 1
Switched from 0 to 1 (thread 108415)

(dlv) bt -full
0 0x0000000000435985 in runtime.gopark
at /home/user/.gimme/versions/go1.16.2.linux.amd64/src/runtime/proc.go:337
unlockf = runtime.chanparkcommit
[...]

3 0x00000000004a74e5 in main.run
at ./stuck-timer-repro.go:52
gomaxprocs = 10
goroutines = 100
wg = sync.WaitGroup {noCopy: sync.noCopy {}, state1: [3]uint32 [...]}
tickers = []*time.Ticker len: 100, cap: 100, [...]
i = 0
ti = 81
t = (*time.Ticker)(0xc0004820a0)
[...]

get information about the timer

(dlv) frame 3
(truncated)
=> 52: <-t.C

(dlv) print t
*time.Ticker {
C: <-chan time.Time {
qcount: 0,
dataqsiz: 1,
buf: *[1]time.Time [
(*time.Time)(0xc000480030),
],
elemsize: 24,
closed: 0,
elemtype: *runtime._type {size: 24, ptrdata: 24, hash: 3225689344, tflag: tflagUncommon|tflagExtraStar|tflagNamed|tflagRegularMemory (15), align: 8, fieldAlign: 8, kind: 25, equal: runtime.memequal_varlen, gcdata: *4, str: 9430, ptrToThis: 140736},
sendx: 0,
recvx: 0,
recvq: waitq<time.Time> {
first: *(*sudog<time.Time>)(0xc00013e000),
last: *(*sudog<time.Time>)(0xc00013e000),},
sendq: waitq<time.Time> {
first: *sudog<time.Time> nil,
last: *sudog<time.Time> nil,},
lock: runtime.mutex {
lockRankStruct: runtime.lockRankStruct {},
key: 0,},},
r: time.runtimeTimer {
pp: 824633896960,
when: 373510132342503,
period: 1000000,
f: time.sendTime,
arg: interface {}(chan time.Time) *(*interface {})(0xc0004820c8),
seq: 0,
nextwhen: 0,
status: 1,},}

print status of the p that the timer was added to, which shows the p is dead

(dlv) print (*runtime.p)(824633896960).status
4
(dlv) print (*runtime.p)(824633896960).timers
[]*runtime.timer len: 1, cap: 1, [
*{
pp: 824633896960,
when: 373510132342503,
period: 1000000,
f: time.sendTime,
arg: interface {}(chan time.Time) *(*interface {})(0xc0004820c8),
seq: 0,
nextwhen: 0,
status: 1,},

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions