Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: timer self-deadlock due to preemption point #38070

Closed
mknyszek opened this issue Mar 25, 2020 · 11 comments
Closed

runtime: timer self-deadlock due to preemption point #38070

mknyszek opened this issue Mar 25, 2020 · 11 comments
Assignees
Labels
Milestone

Comments

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Mar 25, 2020

The problem was discovered in kubernetes/kubernetes#88638, where Kubernetes was seeing some kubelets get wedged. They grabbed a stack trace for me, and after combing over it I summarized the problem. Below are the highlights:

[...] There's a self-deadlock going on in the timer system.

  1. A goroutine does: timer.Reset -> resettimer -> modtimer -> wakeNetPoller
    • At this point, the timer its resetting is in the timerModifying state.
  2. Now, at the beginning of wakeNetPoller, we get a preemption request (synchronous, so unrelated to asynchronous preemption), so we call into morestack.
  3. The chain now goes morestack -> gopreempt_m -> goschedImpl -> schedule -> checkTimers -> runtimer
  4. Now we try to run the timer we were modifying, but it's currently being modified, so we loop and osyield. It never stops being in timerModifying, though.

(kubernetes/kubernetes#88638 (comment))

[...] this can only happen if the timer is in the timerDeleted and at the top of the heap, and indeed (*time.Timer).Reset does stop (delete) the timer before resetting it. So, this doesn't imply a racy use of timers on Kubernetes' end.

Looking at the state machine in time.go again, [...] it could only ever happen with the wakeNetPoller call, because in addInitializedTimer we grab a runtime lock around cleantimers and doaddtimer, which prevents preemption.

Maybe what happened is that we accidentally inserted a preemption point? What if it was the case that in resettimer before wakeNetPoller was actually inlined, but in addInitializedTimer it's not. This means we added in a preemption point unintentionally.

One idea for the fix is to just... make sure wakeNetPoller gets inlined (and maybe https://go-review.googlesource.com/c/go/+/224902 actually does that), but that seems fragile. We should probably say that while a goroutine owns a timer in timerModifying it should not be allowed to be preempted, because it can cause an operation that waits (perhaps the same for other -ing states).

(kubernetes/kubernetes#88638 (comment))

CC @ianlancetaylor

@mknyszek mknyszek added this to the Go1.15 milestone Mar 25, 2020
@mknyszek
Copy link
Contributor Author

@mknyszek mknyszek commented Mar 25, 2020

@gopherbot Please open a backport to 1.14.

@gopherbot

This comment has been minimized.

@mknyszek mknyszek changed the title runtime: timer self-deadlock due to preemption point in go1.14.1 runtime: timer self-deadlock due to preemption point Mar 25, 2020
@gopherbot
Copy link

@gopherbot gopherbot commented Mar 25, 2020

Backport issue(s) opened: #38072 (for 1.14).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://golang.org/wiki/MinorReleases.

@mknyszek mknyszek added the NeedsFix label Mar 25, 2020
@mknyszek mknyszek self-assigned this Mar 25, 2020
@gopherbot
Copy link

@gopherbot gopherbot commented Mar 25, 2020

Change https://golang.org/cl/225497 mentions this issue: runtime: prevent preemption while timer is in timerModifying

@bcmills
Copy link
Member

@bcmills bcmills commented Mar 25, 2020

Does this deadlock account for the test failures observed in #37894?

@bcmills
Copy link
Member

@bcmills bcmills commented Mar 25, 2020

(Looks like probably not.)

@mknyszek
Copy link
Contributor Author

@mknyszek mknyszek commented Mar 25, 2020

@bcmills It looks to me like those test failures are in fact this bug.

@gopherbot gopherbot closed this in e8be350 Mar 25, 2020
@gopherbot

This comment has been minimized.

gopherbot pushed a commit that referenced this issue Mar 25, 2020
… timerModifying

Currently if a goroutine is preempted while owning a timer in the
timerModifying state, it could self-deadlock. When the goroutine is
preempted and calls into the scheduler, it could call checkTimers. If
checkTimers encounters the timerModifying timer and calls runtimer on
it, then runtimer will spin, waiting for that timer to leave the
timerModifying state, which it never will.

So far we got lucky that for the most part that there were no preemption
points while timerModifying is happening, however CL 221077 seems to
have introduced one, leading to sporadic self-deadlocks.

This change disables preemption explicitly while a goroutines holds a
timer in timerModifying. Since only checkTimers (and thus runtimer) is
called from the scheduler, this is sufficient to prevent
preemption-based self-deadlocks.

For #38070
Fixes #38072

Change-Id: Idbfac310889c92773023733ff7e2ff87e9896f0c
Reviewed-on: https://go-review.googlesource.com/c/go/+/225497
Run-TryBot: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
(cherry picked from commit e8be350)
Reviewed-on: https://go-review.googlesource.com/c/go/+/225521
Run-TryBot: Ian Lance Taylor <iant@golang.org>
@mm4tt
Copy link

@mm4tt mm4tt commented Mar 26, 2020

@mknyszek, thanks for debugging it and fixing it so quickly.
When can we expect the fix to be released?

@mvdan
Copy link
Member

@mvdan mvdan commented Mar 26, 2020

I think I've hit the same bug in a long-running server with many thousands of goroutines. This early morning, the process just froze forever - even a goroutine that was simply logging some trivial stats every minute or so was not responding. One of the runnable goroutines is at runtime.resettimer, so that's a bit suspicious, and how I found this bug.

2020-03-26-hang-log.txt.gz

@mknyszek do you think it's the same bug, or should I file a new one?

I should note that this is a server that works fine with Go 1.13.x, but started hanging like this since we jumped to 1.14.x. I had been testing it with 1.14.x for weeks before, but we hadn't started running the long-lived staging servers with 1.14.x.until earlier this week.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Mar 26, 2020

@mm4tt The normal schedule for the next release would be next week some time.

@mvdan That is almost certainly the same bug.

Stebalien added a commit to ipfs/go-ipfs that referenced this issue Mar 30, 2020
Go 1.14 has a timer reset deadlock (golang/go#38070).

This also downgrades quic-go until either a go patch release fixes this issue or
a version of quic-go is released that works with go 1.13.
alanshaw added a commit to libp2p/hydra-booster that referenced this issue Apr 1, 2020
golang 0.14 currently has a bug that is being exposed by our latest deps golang/go#38070

resolves #60
daviddias added a commit to libp2p/hydra-booster that referenced this issue Apr 1, 2020
golang 0.14 currently has a bug that is being exposed by our latest deps golang/go#38070

resolves #60
alanshaw added a commit to libp2p/hydra-booster that referenced this issue May 14, 2020
golang/go#38070 was resolved
ralendor pushed a commit to ralendor/go-ipfs that referenced this issue Jun 6, 2020
Go 1.14 has a timer reset deadlock (golang/go#38070).

This also downgrades quic-go until either a go patch release fixes this issue or
a version of quic-go is released that works with go 1.13.
ralendor pushed a commit to ralendor/go-ipfs that referenced this issue Jun 6, 2020
Go 1.14 has a timer reset deadlock (golang/go#38070).

This also downgrades quic-go until either a go patch release fixes this issue or
a version of quic-go is released that works with go 1.13.
ralendor pushed a commit to ralendor/go-ipfs that referenced this issue Jun 8, 2020
Go 1.14 has a timer reset deadlock (golang/go#38070).

This also downgrades quic-go until either a go patch release fixes this issue or
a version of quic-go is released that works with go 1.13.
ralendor pushed a commit to ralendor/go-ipfs that referenced this issue Jun 8, 2020
Go 1.14 has a timer reset deadlock (golang/go#38070).

This also downgrades quic-go until either a go patch release fixes this issue or
a version of quic-go is released that works with go 1.13.
ralendor pushed a commit to ralendor/go-ipfs that referenced this issue Jun 8, 2020
Go 1.14 has a timer reset deadlock (golang/go#38070).

This also downgrades quic-go until either a go patch release fixes this issue or
a version of quic-go is released that works with go 1.13.
ralendor pushed a commit to ralendor/go-ipfs that referenced this issue Jun 8, 2020
Go 1.14 has a timer reset deadlock (golang/go#38070).

This also downgrades quic-go until either a go patch release fixes this issue or
a version of quic-go is released that works with go 1.13.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
6 participants
You can’t perform that action at this time.