Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.Sign up
runtime: timer self-deadlock due to preemption point #38070
The problem was discovered in kubernetes/kubernetes#88638, where Kubernetes was seeing some kubelets get wedged. They grabbed a stack trace for me, and after combing over it I summarized the problem. Below are the highlights:
… timerModifying Currently if a goroutine is preempted while owning a timer in the timerModifying state, it could self-deadlock. When the goroutine is preempted and calls into the scheduler, it could call checkTimers. If checkTimers encounters the timerModifying timer and calls runtimer on it, then runtimer will spin, waiting for that timer to leave the timerModifying state, which it never will. So far we got lucky that for the most part that there were no preemption points while timerModifying is happening, however CL 221077 seems to have introduced one, leading to sporadic self-deadlocks. This change disables preemption explicitly while a goroutines holds a timer in timerModifying. Since only checkTimers (and thus runtimer) is called from the scheduler, this is sufficient to prevent preemption-based self-deadlocks. For #38070 Fixes #38072 Change-Id: Idbfac310889c92773023733ff7e2ff87e9896f0c Reviewed-on: https://go-review.googlesource.com/c/go/+/225497 Run-TryBot: Michael Knyszek <email@example.com> TryBot-Result: Gobot Gobot <firstname.lastname@example.org> Reviewed-by: Ian Lance Taylor <email@example.com> (cherry picked from commit e8be350) Reviewed-on: https://go-review.googlesource.com/c/go/+/225521 Run-TryBot: Ian Lance Taylor <firstname.lastname@example.org>
I think I've hit the same bug in a long-running server with many thousands of goroutines. This early morning, the process just froze forever - even a goroutine that was simply logging some trivial stats every minute or so was not responding. One of the runnable goroutines is at
@mknyszek do you think it's the same bug, or should I file a new one?
I should note that this is a server that works fine with Go 1.13.x, but started hanging like this since we jumped to 1.14.x. I had been testing it with 1.14.x for weeks before, but we hadn't started running the long-lived staging servers with 1.14.x.until earlier this week.