Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: loops forever on sched_yield sometimes(timer related) #38023

Closed
xtaci opened this issue Mar 23, 2020 · 15 comments
Closed

runtime: loops forever on sched_yield sometimes(timer related) #38023

xtaci opened this issue Mar 23, 2020 · 15 comments

Comments

@xtaci
Copy link

@xtaci xtaci commented Mar 23, 2020

What version of Go are you using (go version)?

$ go version
go version go1.14.1 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
➜  ~ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/xtaci/.cache/go-build"
GOENV="/home/xtaci/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/xtaci/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build204371315=/tmp/go-build -gno-record-gcc-switches"

What did you do?

a program ran for some unkown time(minutes to hours), then CPU goes to 100% and never back to normal.

What did you expect to see?

normal CPU usage

What did you see instead?

100% CPU usage for a single core, I use delve to attach to the process, and other goroutines are blocking on events forever, including some time.Tickers which expected to fire periodically.(I put breakpoints on first lines of timer handlers.)

(dlv) b osyield
Breakpoint 1 set at 0x4645b0 for runtime.osyield() /usr/local/go/src/runtime/sys_linux_amd64.s:655
(dlv) c
> runtime.osyield() /usr/local/go/src/runtime/sys_linux_amd64.s:655 (hits total:7) (PC: 0x4645b0)
Warning: debugging optimized function
   650:         JLS     2(PC)
   651:         MOVL    $0xf1, 0xf1  // crash
   652:         RET
   653:
   654: TEXT runtime·osyield(SB),NOSPLIT,$0
=> 655:         MOVL    $SYS_sched_yield, AX
   656:         SYSCALL
   657:         RET
   658:
   659: TEXT runtime·sched_getaffinity(SB),NOSPLIT,$0
   660:         MOVQ    pid+0(FP), DI
(dlv) bt
0  0x00000000004645b0 in runtime.osyield
   at /usr/local/go/src/runtime/sys_linux_amd64.s:655
1  0x0000000000452d5c in runtime.runtimer
   at /usr/local/go/src/runtime/time.go:740
2  0x000000000043b634 in runtime.checkTimers
   at /usr/local/go/src/runtime/proc.go:2621
3  0x000000000043b0db in runtime.schedule
   at /usr/local/go/src/runtime/proc.go:2483
4  0x000000000043ba76 in runtime.goschedImpl
   at /usr/local/go/src/runtime/proc.go:2705
5  0x000000000043bcc4 in runtime.gopreempt_m
   at /usr/local/go/src/runtime/proc.go:2733
6  0x000000000044bd9e in runtime.newstack
   at /usr/local/go/src/runtime/stack.go:1025
7  0x00000000004604ef in runtime.morestack
   at /usr/local/go/src/runtime/asm_amd64.s:449
8  0x0000000000462331 in runtime.goexit
   at /usr/local/go/src/runtime/asm_amd64.s:1373

also strace shows it call sched_yield() infinitely

sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
....

with 100% CPU usage
program build flag:
env CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build ...

/usr/local/go/src/runtime/time.go

 690 func runtimer(pp *p, now int64) int64 {
 691     for {
 692         t := pp.timers[0]
 693         if t.pp.ptr() != pp {
 694             throw("runtimer: bad p")
 695         }
 696         switch s := atomic.Load(&t.status); s {
 697         case timerWaiting:
 698             if t.when > now {
 699                 // Not ready to run.
 700                 return t.when
 701             }
 702
 703             if !atomic.Cas(&t.status, s, timerRunning) {
 704                 continue
 705             }
 706             // Note that runOneTimer may temporarily unlock
 707             // pp.timersLock.
 708             runOneTimer(pp, t, now)
 709             return 0
 710
 711         case timerDeleted:
....
 738         case timerModifying:
 739             // Wait for modification to complete.
 740             osyield()
 741
 742         case timerNoStatus, timerRemoved:
....
@xtaci xtaci closed this Mar 23, 2020
@xtaci xtaci reopened this Mar 23, 2020
@xtaci xtaci changed the title runtime: loops forever on runtime.futex sometimes runtime: loops forever on sched_yield sometimes Mar 23, 2020
@xtaci xtaci changed the title runtime: loops forever on sched_yield sometimes runtime: loops forever on sched_yield sometimes(timer related) Mar 23, 2020
@andybons
Copy link
Member

@andybons andybons commented Mar 23, 2020

@andybons andybons added this to the Unplanned milestone Mar 23, 2020
@shuLhan
Copy link
Contributor

@shuLhan shuLhan commented Mar 23, 2020

@xtaci Thanks for opening the issue. I also have the same problem but only tested on VirtualBox.

Here is my report previously on golang-nuts: https://groups.google.com/forum/#!topic/golang-nuts/np70gmtKsaM .

Long story short, I have five services run on VirtualBox using Linux as guest machine and macOS as host machine. Sometimes one service will run with high CPU, sometimes two.

My go version,

dev 0 % go version
go version devel +89b791256a Mon Mar 23 20:18:26 2020 +0700 darwin/amd64

Debugging with gdb result in the following list of threads,

Thread 1 "YYY" received signal SIGINT, Interrupt.
runtime.futex () at /Users/XXX/share/go/src/runtime/sys_linux_amd64.s:568
568     /Users/XXX/share/go/src/runtime/sys_linux_amd64.s: No such file or directory.
(gdb) info threads
  Id   Target Id                  Frame 
* 1    LWP 5420 "exchange-restpr" runtime.futex () at /Users/XXX/share/go/src/runtime/sys_linux_amd64.s:568
  2    LWP 5431 "exchange-restpr" runtime.usleep () at /Users/XXX/share/go/src/runtime/sys_linux_amd64.s:146
  3    LWP 5432 "exchange-restpr" runtime.futex () at /Users/XXX/share/go/src/runtime/sys_linux_amd64.s:568
  4    LWP 5433 "exchange-restpr" runtime.futex () at /Users/XXX/share/go/src/runtime/sys_linux_amd64.s:568
  5    LWP 5434 "exchange-restpr" runtime.futex () at /Users/XXX/share/go/src/runtime/sys_linux_amd64.s:568
  6    LWP 5435 "exchange-restpr" runtime.futex () at /Users/XXX/share/go/src/runtime/sys_linux_amd64.s:568
  7    LWP 5436 "exchange-restpr" runtime.futex () at /Users/XXX/share/go/src/runtime/sys_linux_amd64.s:568
  8    LWP 5437 "exchange-restpr" runtime.osyield () at /Users/XXX/share/go/src/runtime/sys_linux_amd64.s:657
(gdb) where
#0  runtime.futex () at /Users/XXX/share/go/src/runtime/sys_linux_amd64.s:568
#1  0x0000000000430726 in runtime.futexsleep (addr=0xf95ae8 <runtime.m0+328>, val=0, ns=-1) at /Users/XXX/share/go/src/runtime/os_linux.go:45
#2  0x000000000040beb6 in runtime.notesleep (n=0xf95ae8 <runtime.m0+328>) at /Users/XXX/share/go/src/runtime/lock_futex.go:152
#3  0x000000000043aab8 in runtime.stoplockedm () at /Users/XXX/share/go/src/runtime/proc.go:2008
#4  0x000000000043c856 in runtime.schedule () at /Users/XXX/share/go/src/runtime/proc.go:2531
#5  0x000000000043cc2d in runtime.park_m (gp=0xc00022ca80) at /Users/XXX/share/go/src/runtime/proc.go:2767
#6  0x000000000046298b in runtime.mcall () at /Users/XXX/share/go/src/runtime/asm_amd64.s:318
#7  0x00000000004628a4 in runtime.rt0_go () at /Users/XXX/share/go/src/runtime/asm_amd64.s:220
#8  0x0000000000000000 in ?? ()
(gdb) thread 2
[Switching to thread 2 (LWP 5431)]
#0  runtime.usleep () at /Users/XXX/share/go/src/runtime/sys_linux_amd64.s:146
146     in /Users/XXX/share/go/src/runtime/sys_linux_amd64.s
(gdb) where
#0  runtime.usleep () at /Users/XXX/share/go/src/runtime/sys_linux_amd64.s:146
#1  0x000000000044107d in runtime.sysmon () at /Users/XXX/share/go/src/runtime/proc.go:4544
#2  0x0000000000438ff3 in runtime.mstart1 () at /Users/XXX/share/go/src/runtime/proc.go:1134
#3  0x0000000000438f0e in runtime.mstart () at /Users/XXX/share/go/src/runtime/proc.go:1099
#4  0x0000000000466b93 in runtime.clone () at /Users/XXX/share/go/src/runtime/sys_linux_amd64.s:619
#5  0x0000000000000000 in ?? ()
(gdb) thread 8
[Switching to thread 8 (LWP 5437)]
#0  runtime.osyield () at /Users/XXX/share/go/src/runtime/sys_linux_amd64.s:657
657     in /Users/XXX/share/go/src/runtime/sys_linux_amd64.s
(gdb) where
#0  runtime.osyield () at /Users/XXX/share/go/src/runtime/sys_linux_amd64.s:657
#1  0x000000000045520a in runtime.runtimer (pp=0xc00002c000, now=35383923085164, ~r2=<optimized out>) at /Users/XXX/share/go/src/runtime/time.go:757
#2  0x000000000043c9e4 in runtime.checkTimers (pp=0xc00002c000, now=35383923085164, rnow=<optimized out>, pollUntil=<optimized out>, ran=<optimized out>)
    at /Users/XXX/share/go/src/runtime/proc.go:2698
#3  0x000000000043c48b in runtime.schedule () at /Users/XXX/share/go/src/runtime/proc.go:2560
#4  0x000000000043ce26 in runtime.goschedImpl (gp=0xc000001680) at /Users/XXX/share/go/src/runtime/proc.go:2782
#5  0x000000000043d074 in runtime.gopreempt_m (gp=0xc000001680) at /Users/XXX/share/go/src/runtime/proc.go:2810
#6  0x000000000044ddee in runtime.newstack () at /Users/XXX/share/go/src/runtime/stack.go:1025
#7  0x0000000000462aef in runtime.morestack () at /Users/XXX/share/go/src/runtime/asm_amd64.s:449
#8  0x0000000000000000 in ?? ()
(gdb) 

According to my bisections (third times), the following commit cause it,

14:18 ~/src/go
(af1f3b0082...)|BISECTING XXX 0 % git bisect good                                                                                                                                                    
98858c438016bbafd161b502a148558987aa44d5 is the first bad commit                                                                                                                                          
commit 98858c438016bbafd161b502a148558987aa44d5                                                                                                                                                          
Author: Ian Lance Taylor <iant@golang.org>                                                                                                                                                                
Date:   Tue Feb 25 20:23:15 2020 -0800                                                                                                                                                                    
                                                                                                                                                                                                         
    runtime: don't panic on racy use of timers
     
    If we see a racy use of timers, as in concurrent calls to Timer.Reset,
    do the operations in an unpredictable order, rather than crashing.
     
    Fixes #37400
     
    Change-Id: Idbac295df2dfd551b6d762909d5040fc532c1b34
    Reviewed-on: https://go-review.googlesource.com/c/go/+/221077
    Run-TryBot: Ian Lance Taylor <iant@golang.org>
    TryBot-Result: Gobot Gobot <gobot@golang.org>
    Reviewed-by: Michael Knyszek <mknyszek@google.com>

 src/runtime/time.go   | 216 ++++++++++++++++----------------------------------
 src/time/time_test.go |  40 ++++++----
 2 files changed, 92 insertions(+), 164 deletions(-) 

Link to CL: https://go-review.googlesource.com/c/go/+/221077

I try to debug it manually and run gdb with it,

master 0 % git diff
diff --git a/src/runtime/lock_futex.go b/src/runtime/lock_futex.go
index 92873f2dac..a436ff4572 100644
--- a/src/runtime/lock_futex.go
+++ b/src/runtime/lock_futex.go
@@ -148,6 +148,7 @@ func notesleep(n *note) {
        }
        for atomic.Load(key32(&n.key)) == 0 {
                gp.m.blocked = true
+               print("futexsleep(", n.key, ", 0, ", ns, ")\n")
                futexsleep(key32(&n.key), 0, ns)
                if *cgo_yield != nil {
                        asmcgocall(*cgo_yield, nil)
diff --git a/src/runtime/time.go b/src/runtime/time.go
index c62207d7f7..80748b6939 100644
--- a/src/runtime/time.go
+++ b/src/runtime/time.go
@@ -753,6 +753,7 @@ func runtimer(pp *p, now int64) int64 {
 
                case timerModifying:
                        // Wait for modification to complete.
+                       print("runtimer: timerModifying\n")
                        osyield()
 
                case timerNoStatus, timerRemoved:

It print the following line for the service with high CPU,

runtimer: timerModifying
runtimer: timerModifying
runtimer: timerModifying
runtimer: timerModifying
runtimer: timerModifying
runtimer: timerModifying
runtimer: timerModifying
runtimer: timerModifying
runtimer: timerModifying
runtimer: timerModifying
runtimer: timerModifying

Thread 1 "exchange-restpu" received signal SIGINT, Interrupt.
runtime.futex () at /Users/XXX/share/go/src/runtime/sys_linux_amd64.s:568
568     /Users/XXX/share/go/src/runtime/sys_linux_amd64.s: No such file or directory.
(gdb)                                                                                            

I hope that give some light, because I did not have a minimal reproducible code.

CC: @ianlancetaylor

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Mar 23, 2020

Is there a way that we can recreate the problem?

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Mar 23, 2020

I still don't understand how this could happen, but I can see how to reduce the length of time that the osyield loop continues. Does https:golang.org/cl/224902 make a difference here? Thanks.

@gopherbot
Copy link

@gopherbot gopherbot commented Mar 23, 2020

Change https://golang.org/cl/224902 mentions this issue: runtime: don't call wakeNetPoller during timerModifying

@xtaci
Copy link
Author

@xtaci xtaci commented Mar 24, 2020

It seems that sched_yield() does not block under some circumtances, so the for loop in time.go runs forever. And yes, this appears on KVM based virtual machine, both vultr and promox-ve.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Mar 24, 2020

Is it possible for you to test whether the patch above helps with the problem? Thanks.

@xtaci
Copy link
Author

@xtaci xtaci commented Mar 24, 2020

ok, I'll try to apply time.go with that patch

shuLhan added a commit to shuLhan/go that referenced this issue Mar 24, 2020
Reduce the length of time that other timer functions can see timerModifying.
In particular avoid system calls.

Fixes golang#38023

Change-Id: I1b61229c668e6085d9ee6dca9488a90055386c36
@xtaci
Copy link
Author

@xtaci xtaci commented Mar 24, 2020

@ianlancetaylor my program works successfully for the last 8hrs, good job! gonna check to see if it works for the next 24hrs.

@shuLhan
Copy link
Contributor

@shuLhan shuLhan commented Mar 24, 2020

@ianlancetaylor after couple of hours, there are no more high CPUs on services in VirtualBox. Seems the CL works.

@xtaci
Copy link
Author

@xtaci xtaci commented Mar 24, 2020

@ianlancetaylor can you adjust the plan and apply this patch to 1.14.2?

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Mar 24, 2020

@gopherbot Please open backport issue for 1.14.

This problem can cause Go programs to hang in some circumstances, with no workaround.

@gopherbot
Copy link

@gopherbot gopherbot commented Mar 24, 2020

Backport issue(s) opened: #38051 (for 1.14).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://golang.org/wiki/MinorReleases.

@gopherbot gopherbot closed this in 355f53f Mar 24, 2020
@gopherbot
Copy link

@gopherbot gopherbot commented Mar 24, 2020

Change https://golang.org/cl/225277 mentions this issue: [release-branch.go1.14] runtime: don't call wakeNetPoller during timerModifying

@xtaci
Copy link
Author

@xtaci xtaci commented Mar 25, 2020

successfuly ran for 24hrs!

gopherbot pushed a commit that referenced this issue Mar 25, 2020
…rModifying

Reduce the length of time that other timer functions can see timerModifying.
In particular avoid system calls.

For #38023
Fixes #38051

Change-Id: I1b61229c668e6085d9ee6dca9488a90055386c36
Reviewed-on: https://go-review.googlesource.com/c/go/+/224902
Run-TryBot: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
(cherry picked from commit 355f53f)
Reviewed-on: https://go-review.googlesource.com/c/go/+/225277
crhino added a commit to hashicorp/consul that referenced this issue Jun 18, 2020
agnivade added a commit to mattermost/mattermost-server that referenced this issue Jul 31, 2020
This contains a patch fix for golang/go#38023
which have been encountered in our load tests and some users.

I have verified on such a stuck instance and the thread dump seem to match
with what is seen on the issue. All threads stuck on runtime.futex, except one,
which is at runtime.osyield.

And of all the times load tests were run which _did not_ show this issue,
it was run with a version greater than 1.14.1, which further points to the
theory that this bug is the culprit.
agnivade added a commit to mattermost/mattermost-server that referenced this issue Jul 31, 2020
* MM-26584: Bump up Go patch version

This contains a patch fix for golang/go#38023
which have been encountered in our load tests and some users.

I have verified on such a stuck instance and the thread dump seem to match
with what is seen on the issue. All threads stuck on runtime.futex, except one,
which is at runtime.osyield.

And of all the times load tests were run which _did not_ show this issue,
it was run with a version greater than 1.14.1, which further points to the
theory that this bug is the culprit.

* Replace docker image versions.

* Update build/README.md

Co-authored-by: Elisabeth Kulzer <elikul@elikul.de>
Co-authored-by: Mattermod <mattermod@users.noreply.github.com>
metanerd added a commit to mattermost/mattermost-server that referenced this issue Jul 31, 2020
* MM-26584: Bump up Go patch version (#15156)

* MM-26584: Bump up Go patch version

This contains a patch fix for golang/go#38023
which have been encountered in our load tests and some users.

I have verified on such a stuck instance and the thread dump seem to match
with what is seen on the issue. All threads stuck on runtime.futex, except one,
which is at runtime.osyield.

And of all the times load tests were run which _did not_ show this issue,
it was run with a version greater than 1.14.1, which further points to the
theory that this bug is the culprit.

* Replace docker image versions.

* Update build/README.md

Co-authored-by: Elisabeth Kulzer <elikul@elikul.de>
Co-authored-by: Mattermod <mattermod@users.noreply.github.com>

* Trigger CI

Co-authored-by: Elisabeth Kulzer <elikul@elikul.de>
Co-authored-by: Mattermod <mattermod@users.noreply.github.com>
KrishnaSindhur pushed a commit to KrishnaSindhur/mattermost-server that referenced this issue Oct 10, 2020
* MM-26584: Bump up Go patch version

This contains a patch fix for golang/go#38023
which have been encountered in our load tests and some users.

I have verified on such a stuck instance and the thread dump seem to match
with what is seen on the issue. All threads stuck on runtime.futex, except one,
which is at runtime.osyield.

And of all the times load tests were run which _did not_ show this issue,
it was run with a version greater than 1.14.1, which further points to the
theory that this bug is the culprit.

* Replace docker image versions.

* Update build/README.md

Co-authored-by: Elisabeth Kulzer <elikul@elikul.de>
Co-authored-by: Mattermod <mattermod@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants
You can’t perform that action at this time.