Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: sometimes 100% CPU spin during init phase in Go 1.14 with preemptive scheduler #37741

Closed
mholt opened this issue Mar 7, 2020 · 28 comments
Closed
Milestone

Comments

@mholt
Copy link

@mholt mholt commented Mar 7, 2020

What version of Go are you using (go version)?

$ go version
go version go1.14 darwin/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/matt/Library/Caches/go-build"
GOENV="/Users/matt/Library/Application Support/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOINSECURE=""
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/matt/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/darwin_amd64"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/Users/matt/Dev/caddyserver/caddy/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/97/r6t5z7zj0sd8pgh13bnwjh2w0000gn/T/go-build879531905=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

Ran go run main.go, in my case, specifically in this folder.

What did you expect to see?

The program should run.

What did you see instead?

Sometimes, the CPU spins at 100% and the entry point of the program is never reached reached after anywhere from 5-50s (but there are probably no hard bounds on this, it's just what I've observed).

This is part of my typical development cycle where I write some code, then run the program, write more code, run the program, etc. Sometimes it happens, sometimes it doesn't. Even without changes, re-running the command will work. I can't pin it down, unfortunately, nor make a simpler test case.

I verified that the entry point main() is never reached, by changing main() so it is now:

func main() {
	log.Println("Got here")
	caddycmd.Main()
}

and can verify that "Got here" is never printed. Instead, the CPU spins:

Screen Shot 2020-03-07 at 4 36 40 PM

In the past, @whitestrake has reported a similar phenomenon independently, but I only started noticing it recently myself. So, I know it's not isolated to just my machine.

Let me know what more would be useful to debug this. Thanks!

@dmitshur
Copy link
Member

@dmitshur dmitshur commented Mar 8, 2020

How often does this happen? Is it possible to reproduce by trying it some number of times?

Does it happen with go build too, or only go run main.go?

All imported packages get initialized before main is executed (see https://golang.org/ref/spec#Package_initialization for details). Is it possible some of your imported packages have init() that sometimes take too long? Or top level variable declarations that call functions?

In the screenshot you posted, the process name is main, which I believe means what's running and using 100% CPU is the binary that was built, not the go command.

@mholt
Copy link
Author

@mholt mholt commented Mar 8, 2020

That's a good point about init(), I could put the log line into init() but I guess that won't help since inits are called in a nondeterminstic (or at least unknown) order, so any offending init() might be called first.

You're probably right... although I haven't seen it happen with go build or any non-temporary binaries. I'll get more info and reopen if I can pin it down. Thanks

@mholt mholt closed this Mar 8, 2020
@dmitshur
Copy link
Member

@dmitshur dmitshur commented Mar 8, 2020

To clarify, package initialization is done in a deterministic order, as defined in the Go specification.

I agree it might be hard to know what init functions are being called when you are importing many packages transitively. Maybe a debugger or some other tool can help when investigating an issue like this.

@mholt
Copy link
Author

@mholt mholt commented Mar 8, 2020

To clarify, package initialization is done in a deterministic order,

Yes, but it's pretty opaque in practice :) i.e. it's not trivial to know in a large project the exact order of all instances of init(). If there's a tool that can print that out, that could be helpful in this case. It's deterministic to the compiler, but not to the program being compiled, since the program being compiled doesn't have the scope of perspective of the compiler AFAIK. Anyway, thanks!

@mholt
Copy link
Author

@mholt mholt commented Mar 9, 2020

@dmitshur Do you know of a way I can run a profile that starts before the init phase? Based on your feedback, I'm 99% sure this is a bad init() function or bad var foo = ... declaration somewhere. So I need to start a profile before all those are run.

@randall77
Copy link
Contributor

@randall77 randall77 commented Mar 9, 2020

Starting the profiler early would be hard. Can you run under a debugger, then just ctrl-C when it hangs and see where the code is executing?

@mholt
Copy link
Author

@mholt mholt commented Mar 9, 2020

Nevermind! Gopher Slack is awesome. Someone gave me a tip: pkill -ABRT main and then you get a stack trace and registers. Worked like a charm. Thanks everyone!

@mholt
Copy link
Author

@mholt mholt commented Mar 10, 2020

@dmitshur @randall77 The plot thickens, as I can only reproduce the bug in Go 1.14 and not Go 1.13: census-instrumentation/opencensus-go#1200

I suspect this might be a bug in the Go scheduler? One or more goroutines are blocking on select, but the stack trace of the active goroutine is still "unavailable". That linked issue has more details.

Any advice on where to go from here? I wish I could make the reproduction more minimal, but I just haven't figured it out as it seems dependent on importing (and using) these packages (even though the "using" code doesn't even get reached).

@mholt mholt reopened this Mar 10, 2020
@dmitshur dmitshur added this to the Go1.15 milestone Mar 10, 2020
@dmitshur dmitshur changed the title macOS: 100% cpu and program never starts when using `go run` (sometimes) runtime: sometimes 100% CPU spin during init() in Go 1.14 Mar 10, 2020
@dmitshur
Copy link
Member

@dmitshur dmitshur commented Mar 10, 2020

Also /cc @aclements per owners.

@mholt
Copy link
Author

@mholt mholt commented Mar 10, 2020

As per census-instrumentation/opencensus-go#1200 (comment) it seems pretty clear that this is likely a bug in the Go 1.14 preemptive scheduler. When I disable it, the bug does not manifest.

@mholt mholt changed the title runtime: sometimes 100% CPU spin during init() in Go 1.14 runtime: sometimes 100% CPU spin during init phase in Go 1.14 with preemptive scheduler Mar 11, 2020
@randall77
Copy link
Contributor

@randall77 randall77 commented Mar 11, 2020

I cannot reproduce:

$ mkdir issue37741
$ cd issue37741
$ git clone https://github.com/caddyserver/caddy
$ cd caddy
$ git checkout 6011ce120a4b2df9314a800fc93a61f852b766c1
$ cd cmd/caddy
$ ~/go1.14/bin/go build main.go
$ repeat 1000 (./main; sleep 0.5; date)

Runs 1000 times without fail.
I'm on OSX 10.15.2.
Anything I'm missing from the repro instructions?

@mholt
Copy link
Author

@mholt mholt commented Mar 11, 2020

@randall77 Thanks for trying! That's a pretty old commit, that doesn't add the dependencies which trigger the bug. Please try the linked branch, currently at commit f0f5490b9c0d07ae3acc35d559c5813f09c8a219. Thanks!

@randall77
Copy link
Contributor

@randall77 randall77 commented Mar 11, 2020

Ok, now I can reproduce. Thanks.

@randall77
Copy link
Contributor

@randall77 randall77 commented Mar 11, 2020

Also reproduces at tip.

Seems to be stuck in an endless attempt-to-preempt loop.
With this patch:

--- a/src/runtime/signal_unix.go
+++ b/src/runtime/signal_unix.go
@@ -329,12 +329,21 @@ func doSigPreempt(gp *g, ctxt *sigctxt) {
        if wantAsyncPreempt(gp) && isAsyncSafePoint(gp, ctxt.sigpc(), ctxt.sigsp(), ctxt.siglr()) {
                // Inject a call to asyncPreempt.
                ctxt.pushCall(funcPC(asyncPreempt))
+               nfail = 0
+       }
+       println("not at safepoint", nfail)
+       nfail++
+       if nfail == 1000 {
+          traceback(ctxt.sigpc(), ctxt.sigsp(), ctxt.siglr(), gp)
+          throw("BAD")
        }
 
        // Acknowledge the preemption.
        atomic.Xadd(&gp.m.preemptGen, 1)
 }
 
+var nfail int64
+

It will occasionally fail (and without the throw, increment nfail forever).

The stack trace it prints is simple:

github.com/yuin/goldmark/util.init()
	/Users/khr/gopath/pkg/mod/github.com/yuin/goldmark@v1.1.24/util/html5entities.go:747 +0xcb886

The line number fluctuates from run to run. It seems to always be in a big map initializer, that looks like this:

var html5entities = map[string]*HTML5Entity{
	"AElig":                           {Name: "AElig", CodePoints: []int{198}, Characters: []byte{0xc3, 0x86}},
	"AMP":                             {Name: "AMP", CodePoints: []int{38}, Characters: []byte{0x26}},
	"Aacute":                          {Name: "Aacute", CodePoints: []int{193}, Characters: []byte{0xc3, 0x81}},
...
	"zscr":                            {Name: "zscr", CodePoints: []int{120015}, Characters: []byte{0xf0, 0x9d, 0x93, 0x8f}},
	"zwj":                             {Name: "zwj", CodePoints: []int{8205}, Characters: []byte{0xe2, 0x80, 0x8d}},
	"zwnj":                            {Name: "zwnj", CodePoints: []int{8204}, Characters: []byte{0xe2, 0x80, 0x8c}},
}

With over 2000+ lines.

Each entry is inserted separately, so there are ~2000 chunks of assembly like this:

  html5entities.go:20           0x1b06407               488d05d2064600          LEAQ _runtime.types+1834016(SB), AX                             
  html5entities.go:20           0x1b0640e               48890424                MOVQ AX, 0(SP)                                                  
  html5entities.go:20           0x1b06412               e8a99a50ff              CALL _runtime.newobject(SB)                                     
  html5entities.go:20           0x1b06417               488b442408              MOVQ 0x8(SP), AX                                                
  html5entities.go:20           0x1b0641c               48898424a82e0000        MOVQ AX, 0x2ea8(SP)                                             
  html5entities.go:20           0x1b06424               48c7400806000000        MOVQ $0x6, 0x8(AX)                                              
  html5entities.go:20           0x1b0642c               488d0dd81e5600          LEAQ _go.string.*+29427(SB), CX                                 
  html5entities.go:20           0x1b06433               488908                  MOVQ CX, 0(AX)                                                  
  html5entities.go:20           0x1b06436               488d15235e3600          LEAQ _runtime.types+807840(SB), DX                              
  html5entities.go:20           0x1b0643d               48891424                MOVQ DX, 0(SP)                                                  
  html5entities.go:20           0x1b06441               e87a9a50ff              CALL _runtime.newobject(SB)                                     
  html5entities.go:20           0x1b06446               488b442408              MOVQ 0x8(SP), AX                                                
  html5entities.go:20           0x1b0644b               48c700c1000000          MOVQ $0xc1, 0(AX)                                               
  html5entities.go:20           0x1b06452               488b8c24a82e0000        MOVQ 0x2ea8(SP), CX                                             
  html5entities.go:20           0x1b0645a               48c7411801000000        MOVQ $0x1, 0x18(CX)                                             
  html5entities.go:20           0x1b06462               48c7412001000000        MOVQ $0x1, 0x20(CX)                                             
  html5entities.go:20           0x1b0646a               833d0f832c0100          CMPL $0x0, _runtime.writeBarrier(SB)                            
  html5entities.go:20           0x1b06471               0f8551380d00            JNE 0x1bd9cc8                                                   
  html5entities.go:20           0x1b06477               48894110                MOVQ AX, 0x10(CX)                                               
  html5entities.go:20           0x1b0647b               488d051ea83600          LEAQ _runtime.types+826848(SB), AX                              
  html5entities.go:20           0x1b06482               48890424                MOVQ AX, 0(SP)                                                  
  html5entities.go:20           0x1b06486               e8359a50ff              CALL _runtime.newobject(SB)                                     
  html5entities.go:20           0x1b0648b               488b442408              MOVQ 0x8(SP), AX                                                
  html5entities.go:20           0x1b06490               66c700c381              MOVW $-0x7e3d, 0(AX)                                            
  html5entities.go:20           0x1b06495               488b8c24a82e0000        MOVQ 0x2ea8(SP), CX                                             
  html5entities.go:20           0x1b0649d               48c7413002000000        MOVQ $0x2, 0x30(CX)                                             
  html5entities.go:20           0x1b064a5               48c7413802000000        MOVQ $0x2, 0x38(CX)                                             
  html5entities.go:20           0x1b064ad               833dcc822c0100          CMPL $0x0, _runtime.writeBarrier(SB)                            
  html5entities.go:20           0x1b064b4               0f8500380d00            JNE 0x1bd9cba                                                   
  html5entities.go:20           0x1b064ba               48894128                MOVQ AX, 0x28(CX)                                               
  html5entities.go:17           0x1b064be               488d055bfc3b00          LEAQ _runtime.types+1176160(SB), AX                             
  html5entities.go:17           0x1b064c5               48890424                MOVQ AX, 0(SP)                                                  
  html5entities.go:17           0x1b064c9               488b8c24b02e0000        MOVQ 0x2eb0(SP), CX                                             
  html5entities.go:17           0x1b064d1               48894c2408              MOVQ CX, 0x8(SP)                                                
  html5entities.go:17           0x1b064d6               488d152e1e5600          LEAQ _go.string.*+29427(SB), DX                                 
  html5entities.go:17           0x1b064dd               4889542410              MOVQ DX, 0x10(SP)                                               
  html5entities.go:17           0x1b064e2               48c744241806000000      MOVQ $0x6, 0x18(SP)                                             
  html5entities.go:17           0x1b064eb               e830fd50ff              CALL _runtime.mapassign_faststr(SB)                             
  html5entities.go:17           0x1b064f0               488b7c2420              MOVQ 0x20(SP), DI                                               
  html5entities.go:17           0x1b064f5               8407                    TESTB AL, 0(DI)                                                 
  html5entities.go:17           0x1b064f7               833d82822c0100          CMPL $0x0, _runtime.writeBarrier(SB)                            
  html5entities.go:17           0x1b064fe               0f85a4370d00            JNE 0x1bd9ca8                                                   
  html5entities.go:17           0x1b06504               488b8424a82e0000        MOVQ 0x2ea8(SP), AX                                             
  html5entities.go:17           0x1b0650c               488907                  MOVQ AX, 0(DI)                                                  

@cherryyz, how do I tell whether the code here is marked as preemptable?
Total guess, but the runtime is probably trying to start a GC. It's trying to interrupt this code to do so. The code is always sent a signal at a bad place, so it resumes and runs a bit more. Eventually it hits a write barrier or an allocation that the runtime stops it at, saying "no more! I need to GC!". Of course that's at a non-preemptable location. Deadlock ensues.
Kind of a half-baked theory. I'll look some more.

@randall77
Copy link
Contributor

@randall77 randall77 commented Mar 11, 2020

@randall77
Copy link
Contributor

@randall77 randall77 commented Mar 11, 2020

If I look at the places where preempt is attempted (change the println in my patch above to println("not at safepoint", nfail, hex(ctxt.sigpc())), I get something like this:

not at safepoint 0 0x1045b79
not at safepoint 1 0x7fff68a01d36
not at safepoint 2 0x10225d7
not at safepoint 3 0x10225d7
not at safepoint 4 0x10225d7
not at safepoint 5 0x1022896
not at safepoint 6 0x10228cd
not at safepoint 7 0x7fff68ac1ace
not at safepoint 8 0x1022095
not at safepoint 9 0x1017ad9
not at safepoint 10 0x7fff68a00ce6
not at safepoint 11 0x1022f8e
not at safepoint 12 0x7fff68a047fa
not at safepoint 13 0x7fff68a047fa
not at safepoint 14 0x7fff68a00ce6
not at safepoint 15 0x7fff68a047fa
not at safepoint 16 0x10229a2
not at safepoint 17 0x10229a2
not at safepoint 0 0x106ade0
not at safepoint 1 0x106ade0
not at safepoint 2 0x1068be6
not at safepoint 3 0x1bd0ff8
not at safepoint 4 0x1bd0ff8

Then lots of that same address repeating, with an occasional one of these:

not at safepoint 46 0x7fff68a00ce6
not at safepoint 454 0x100e172

The common line is a call to the write barrier code, for a check in that big map constructor of the gc enabled bit. That is indeed a non-preemptable location.

  html5entities.go:17           0x1bd0ff8               488b8424a82e0000        MOVQ 0x2ea8(SP), AX                                             
  html5entities.go:17           0x1bd1000               e87b8549ff              CALL _runtime.gcWriteBarrier(SB)                                
  html5entities.go:17           0x1bd1005               e9a986f6ff              JMP 0x1b396b3                                                   

I don't see how, if we resumed at 0x1bd0ff8, we'd ever get back to that same address again. As long as we made at least one instruction of progress.

0x100e172 is in the runtime, in notetsleep_internal.
0x7fff68a00ce6 is presumably in vdso somewhere, also not preemptable.

My hunch is now that the thread is somehow not making any progress from 0x1bd0ff8 by the time we interrupt it again. My earlier theory, that the init thread was blocking on the runtime to do a GC, looks wrong.

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Mar 11, 2020

(haven't read the whole thread, just answer this first)

how do I tell whether the code here is marked as preemptable?

Compiler's -d=pctab=pctopcdata flag will do. Look for -2. It is not preemptible from a -2 (inclusive) to the next number that is not -2 (exclusive)

@randall77
Copy link
Contributor

@randall77 randall77 commented Mar 11, 2020

This patch seems to fix the bug:

--- a/src/runtime/proc.go
+++ b/src/runtime/proc.go
@@ -4680,6 +4680,7 @@ func preemptone(_p_ *p) bool {
        if preemptMSupported && debug.asyncpreemptoff == 0 {
                _p_.preempt = true
                preemptM(mp)
+               usleep(1000)
        }
 
        return true

I suspect that we're issuing preemptone requests too quickly, and the target M never has the chance to make progress. The sleep gives the OS time to schedule it.

We should probably rate limit preemptM requests to any (all?) targets. We do such a thing in suspendG in runtime/preempt.go, just not preemptone in runtime/proc.go.

This definitely seems to be a fundamental problem with async preempt + give up and retry...

It would be really nice to ask the OS to "interrupt the target thread only after it has been given at least a bit of time slice".

@randall77
Copy link
Contributor

@randall77 randall77 commented Mar 12, 2020

Attached is a single-file repro. It hangs pretty reliably when run 100 times.
Rename it from test.txt to test.go. For some reason you can't attach *.go files in github. Isn't this a programming website?

test.txt

@randall77
Copy link
Contributor

@randall77 randall77 commented Mar 12, 2020

Here's a pretty simple fix.

--- a/src/runtime/proc.go
+++ b/src/runtime/proc.go
@@ -4667,6 +4667,15 @@ func preemptone(_p_ *p) bool {
        if gp == nil || gp == mp.g0 {
                return false
        }
+       if gp.preempt {
+               // Someone is already trying to preempt the goroutine
+               // running on this P. Don't try again, that can lead
+               // to lots of signals heading to the preempted thread,
+               // causing it not to make any progress on its actual code.
+               // That causes live-lock if the actual code is not
+               // at a preemptible point. See issue 37741.
+               return false
+       }
 
        gp.preempt = true 

Now I just need a decent test. The repro above is fine for one-offs, but it's too expensive for a checked-in test.

@shawn-xdji
Copy link
Contributor

@shawn-xdji shawn-xdji commented Mar 12, 2020

Edit:
The build finally passed if waiting longer, multiple times slower.

time -p ./make.bash

w./o the fix:
real 81.00
user 349.97
sys 96.57

w. the fix:
real 698.99
user 617.18
sys 1615.17

=============================================================
Hi @randall77 , building on an arm64 machine hangs with the fix, the machine used to experience long running time when testing 'reflect' package with "internal linking of -buildmode=pie", as mentioned in #37779 .

// command that hangs
/go/pkg/tool/linux_arm64/compile -o /tmp/go-build707266415/b014/pkg.a -trimpath /tmp/go-build707266415/b014=> -p sync -std -buildid ZruWUdr0RuQhPDUgpU

#0 runtime.sigtramp () at /home/ent-user/go/src/runtime/sys_linux_arm64.s:394 #1 <signal handler called> #2 0x00000000000234fc in runtime.spanOf (p=0x400082c000) at /home/ent-user/go/src/runtime/mheap.go:628 #3 runtime.markBitsForAddr (p=0x400082c000, ~r1=...) at /home/ent-user/go/src/runtime/mbitmap.go:271 #4 0x000000000002f4f8 in runtime.gcmarknewobject (obj=0x400082c000, size=0x2000, scanSize=0x1fa0) at /home/ent-user/go/src/runtime/mgcmark.go:1524 #5 0x000000000001c3a4 in runtime.mallocgc (size=0x2000, typ=0x720b20, needzero=0x1, ~r3=<optimized out>) at /home/ent-user/go/src/runtime/malloc.go:1119 #6 0x0000000000059294 in runtime.makeslice (et=0x720b20, len=0x40, cap=0x40, ~r3=<optimized out>) at /home/ent-user/go/src/runtime/slice.go:49 #7 0x0000000000087928 in sync.(*Pool).pinSlow (p=0xd26800 <fmt.ppFree>, ~r0=<optimized out>, ~r1=<optimized out>) at /home/ent-user/go/src/sync/pool.go:227 #8 0x0000000000087830 in sync.(*Pool).pin (p=0xd26800 <fmt.ppFree>, ~r0=<optimized out>, ~r1=<optimized out>) at /home/ent-user/go/src/sync/pool.go:206 #9 0x0000000000087518 in sync.(*Pool).Get (p=0xd26800 <fmt.ppFree>, ~r0=...) at /home/ent-user/go/src/sync/pool.go:128 #10 0x00000000000c05ac in fmt.newPrinter (~r0=<optimized out>) at /home/ent-user/go/src/fmt/print.go:137 #11 0x00000000000c0b30 in fmt.Sprintf (Python Exception <class 'UnicodeEncodeError'> 'ascii' codec can't encode character '\xb7' in position 8: ordinal not in range(128): format=, a= []interface {} = {...}, ~r2=...) at /home/ent-user/go/src/fmt/print.go:218 #12 0x000000000066bb24 in cmd/compile/internal/gc.(*Liveness).emit.func1 (tmpSym=0x4000efe000, ~r1=<optimized out>) at /home/ent-user/go/src/cmd/compile/internal/gc/plive.go:1479 #13 0x00000000005c47a8 in cmd/compile/internal/gc.(*Liveness).emit (lv=0x4000103980, argsSym=<optimized out>, liveSym=<optimized out>, regsSym=<optimized out>) at /home/ent-user/go/src/cmd/compile/internal/gc/plive.go:1483 #14 0x00000000005c4b4c in cmd/compile/internal/gc.liveness (e=0x40004d1410, f=0x4000544f20, pp=0x4000565c00, ~r3=...) at /home/ent-user/go/src/cmd/compile/internal/gc/plive.go:1529 #15 0x00000000006141c0 in cmd/compile/internal/gc.genssa (f=0x4000544f20, pp=0x4000565c00) at /home/ent-user/go/src/cmd/compile/internal/gc/ssa.go:5908 #16 0x00000000005b6e80 in cmd/compile/internal/gc.compileSSA (fn=0x40002486e0, worker=0x2) at /home/ent-user/go/src/cmd/compile/internal/gc/pgen.go:308 #17 0x000000000066addc in cmd/compile/internal/gc.compileFunctions.func2 (c=0x4000bd5200, &wg=0x4000919650, worker=0x2) at /home/ent-user/go/src/cmd/compile/internal/gc/pgen.go:363 #18 0x0000000000072124 in runtime.goexit () at /home/ent-user/go/src/runtime/asm_arm64.s:1136

@gopherbot
Copy link

@gopherbot gopherbot commented Mar 12, 2020

Change https://golang.org/cl/223240 mentions this issue: runtime: throttle signal senders when receivers aren't making progress

@randall77
Copy link
Contributor

@randall77 randall77 commented Mar 12, 2020

@shawn-xdji , could you try the CL above on your example?

@randall77
Copy link
Contributor

@randall77 randall77 commented Mar 12, 2020

@gopherbot please open a backport issue to 1.14.1.

@gopherbot
Copy link

@gopherbot gopherbot commented Mar 12, 2020

Backport issue(s) opened: #37833 (for 1.14).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://golang.org/wiki/MinorReleases.

@shawn-xdji
Copy link
Contributor

@shawn-xdji shawn-xdji commented Mar 13, 2020

Hi @randall77 thanks, build time backs to normal and is even better now, 'sys' is below 80s consistently, compared with 100s or so without the fix.
And the long running time issue of testing 'reflect' package is addressed as well.

@gopherbot
Copy link

@gopherbot gopherbot commented Mar 17, 2020

Change https://golang.org/cl/223737 mentions this issue: runtime: don't send preemption signal if there is a signal pending

@gopherbot gopherbot closed this in 0c0e8f2 Mar 18, 2020
@gopherbot
Copy link

@gopherbot gopherbot commented Mar 18, 2020

Change https://golang.org/cl/223939 mentions this issue: [release-branch.go1.14] runtime: don't send preemption signal if there is a signal pending

gopherbot pushed a commit that referenced this issue Mar 18, 2020
…e is a signal pending

If multiple threads call preemptone to preempt the same M, it may
send many signals to the same M such that it hardly make
progress, causing live-lock problem. Only send a signal if there
isn't already one pending.

Updates #37741.
Fixes #37833.

Change-Id: Id94adb0b95acbd18b23abe637a8dcd81ab41b452
Reviewed-on: https://go-review.googlesource.com/c/go/+/223737
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
(cherry picked from commit 0c0e8f2)
Reviewed-on: https://go-review.googlesource.com/c/go/+/223939
Reviewed-by: Austin Clements <austin@google.com>
mrsdizzie added a commit to mrsdizzie/gitea that referenced this issue Apr 28, 2020
…14 in git env

As seen in trouble shooting go-gitea#11032 the new feature of Go 1.14 is causing several second delays in startup in certain situations. Debugging shows it spending several seconds handling SIGURG commands during init:

```
6922:04:51.984234 trace init() ./modules/queue/unique_queue_wrapped.go
remote: ) = 69 <0.000012>
remote: [pid 15984] 22:04:51 write(1, "\ttime taken: 236.761\302\265s\n\n", 25    time taken: 236.761µs
remote:
remote: ) = 25 <0.000011>
remote: [pid 15984] 22:04:51 --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=15984, si_uid=0} ---
remote: [pid 15984] 22:04:52 --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=15984, si_uid=0} ---
remote: [pid 15984] 22:04:52 --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=15984, si_uid=0} ---
```

This causes up to 20 seconds added to a push in some cases as it happens for each call of the gitea hook command. This is likely the cause of go-gitea#10661 as well and would start to effect users once we release 1.12 which would be the first release compiled with Go 1.14. I suspect this is just a slight issue with the upstream implementatation as there have been a few very similar bugs fixed and reported:

 golang/go#37741
 golang/go#37942

We should revisit this in the future and see if a newer version of Go has solved it, but for now disable this option in the environment that gitea hook runs in to avoid it.
techknowlogick pushed a commit to go-gitea/gitea that referenced this issue Apr 28, 2020
…14 in git env (#11237)

As seen in trouble shooting #11032 the new feature of Go 1.14 is causing several second delays in startup in certain situations. Debugging shows it spending several seconds handling SIGURG commands during init:

```
6922:04:51.984234 trace init() ./modules/queue/unique_queue_wrapped.go
remote: ) = 69 <0.000012>
remote: [pid 15984] 22:04:51 write(1, "\ttime taken: 236.761\302\265s\n\n", 25    time taken: 236.761µs
remote:
remote: ) = 25 <0.000011>
remote: [pid 15984] 22:04:51 --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=15984, si_uid=0} ---
remote: [pid 15984] 22:04:52 --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=15984, si_uid=0} ---
remote: [pid 15984] 22:04:52 --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=15984, si_uid=0} ---
```

This causes up to 20 seconds added to a push in some cases as it happens for each call of the gitea hook command. This is likely the cause of #10661 as well and would start to effect users once we release 1.12 which would be the first release compiled with Go 1.14. I suspect this is just a slight issue with the upstream implementatation as there have been a few very similar bugs fixed and reported:

 golang/go#37741
 golang/go#37942

We should revisit this in the future and see if a newer version of Go has solved it, but for now disable this option in the environment that gitea hook runs in to avoid it.
ydelafollye pushed a commit to ydelafollye/gitea that referenced this issue Jul 31, 2020
…14 in git env (go-gitea#11237)

As seen in trouble shooting go-gitea#11032 the new feature of Go 1.14 is causing several second delays in startup in certain situations. Debugging shows it spending several seconds handling SIGURG commands during init:

```
6922:04:51.984234 trace init() ./modules/queue/unique_queue_wrapped.go
remote: ) = 69 <0.000012>
remote: [pid 15984] 22:04:51 write(1, "\ttime taken: 236.761\302\265s\n\n", 25    time taken: 236.761µs
remote:
remote: ) = 25 <0.000011>
remote: [pid 15984] 22:04:51 --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=15984, si_uid=0} ---
remote: [pid 15984] 22:04:52 --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=15984, si_uid=0} ---
remote: [pid 15984] 22:04:52 --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=15984, si_uid=0} ---
```

This causes up to 20 seconds added to a push in some cases as it happens for each call of the gitea hook command. This is likely the cause of go-gitea#10661 as well and would start to effect users once we release 1.12 which would be the first release compiled with Go 1.14. I suspect this is just a slight issue with the upstream implementatation as there have been a few very similar bugs fixed and reported:

 golang/go#37741
 golang/go#37942

We should revisit this in the future and see if a newer version of Go has solved it, but for now disable this option in the environment that gitea hook runs in to avoid it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
6 participants
You can’t perform that action at this time.