New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: error message: P has cached GC work at end of mark termination #27993

Open
ianlancetaylor opened this Issue Oct 3, 2018 · 33 comments

Comments

Projects
None yet
10 participants
@aarzilli

This comment has been minimized.

@FiloSottile

This comment has been minimized.

@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented Oct 31, 2018

@bcmills

This comment has been minimized.

@ianlancetaylor

This comment has been minimized.

@griesemer

This comment has been minimized.

Contributor

griesemer commented Nov 3, 2018

@ianlancetaylor

This comment has been minimized.

@griesemer

This comment has been minimized.

Contributor

griesemer commented Nov 5, 2018

@odeke-em

This comment has been minimized.

Member

odeke-em commented Nov 9, 2018

Hey @griesemer in your comment #27993 (comment)

See #27993 .

you self referenced this same issue 27993, which issue did you mean to refer to?

@griesemer

This comment has been minimized.

Contributor

griesemer commented Nov 9, 2018

@odeke-em This was just an e-mail reply - I missed that the mail already referred to this issue. Please ignore.

@aclements

This comment has been minimized.

Member

aclements commented Nov 19, 2018

$ greplogs -e 'P has cached GC work' -dashboard -md -l
2018-11-16T23:30:19-6797b32/netbsd-amd64-8_0
2018-11-16T19:46:17-6157dda/linux-amd64-stretch
2018-11-16T17:49:55-53ed92d/nacl-386
2018-11-16T17:33:54-a1025ba/nacl-386
2018-11-14T21:47:50-0a40d45/freebsd-amd64-11_1
2018-11-14T20:32:15-f36e92d/android-arm-wiko-fever
2018-11-13T20:46:39-af07f77/windows-386-2008
2018-11-13T15:08:13-e51b19a/freebsd-amd64-10_3
2018-11-12T20:46:39-7f1dd3a/linux-386-387
2018-11-12T20:46:25-52b2220/windows-amd64-2008
2018-11-12T20:43:55-106db71/linux-386-387
2018-11-12T20:27:21-ec4ae29/linux-mipsle
2018-11-12T20:27:14-4f3604d/android-arm64-wiko-fever
2018-11-10T16:04:18-e4c1fee/solaris-amd64-oraclerel
...
2018-10-11T16:31:24-689321e/freebsd-386-11_2
2018-10-10T14:55:17-29907b1/freebsd-386-10_4
2018-10-10T04:29:55-8f9902d/freebsd-amd64-10_4
2018-10-09T18:20:23-416804f/windows-386-2008
2018-10-09T18:19:59-7d2f46d/freebsd-amd64-10_3
2018-10-08T17:47:49-26d2260/windows-arm
2018-10-06T19:18:34-2bb91e0/nacl-amd64p32
2018-10-06T15:40:03-f90e89e/linux-amd64-nocgo
2018-10-05T21:53:34-9d90716/freebsd-amd64-race
2018-10-03T22:50:25-1961d8d/freebsd-amd64-10_4
2018-10-03T19:54:23-9dac0a8/linux-amd64-nocgo
2018-10-03T17:40:17-048de7b/freebsd-386-11_2
2018-10-03T02:09:38-06ff477/freebsd-386-10_4

It clearly happens on all GOOSes:

      1 darwin
      1 js
      1 solaris
      2 android
      2 netbsd
      3 plan9
      4 openbsd
      5 nacl
      8 windows
     19 linux
     23 freebsd

And has shown up on plenty of GOARCHes:

      1 arm64
      1 wasm
      2 amd64p32
      3 mipsle
      5 arm
     24 386
     33 amd64

The failure is usually, though not always, in cmd/go, but that might just be sampling bias, and the exact subcommand that's running during the failure varies.

@gopherbot

This comment has been minimized.

gopherbot commented Nov 19, 2018

Change https://golang.org/cl/149968 mentions this issue: runtime: improve "P has cached GC work" debug info

@gopherbot

This comment has been minimized.

gopherbot commented Nov 19, 2018

Change https://golang.org/cl/149969 mentions this issue: runtime: debug code to catch bad gcWork.puts

@bradfitz

This comment has been minimized.

Member

bradfitz commented Nov 20, 2018

gopherbot pushed a commit that referenced this issue Nov 21, 2018

runtime: improve "P has cached GC work" debug info
For #27993.

Change-Id: I20127e8a9844c2c488f38e1ab1f8f5a27a5df03e
Reviewed-on: https://go-review.googlesource.com/c/149968
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
@gopherbot

This comment has been minimized.

gopherbot commented Nov 21, 2018

Change https://golang.org/cl/150778 mentions this issue: Revert "runtime: debug code to catch bad gcWork.puts"

gopherbot pushed a commit that referenced this issue Nov 21, 2018

runtime: debug code to catch bad gcWork.puts
This adds a debug check to throw immediately if any pointers are added
to the gcWork buffer after the mark completion barrier. The intent is
to catch the source of the cached GC work that occasionally produces
"P has cached GC work at end of mark termination" failures.

The result should be that we get "throwOnGCWork" throws instead of "P
has cached GC work at end of mark termination" throws, but with useful
stack traces.

This should be reverted before the release. I've been unable to
reproduce this issue locally, but this issue appears fairly regularly
on the builders, so the intent is to catch it on the builders.

This probably slows down the GC slightly.

For #27993.

Change-Id: I5035e14058ad313bfbd3d68c41ec05179147a85c
Reviewed-on: https://go-review.googlesource.com/c/149969
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>

bradfitz pushed a commit that referenced this issue Nov 21, 2018

runtime: improve "P has cached GC work" debug info
For #27993.

Change-Id: I20127e8a9844c2c488f38e1ab1f8f5a27a5df03e
Reviewed-on: https://go-review.googlesource.com/c/149968
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>

bradfitz pushed a commit that referenced this issue Nov 21, 2018

runtime: debug code to catch bad gcWork.puts
This adds a debug check to throw immediately if any pointers are added
to the gcWork buffer after the mark completion barrier. The intent is
to catch the source of the cached GC work that occasionally produces
"P has cached GC work at end of mark termination" failures.

The result should be that we get "throwOnGCWork" throws instead of "P
has cached GC work at end of mark termination" throws, but with useful
stack traces.

This should be reverted before the release. I've been unable to
reproduce this issue locally, but this issue appears fairly regularly
on the builders, so the intent is to catch it on the builders.

This probably slows down the GC slightly.

For #27993.

Change-Id: I5035e14058ad313bfbd3d68c41ec05179147a85c
Reviewed-on: https://go-review.googlesource.com/c/149969
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
@aclements

This comment has been minimized.

Member

aclements commented Nov 26, 2018

I got a little data over the weekend with the added debug code.

$ greplogs -dashboard -E "flushedWork|throwOnGCWork" -md
2018-11-26T14:13:53-9fe9853/darwin-amd64-10_12:

runtime: P 5 flushedWork false wbuf1.n=1 wbuf2.n=0
fatal error: P has cached GC work at end of mark termination

2018-11-23T21:21:35-c6d4939/linux-amd64-noopt:

fatal error: throwOnGCWork

runtime stack:
runtime.throw(0x6f386c, 0xd)
	/workdir/go/src/runtime/panic.go:608 +0x72 fp=0x7f034b7fdd48 sp=0x7f034b7fdd18 pc=0x42f4a2
runtime.(*gcWork).putBatch(0xc000021770, 0xc0000217a8, 0x13, 0x200)
	/workdir/go/src/runtime/mgcwork.go:182 +0x1e0 fp=0x7f034b7fdd90 sp=0x7f034b7fdd48 pc=0x424aa0
runtime.wbBufFlush1(0xc000020500)
	/workdir/go/src/runtime/mwbbuf.go:277 +0x1ba fp=0x7f034b7fdde8 sp=0x7f034b7fdd90 pc=0x42b87a
runtime.gcMark(0xa4a16143b5)
	/workdir/go/src/runtime/mgc.go:1932 +0x10b fp=0x7f034b7fde80 sp=0x7f034b7fdde8 pc=0x41e22b
runtime.gcMarkTermination.func1()
	/workdir/go/src/runtime/mgc.go:1501 +0x2a fp=0x7f034b7fde98 sp=0x7f034b7fde80 pc=0x45c57a
runtime.systemstack(0x1)
	/workdir/go/src/runtime/asm_amd64.s:351 +0x66 fp=0x7f034b7fdea0 sp=0x7f034b7fde98 pc=0x45f336
runtime.mstart()
	/workdir/go/src/runtime/proc.go:1153 fp=0x7f034b7fdea8 sp=0x7f034b7fdea0 pc=0x4339c0

goroutine 181 [garbage collection]:
runtime.systemstack_switch()
	/workdir/go/src/runtime/asm_amd64.s:311 fp=0xc000342d60 sp=0xc000342d58 pc=0x45f2c0
runtime.gcMarkTermination(0x3fe295012d50d6c8)
	/workdir/go/src/runtime/mgc.go:1500 +0x178 fp=0xc000342f20 sp=0xc000342d60 pc=0x41d218
runtime.gcMarkDone()
	/workdir/go/src/runtime/mgc.go:1475 +0x168 fp=0xc000342f60 sp=0xc000342f20 pc=0x41cff8
runtime.gcBgMarkWorker(0xc00001e000)
	/workdir/go/src/runtime/mgc.go:1858 +0x294 fp=0xc000342fd8 sp=0xc000342f60 pc=0x41df14
runtime.goexit()
	/workdir/go/src/runtime/asm_amd64.s:1337 +0x1 fp=0xc000342fe0 sp=0xc000342fd8 pc=0x461391
created by runtime.gcBgMarkStartWorkers
	/workdir/go/src/runtime/mgc.go:1679 +0x77

2018-11-21T16:28:17-2a7f904/linux-386-clang:

runtime: P 0 flushedWork false wbuf1.n=1 wbuf2.n=0
fatal error: P has cached GC work at end of mark termination

2018-11-21T16:28:17-2a7f904/linux-mipsle:

runtime: P 0 flushedWork false wbuf1.n=2 wbuf2.n=0
fatal error: P has cached GC work at end of mark termination
@aclements

This comment has been minimized.

Member

aclements commented Nov 26, 2018

It's really interesting that in three of the four failures, it did not trip the check on adding work to the gcWork. There's a pretty narrow window when we're still running the termination detection algorithm where work can still be added without triggering this check. Perhaps for debugging we should do a second round after setting throwOnGCWork to see if anything was added during this window.

In the one throwOnGCWork failure, it was flushing a write barrier buffer that apparently had 19 (0x13) unmarked pointers in it. Perhaps for debugging we should disable the write barrier buffers, too, when we set throwOnGCWork.

@aclements

This comment has been minimized.

Member

aclements commented Nov 26, 2018

FWIW, I was able to reproduce this once out of 1,334 runs of all.bash on my linux/amd64 box.

##### ../test
# go run run.go -- recover2.go
signal: aborted
runtime: P 5 flushedWork false wbuf1.n=1 wbuf2.n=0
fatal error: P has cached GC work at end of mark termination
@andybons

This comment has been minimized.

Member

andybons commented Nov 28, 2018

@aclements how serious is this and how long do you think it will take to fix (worst question to ask I know)? We’re asking because we may delay the beta due to it so getting more insight into the scope of the issue will help us with that decision. Thanks.

@bcmills

This comment has been minimized.

Member

bcmills commented Nov 28, 2018

Another repro in https://storage.googleapis.com/go-build-log/f64c1069/windows-386-2008_34d5c37b.log, if it helps:

            > go list -f '{{.ImportPath}}: {{.Match}}' all ... example.com/m/... ./... ./xyz...
            [stderr]
            runtime: P 0 flushedWork false wbuf1.n=1 wbuf2.n=0
            fatal error: P has cached GC work at end of mark termination
@aclements

This comment has been minimized.

Member

aclements commented Nov 29, 2018

@aclements how serious is this and how long do you think it will take to fix (worst question to ask I know)?

In terms of seriousness, it can cause crashes, but not corruption. I was only able to reproduce it once out of 1,334 runs of all.bash, though it seems to be somewhat more common on the buildbots.

I still have no idea what's causing this bug, though, so it's hard to say what it will take to fix it. If I'm not able to figure it out, I do have a likely work-around in mind that is easy to implement (run a second round of the completion protocol before going into mark termination).

Today I'm planning to deploy more debugging code based on what (little) I gleaned from the debugging code I added last week.

@aclements

This comment has been minimized.

Member

aclements commented Nov 29, 2018

A few more failures. All still tripping over the later check.

2018-11-29T18:58:21-aab0b70/linux-amd64-ssacheck:

runtime: P 1 flushedWork false wbuf1.n=2 wbuf2.n=0
fatal error: P has cached GC work at end of mark termination

2018-11-28T21:49:33-048580d/linux-arm:

runtime: P 0 flushedWork false wbuf1.n=1 wbuf2.n=0
fatal error: P has cached GC work at end of mark termination

2018-11-28T16:27:48-71f9f4b/windows-amd64-2008:

runtime: P 2 flushedWork false wbuf1.n=2 wbuf2.n=0
fatal error: P has cached GC work at end of mark termination

2018-11-27T15:21:22-cfac65b/linux-amd64-race:

runtime: P 2 flushedWork false wbuf1.n=2 wbuf2.n=0
fatal error: P has cached GC work at end of mark termination

2018-11-26T23:37:00-63ac89b/openbsd-amd64-62:

runtime: P 3 flushedWork false wbuf1.n=4 wbuf2.n=0
fatal error: P has cached GC work at end of mark termination

2018-11-26T19:24:54-9ab2ffe/nacl-386:

runtime: P 3 flushedWork false wbuf1.n=1 wbuf2.n=0
fatal error: P has cached GC work at end of mark termination
@gopherbot

This comment has been minimized.

gopherbot commented Nov 29, 2018

Change https://golang.org/cl/151797 mentions this issue: runtime: check more work flushing races

gopherbot pushed a commit that referenced this issue Nov 29, 2018

runtime: check more work flushing races
This adds several new checks to help debug #27993. It adds a mechanism
for freezing write barriers and gcWork puts during the mark completion
algorithm. This way, if we do detect mark completion, we can catch any
puts that happened during the completion algorithm. Based on build
dashboard failures, this seems to be the window of time when these are
happening.

This also double-checks that all work buffers are empty immediately
upon entering mark termination (much earlier than the current check).
This is unlikely to trigger based on the current failures, but is a
good safety net.

Change-Id: I03f56c48c4322069e28c50fbc3c15b2fee2130c2
Reviewed-on: https://go-review.googlesource.com/c/151797
Run-TryBot: Austin Clements <austin@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
@aclements

This comment has been minimized.

Member

aclements commented Nov 30, 2018

Got two failures with the new-and-improved debugging. I don't know what's going on yet, but these both happened in the exact same place: the write barrier in fmt.(*fmt).init. There are a few odd things going on here. Both the slot and the pointer being written are to an object that was just retrieved from a sync.Pool. Conveniently, the slot and the pointer are both to the same object, so we know that's the unmarked object. sync.Pool interacts with the GC in weird ways (which maybe violate the hybrid barrier assumptions?) and uses various atomic pointer writes, which also interact with the write barrier in slightly atypical ways.

2018-11-29T22:08:05-438b954/linux-amd64:

--- FAIL: TestGoTestRaceFailures (1.20s)
...
    go_test.go:3766: running testgo [test -race testrace -run XXX -bench .]
    go_test.go:3766: standard error:
    go_test.go:3766: fatal error: throwOnGCWork
        
        runtime stack:
        runtime.throw(0x9c785d, 0xd)
        	/workdir/go/src/runtime/panic.go:608 +0x72
        runtime.(*gcWork).checkPut(0xc00002d770)
        	/workdir/go/src/runtime/mgcwork.go:123 +0x5b
        runtime.(*gcWork).putBatch(0xc00002d770, 0xc00002d7a8, 0x1, 0x200)
        	/workdir/go/src/runtime/mgcwork.go:191 +0x3d
        runtime.wbBufFlush1(0xc00002c500)
        	/workdir/go/src/runtime/mwbbuf.go:277 +0x1ba
        runtime.wbBufFlush.func1()
        	/workdir/go/src/runtime/mwbbuf.go:200 +0x3a
        runtime.systemstack(0x0)
        	/workdir/go/src/runtime/asm_amd64.s:351 +0x66
        runtime.mstart()
        	/workdir/go/src/runtime/proc.go:1153
        
        goroutine 8 [running]:
        runtime.systemstack_switch()
        	/workdir/go/src/runtime/asm_amd64.s:311 fp=0xc0001e35c0 sp=0xc0001e35b8 pc=0x456850
        runtime.wbBufFlush(0xc0000841c0, 0xc000084180)
        	/workdir/go/src/runtime/mwbbuf.go:199 +0x47 fp=0xc0001e35e0 sp=0xc0001e35c0 pc=0x429397
        runtime.gcWriteBarrier(0xdf6550, 0x9af080, 0xc000084180, 0xc0001e36d8, 0x4c9a76, 0xc000084180, 0xc0001c9b00, 0xc0001e36d8, 0x40b012, 0xc000298a60, ...)
        	/workdir/go/src/runtime/asm_amd64.s:1424 +0xb4 fp=0xc0001e3668 sp=0xc0001e35e0 pc=0x4589e4
        fmt.(*fmt).init(...)
        	/workdir/go/src/fmt/format.go:59
        fmt.newPrinter(0xc000084180)
        	/workdir/go/src/fmt/print.go:137 +0x85 fp=0xc0001e3690 sp=0xc0001e3668 pc=0x4c9545
...

2018-11-30T15:17:34-42e8b9c/linux-s390x-ibm:

fatal error: throwOnGCWork

runtime stack:
runtime.throw(0x5f48b6, 0xd)
	/data/golang/workdir/go/src/runtime/panic.go:608 +0x72
runtime.(*gcWork).checkPut(0xc000038170)
	/data/golang/workdir/go/src/runtime/mgcwork.go:123 +0x72
runtime.(*gcWork).putBatch(0xc000038170, 0xc0000381a8, 0x2, 0x200)
	/data/golang/workdir/go/src/runtime/mgcwork.go:191 +0x38
runtime.wbBufFlush1(0xc000036f00)
	/data/golang/workdir/go/src/runtime/mwbbuf.go:277 +0xf0
runtime.wbBufFlush.func1()
	/data/golang/workdir/go/src/runtime/mwbbuf.go:200 +0x3c
runtime.systemstack(0x0)
	/data/golang/workdir/go/src/runtime/asm_s390x.s:293 +0x9c

goroutine 9 [running]:
runtime.systemstack_switch()
	/data/golang/workdir/go/src/runtime/asm_s390x.s:246 +0x10 fp=0xc000337110 sp=0xc000337108 pc=0x719b0
runtime.wbBufFlush(0xc000344040, 0xc000344000)
	/data/golang/workdir/go/src/runtime/mwbbuf.go:199 +0x60 fp=0xc000337128 sp=0xc000337110 pc=0x3f350
runtime.gcWriteBarrier(0xa2ffd0, 0x5db900, 0xc000344000, 0x435452, 0x30, 0x2e, 0x0, 0x0, 0x4353fe, 0x0, ...)
	/data/golang/workdir/go/src/runtime/asm_s390x.s:860 +0x88 fp=0xc000337198 sp=0xc000337128 pc=0x73d88
fmt.(*fmt).init(...)
	/data/golang/workdir/go/src/fmt/format.go:59
fmt.newPrinter(0x30)
	/data/golang/workdir/go/src/fmt/print.go:137 +0x8c fp=0xc0003371b8 sp=0xc000337198 pc=0xecd5c
...
@bradfitz

This comment has been minimized.

Member

bradfitz commented Dec 1, 2018

Another on linux-amd64-race from a trybot:

https://storage.googleapis.com/go-build-log/a01231a0/linux-amd64-race_4cd43732.log

ok  	math/big	6.924s
ok  	math/bits	1.048s
ok  	math/cmplx	1.047s
ok  	math/rand	5.716s
ok  	mime	1.092s

##### 
fatal error: throwOnGCWork

runtime stack:
runtime.throw(0x9cd25d, 0xd)
	/workdir/go/src/runtime/panic.go:608 +0x72
runtime.(*gcWork).checkPut(0xc000034170)
	/workdir/go/src/runtime/mgcwork.go:123 +0x5b
runtime.(*gcWork).putBatch(0xc000034170, 0xc0000341a8, 0x1, 0x200)
	/workdir/go/src/runtime/mgcwork.go:191 +0x3d
runtime.wbBufFlush1(0xc000032f00)
	/workdir/go/src/runtime/mwbbuf.go:277 +0x1ba
runtime.wbBufFlush.func1()
	/workdir/go/src/runtime/mwbbuf.go:200 +0x3a
runtime.systemstack(0x456729)
	/workdir/go/src/runtime/asm_amd64.s:351 +0x66
runtime.mstart()
	/workdir/go/src/runtime/proc.go:1153

goroutine 19 [running]:
runtime.systemstack_switch()
	/workdir/go/src/runtime/asm_amd64.s:311 fp=0xc000457280 sp=0xc000457278 pc=0x456820
runtime.wbBufFlush(0xc000514018, 0x0)
	/workdir/go/src/runtime/mwbbuf.go:199 +0x47 fp=0xc0004572a0 sp=0xc000457280 pc=0x429367
runtime.gcWriteBarrier(0xc000457370, 0xc000514030, 0x18, 0xc00038b680, 0xc0003e14a1, 0xc000514000, 0xc00044bda0, 0xc000457400, 0x7ff11b, 0xc00010e1a0, ...)
	/workdir/go/src/runtime/asm_amd64.s:1424 +0xb4 fp=0xc000457328 sp=0xc0004572a0 pc=0x4589b4
runtime.deferreturn(0xc00010e1a0)
	/workdir/go/src/runtime/panic.go:362 +0xb1 fp=0xc000457370 sp=0xc000457328 pc=0x42c551
cmd/internal/buildid.ReadFile(0xc0000278f0, 0x27, 0xc00044bda1, 0x29, 0x0, 0x0)
	/workdir/go/src/cmd/internal/buildid/buildid.go:101 +0x76b fp=0xc000457410 sp=0xc000457370 pc=0x7ff11b
cmd/go/internal/work.(*Builder).useCache(0xc000357f40, 0xc000360a00, 0xc00024f680, 0xd15a37b3a41ae4cd, 0xfc4a722dfa0d6a44, 0xe9b2998ab6f18a35, 0x114487c878347ff4, 0xc0000278f0, 0x27, 0x2c656e6f64203a01)
	/workdir/go/src/cmd/go/internal/work/buildid.go:440 +0xdf8 fp=0xc000457630 sp=0xc000457410 pc=0x80d868
cmd/go/internal/work.(*Builder).build(0xc000357f40, 0xc000360a00, 0x0, 0x0)
	/workdir/go/src/cmd/go/internal/work/exec.go:373 +0x55ef fp=0xc000457ea0 sp=0xc000457630 pc=0x81704f
cmd/go/internal/work.(*Builder).Do.func1(0xc000360a00)
	/workdir/go/src/cmd/go/internal/work/exec.go:107 +0x6c fp=0xc000457f28 sp=0xc000457ea0 pc=0x844c7c
cmd/go/internal/work.(*Builder).Do.func2(0xc000524404, 0xc000357f40, 0xc0002dd4e0)
	/workdir/go/src/cmd/go/internal/work/exec.go:165 +0x87 fp=0xc000457fc8 sp=0xc000457f28 pc=0x844f47
runtime.goexit()
	/workdir/go/src/runtime/asm_amd64.s:1337 +0x1 fp=0xc000457fd0 sp=0xc000457fc8 pc=0x4588f1
created by cmd/go/internal/work.(*Builder).Do
	/workdir/go/src/cmd/go/internal/work/exec.go:152 +0x39b
....
@gopherbot

This comment has been minimized.

gopherbot commented Dec 6, 2018

Change https://golang.org/cl/152957 mentions this issue: runtime: print pointers being put in checkPut

@aclements

This comment has been minimized.

Member

aclements commented Dec 6, 2018

Analysis of some recent failures.

2018-12-05T20:59:58-cd47e89/solaris-amd64-smartosbuildlet

  • sync.Pool internals. fmt.(*pp).free(0xc00028a000) -> sync.(*Pool).Put. Assigning l.private = x, where l.private was nil, &l.private is 0xc0001eb788, and x is 0xc00028a000. It's marking 2 pointers, which is weird: given that l.private is nil, at most it should be marking 0xc00028a000. What's the other pointer?

2018-12-05T02:56:47-9be01c2/freebsd-386-10_3

  • defer internals. freedefer writing d.link = nil. &d.link is 0x3997c8d8. This is marking 2 pointers, which is weird: we know the new value is nil, so it should at most be marking the old value. One of the marks is presumably a _defer object.

2018-12-04T23:01:00-be09bdf/solaris-amd64-oraclerel

  • This one happened when gcMarkTermination was flushing write barrier buffers. I don't know how we could possibly not have caught this earlier, given all of the debugging code that's in there now.

2018-12-04T00:19:55-c2412a7/linux-amd64-stretch

  • Another object from a sync.Pool.

2018-12-03T20:09:50-f98081e/linux-amd64-noopt

  • defer internals. Writing _defer.fn. &_defer.fn is 0xc00025d5d8 and the new value is a static pointer (funcval 0xbf6118, seen in both deferproc, and wbBufFlush). The old value must have been nil because the defer was either just allocated or came from the pool, which contains only cleared _defer objects. Nevertheless, the led to flushing one pointer (seen in the third argument to putBatch).

2018-12-03T19:37:57-cef41e0/solaris-amd64-smartosbuildlet

  • defer internals. Writing _defer.fn = nil. &_defer.fn is 0xc000602298. This leads to one pointer being flushed, which must be the old funcval, but that's a static value because the closure was lifted.

2018-12-03T16:48:21-4a1a783/darwin-amd64-10_11

  • defer internals. Writing gp._defer = d. In this case, the object being marked must be a _defer.

2018-12-03T16:48:21-4a1a783/darwin-amd64-race

  • Another object from a sync.Pool.

2018-12-02T22:46:27-b3294d9/darwin-amd64-race

  • Writing a field in a freshly-allocated object. This is odd because it's writing a nil, but the old value must also be nil, yet it's marking 1 pointer.

2018-12-02T22:46:27-b3294d9/linux-386

  • defer internals. Writing _defer.link = nil, so one of the marks is a _defer object.

gopherbot pushed a commit that referenced this issue Dec 6, 2018

runtime: print pointers being put in checkPut
In order to further diagnose #27993, I need to see exactly what
pointers are being added to the gcWork buffer too late.

Change-Id: I8d92113426ffbc6e55d819c39e7ab5eafa68668d
Reviewed-on: https://go-review.googlesource.com/c/152957
Run-TryBot: Austin Clements <austin@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
@bcmills

This comment has been minimized.

Member

bcmills commented Dec 7, 2018

@bcmills

This comment has been minimized.

Member

bcmills commented Dec 7, 2018

@aclements

This comment has been minimized.

Member

aclements commented Dec 7, 2018

The top one here may be the most informative failure.

https://storage.googleapis.com/go-build-log/be3b820c/linux-amd64_6758157c.log

  • Assigning gp._defer = d in newdefer. Slot is 0xc0000cb0a8, new value is 0xc00023d700. Marking 0xc00023d700 (the new value). The new defer is siz 8, and links to 0xc000369040, which must have been the old value of gp._defer. There are two possibilities: 1. d was just allocated, in which case it should have been marked and since the allocation happened on the system stack there were no safe-points between the allocation and the assignment; 2. d came from a deferpool, in which case it should have been marked by the write barrier for pp.deferpool[sc][n-1] = nil.

https://storage.googleapis.com/go-build-log/be3b820c/linux-386_3eeb1417.log

  • This one is really weird. It's gp.m = g.m in execute, but the object being marked isn't an m. The slot is 0x8cbdc18, which is g.m, and the new value is 0x88f64e0. It's marking 0x8c56000, which is at most a 64 byte object.

https://storage.googleapis.com/go-build-log/fc658a42/linux-amd64_957a1707.log

  • Appending d to pp.deferpool[sc] in freedefer (must be the write of d and not the grow). Slot is 0xc000030be8, new value is 0xc000346000 (which is d, the _defer being freed). Marking 0xc000346000, which is the _defer being freed. The _defer is zeroed, plus room for two words of arguments. In this case, the defer is fd.readUnlock(), so it has one actual word of arguments. (*FD).Read pushed the defer and is returning after a Read syscall.

2018-12-06T22:06:55-4a801cd/darwin-arm-mg912baios

  • Assigning p.arg = nil, where p came from a sync.Pool. Slot is 0xc0002b32e0 (probably the second word of p.arg), new value is 0. Object being marked is 0xc0000b63c0 (zzzzzzPP). Since p.arg is an interface{}, it's hard to say what this object actually is.

2018-12-06T21:10:35-ec0077c/freebsd-386-10_3

  • Assigning pp.deferpool[sc][n-1] = nil in newdefer, which is a *_defer write. Slot is 0x39823988, new value is 0. The defer pool is for 4-byte argument defers. Object being marked is 0x39b44140 and is consistent with a zeroed _defer with 4 bytes of arguments (not zeroed).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment