Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile,runtime: checkptr race condition #38872

Open
cuonglm opened this issue May 5, 2020 · 28 comments
Open

cmd/compile,runtime: checkptr race condition #38872

cuonglm opened this issue May 5, 2020 · 28 comments
Milestone

Comments

@cuonglm
Copy link
Contributor

@cuonglm cuonglm commented May 5, 2020

What version of Go are you using (go version)?

$ go version
go version devel +9b189686a5 Tue May 5 05:13:26 2020 +0000 darwin/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/cuonglm/Library/Caches/go-build"
GOENV="/Users/cuonglm/Library/Application Support/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/cuonglm/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/cuonglm/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/Users/cuonglm/sources/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/Users/cuonglm/sources/go/pkg/tool/darwin_amd64"
GCCGO="gccgo"
GOAMD64="alignedjumps"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/y4/hs76ltbn7sb66lw_6934kq4m0000gn/T/go-build903704744=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

$ cat main.go
package main

import (
	"fmt"
	"sync"
	"sync/atomic"
	"unsafe"
)

type onceSpy struct {
	done uint64
	m    sync.Mutex
}

// ResetSyncOnce will reset sync.Once to default state.
func ResetSyncOnce(once *sync.Once) {
	spy := (*onceSpy)(unsafe.Pointer(once))
	atomic.StoreUint64(&spy.done, 0)
}
func main() {
	o := sync.Once{}
	o.Do(func() {
		fmt.Println("hello 1")
	})
	ResetSyncOnce(&o)
	o.Do(func() {
		fmt.Println("hello 2")
	})
}

$ go build -gcflags=-d=checkptr main.go
$ for _ in {1..10}; do ./main; done

What did you expect to see?

Consistent result between 10 times run.

What did you see instead?

The result is inconsistent, sometimes passes, sometimes checkptr report errors.

$ for _ in {1..10}; do ./main; done 
hello 1
hello 2
hello 1
hello 2
hello 1
hello 2
hello 1
hello 2
hello 1
fatal error: checkptr: converted pointer straddles multiple allocations

goroutine 1 [running]:
runtime.throw(0x10d1368, 0x3a)
	/Users/cuonglm/sources/go/src/runtime/panic.go:1116 +0x72 fp=0xc000098f28 sp=0xc000098ef8 pc=0x1030b32
runtime.checkptrAlignment(0xc0000b4004, 0x10b8620, 0x1)
	/Users/cuonglm/sources/go/src/runtime/checkptr.go:20 +0xc9 fp=0xc000098f58 sp=0xc000098f28 pc=0x1005e69
main.ResetSyncOnce(...)
	/Users/cuonglm/main.go:17
main.main()
	/Users/cuonglm/main.go:25 +0x65 fp=0xc000098f88 sp=0xc000098f58 pc=0x10a5625
runtime.main()
	/Users/cuonglm/sources/go/src/runtime/proc.go:204 +0x223 fp=0xc000098fe0 sp=0xc000098f88 pc=0x10332c3
runtime.goexit()
	/Users/cuonglm/sources/go/src/runtime/asm_amd64.s:1374 +0x1 fp=0xc000098fe8 sp=0xc000098fe0 pc=0x1061241
hello 1
fatal error: checkptr: converted pointer straddles multiple allocations

goroutine 1 [running]:
runtime.throw(0x10d1368, 0x3a)
	/Users/cuonglm/sources/go/src/runtime/panic.go:1116 +0x72 fp=0xc000098f28 sp=0xc000098ef8 pc=0x1030b32
runtime.checkptrAlignment(0xc0000b4004, 0x10b8620, 0x1)
	/Users/cuonglm/sources/go/src/runtime/checkptr.go:20 +0xc9 fp=0xc000098f58 sp=0xc000098f28 pc=0x1005e69
main.ResetSyncOnce(...)
	/Users/cuonglm/main.go:17
main.main()
	/Users/cuonglm/main.go:25 +0x65 fp=0xc000098f88 sp=0xc000098f58 pc=0x10a5625
runtime.main()
	/Users/cuonglm/sources/go/src/runtime/proc.go:204 +0x223 fp=0xc000098fe0 sp=0xc000098f88 pc=0x10332c3
runtime.goexit()
	/Users/cuonglm/sources/go/src/runtime/asm_amd64.s:1374 +0x1 fp=0xc000098fe8 sp=0xc000098fe0 pc=0x1061241
hello 1
hello 2
hello 1
hello 2
hello 1
fatal error: checkptr: converted pointer straddles multiple allocations

goroutine 1 [running]:
runtime.throw(0x10d1368, 0x3a)
	/Users/cuonglm/sources/go/src/runtime/panic.go:1116 +0x72 fp=0xc000098f28 sp=0xc000098ef8 pc=0x1030b32
runtime.checkptrAlignment(0xc0000b4004, 0x10b8620, 0x1)
	/Users/cuonglm/sources/go/src/runtime/checkptr.go:20 +0xc9 fp=0xc000098f58 sp=0xc000098f28 pc=0x1005e69
main.ResetSyncOnce(...)
	/Users/cuonglm/main.go:17
main.main()
	/Users/cuonglm/main.go:25 +0x65 fp=0xc000098f88 sp=0xc000098f58 pc=0x10a5625
runtime.main()
	/Users/cuonglm/sources/go/src/runtime/proc.go:204 +0x223 fp=0xc000098fe0 sp=0xc000098f88 pc=0x10332c3
runtime.goexit()
	/Users/cuonglm/sources/go/src/runtime/asm_amd64.s:1374 +0x1 fp=0xc000098fe8 sp=0xc000098fe0 pc=0x1061241
hello 1
hello 2

cc @mdempsky

@cuonglm
Copy link
Contributor Author

@cuonglm cuonglm commented May 5, 2020

The problem does not occur if I removed m sync.Mutex from struct.

@agnivade agnivade added this to the Go1.15 milestone May 5, 2020
@mdempsky
Copy link
Member

@mdempsky mdempsky commented May 5, 2020

sync.Once.done's type is uint32, not uint64, so I think this is a legit checkptr report. Not sure off hand why it wouldn't consistently report this though.

I'd guess it has something to do with either scheduling of the concurrent GC, or with placement of o in the heap.

@cuonglm
Copy link
Contributor Author

@cuonglm cuonglm commented May 5, 2020

sync.Once.done's type is uint32, not uint64, so I think this is a legit checkptr report. Not sure off hand why it wouldn't consistently report this though.

I'd guess it has something to do with either scheduling of the concurrent GC, or with placement of o in the heap.

If m sync.Mutex was removed, then checkptr does not report, sounds like another issue.

@mdempsky
Copy link
Member

@mdempsky mdempsky commented May 5, 2020

Why? It's not an error to convert from *BigType to *SmallType. If you remove m sync.Mutex, then unsafe.Sizeof(onceSpy{}) <= unsafe.Sizeof(sync.Once{}), right?

@cuonglm
Copy link
Contributor Author

@cuonglm cuonglm commented May 5, 2020

Why? It's not an error to convert from *BigType to *SmallType. If you remove m sync.Mutex, then unsafe.Sizeof(onceSpy{}) <= unsafe.Sizeof(sync.Once{}), right?

Ah right 👍

@bcmills
Copy link
Member

@bcmills bcmills commented May 5, 2020

Hmm, but should the race detector emit an error when a 32-bit atomic op occurs concurrently with a 64-bit atomic op?

My recollection is that on some 32-bit architectures, the 32-bit atomic ops use an entirely separate code path from the 64-bit ones, so I can't imagine that the memory model allows it.

@bcmills
Copy link
Member

@bcmills bcmills commented May 5, 2020

(Oh, right: the memory model doesn't define atomic at all: that's #5045.)

@bcmills
Copy link
Member

@bcmills bcmills commented May 5, 2020

Filed that separately as #38881.

@cuonglm
Copy link
Contributor Author

@cuonglm cuonglm commented May 15, 2020

@mdempsky

I'd guess it has something to do with either scheduling of the concurrent GC, or with placement of o in the heap.

Seems concurrent GC problem, because if I set GOGC=off, then checkptr reports it consistently.

@mdempsky
Copy link
Member

@mdempsky mdempsky commented May 15, 2020

Hmm, but should the race detector emit an error when a 32-bit atomic op occurs concurrently with a 64-bit atomic op?

This is worth discussing in #38881, but I think it's a red herring here. The 32-bit and 64-bit atomic ops that modify o are all happening on the same goroutine anyway, so there's no race.

Seems concurrent GC problem, because if I set GOGC=off, then checkptr reports it consistently.

Interesting. I wonder if there's a race condition in the calls to findObject in checkptrAlignment. Maybe it's not actually safe to call checkptrBase(add(p, size-1))?

@mdempsky
Copy link
Member

@mdempsky mdempsky commented May 15, 2020

Hm, I'm not able to reproduce this on linux/amd64. I don't get any checkptr reports, even with GOGC=off.

@cuonglm
Copy link
Contributor Author

@cuonglm cuonglm commented May 16, 2020

Hm, I'm not able to reproduce this on linux/amd64. I don't get any checkptr reports, even with GOGC=off.

Confirm that this does not happen on linux/amd64, but does on darwin/amd64, sounds like a false negative on linux/amd64?

@mdempsky
Copy link
Member

@mdempsky mdempsky commented May 16, 2020

Yeah, not reporting on linux/amd64 is a false negative. (With the usual caveats that checkptr is best-effort, so some false negatives will be inevitable; though I can't immediately see why the behavior here would be OS-specific.)

@randall77
Copy link
Contributor

@randall77 randall77 commented Jun 24, 2020

I'm not able to reproduce on darwin/amd64.

% uname -a
Darwin CheeseGrater.local 19.2.0 Darwin Kernel Version 19.2.0: Sat Nov  9 03:47:04 PST 2019; root:xnu-6153.61.1~20/RELEASE_X86_64 x86_64

I checked out the Go release at 9b18968.
It hasn't failed after thousands of iterations.

@cuonglm
Copy link
Contributor Author

@cuonglm cuonglm commented Jun 24, 2020

@randall77 still reproduce for me in current latest commit 9f33108

% uname -a
Darwin Cuongs-MacBook-Pro.local 19.4.0 Darwin Kernel Version 19.4.0: Wed Mar  4 22:28:40 PST 2020; root:xnu-6153.101.6~15/RELEASE_X86_64 x86_64
@cuonglm
Copy link
Contributor Author

@cuonglm cuonglm commented Jun 24, 2020

Also reproduce on:

19.5.0 Darwin Kernel Version 19.5.0: Tue May 26 20:41:44 PDT 2020; root:xnu-6153.121.2~2/RELEASE_X86_64 x86_64
@randall77
Copy link
Contributor

@randall77 randall77 commented Jun 24, 2020

Operator error. I am now able to reproduce. Sorry for the runaround.

@randall77
Copy link
Contributor

@randall77 randall77 commented Jun 24, 2020

This is definitely a correct report.

A sync.Once is 12 bytes (4 for the done field, 8 for the sync.Mutex, and the sync.Mutex is only 4-byte aligned).
The sync.Once gets allocated in a 16 byte tinyalloc region, at offsets 4-16. When you cast to your onceSpy, that 16-byte object now straddles 4 bytes into the next region.

I can confirm that the problem reproduces exactly when the sync.Once is allocated at 4 mod 16 and not when it is allocated at 0 mod 16. That's the source of the inconsistent reporting.

On Linux, the sync.Once never appears to be allocated at 4 mod 16. I guess there's some early allocation on Darwin that doesn't happen on Linux.

I can force an error on Linux by allocating 4 bytes at the start of main:

var x int = 4
var b []byte

func main() {
	b = make([]byte, x)

With that change, it now reports an error reliably on Linux.

It is unfortunate that checkptr does not reliably report this as a violation. I think to fix that, we'd need to keep track of the separate allocations inside a tinyalloc. I don't think there's any easy way to do that, unfortunately. Possibly we could just turn off tinyalloc when checkptr is on. Certainly when checkptr is on because the race detector is on, the cost would be in the noise. With just checkptr, though, I'm not sure. @mdempsky what do you think?

@randall77
Copy link
Contributor

@randall77 randall77 commented Jun 24, 2020

Of course, just turning off tinyalloc is not enough - that would just allocate sync.Once in its own 16-byte region and thus checkptr would never report an error.
I think in addition maybe we could allocate objects that are smaller than the size class randomly either at the start or end of the sizeclass object?

@cuonglm
Copy link
Contributor Author

@cuonglm cuonglm commented Jun 24, 2020

I can confirm that the problem reproduces exactly when the sync.Once is allocated at 4 mod 16 and not when it is allocated at 0 mod 16. That's the source of the inconsistent reporting.

@randall77 👍 , also, any thought on why GOGC=off makes it reports reliably on OSX?

@randall77
Copy link
Contributor

@randall77 randall77 commented Jun 24, 2020

Not sure. Probably just makes the allocation pattern more deterministic.

@randall77
Copy link
Contributor

@randall77 randall77 commented Jun 24, 2020

In any case, there's no bug here worth fixing for 1.15. Punting to 1.16.

@randall77 randall77 modified the milestones: Go1.15, Go1.16 Jun 24, 2020
@mdempsky
Copy link
Member

@mdempsky mdempsky commented Jun 24, 2020

@randall77 Thanks for reproducing and figuring out what's up.

Agreed with your suggestions (being able to turn off tinyalloc, and randomly end-aligning objects rather than start-aligning them) would be good ways of making this fire more reliably.

@randall77
Copy link
Contributor

@randall77 randall77 commented Jun 24, 2020

Ok, we'll do that for 1.16.
Now that I think about it, always end-aligning is probably better than random. it's rare to modify a pointer in the negative direction too far.

@gopherbot

This comment was marked as off-topic.

@randall77 randall77 removed this from the Go1.16 milestone Jun 25, 2020
@randall77 randall77 modified the milestones: Go1.15, Go1.16 Jun 25, 2020
@randall77

This comment was marked as off-topic.

@cuonglm
Copy link
Contributor Author

@cuonglm cuonglm commented Jul 1, 2020

@mdempsky @randall77 related issue #35128

@cuonglm
Copy link
Contributor Author

@cuonglm cuonglm commented Aug 19, 2020

Hi @aclements, do you have any suggestion to solve this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
6 participants
You can’t perform that action at this time.