Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: gcWriteBarrier crashes in production system #42249

Open
nporsche opened this issue Oct 28, 2020 · 12 comments
Open

runtime: gcWriteBarrier crashes in production system #42249

nporsche opened this issue Oct 28, 2020 · 12 comments

Comments

@nporsche
Copy link

@nporsche nporsche commented Oct 28, 2020

What version of Go are you using (go version)?

$ go version
1.15.3

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
[root@ip-10-204-48-10 /]# go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/root/.cache/go-build"
GOENV="/root/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/root/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/root/go"
GOPRIVATE=""
GOPROXY="direct"
GOROOT="/usr/lib/golang"
GOSUMDB="off"
GOTMPDIR=""
GOTOOLDIR="/usr/lib/golang/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build154682581=/tmp/go-build -gno-record-gcc-switches"

What did you do?

Crashes at runtime.gcWriteBarrier, stack

Type 'help' for list of commands.
(dlv) bt
0 0x0000000000474781 in runtime.raise
at /usr/local/go/src/runtime/sys_linux_amd64.s:165
1 0x0000000000471160 in runtime.systemstack_switch
at /usr/local/go/src/runtime/asm_amd64.s:330
2 0x0000000000434de6 in runtime.wbBufFlush
at /usr/local/go/src/runtime/mwbbuf.go:206
3 0x000000000047300e in runtime.gcWriteBarrier
at /usr/local/go/src/runtime/asm_amd64.s:1461
4 0x0000000000473067 in runtime.gcWriteBarrierCX
at /usr/local/go/src/runtime/asm_amd64.s:1481
5 0x0000000000ac8669 in core.(*RequestContext).xxxRequestsBuf
at core/xxx_core.go:466

xxx_core.go:466

func (c *RequestContext) xxxRequestsBuf(buf *bytes.Buffer) {
	defer func() {
		c.Log("xxxRequestsBuf tx_data_requests_size_=%d", buf.Len())
	}()

	unsafeBuf := util.UnsafeBytes(unsafe.Pointer(c.p.tx_data_requests_bytes_), int(c.p.tx_data_requests_size_))
	io.CopyN(buf, bytes.NewReader(unsafeBuf), int64(c.p.tx_data_requests_size_))

	c.p.tx_data_requests_bytes_ = nil //line 466
	c.p.tx_data_requests_size_ = 0

	return
}

What did you expect to see?

runs no problem

What did you see instead?

crashes

@nporsche
Copy link
Author

@nporsche nporsche commented Oct 28, 2020

0 0x0000000000474781 in runtime.raise
at /usr/local/go/src/runtime/sys_linux_amd64.s:165
1 0x0000000000471160 in runtime.systemstack_switch
at /usr/local/go/src/runtime/asm_amd64.s:330
2 0x0000000000434de6 in runtime.wbBufFlush
at /usr/local/go/src/runtime/mwbbuf.go:206
3 0x000000000047300e in runtime.gcWriteBarrier
at /usr/local/go/src/runtime/asm_amd64.s:1461
4 0x0000000000473067 in runtime.gcWriteBarrierCX

crashes happened occasionally in our production, 1-2 core dumps per day out of 1000 instances.
xxxRequestsBuf was not the only trigger, below was another stack that cause gc write barrier crashes

(dlv) bt
 0  0x0000000000474781 in runtime.raise
    at /usr/local/go/src/runtime/sys_linux_amd64.s:165
 1  0x0000000000471160 in runtime.systemstack_switch
    at /usr/local/go/src/runtime/asm_amd64.s:330
 2  0x0000000000434de6 in runtime.wbBufFlush
    at /usr/local/go/src/runtime/mwbbuf.go:206
 3  0x000000000047300e in runtime.gcWriteBarrier
    at /usr/local/go/src/runtime/asm_amd64.s:1461
 4  0x0000000000473067 in runtime.gcWriteBarrierCX
    at /usr/local/go/src/runtime/asm_amd64.s:1481
 5  0x00000000004f6d65 in fmt.(*pp).free
    at /usr/local/go/src/fmt/print.go:159
 6  0x00000000004f728b in fmt.Sprintf
    at /usr/local/go/src/fmt/print.go:221
@ianlancetaylor ianlancetaylor changed the title runtime.gcWriteBarrier crashes in production system runtime: gcWriteBarrier crashes in production system Oct 28, 2020
@ianlancetaylor ianlancetaylor added this to the Backlog milestone Oct 28, 2020
@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Oct 28, 2020

This looks like memory corruption. Please take a close look at your use of unsafe. Run your program under the race detector.

I see that you are including backtraces from gdb. Does the program itself print anything? The program printout would provide more information about the crash.

@nporsche
Copy link
Author

@nporsche nporsche commented Oct 28, 2020

@ianlancetaylor thanks for your reply, will have a try of race detector.
The program did output in stderr

runtime: pointer 0xc048bfb301 to unused region of span span.base()=0xc006d7a000 span.limit=0xc006d7bfa0 span.state=1
fatal error: found bad pointer in Go heap (incorrect use of unsafe or cgo?)

runtime stack:
runtime.throw(0xeb3d5e, 0x3e)
	/usr/local/go/src/runtime/panic.go:1116 +0x72 fp=0x7fb148bfcbe8 sp=0x7fb148bfcbb8 pc=0x43a1d2
runtime.badPointer(0x7fb25c3ca568, 0xc048bfb301, 0x0, 0x0)
	/usr/local/go/src/runtime/mbitmap.go:380 +0x235 fp=0x7fb148bfcc30 sp=0x7fb148bfcbe8 pc=0x4180f5
runtime.findObject(0xc048bfb301, 0x0, 0x0, 0x0, 0x0, 0x0)
	/usr/local/go/src/runtime/mbitmap.go:416 +0x9b fp=0x7fb148bfcc68 sp=0x7fb148bfcc30 pc=0x4181bb
runtime.wbBufFlush1(0xc000095800)
	/usr/local/go/src/runtime/mwbbuf.go:288 +0xa8 fp=0x7fb148bfccc0 sp=0x7fb148bfcc68 pc=0x434f28
runtime.wbBufFlush.func1()
	/usr/local/go/src/runtime/mwbbuf.go:218 +0x3a fp=0x7fb148bfccd8 sp=0x7fb148bfccc0 pc=0x46979a
runtime.systemstack(0x7fb163800e28)
	/usr/local/go/src/runtime/asm_amd64.s:370 +0x66 fp=0x7fb148bfcce0 sp=0x7fb148bfccd8 pc=0x4711e6
runtime.mstart()
	/usr/local/go/src/runtime/proc.go:1116 fp=0x7fb148bfcce8 sp=0x7fb148bfcce0 pc=0x43f3a0
@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Oct 28, 2020

That also suggests memory corruption of some sort.

@nporsche
Copy link
Author

@nporsche nporsche commented Oct 29, 2020

@ianlancetaylor Is there any tools that help to narrow down the first place of memory corruption?

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Oct 29, 2020

I'm not aware of any general tools for detecting memory corruption. Memory corruption is impossible when writing safe Go code that doesn't have race conditions.

I can see from your code fragment that your code is unsafe (or at least it certainly looks unsafe), but what is the goal? Why are you using unsafe code?

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Oct 29, 2020

There are some GODEBUG settings you could try. See https://golang.org/pkg/runtime.

@nporsche
Copy link
Author

@nporsche nporsche commented Oct 29, 2020

@ianlancetaylor we are the large project, we re-write the http server with golang while leaving legacy business with c++ which exposed at shared object, thus cgo is used for between communicaiton. The golang part is hosting the process and provide goroutinue/thread infrastructure, calling in to c++ for processing the http request, thus the memory are need to be access by both golang and c++.

@nporsche
Copy link
Author

@nporsche nporsche commented Oct 29, 2020

I enabled -race and cgoCheck=2 but looks like no output yet

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Oct 29, 2020

You could try compiling your C++ code with -fsanitize=memory and your Go code with -msan. That will enable the memory sanitizer, which will detect certain kinds of memory corruption. I don't know whether it will help, but it may.

@nporsche
Copy link
Author

@nporsche nporsche commented Oct 29, 2020

Where I can find who is holding the bad pointer?

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Oct 29, 2020

I don't know of any way to do that, sorry. By the time of the crash memory has probably been corrupted for some time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.