Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: mTreap corruption in go1.13.10 #40372

Open
luckyxiaoqiang opened this issue Jul 23, 2020 · 11 comments
Open

runtime: mTreap corruption in go1.13.10 #40372

luckyxiaoqiang opened this issue Jul 23, 2020 · 11 comments

Comments

@luckyxiaoqiang
Copy link

@luckyxiaoqiang luckyxiaoqiang commented Jul 23, 2020

What version of Go are you using (go version)?

$ go version
go version go1.13.10 darwin/amd64

Does this issue reproduce with the latest release?

It's very hard to reproduce. We have not yet reproduced it with the latest release.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPRIVATE="*.qima-inc.com"
GOPROXY="https://goproxy.cn,direct"
GOROOT="/usr/local/Cellar/go@1.13/1.13.10_1/libexec"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/Cellar/go@1.13/1.13.10_1/libexec/pkg/tool/darwin_amd64"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/3h/68msm91n7z9dvjhlxz5ms_z00000gn/T/go-build802857650=/tmp/go-build -gno-record-gcc-switches -fno-common"
The go info above is the development and compile environment. We build the binary for linux amd64 platform by cross compiling. And below is the runtime environment.
$ go uname -a
Linux hostname 3.10.0-1062.12.1.el7.x86_64 #1 SMP Tue Feb 4 23:02:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

What did you do?

We deployed Istio Pilot on our product environment (Virtual Machine). It ran for about a month or two after we used go1.13.10 util we met this problem. There is no special operation, no large load. And only one instance had this problem.

What did you expect to see?

The process responded normally.

What did you see instead?

The process was unresponsive.
The initial phenomenon was that all client requests timed out. We got high CPU usage alert after a few minute, and we found one CPU core utilization rate was close to 100 on the abnormal node.
We used dlv to find there was one goroutine that was looping in a for loop.
Below is the stack.

0  0x000000000041d97a in runtime.(*mTreap).insert
   at /usr/local/Cellar/go@1.13/1.13.10_1/libexec/src/runtime/mgclarge.go:449
1  0x000000000045d6b0 in runtime.systemstack_switch
   at /usr/local/Cellar/go@1.13/1.13.10_1/libexec/src/runtime/asm_amd64.s:330
2  0x00000000004267ca in runtime.(*mheap).freeSpan
   at /usr/local/Cellar/go@1.13/1.13.10_1/libexec/src/runtime/mheap.go:1328
3  0x0000000000417d38 in runtime.(*mcentral).freeSpan
   at /usr/local/Cellar/go@1.13/1.13.10_1/libexec/src/runtime/mcentral.go:246
4  0x00000000004234b6 in runtime.(*mspan).sweep
   at /usr/local/Cellar/go@1.13/1.13.10_1/libexec/src/runtime/mgcsweep.go:366
5  0x0000000000422c85 in runtime.sweepone
   at /usr/local/Cellar/go@1.13/1.13.10_1/libexec/src/runtime/mgcsweep.go:136
6  0x000000000042294a in runtime.bgsweep
   at /usr/local/Cellar/go@1.13/1.13.10_1/libexec/src/runtime/mgcsweep.go:73
7  0x000000000045f601 in runtime.goexit
   at /usr/local/Cellar/go@1.13/1.13.10_1/libexec/src/runtime/asm_amd64.s:1357

Below is the corresponding code(the first for loop).

// insert adds span to the large span treap.
func (root *mTreap) insert(span *mspan) {
	if !span.scavenged {
		root.unscavHugePages += span.hugePages()
	}
	base := span.base()
	var last *treapNode
	pt := &root.treap
	for t := *pt; t != nil; t = *pt {
		last = t
		if t.key < base {
			pt = &t.right
		} else if t.key > base {
			pt = &t.left
		} else {
			throw("inserting span already in treap")
		}
	}

	// Add t as new leaf in tree of span size and unique addrs.
	// The balanced tree is a treap using priority as the random heap priority.
	// That is, it is a binary tree ordered according to the key,
	// but then among the space of possible binary trees respecting those
	// keys, it is kept balanced on average by maintaining a heap ordering
	// on the priority: s.priority <= both s.right.priority and s.right.priority.
	// https://en.wikipedia.org/wiki/Treap
	// https://faculty.washington.edu/aragon/pubs/rst89.pdf

	t := (*treapNode)(mheap_.treapalloc.alloc())
	t.key = span.base()
	t.priority = fastrand()
	t.span = span
	t.maxPages = span.npages
	t.types = span.treapFilter()
	t.parent = last
	*pt = t // t now at a leaf.

	// Update the tree to maintain the various invariants.
	i := t
	for i.parent != nil && i.parent.updateInvariants() {
		i = i.parent
	}

	// Rotate up into tree according to priority.
	for t.parent != nil && t.parent.priority > t.priority {
		if t != nil && t.span.base() != t.key {
			println("runtime: insert t=", t, "t.key=", t.key)
			println("runtime:      t.span=", t.span, "t.span.base()=", t.span.base())
			throw("span and treap node base addresses do not match")
		}
		if t.parent.left == t {
			root.rotateRight(t.parent)
		} else {
			if t.parent.right != t {
				throw("treap insert finds a broken treap")
			}
			root.rotateLeft(t.parent)
		}
	}
}
@davecheney
Copy link
Contributor

@davecheney davecheney commented Jul 23, 2020

This stack trace is from an OS X machine, to the best of my knowledge istio is only deployed on linux. Are you sure this stack trace is from the affected machine.

@luckyxiaoqiang
Copy link
Author

@luckyxiaoqiang luckyxiaoqiang commented Jul 24, 2020

This stack trace is from an OS X machine, to the best of my knowledge istio is only deployed on linux. Are you sure this stack trace is from the affected machine.

Sorry, I didn't describe it clearly. We used cross compiling. The runtime environment is Linux x86_64. Description already updated, and please take a look again.

@davecheney
Copy link
Contributor

@davecheney davecheney commented Jul 24, 2020

Thank you. Have you raised this issue with the istio developers?

@luckyxiaoqiang
Copy link
Author

@luckyxiaoqiang luckyxiaoqiang commented Jul 24, 2020

Thank you. Have you raised this issue with the istio developers?

Not yet. I’ll do later.

@tokers
Copy link

@tokers tokers commented Jul 24, 2020

It seems that the treap structure of mheap.free was corrupted.

@davecheney
Copy link
Contributor

@davecheney davecheney commented Jul 24, 2020

In cases like this the most appropriate line of investigation is to ensure the program is completely free of data races. As the OP is running software written by a third party, I think it is appropriate that they seek assertions from the istio developers that they are confident that no data races exist in the version in question.

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Jul 28, 2020

It may be worth mentioning that in Go 1.14 and later mheap.free no longer exists and has been replaced with an entirely different mechanism. That would explain why you cannot reproduce it.

It's not totally clear to me how a data race at the application level could lead to corruption of mheap.free which is pretty far removed from user code and protected by a runtime lock. I don't want to write off the possibility that it's a data race in Istio Pilot, but it wouldn't be my first guess (memory corruption caused by garbage data (perhaps created due to a data race) in C is always a possibility, but I don't know if Istio uses any C).

It's also possible (but I don't know how) that mheap.free is corrupting itself due to a bug. We don't have any stress tests for it, only basic functional tests. Maybe a smaller-scale stress test in the runtime would reveal the problem? It's kind of a moot point now that it isn't even in the runtime at tip anymore, but Go 1.13 will continue to be supported, so this should probably be fixed nonetheless.

@mknyszek mknyszek changed the title runtime.(*mTreap).insert is stuck in an infinite loop with Go1.13.10 runtime: mTreap corruption in go1.13.10 Jul 28, 2020
@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Jul 28, 2020

@tokers can you provide more detail? Were you able to look at a core dump?

And thanks @luckyxiaoqiang for the report.

@luckyxiaoqiang
Copy link
Author

@luckyxiaoqiang luckyxiaoqiang commented Jul 29, 2020

It may be worth mentioning that in Go 1.14 and later mheap.free no longer exists and has been replaced with an entirely different mechanism. That would explain why you cannot reproduce it.

It's not totally clear to me how a data race at the application level could lead to corruption of mheap.free which is pretty far removed from user code and protected by a runtime lock. I don't want to write off the possibility that it's a data race in Istio Pilot, but it wouldn't be my first guess (memory corruption caused by garbage data (perhaps created due to a data race) in C is always a possibility, but I don't know if Istio uses any C).

It's also possible (but I don't know how) that mheap.free is corrupting itself due to a bug. We don't have any stress tests for it, only basic functional tests. Maybe a smaller-scale stress test in the runtime would reveal the problem? It's kind of a moot point now that it isn't even in the runtime at tip anymore, but Go 1.13 will continue to be supported, so this should probably be fixed nonetheless.

Thanks for your reply @mknyszek .
In my opinion, if a data race happens at application level that results to corruption of runtime data, the most possible result may be a panic instead of an infinite loop. And it's very hard to analyze from application level, because Istio uses lots of 3rd party packages. What's more, we do lots of internal development based on the official Istio with an old version.
We are trying to analyze the core dump, and trying to find some useful info to report here.

@tokers
Copy link

@tokers tokers commented Jul 29, 2020

Yet another weird thing is that this scenario happened for several times but all in one machine (Linux VM). I don't know whether there are some environment-dependent factors that will influence the runtime memory allocator.

@luckyxiaoqiang
Copy link
Author

@luckyxiaoqiang luckyxiaoqiang commented Jul 29, 2020

In cases like this the most appropriate line of investigation is to ensure the program is completely free of data races. As the OP is running software written by a third party, I think it is appropriate that they seek assertions from the istio developers that they are confident that no data races exist in the version in question.

Thanks @davecheney . We do lots of internal development based on the official Istio with an old version which no longer maintained by community, so maybe it's hard to get help from Istio developers. We'll try to find data races at application level by ourselves.
Thanks for your advice again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants
You can’t perform that action at this time.