Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile/internal/types2: runtime.mapassign1 throws in (*Checker.recordScope) during bootstrapping #50999

Open
bcmills opened this issue Feb 3, 2022 · 17 comments
Labels
NeedsInvestigation

Comments

@bcmills
Copy link
Member

@bcmills bcmills commented Feb 3, 2022

greplogs --dashboard -md -l -e '^goroutine \d+ .*:\n(?:.+\n\t.*\n)*(?:bootstrap/)cmd/compile/internal/types2'

2022-02-03T16:53:55-bec8a10/windows-386-2008

runtime stack:
runtime.gothrow(0x11cfeb0, 0x2a)
	C:/workdir/src/runtime/panic.go:503 +0x8e fp=0x6fc50 sp=0x6fc38
runtime.sigpanic()
	C:/workdir/src/runtime/os_windows.go:36 +0x65 fp=0x6fc70 sp=0x6fc50
scanblock(0xc0833fc000, 0x1c4000, 0x0)
	C:/workdir/src/runtime/mgc0.c:311 +0xab0 fp=0x6fdb0 sp=0x6fc70
gc(0x6fee8)
	C:/workdir/src/runtime/mgc0.c:1440 +0x28b fp=0x6fec8 sp=0x6fdb0
runtime.gc_m()
	C:/workdir/src/runtime/mgc0.c:1368 +0x103 fp=0x6ff00 sp=0x6fec8
runtime.onM(0x169a5b0)
	C:/workdir/src/runtime/asm_amd64.s:257 +0x6d fp=0x6ff08 sp=0x6ff00
runtime.mstart()
	C:/workdir/src/runtime/proc.c:818 fp=0x6ff10 sp=0x6ff08

goroutine 1 [garbage collection]:
runtime.switchtoM()
	C:/workdir/src/runtime/asm_amd64.s:198 fp=0xc0834e6a80 sp=0xc0834e6a78
runtime.gogc(0x0)
	C:/workdir/src/runtime/malloc.go:469 +0x1d6 fp=0xc0834e6ab8 sp=0xc0834e6a80
runtime.mallocgc(0x34000, 0xf97360, 0x0, 0x401f89)
	C:/workdir/src/runtime/malloc.go:341 +0x398 fp=0xc0834e6b68 sp=0xc0834e6ab8
runtime.newarray(0xf97360, 0x400, 0x1aba10)
	C:/workdir/src/runtime/malloc.go:365 +0xc8 fp=0xc0834e6ba0 sp=0xc0834e6b68
runtime.hashGrow(0xe158a0, 0xc0823c9230)
	C:/workdir/src/runtime/hashmap.go:744 +0x8d fp=0xc0834e6bd0 sp=0xc0834e6ba0
runtime.mapassign1(0xe158a0, 0xc0823c9230, 0xc0834e6c98, 0xc0834e6c90)
	C:/workdir/src/runtime/hashmap.go:456 +0x56f fp=0xc0834e6c70 sp=0xc0834e6bd0
bootstrap/cmd/compile/internal/types2.(*Checker).recordScope(0xc08200e380, 0x1506d8, 0xc0825466e0, 0xc0834a6a10)
	C:/workdir/go/src/cmd/compile/internal/types2/check.go:580 +0xa6 fp=0xc0834e6cb0 sp=0xc0834e6c70

2021-09-07T14:53:41-21de6bc/linux-amd64-clang

(attn @griesemer, @findleyr)

@findleyr
Copy link
Contributor

@findleyr findleyr commented Feb 3, 2022

CC @danscales @mdempsky

Could the compiler be re-using a types2.Info?

@bcmills
Copy link
Member Author

@bcmills bcmills commented Feb 3, 2022

(This failure mode seems to be new as of Go 1.18, but as far as I can tell it only occurs during bootstrapping; it could possibly be a bad interaction with the Go 1.4 toolchain used for bootstrapping.)

@bcmills bcmills added the NeedsInvestigation label Feb 3, 2022
@mdempsky
Copy link
Member

@mdempsky mdempsky commented Feb 3, 2022

Could the compiler be re-using a types2.Info?

I don't think so. The compiler only runs the types2 type checker once per process.

@mdempsky
Copy link
Member

@mdempsky mdempsky commented Feb 3, 2022

The file:line numbers seem pretty odd. I know cmd/dist rewrites the files for bootstrapping, so I'd expect maybe a few lines delta, but they seem further off than I'd expect.

Like just the top 3 frames within types2:

bootstrap/cmd/compile/internal/types2.(*Checker).recordSelection(0xc208078000, 0xc2084293e0, 0x1, 0x7ff2b6d34790, 0xc208f606c0, 0x7ff2b6d346b0, 0xc208c284e0, 0xc20957fe08, 0x1, 0x1, ...)
	/workdir/go/src/cmd/compile/internal/types2/check.go:450 +0x1a6 fp=0xc209600d80 sp=0xc209600d38

recordSelection starts at line 568

bootstrap/cmd/compile/internal/types2.(*Checker).selector(0xc208078000, 0xc2095f78c0, 0xc2084293e0)
	/workdir/go/src/cmd/compile/internal/types2/call.go:648 +0x3d03 fp=0xc2096012f8 sp=0xc209600d80

There's a call at line 642, which seems reasonably close to 648. (There are also calls at lines 588 and 631.)

bootstrap/cmd/compile/internal/types2.(*Checker).exprInternal(0xc208078000, 0xc2095f78c0, 0x7ff2b6d35240, 0xc2084293e0, 0x0, 0x0, 0xc2088538a8)
	/workdir/go/src/cmd/compile/internal/types2/expr.go:1425 +0x1256 fp=0xc209601840 sp=0xc2096012f8

The only call to selector in expr.go is at line 1458.

@findleyr
Copy link
Contributor

@findleyr findleyr commented Feb 3, 2022

I don't think so. The compiler only runs the types2 type checker once per process.

Hmm, so that would rule out a race.

The actual error message is this:

runtime: garbage collector found invalid heap pointer *(0xc209600c08+0x0)=0xc20957d000 span=0xc209574000-0xc20957d000-0xc20957e000 state=0
fatal error: invalid heap pointer

Can someone describe what that means exactly, and the conditions under which it could occur? That would help narrow down whether could be a problem in types2.

@griesemer
Copy link
Contributor

@griesemer griesemer commented Feb 3, 2022

@mknyszek for input on the "invalid heap pointer" error

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Feb 3, 2022

That is an... old message. I can't find it in the current runtime. I'll do some digging in the history but at first glance I suspect that what it's trying to say is that there's a pointer pointing outside of any span (maybe?)

My guess is this is saying that at address 0xc209600c08 there's a pointer 0xc20957d000 and it's invalid. I think what it's trying to say is that the span is not a span that is currently in use by the heap (back then, the whole heap was always covered by the set of spans).

I'll get back to you all in a bit.

@mdempsky
Copy link
Member

@mdempsky mdempsky commented Feb 3, 2022

I'm inclined to say this is a Go 1.4 compiler (or runtime) bug. It certainly wouldn't be the first time we've been bit by the Go 1.4 toolchain when enabling new code in the compiler. E.g., this is why we currently disable inlining during bootstrapping:

// Run Go 1.4 to build binaries. Use -gcflags=-l to disable inlining to
// workaround bugs in Go 1.4's compiler. See discussion thread:
// https://groups.google.com/d/msg/golang-dev/Ss7mCKsvk8w/Gsq7VYI0AwAJ

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Feb 3, 2022

Huh, OK. So state=0 implies that the span it found for this object is indeed an in-use heap span. The three components 0xc209574000-0xc20957d000-0xc20957e000 refer to span boundaries: start, limit (the point at which you stop having valid objects), and end. The pointer 0xc20957d000 is the limit, so that's why it's invalid. The span is 40960 bytes (5 8 KiB pages) wide, but only the first 36864 bytes (4.5 8 KiB pages) is valid. So there's some 36864 byte object there, and the GC found a pointer past the end of it.

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Feb 3, 2022

Because it's a pointer-past-the-end situation, this could be a compiler bug potentially too, where there's some moment that the GC could stop a goroutine and see a transiently incorrect pointer value in a stack slot or something (like at the end of a loop, maybe?). Either that, or some of the code overwritten into the source tree unsafe and actually makes an invalid pointer (this is unlikely as it would be breaking outside of bootstrapping too). It could be a GC bug too I suppose if it reached into a place it wasn't supposed to, or at a time it wasn't supposed to.

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Feb 3, 2022

Oh! One more thing: did Go 1.4 have the "the last zero-sized field in a struct adds a byte of padding" rule? If not, I could see how taking that address in current Go is totally valid, but not valid back then.

@mdempsky
Copy link
Member

@mdempsky mdempsky commented Feb 3, 2022

Either that, or some of the code overwritten into the source tree unsafe and actually makes an invalid pointer (this is unlikely as it would be breaking outside of bootstrapping too).

The compiler only uses unsafe.Pointer for mmaping source files and export data into memory as strings. I don't think that's likely to be the issue. The mapped memory wouldn't be in a span anyway.

(Though note that the unsafe.Pointer safety rules weren't established until after Go 1.4 anyway.)

did Go 1.4 have the "the last zero-sized field in a struct adds a byte of padding" rule?

I was trying to remember whether that's the case or not. I thought that became a rule when we switched from conservative to precise GC, which I thought was before Go 1.4?

The span is 40960 bytes (5 8 KiB pages) wide, but only the first 36864 bytes (4.5 8 KiB pages) is valid. So there's some 36864 byte object there, and the GC found a pointer past the end of it.

Do we know for certain that it's a single 36864-byte object, or could it be multiple objects taking up that much space?

@bcmills
Copy link
Member Author

@bcmills bcmills commented Feb 3, 2022

Given #44505 — and assuming that it actually happens within the not-too-distant future — if we are reasonably confident that this is a Go 1.4 bug it might not be worth spending too much time on.

(External Go users who want to build from source can already work around a Go 1.4 bug by bootstrapping from a newer Go version instead.)

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Feb 3, 2022

Do we know for certain that it's a single 36864-byte object, or could it be multiple objects taking up that much space?

I don't think we do. This span is in range for being a small object span and it's within tail waste bounds. It's hard to say at this point what it was since the size class wasn't dumped.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Feb 3, 2022

I'm pretty sure the last zero-sized field thing was added in Go 1.5 for #9401.

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Feb 3, 2022

Without running Go 1.4 my best guess is that the size class scheme back then is the same as today's minus the 24-byte size class (there was already the MemStats.BySize array length mismatch). In which case there is no size class that fits the description. None of the size classes waste more than 896 bytes at the tail AFAICT. But a single large object could definitely waste that much because of the page size.

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Feb 3, 2022

OK, I ran the code and there definitely are size classes with >896 byte tails but they also do not fit the description. I'm pretty convinced this is a single large object. A fun fact is that the code that we used to use for size classes is almost identical to what we use now except that we have a second pass that bumps all object sizes up within size classes to reduce tail waste.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation
Projects
None yet
Development

No branches or pull requests

6 participants