Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: "fatal error: s.allocCount != s.nelems && freeIndex == s.nelems" during newobject on windows-amd64 #45775

Closed
bcmills opened this issue Apr 26, 2021 · 13 comments

Comments

@bcmills
Copy link
Member

@bcmills bcmills commented Apr 26, 2021

2021-04-23T21:42:59-41e5ae4/windows-amd64-2012

runtime: s.allocCount= 63 s.nelems= 64
            fatal error: s.allocCount != s.nelems && freeIndex == s.nelems
            
            goroutine 1 [running]:
            runtime.throw({0xd5860d, 0x0})
            	C:/workdir/go/src/runtime/panic.go:1198 +0x76 fp=0xc0000898a0 sp=0xc000089870 pc=0x438f76
            runtime.(*mcache).nextFree(0x330598, 0x14)
            	C:/workdir/go/src/runtime/malloc.go:877 +0x1e5 fp=0xc0000898e8 sp=0xc0000898a0 pc=0x40d145
            runtime.mallocgc(0xc000089990, 0xd042a0, 0x1)
            	C:/workdir/go/src/runtime/malloc.go:1066 +0x4e5 fp=0xc000089978 sp=0xc0000898e8 pc=0x40d665
            runtime.newobject(0x8)
            	C:/workdir/go/src/runtime/malloc.go:1174 +0x27 fp=0xc0000899a0 sp=0xc000089978 pc=0x40dac7
            cmd/internal/obj.(*Link).LookupABIInit(0xc0000de400, {0xc00051a0a0, 0xd32209}, 0x7, 0xc000089a28)
            	C:/workdir/go/src/cmd/internal/obj/sym.go:104 +0xbd fp=0xc000089a00 sp=0xc0000899a0 pc=0x542b1d
            cmd/compile/internal/base.linksym({0xd32209, 0xd32209}, {0xc00051a0a0, 0xe7ad08}, 0x7)
            	C:/workdir/go/src/cmd/compile/internal/base/link.go:35 +0x5c fp=0xc000089a50 sp=0xc000089a00 pc=0x550b3c
            cmd/compile/internal/base.PkgLinksym({0xd32209, 0xc0000addc0}, {0xd3399d, 0xc0000adea0}, 0x10)
            	C:/workdir/go/src/cmd/compile/internal/base/link.go:23 +0x92 fp=0xc000089ab0 sp=0xc000089a50 pc=0x550a92
            cmd/compile/internal/typecheck.LookupRuntimeABI(...)
            	C:/workdir/go/src/cmd/compile/internal/typecheck/syms.go:102
            cmd/compile/internal/typecheck.LookupRuntimeVar(...)
            	C:/workdir/go/src/cmd/compile/internal/typecheck/syms.go:97
            cmd/compile/internal/ssagen.InitConfig()
            	C:/workdir/go/src/cmd/compile/internal/ssagen/ssa.go:208 +0x1cd7 fp=0xc000089c98 sp=0xc000089ab0 pc=0xa67a97
            cmd/compile/internal/gc.Main(0xd64c20)
            	C:/workdir/go/src/cmd/compile/internal/gc/main.go:267 +0xd2a fp=0xc000089f20 sp=0xc000089c98 pc=0xc18f6a
            main.main()
            	C:/workdir/go/src/cmd/compile/main.go:55 +0xdd fp=0xc000089f80 sp=0xc000089f20 pc=0xc3b79d
            runtime.main()
            	C:/workdir/go/src/runtime/proc.go:255 +0x217 fp=0xc000089fe0 sp=0xc000089f80 pc=0x43b537
            runtime.goexit()
            	C:/workdir/go/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc000089fe8 sp=0xc000089fe0 pc=0x4691c1

CC @mknyszek

@bcmills
Copy link
Member Author

@bcmills bcmills commented Apr 26, 2021

Tentatively marking as release-blocker for 1.17 until we can determine whether this is a regression. (The only other occurrence of this error I could find in the logs was in 2019 on plan9, which seems likely to be unrelated.)

Loading

@toothrot
Copy link
Contributor

@toothrot toothrot commented Apr 29, 2021

/cc @bufflig

Loading

@heschi
Copy link
Contributor

@heschi heschi commented May 13, 2021

Weekly check-in: this needs to be investigated before beta 1.

@mknyszek @aclements @prattmic

Loading

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented May 18, 2021

Here's what the situation looks like: nextFreeFast (the allocator fast path) was called for this span that was already in the mcache, and it failed. Then, nextFree was called to either replenish the span's allocCache, or go get a new span. In this case, the span appears full, so it tries to go get a new span, but before it does that, the allocator notices that allocCount doesn't line up with nelems and throws an error.

What's interesting here is that this means the program was already allocating out of this mcache in the current GC cycle, at least once, anyway. And somehow the allocator missed a free slot in the process. In general, this is very unlikely; these code paths are exercised extremely heavily. The allocator did not change in 1.17, so if there's a bug in that logic, it's not new. I've walked over this code a bunch of times now and I can't find a fault in the algorithm.

I thought maybe I saw something in a corner case, like a span that was just swept (there is some super weird stuff we do that I think we should clean up, like relying on certain pieces of span state to change then fixing them up...) but it all seems to check out.

Now, on the other hand, if we consider memory corruption a (scary but) viable alternative, then an errant zero value on the span's allocCache value would manifest as this error in many cases.

Loading

@RLH
Copy link
Contributor

@RLH RLH commented May 19, 2021

Loading

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented May 19, 2021

Thanks Rick. I'm trying to wrap my head around why a revived object would be caught here, though, instead of when sweeping. Since only swept spans may be cached, we check for revived objects during sweeping, and marking never touches the allocation bits, I'm not sure how the allocation path could surface an error for a revived object, at least the way the code is currently written.

Perhaps there's something I'm missing, though.

Loading

@heschi
Copy link
Contributor

@heschi heschi commented Jun 10, 2021

I think this needs to be addressed before final release and ideally before RC, so ping.

Loading

@toothrot
Copy link
Contributor

@toothrot toothrot commented Jun 17, 2021

Checking in as the RC1 date is approaching.

Loading

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Jun 17, 2021

Just walked back all the windows/amd64 failures I could find in the logs up to this failure. I couldn't find anything else like it. EDIT: Or anything else untriaged or unresolved, for that matter.

I'm really not sure what to do here. My analysis above lead nowhere.

Loading

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Jun 17, 2021

As far as I can tell this has happened exactly once. "Once is happenstance." If we don't see any way that this could happen, I think we should close the issue until it recurs.

Loading

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Jun 17, 2021

If it happens again, I'm happy to throw more resources at it to find the root cause, but I really don't see how this could happen, and I don't feel particularly optimistic about reproducing it given how well reproducing FreeBSD memory corruption is going (though, that has certainly happened more than once!).

Loading

@bcmills
Copy link
Member Author

@bcmills bcmills commented Jun 17, 2021

I'd be ok with closing it as non-reproducible for now, given that we also can't identify any changes that might have caused it.

Loading

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Jun 17, 2021

Closing.

Loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
7 participants