Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime,cmd/compile: apparent memory corruption in compress/flate #54596

Open
bcmills opened this issue Aug 22, 2022 · 10 comments
Open

runtime,cmd/compile: apparent memory corruption in compress/flate #54596

bcmills opened this issue Aug 22, 2022 · 10 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@bcmills
Copy link
Member

bcmills commented Aug 22, 2022

greplogs -l -e \(\?ms\)runtime\\.throw.\*FAIL\\s+compress/flate --since=2021-10-09
2022-08-17T20:19:28-9c2b481/freebsd-amd64-12_3
2022-08-17T03:15:44-e1b62ef/netbsd-arm64-bsiegert

A very similar failure mode was reported in #48846 and thought to have been fixed by @cherrymui.

(attn @golang/runtime)

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Aug 22, 2022
@dr2chase dr2chase added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Aug 24, 2022
@mknyszek mknyszek added this to the Go1.20 milestone Aug 24, 2022
@heschi
Copy link
Contributor

heschi commented Aug 29, 2022

@cherrymui
Copy link
Member

cherrymui commented Aug 30, 2022

All three failures seem to have in common:

  • the GC complains about a bad pointer pointing to an unallocated span, but the object hex dump shows the memory content is 0. In face, the entire object seems 0. I think this looks like there is something zeroing the object in concurrent with the GC.
  • the object is in a span of size 663552 (0xa2000), which may be a flate.Writer which has size 656648 (0xa0508)

One possibility for the concurrent write of 0 is allocation. It may be that the GC scans the object when (or before) mallocgc zeroes it. For an object with that size, it is a large object and goes on this code path https://cs.opensource.google/go/go/+/master:src/runtime/malloc.go;l=1036

Why the GC sees the object before mallocgc returns? I don't really know. A conservative scan that happens to see a value coincide with the newly allocated pointer? That seems possible. mallocgc sets span.freeindex to 1 before zeroing the object or setting the heap bitmap. scanConservative (and greyobject) checks for span.IsFree, which in turn checks for freeindex https://cs.opensource.google/go/go/+/master:src/runtime/mgcmark.go;l=1431 . So it seems possible for scanConservative to pick up that object.

The span is from mcache.allocLarge, which calls initHeapBits https://cs.opensource.google/go/go/+/master:src/runtime/mcache.go;l=255
which, however, seems to set the heap bits only in certain cases -- noscan, or all pointers https://cs.opensource.google/go/go/+/master:src/runtime/mbitmap.go;l=735
For the case of a flate.Writer, it would be unset. I guess then the GC is possible to use an old heap bitmap to scan an object that isn't (fully) initialized? That seems bad.

In the third failure https://build.golang.org/log/4c7618e90e934e0158912f4fb949875d171bd521 it is complaining about the first word of the object. If the object is really a flate.Writer, the first word is not a pointer. This may indicate it is not using the right pointer bitmap, which the theory above would lead to.

In Go 1.19, where heapBits.initSpan always set the span's heap bits (to either all 0s or all pointers, never leave it unset), which seems to prevent this problem.

Maybe we should increment span.freeindex later, after zeroing the object and maybe setting the heap bits, so the GC won't see it, even for a conservative scan? This seems easy to do for large objects, but may be harder to do for small objects (without refactoring nextFree etc.).

I haven't reproduced the failure myself (didn't try very hard either). This is just a hypothesis. I may have missed something...

cc @mknyszek @randall77 @aclements

@mknyszek
Copy link
Contributor

mknyszek commented Aug 30, 2022

A conservative scan that happens to see a value coincide with the newly allocated pointer?

One issue with this is while we're still in some runtime frame a conservative scan shouldn't happen. Asynchronous preemption backs out, I think.

But I agree with the rest of your thinking, i.e. the fact that it's a large object and that it seems like a pointer might somehow be getting published early (while freeindex is still unset). Either that, or there's a pointer into that space that was for some old object before the span was freed, and then materialized after the space for it had died (and it's just likely to land on the large object because it's large).

I think you're also onto something with the fact that the bitmap isn't cleared. Maybe there's yet-another place somewhere that we might hold onto a stale pointer that's otherwise safe because we assume the bitmap is going to get cleared (this was the case with noscan objects and cgo not long ago)?

We can try always zeroing the bitmap (unfortunate...) and see if the problem still shows up. That's data, but it would take a while to be sure given how rare this is.

@cherrymui
Copy link
Member

cherrymui commented Aug 30, 2022

One issue with this is while we're still in some runtime frame a conservative scan shouldn't happen. Asynchronous preemption backs out, I think.

Oh, I didn't mean conservatively scanning the runtime frame (sorry for being unclear). I think it is possible that a conservative scan of a (completely separate) user frame, on a different thread, happens to see a value (possibly a dead pointer) that is within the range of the newly allocated span.

That said, I don't really know that it was conservative scan. There might be other possibilities for a GC to see such pointer that I couldn't think of.

We can try always zeroing the bitmap (unfortunate...) and see if the problem still shows up. That's data, but it would take a while to be sure given how rare this is.

I can try that. Last night I ran stress flate.test -test.short overnight on a gomote and saw 15 failures over 300000 runs. I can make the change and let it run for a day.

Among the failures the object hex dump are not always all 0. It could be that it is not (always) allocation. Or it may still be allocation, just that when it throws the allocating goroutine has already moved on, returned from mallocgc and started filling in the content.

I think you're also onto something with the fact that the bitmap isn't cleared. Maybe there's yet-another place somewhere that we might hold onto a stale pointer that's otherwise safe because we assume the bitmap is going to get cleared (this was the case with noscan objects and cgo not long ago)?

Yeah, this is unfortunate. But I'm also not really sure how safe it is for the "all pointer" case. If the above is possible, I guess it is still possible for the GC to see the newly allocated but not yet zeroed pointer word? Just much harder as it is just one word?

@mknyszek
Copy link
Contributor

mknyszek commented Aug 30, 2022

Oh, I didn't mean conservatively scanning the runtime frame

Ohhhhhhh, I see. That's a really good point. It didn't occur to me that pointers could in effect get "published" this way. We probably do need to be defensive about it, as you suggest.

@gopherbot
Copy link

gopherbot commented Aug 30, 2022

Change https://go.dev/cl/426834 mentions this issue: runtime: always initialize heap bits for a span

@cherrymui
Copy link
Member

cherrymui commented Aug 30, 2022

With the CL above, before the gomote was killed by coordinator restart, I got

3h36m5s: 101279 runs so far, 0 failures

compared to previously

14h0m0s: 310600 runs so far, 15 failures (0.00%)

It seems positive that zeroing heap bitmap helps.

Independently I'm going to try if delaying incrementing span.freeindex makes any difference.

@gopherbot
Copy link

gopherbot commented Sep 1, 2022

Change https://go.dev/cl/427619 mentions this issue: runtime: delay incrementing freeindex in malloc

@cherrymui
Copy link
Member

cherrymui commented Sep 1, 2022

With CL 427619 (without CL 426834), it also doesn't fail running overnight (18h4m40s: 522171 runs so far, 0 failures). So delaying incrementing freeindex also seems to help.

As another data point, disabling async preemption (without either CL above) also seems to stop the failure (15h58m40s: 462136 runs so far, 0 failures).

I think these all support the hypothesis above, i.e. conservative scan, GC seeing the newly allocated object before zeroing, and heap bitmap not set.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
Status: Todo
Development

No branches or pull requests

6 participants