Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compress/flate: unrecognized failures #66383

Open
gopherbot opened this issue Mar 18, 2024 · 17 comments
Open

compress/flate: unrecognized failures #66383

gopherbot opened this issue Mar 18, 2024 · 17 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@gopherbot
Copy link
Contributor

#!watchflakes
default <- pkg == "compress/flate" && test == ""

Issue created automatically to collect these failures.

Example (log):

SIGSEGV: segmentation violation
PC=0x10022034c m=6 sigcode=2 addr=0x20

goroutine 0 gp=0x140000848c0 m=6 mp=0x14000054408 [idle]:
runtime.(*mspan).typePointersOfUnchecked(0x140006f0e80?, 0x14000380600?)
	/tmp/buildlet/go/src/runtime/mbitmap_allocheaders.go:202 +0x3c fp=0x16ff36d90 sp=0x16ff36d70 pc=0x10022034c
runtime.scanobject(0x14003542000, 0x1400002d258)
	/tmp/buildlet/go/src/runtime/mgcmark.go:1439 +0x1c4 fp=0x16ff36e20 sp=0x16ff36d90 pc=0x10022c264
runtime.gcDrain(0x1400002d258, 0x3)
	/tmp/buildlet/go/src/runtime/mgcmark.go:1240 +0x1d4 fp=0x16ff36e90 sp=0x16ff36e20 pc=0x10022ba04
...
r24     0x0
r25     0x1400002d2a0
r26     0xffffffffffffffff
r27     0xffffffffffffffe8
r28     0x140000848c0
r29     0x16ff36d68
lr      0x10022c264
sp      0x16ff36d70
pc      0x10022034c
fault   0x20

watchflakes

@gopherbot gopherbot added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Mar 18, 2024
@gopherbot
Copy link
Contributor Author

Found new dashboard test flakes for:

#!watchflakes
default <- pkg == "compress/flate" && test == ""
2024-03-18 17:06 darwin-arm64-12 go@dda4b17e compress/flate (log)
SIGSEGV: segmentation violation
PC=0x10022034c m=6 sigcode=2 addr=0x20

goroutine 0 gp=0x140000848c0 m=6 mp=0x14000054408 [idle]:
runtime.(*mspan).typePointersOfUnchecked(0x140006f0e80?, 0x14000380600?)
	/tmp/buildlet/go/src/runtime/mbitmap_allocheaders.go:202 +0x3c fp=0x16ff36d90 sp=0x16ff36d70 pc=0x10022034c
runtime.scanobject(0x14003542000, 0x1400002d258)
	/tmp/buildlet/go/src/runtime/mgcmark.go:1439 +0x1c4 fp=0x16ff36e20 sp=0x16ff36d90 pc=0x10022c264
runtime.gcDrain(0x1400002d258, 0x3)
	/tmp/buildlet/go/src/runtime/mgcmark.go:1240 +0x1d4 fp=0x16ff36e90 sp=0x16ff36e20 pc=0x10022ba04
...
r24     0x0
r25     0x1400002d2a0
r26     0xffffffffffffffff
r27     0xffffffffffffffe8
r28     0x140000848c0
r29     0x16ff36d68
lr      0x10022c264
sp      0x16ff36d70
pc      0x10022034c
fault   0x20

watchflakes

@aciduck
Copy link

aciduck commented Mar 28, 2024

This happens to us as well, since upgrading from Go 1.21 to 1.22.1.
In a multi-tenant environment this constantly reproduces just for a single tenant in a single component, after a variable runtime of 30 minutes - 4 hours. We couldn't find a correlation to specific business logic actions, but if this is related to the compression package it might just be networking.

$ go version
go version go1.22.1 linux/arm64

An example of such an error is attached -
extract-2024-03-27T16_36_44.051Z.txt

@aciduck
Copy link

aciduck commented Apr 3, 2024

Update: our code was using the builtin gzip package specifically in this use case, and for a specific tenant, to compress large amounts of data. Once we disabled it the crash disappeared. This seems to match the original failure in this ticket, sugging issues in the compression code.

@dany74q

This comment was marked as duplicate.

@AdallomRoy

This comment was marked as duplicate.

@seankhliao
Copy link
Member

@mknyszek
Copy link
Contributor

mknyszek commented May 7, 2024

I missed the comments here from back in late March/early April, but to be absolutely clear, this failure happened on our old macOS infrastructure which was known for having memory corruption bugs that haven't been present since we switched that infrastructure (though, there were other issues, but they didn't look like this).

For everyone commenting here, I think we need much more information to be able to do anything about this.

  • What platform did it happen to you on?
  • Which Go version was used to build the code?
  • Can you provide stack traces (even just the runtime parts are helpful)?
  • Do you have a way to reproduce?

Thanks in advance.

@mknyszek mknyszek added the compiler/runtime Issues related to the Go compiler and/or runtime. label May 7, 2024
@mknyszek
Copy link
Contributor

mknyszek commented May 7, 2024

(My only suspicion here is that there's a mishandling of "GC progs" which are used for the GC metadata for very large objects. I know for certain that compress/flate has one such object. A reproducer would go a very long way to resolving this.)

@mknyszek mknyszek self-assigned this May 7, 2024
@mknyszek mknyszek added this to the Backlog milestone May 7, 2024
@aciduck
Copy link

aciduck commented May 8, 2024

Hello @mknyszek, thank you for the response.
Attached is a crash stack trace from today -
extract-2024-05-08T09_29_41.780Z.txt
The code is running in an ARM64 Linux container, built using Go 1.22.2.

We don't have a small reproducer, because it doesn't happen on small payloads. It only happens for a single tenant in our environment, but for them it happens on a daily basis. So we know it depends on the actual data being compressed.

@mknyszek
Copy link
Contributor

mknyszek commented May 8, 2024

Thanks. I looked at the failing line it looks like the type information is nil. arm64 is an interesting platform to fail on because of the weaker memory model -- perhaps GC prog objects are becoming visible to the GC sooner? (Assuming that is the codepath being taken here, which is still a bit unclear.)

Go 1.23 actually already has a mitigation for this: if the type is nil, scanning is skipped (this is to support a different use-case). It's possible you just might not see the issue with Go 1.23 already (that is, if you try building Go from tip-of-tree).

@mknyszek
Copy link
Contributor

mknyszek commented May 8, 2024

#67255 may be related. It can produce the same failure and a fix was recently landed. The situation is an append of a 0-sized slice, which may be something compress/flate does.

@randall77
Copy link
Contributor

I tried reverting my CL for 67255 and adding this hack to the runtime:

diff --git a/src/runtime/mbarrier.go b/src/runtime/mbarrier.go
index dc6922da54..f897b2928b 100644
--- a/src/runtime/mbarrier.go
+++ b/src/runtime/mbarrier.go
@@ -367,6 +367,30 @@ func reflect_typedarrayclear(typ *_type, ptr unsafe.Pointer, len int) {
 //
 //go:nosplit
 func memclrHasPointers(ptr unsafe.Pointer, n uintptr) {
+       if n == 0 && ptr != nil {
+               p := uintptr(ptr)
+               s := spanOf(p)
+               if s == nil {
+                       println(ptr)
+                       panic("no span")
+               }
+               if s.state.get() != mSpanInUse {
+                       panic("span not in use")
+               }
+               i := s.objIndex(p)
+               if i >= uintptr(s.nelems) {
+                       panic("index too big")
+               }
+               if s.isFree(i) {
+                       panic("free object!")
+               }
+               if s.elemsize > 512 && s.spanclass.sizeclass() != 0 {
+                       q := s.objBase(i)
+                       if p == q {
+                               panic("header pointer")
+                       }
+               }
+       }
        // Pass nil for the type since we don't have one here anyway.
        bulkBarrierPreWrite(uintptr(ptr), 0, n, nil)
        memclrNoHeapPointers(ptr, n)

And all.bash almost passes. Just some tests in test/append.go trigger it. (As does my new test for 67255 if I add that in.) In particular, compress/flate tests don't trigger it. So it appears unlikely that #67255 is the direct cause, although it is hard to be 100% sure.

@randall77
Copy link
Contributor

... but I'm still betting that a pointer-past-the-end bug somewhere is ultimately responsible.

@mknyszek
Copy link
Contributor

@aciduck Would it be possible for you to try out different things to help narrow down the root cause?

  • If you set GODEBUG=asyncpreemptoff=1, does that change anything? The fact that it's arm64 is fishy to me, and I'm concerned it's some kind of rare synchronization bug due to the weak memory model.
  • If you're open to trying out a patch, it would be good to confirm or deny my suspicion about it GC progs at fault. I can prepare that shortly.
  • What happens if you try Go at tip-of-tree? As I mentioned before, this may have been unintentionally mitigated already, and if so, it might be that it's just fixed (which may also be a clue as to what's going on). Simultaneously, if instead of this failure, you start to see new ones (looking more like random memory corruption), that's also a hint that something is very wrong.

Since you don't have a small reproducer, I assume the only way to try this out is to just run with these modes for a while and see what happens? Meanwhile I'll try to reproduce myself.

@aciduck
Copy link

aciduck commented Jun 11, 2024

Hi @mknyszek, we have rolled back the updated config and have been trying to reproduce the issue for two weeks now, without success. As mentioned before, it happened for a single tenant in a multi-tenant environment, so the problem was always related to a specific data set. The data seems to have changed in the past month, and it doesn't cause the issue anymore.

@valyala
Copy link
Contributor

valyala commented Jun 21, 2024

FYI, this panic is also reproduced in Go1.22.4 on amd64 architecture, while it cannot be reproduced in Go1.21 - see VictoriaMetrics/VictoriaMetrics#6425

We are going to build the app with Go tip and see whether it is reproduced there.

@valyala
Copy link
Contributor

valyala commented Jun 26, 2024

Update: the panic isn't reproducible in Go1.23rc1 - see VictoriaMetrics/VictoriaMetrics#6425 (comment) . So, it looks like the issue has been already fixed in Go1.23.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
Status: In Progress
Status: No status
Development

No branches or pull requests

8 participants