Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: go microservice is crashing in runtime during garbage collection running #64682

Closed
harsh8118 opened this issue Dec 12, 2023 · 16 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Milestone

Comments

@harsh8118
Copy link

Go version

go 1.20

What operating system and processor architecture are you using (go env)?

go env
GO111MODULE=''
GOARCH='amd64'
GOBIN=''
GOCACHE='.cache/go-build'
GOENV='.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMODCACHE='/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/usr/local/go'
GOSUMDB='off'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/usr/local/go/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='go1.21.4'
GCCGO='gccgo'
GOAMD64='v1'
AR='ar'
CC='gcc'
CXX='g++'
CGO_ENABLED='1'
GOMOD='go.mod'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build4248553971=/tmp/go-build -gno-record-gcc-switches'

What did you do?

system was in idle state for 48 hours. observed this crash

What did you expect to see?

no crash

What did you see instead?

0 0x000055925bb1e532 in runtime.readgstatus
at runtime/proc.go:977
1 0x000055925bb1e532 in runtime.tracebackothers.func1
at runtime/traceback.go:1231
2 0x000055925baf8969 in runtime.forEachGRace
at runtime/proc.go:621
3 0x000055925bb1e4db in runtime.tracebackothers
at runtime/traceback.go:1230
4 0x000055925bb0b9ab in runtime.sighandler
at runtime/signal_unix.go:748
5 0x000055925bb0b0d0 in runtime.sigtrampgo
at runtime/signal_unix.go:490
6 0x000055925bb1e532 in runtime.tracebackothers.func1
at runtime/traceback.go:1231
7 0x000055925baf8969 in runtime.forEachGRace
at runtime/proc.go:621
8 0x000055925bb1e4db in runtime.tracebackothers
at runtime/traceback.go:1230
9 0x000055925baf59c6 in runtime.dopanic_m
at runtime/panic.go:1316
10 0x000055925baf53ed in runtime.fatalthrow.func1
at runtime/panic.go:1170
11 0x000055925baf5345 in runtime.fatalthrow
at runtime/panic.go:1163
12 0x000055925baf503e in runtime.throw
at runtime/panic.go:1077
13 0x000055925bad16a5 in runtime.badPointer
at runtime/mbitmap.go:321
14 0x000055925bad1826 in runtime.findObject
at runtime/mbitmap.go:364
15 0x000055925badd40c in runtime.scanobject
at runtime/mgcmark.go:1335
16 0x000055925badccda in runtime.gcDrain
at runtime/mgcmark.go:1103
17 0x000055925bad95cf in runtime.gcBgMarkWorker.func2
at runtime/mgc.go:1385
18 0x000055925bb29327 in runtime.systemstack
at runtime/asm_amd64.s:509
19 0x000055925bb292c8 in runtime.systemstack_switch
at runtime/asm_amd64.s:474
20 0x000055925bad9296 in runtime.gcBgMarkWorker
at runtime/mgc.go:1353
21 0x000055925bb2b2c1 in runtime.goexit
at runtime/asm_amd64.s:1650
(dlv)

@harsh8118
Copy link
Author

harsh8118 commented Dec 12, 2023

in frame 14 i dumped p and s value.
looks p is set at value of s.limit, so it is crashing.

(dlv) print p
824644280320

(dlv) print s
(*runtime.mspan)(0x7f3b40cba868)
*runtime.mspan {
_: runtime/internal/sys.NotInHeap {
_: runtime/internal/sys.nih {},},
next: *runtime.mspan nil,
prev: *runtime.mspan nil,
list: *runtime.mSpanList nil,
startAddr: 824644272128,
npages: 1,
manualFreeList: 0,
freeindex: 0,
nelems: 256,
allocCache: 18446744073709551615,
allocBits: runtime.gcBits {
_: (
"runtime/internal/sys.NotInHeap")(0x7f3b40a208e0),
x: 0,},
gcmarkBits: runtime.gcBits {
_: (
"runtime/internal/sys.NotInHeap")(0x7f3b40fbc270),
x: 0,},
pinnerBits: runtime.gcBits nil,
sweepgen: 7992,
divMul: 134217728,
allocCount: 0,
spanclass: 8,
state: runtime.mSpanStateBox {
s: (
"runtime/internal/atomic.Uint8")(0x7f3b40cba8d3),},
needzero: 1,
isUserArenaChunk: false,
allocCountBeforeCache: 0,
elemsize: 32,
limit: 824644280320,
speciallock: runtime.mutex {
lockRankStruct: runtime.lockRankStruct {},
key: 0,},
specials: *runtime.special nil,
userArenaChunkFree: runtime.addrRange {
base: (*runtime.offAddr)(0x7f3b40cba8f8),
limit: (*runtime.offAddr)(0x7f3b40cba900),},
freeIndexForScan: 0,}
(dlv)

in findObject it is failing at this condition
if state := s.state.get(); state != mSpanInUse || p < s.base() || p >= s.limit {

here p is 824644280320
and s.limit is also 824644280320 so it is entering in this condition and calling badPointer().

what could be reason it set to s.limit ?
it is some issue with runtime GC code or it can happen due to some wrong/invalid memory access in my code ?

@mia19550
Copy link

mia19550 commented Dec 12, 2023 via email

@randall77
Copy link
Contributor

It could be a problem with your code, if you have a data race or have incorrect uses of unsafe.Pointer.
Run your program with the race detector (-race) and see if it reports anything.

@seankhliao seankhliao added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Dec 12, 2023
@harsh8118
Copy link
Author

it's a very large application. this crash is seen very randomly in idle state. i tried running my program with race detector multiple times but i am not hitting the issue when -race is enabled.

@randall77
Copy link
Contributor

Thanks for that. The race detector will often catch problems well before they become a crash, so there's likely information in its lack of report even if you didn't reach the crash point with it on.

Unfortunately, there's not a lot to go on here. The garbage collector is clearly finding a bad pointer, but that doesn't give us much information about where it found that pointer. If you could provide more of the bad pointer report, that would help. That report would be the lines starting with a phrase like "runtime: pointer 0xXXX to unused region of span" and ending with "fatal error: found bad pointer in Go heap".

@seankhliao seankhliao added WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. and removed WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. labels Dec 13, 2023
@harsh8118
Copy link
Author

Hi, Unfortunately i was not able to collect the report logs when crash happen. but i have access to core file. i can dump all the values captured in logs using core files.
here state is mSpanDead
(dlv) print s.state
runtime.mSpanStateBox {
s: runtime/internal/atomic.Uint8 {
noCopy: runtime/internal/atomic.noCopy {},
value: 0,},}

p = 824644280320
refBase = 824659009360
refOff = 16

previously i added complete dump for s also. so from there we can get s.limit, s.base and other related parameters.
s.base is startAddr: 824644272128
s.limit is limit: 824644280320

is there a way i can simulate the output of gcDumpObject() using dlv ?

@harsh8118
Copy link
Author

I am also confused about these traces in stack trace. in frame 5 runtime.sigtrampgo() is called is it due to badPointer call in same flow or some other signal handler ?

2 0x000055925baf8969 in runtime.forEachGRace
at runtime/proc.go:621
3 0x000055925bb1e4db in runtime.tracebackothers
at runtime/traceback.go:1230
4 0x000055925bb0b9ab in runtime.sighandler
at runtime/signal_unix.go:748
5 0x000055925bb0b0d0 in runtime.sigtrampgo
at runtime/signal_unix.go:490
6 0x000055925bb1e532 in runtime.tracebackothers.func1
at runtime/traceback.go:1231
7 0x000055925baf8969 in runtime.forEachGRace
at runtime/proc.go:621
8 0x000055925bb1e4db in runtime.tracebackothers
at runtime/traceback.go:1230
9 0x000055925baf59c6 in runtime.dopanic_m
at runtime/panic.go:1316
10 0x000055925baf53ed in runtime.fatalthrow.func1
at runtime/panic.go:1170
11 0x000055925baf5345 in runtime.fatalthrow
at runtime/panic.go:1163
12 0x000055925baf503e in runtime.throw
at runtime/panic.go:1077
13 0x000055925bad16a5 in runtime.badPointer
at runtime/mbitmap.go:321

@randall77
Copy link
Contributor

is there a way i can simulate the output of gcDumpObject() using dlv ?

I don't think there's any easy way to do that, no. Other than simulating it by hand.

I am also confused about these traces in stack trace. in frame 5 runtime.sigtrampgo() is called is it due to badPointer call in same flow or some other signal handler ?

The signal handler during the backtrace does look odd. The runtime is trying to print the backtrace of a goroutine and it is failing. If I had to guess I'd say there was a nil *g reported by forEachGRace. That sounds like a heap-has-been-corrupted problem, which would also explain the OG problem that a bad pointer was found.

I'm having a bit of trouble making sense of your line numbers. The traceback.go line numbers should be 1240 and 1241, not 1230 and 1231, I think. Your environment variables say go1.21.4, but the line numbers look like they are from an earlier point release, maybe 1.21.1?

@harsh8118
Copy link
Author

go version is 1.20, looks my go env was wrong.

@randall77
Copy link
Contributor

Those line numbers look completely wrong for 1.20. Can you double-check? You should be able to run go version <binary> to double-check the version of the binary you are using (using any version of go for the go version command).

It would be worth trying 1.21.5 or tip, just to see if it is a problem that has been fixed.

@harsh8118
Copy link
Author

harsh8118 commented Dec 13, 2023

yeah sorry it is go1.21.1
it will be big change for us to move to 1.21.5, we may plan it in sometime. do you suspect any issue with runtime code in 1.21.1 or any related fix is done in 1.21.5 ?

@prattmic prattmic changed the title import/path: go microservice is crashing in runtime during garbage collection running. runtime: go microservice is crashing in runtime during garbage collection running Dec 13, 2023
@prattmic prattmic added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Dec 13, 2023
@prattmic prattmic added this to the Backlog milestone Dec 13, 2023
@randall77
Copy link
Contributor

I don't see anything obviously related in the 1.21.1 -> 1.21.5 changes.

Was there a Go version under which this code didn't crash? If so, or if we could find one, then we could do binary search to find the change that is likely responsible.

I don't have any other ideas to figure out what the problem is here, sorry.

@seankhliao seankhliao added WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. and removed WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. labels Dec 13, 2023
@harsh8118
Copy link
Author

we were hitting this crash on previous version of go also.
i need some info regarding this code part in runtime.
in findObject() api below condition is failing because is p's address value is same as s.limit
p = 824644280320
s.limit is limit: 824644280320

scanobject() api calculates p and pass it as parameter to findObject(). there can be any issue in scanobject while calculating p or since heap is corrupted it causes some issue in calculating p ?

if state := s.state.get(); state != mSpanInUse || p < s.base() || p >= s.limit {

@randall77
Copy link
Contributor

If the heap is corrupted, arbitrarily bad things can happen. s.limit could be overwritten with junk, for example.
That said, in scanobject the p argument to findObject comes directly from a load from memory, so scanobject can't "miscalculate" p, as it does no calculation to construct p. It can, of course, load it from the wrong location if the heap is corrupted.

@harsh8118
Copy link
Author

i hit another crash when system was in idle state, this time BT is different, it is crashing in pprof. we have pprof enabled for all profiles and keep collecting data every 15 mins. I there a chance pprof causes race race condition when GC is running that can possibly corrupt the heap ?

Sending output to pager...
0 0x0000558343990805 in runtime.(*unwinder).next
at runtime/traceback.go:453
1 0x00005583439920a5 in runtime.traceback2
at runtime/traceback.go:977
2 0x0000558343991e65 in runtime.traceback1.func1
at runtime/traceback.go:913
3 0x0000558343991cd2 in runtime.traceback1
at runtime/traceback.go:936
4 0x00005583439935e5 in runtime.traceback
at runtime/traceback.go:813
5 0x00005583439935e5 in runtime.tracebackothers.func1
at runtime/traceback.go:1244
6 0x000055834396d969 in runtime.forEachGRace
at runtime/proc.go:621
7 0x00005583439934db in runtime.tracebackothers
at runtime/traceback.go:1230
8 0x00005583439809ab in runtime.sighandler
at runtime/signal_unix.go:748
9 0x00005583439800d0 in runtime.sigtrampgo
at runtime/signal_unix.go:490
10 0x0000558343990805 in runtime.(*unwinder).next
at runtime/traceback.go:453
11 0x00005583439920a5 in runtime.traceback2
at runtime/traceback.go:977
12 0x0000558343991e65 in runtime.traceback1.func1
at runtime/traceback.go:913
13 0x0000558343991cd2 in runtime.traceback1
at runtime/traceback.go:936
14 0x00005583439935e5 in runtime.traceback
at runtime/traceback.go:813
15 0x00005583439935e5 in runtime.tracebackothers.func1
at runtime/traceback.go:1244
16 0x000055834396d969 in runtime.forEachGRace
at runtime/proc.go:621
17 0x00005583439934db in runtime.tracebackothers
at runtime/traceback.go:1230
18 0x000055834396a9c6 in runtime.dopanic_m
at runtime/panic.go:1316
19 0x000055834396a3ed in runtime.fatalthrow.func1
at runtime/panic.go:1170
20 0x000055834399e3a7 in runtime.systemstack
at runtime/asm_amd64.s:509
21 0x000055834399e348 in runtime.systemstack_switch
at runtime/asm_amd64.s:474
22 0x000055834396a345 in runtime.fatalthrow
at runtime/panic.go:1163
23 0x000055834396a03e in runtime.throw
at runtime/panic.go:1077
24 0x0000558343981125 in runtime.sigpanic
at runtime/signal_unix.go:875
25 0x0000558343943e3a in runtime.mapaccess2_fast64
at runtime/map_fast64.go:84
26 0x0000558343e7a070 in runtime/pprof.(*profileBuilder).appendLocsForStack
at runtime/pprof/proto.go:411
27 0x0000558343e7f9dc in runtime/pprof.writeHeapProto
at runtime/pprof/protomem.go:46
28 0x0000558343e73325 in runtime/pprof.writeHeapInternal
at runtime/pprof/pprof.go:572
29 0x0000558343e72fe9 in runtime/pprof.writeAlloc
at runtime/pprof/pprof.go:538
30 0x0000558343e70fc6 in runtime/pprof.(*Profile).WriteTo
at runtime/pprof/pprof.go:329

@seankhliao seankhliao added WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. and removed WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. labels Jan 13, 2024
@gopherbot
Copy link

Timed out in state WaitingForInfo. Closing.

(I am just a bot, though. Please speak up if this is a mistake or you have the requested information.)

@gopherbot gopherbot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Projects
None yet
Development

No branches or pull requests

6 participants