-
Notifications
You must be signed in to change notification settings - Fork 18k
runtime: PreciseGC causing app to crash during longevity tests #5443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Labels
Milestone
Comments
I have been trying for 2 days, but can't narrow it down. The crash point changes like every day, so it's hard to pin-point, and the crash doesn't come from a specific place in my code. The only constant is that the crash disappears once I disabled PreciseGC. If it will help, I can build and privately share my the application binaries (both the app and the datastore), so you can reproduce in-house. |
I can reproduce this now even without concurrent access. I added these lines to my code (paraphrased as): func logR(calldepth uint8, ctx interface{}, level Level, message string, params ...interface{}, ) (err error) { //snip ... fmt.Printf("====> Params: len: %d\n", len(params)) fmt.Printf("====> %#v\n", params) r.Message = fmt.Sprintf(message, params...) //snip ... } #### I attached logging.go in case that will help. In my code, I print out the arguments to Sprintf. Since it's always the same request, the same sequence of lines should be printed. I then request a page in a loop. T Notice that, at the time the crash happens, we should have been printing out []interface {}{interface {}(nil)} But at a certain point, accessing that within fmt.Sprintf crashes with a unexpected fault address 0x0. Like I mentioned earlier, this is not limited to fmt.Sprintf. The crash point has changed a number of times, but has been constant at this point the last few days. I've tried to trigger this standalone (ie code that loops forever and does this sprintf of a []interface{}{nil}... and calls runtime.GC() each time, but can't trigger it standalone. This last time during my longevity test, it triggered it with no concurrent requests after 47275 requests. ====> []interface {}{"/p/core/topic/1688849860329472"} ====> Params: len: 1 ====> []interface {}{"topic"} ====> Params: len: 4 ====> []interface {}{"TP", "", 1688849860329472, interface {}(nil)} ====> Params: len: 4 ====> []interface {}{1688849860329472, 0x6, 0x1, 1688849860329472} ====> Params: len: 2 ====> []interface {}{(*ndb.Key)(0xc2003bdb90), []uint8{0x20, 0x6, 0x0, 0x0, 0x0, 0x1, 0x8e, 0x2}} ====> Params: len: 2 ====> []interface {}{"TC", ""} ====> Params: len: 2 ====> []interface {}{"TC", ""} ====> Params: len: 0 ====> []interface {}(nil) ====> Params: len: 4 ====> []interface {}{(*ndb.Key)(0xc2003bdb90), "TC", (*app.QueryOpts)(nil), []*app.QueryFilter(nil)} ====> Params: len: 1 ====> []interface {}{[]uint8{0x20, 0x6, 0x0, 0x0, 0x0, 0x1, 0x8e, 0x2}} ====> Params: len: 2 ====> []interface {}{1, []uint8{0x20, 0x6, 0x0, 0x0, 0x0, 0x1, 0x8e, 0x2}} ====> Params: len: 2 ====> []interface {}{8, []uint8{0x6, 0x0, 0x8}} ====> Params: len: 2 ====> []interface {}{0, []uint8{0x6, 0x0, 0x0}} ====> Params: len: 1 ====> []interface {}{interface {}(nil)} <snip> ====> []interface {}{"/p/core/topic/1688849860329472"} ====> Params: len: 1 ====> []interface {}{"topic"} ====> Params: len: 4 ====> []interface {}{"TP", "", 1688849860329472, interface {}(nil)} ====> Params: len: 4 ====> []interface {}{1688849860329472, 0x6, 0x1, 1688849860329472} ====> Params: len: 2 ====> []interface {}{(*ndb.Key)(0xc200585190), []uint8{0x20, 0x6, 0x0, 0x0, 0x0, 0x1, 0x8e, 0x2}} ====> Params: len: 2 ====> []interface {}{"TC", ""} ====> Params: len: 2 ====> []interface {}{"TC", ""} ====> Params: len: 0 ====> []interface {}(nil) ====> Params: len: 4 ====> []interface {}{(*ndb.Key)(0xc200585190), "TC", (*app.QueryOpts)(nil), []*app.QueryFilter(nil)} ====> Params: len: 1 ====> []interface {}{[]uint8{0x20, 0x6, 0x0, 0x0, 0x0, 0x1, 0x8e, 0x2}} ====> Params: len: 2 ====> []interface {}{1, []uint8{0x20, 0x6, 0x0, 0x0, 0x0, 0x1, 0x8e, 0x2}} ====> Params: len: 2 ====> []interface {}{8, []uint8{0x6, 0x0, 0x8}} ====> Params: len: 2 ====> []interface {}{0, []uint8{0x6, 0x0, 0x0}} ====> Params: len: 1 unexpected fault address 0x0 fatal error: fault [signal 0xb code=0x80 addr=0x0 pc=0x409682] goroutine 47300 [running]: [fp=0x7f6e2f967328] runtime.throw(0x9c5297) /opt/go-tip/src/pkg/runtime/panic.c:473 +0x67 [fp=0x7f6e2f967340] runtime.sigpanic() /opt/go-tip/src/pkg/runtime/os_linux.c:239 +0xe7 [fp=0x7f6e2f9673d8] itab(0x6fe1a0, 0xdeaddeaddeaddead, 0x1) /opt/go-tip/src/pkg/runtime/iface.c:46 +0x62 [fp=0x7f6e2f9673f8] runtime.assertE2I2(0x6fe1a0, 0xdeaddeaddeaddead, 0xdeaddeaddeaddead, 0x0, 0x0, ...) /opt/go-tip/src/pkg/runtime/iface.c:507 +0x5c [fp=0x7f6e2f967480] fmt.(*pp).handleMethods(0xc2003570d0, 0x10000000076, 0x1, 0xdeaddeaddead0000) /opt/go-tip/src/pkg/fmt/print.go:673 +0x69 [fp=0x7f6e2f9674f8] fmt.(*pp).printValue(0xc2003570d0, 0x69f860, 0xc2003bd030, 0x146, 0x10000000076, ...) /opt/go-tip/src/pkg/fmt/print.go:847 +0x26e [fp=0x7f6e2f967f80] fmt.(*pp).printReflectValue(0xc2003570d0, 0x692e40, 0xc200587040, 0x172, 0x10000000076, ...) /opt/go-tip/src/pkg/fmt/print.go:985 +0xfa4 ----- stack segment boundary ----- [fp=0x7f6e1c0262f0] fmt.(*pp).printField(0xc2003570d0, 0x692e40, 0xc200587040, 0x10000000076, 0x0, ...) /opt/go-tip/src/pkg/fmt/print.go:813 +0x3d1 [fp=0x7f6e1c026670] fmt.(*pp).doPrintf(0xc2003570d0, 0x7af6d0, 0xa, 0x7f6e1c0267b0, 0x1, ...) /opt/go-tip/src/pkg/fmt/print.go:1111 +0x1276 [fp=0x7f6e1c0266c0] fmt.Fprintf(0xc2000f44b0, 0xc200000008, 0x7af6d0, 0xa, 0x7f6e1c0267b0, ...) /opt/go-tip/src/pkg/fmt/print.go:214 +0x80 [fp=0x7f6e1c026718] fmt.Printf(0x7af6d0, 0xa, 0x7f6e1c0267b0, 0x1, 0x1, ...) /opt/go-tip/src/pkg/fmt/print.go:223 +0x8c [fp=0x7f6e1c0269d8] ugorji.net/logging.logR(0xc2001cee02, 0x76c340, 0xc200358d20, 0xc2001c03e9, 0x80ef70, ...) /home/ugorji/depot/repo/src/ugorji.net/logging/logging.go:277 +0x79f [fp=0x7f6e1c026a38] ugorji.net/logging.Trace(0x76c340, 0xc200358d20, 0x80ef70, 0x2f, 0xc2003bd030, ...) /home/ugorji/depot/repo/src/ugorji.net/logging/logging.go:303 +0x83 [fp=0x7f6e1c026e70] ugorji.net/ndb.(*Db).SvrQuery(0xc2001aa240, 0x76c340, 0xc200358d20, 0xc200585190, 0x7a0a00, ...) /home/ugorji/depot/repo/src/ugorji.net/ndb/client.go:423 +0x922 [fp=0x7f6e1c026f88] ugorji.net/ndb.func·001(0x0, 0x0, 0x0, 0x0, 0x0, ...) /home/ugorji/depot/repo/src/ugorji.net/ndb/low_level_driver.go:195 +0x16c ----- stack segment boundary ----- [fp=0x7f6e2fa73420] ugorji.net/db.QuerySupport(0xc2001e8640, 0xc200358d20, 0xc200298be0, 0x19, 0x7a0a00, ...) /home/ugorji/depot/repo/src/ugorji.net/db/datastore.go:760 +0xac0 [fp=0x7f6e2fa73580] ugorji.net/ndb.LowLevelDriver.Query(0xc2001ca080, 0xc2001a8400, 0xc2001aa240, 0xc2001c39a0, 0xc2001c39a0, ...) /home/ugorji/depot/repo/src/ugorji.net/ndb/low_level_driver.go:208 +0x434 [fp=0x7f6e2fa73638] ugorji.net/ndb.(*LowLevelDriver).Query(0xc2001c9420, 0xc2001e8640, 0xc200358d20, 0xc20023b240, 0xc200585190, ...) /home/ugorji/depot/repo/src/ugorji.net/ndb/blob.go:0 +0x173 [fp=0x7f6e2fa736d0] ugorji.net/baseapp.(*tlld).Query(0xc200124e20, 0xc2001e8640, 0xc200358d20, 0xc20023b240, 0xc200585190, ...) /home/ugorji/depot/repo/src/ugorji.net/baseapp/baseapp.go:0 +0x117 [fp=0x7f6e2fa738f8] ugorji.net/richapp.TopicView(0xc2001e8640, 0xc200358d20, 0xc2001e8680, 0xc200358d50, 0xc2003571a0, ...) /home/ugorji/depot/repo/src/ugorji.net/richapp/actions.go:106 +0x2a3 [fp=0x7f6e2fa73938] ugorji.net/web.HandlerFunc.HandleHttp(0x833f58, 0xc2001e8640, 0xc200358d20, 0xc2001e8680, 0xc200358d50, ...) /home/ugorji/depot/repo/src/ugorji.net/web/webrouter.go:132 +0x57 [fp=0x7f6e2fa739b0] ugorji.net/web.Dispatch(0xc2001e8640, 0xc200358d20, 0xc2001b42a0, 0xc2001e8680, 0xc200358d50, ...) /home/ugorji/depot/repo/src/ugorji.net/web/webrouter.go:290 +0x191 [fp=0x7f6e2fa73ae0] ugorji.net/baseapp.HTTPHandler.ServeHTTP(0xc2001a95a0, 0xc2001d2bb0, 0xc2002149c0, 0x0, 0xc2003c2840, ...) /home/ugorji/depot/repo/src/ugorji.net/baseapp/baseapp.go:338 +0x627 [fp=0x7f6e2fa73b20] ugorji.net/baseapp.(*HTTPHandler).ServeHTTP(0xc2002149e0, 0xc2003c2840, 0xc2004498c0, 0xc2003571a0) /home/ugorji/depot/repo/src/ugorji.net/baseapp/baseapp.go:0 +0xa8 [fp=0x7f6e2fa73da8] ugorji.net/blackannex/server.func·003(0xc2003c2840, 0xc2004498c0, 0xc2003571a0) /home/ugorji/depot/repo/src/ugorji.net/blackannex/server/local.go:156 +0xa6c [fp=0x7f6e2fa73dc8] net/http.HandlerFunc.ServeHTTP(0xc200206c90, 0xc2003c2840, 0xc2004498c0, 0xc2003571a0) /opt/go-tip/src/pkg/net/http/server.go:1149 +0x3e [fp=0x7f6e2fa73e10] ugorji.net/web.(*Server).ServeHTTP(0xc2001ee180, 0xc2003c2840, 0xc2004498c0, 0xc2003571a0) /home/ugorji/depot/repo/src/ugorji.net/web/server.go:88 +0x1a5 [fp=0x7f6e2fa73e50] net/http.serverHandler.ServeHTTP(0xc2001ce4b0, 0xc2003c2840, 0xc2004498c0, 0xc2003571a0) /opt/go-tip/src/pkg/net/http/server.go:1517 +0x16c [fp=0x7f6e2fa73fb0] net/http.(*conn).serve(0xc20035a240) /opt/go-tip/src/pkg/net/http/server.go:1096 +0x765 [fp=0x7f6e2fa73fb8] runtime.goexit() /opt/go-tip/src/pkg/runtime/proc.c:1223 created by net/http.(*Server).Serve /opt/go-tip/src/pkg/net/http/server.go:1564 +0x266 <snip> ##### gdb bt output from the core file #0 runtime.raise (sig=void) at /opt/go-tip/src/pkg/runtime/sys_linux_amd64.s:85 #1 0x000000000041aa1f in runtime.crash () #2 0x0000000000413bd7 in runtime.dopanic (unused=void) at /opt/go-tip/src/pkg/runtime/panic.c:434 #3 0x0000000000413d17 in runtime.throw (s=void) at /opt/go-tip/src/pkg/runtime/panic.c:472 #4 0x0000000000412cc7 in runtime.sigpanic () at /opt/go-tip/src/pkg/runtime/os_linux.c:239 #5 0x0000000000409682 in itab (inter=void, type=void, canfail=void) at /opt/go-tip/src/pkg/runtime/iface.c:43 #6 0x000000000040a3dc in runtime.assertE2I2 (inter=void, e=void, ret=void, ok=void) at /opt/go-tip/src/pkg/runtime/iface.c:507 #7 0x000000000042ea89 in fmt.(*pp).handleMethods (p=0xc2003570d0, verb=118, plus=false, goSyntax=true, depth=1, wasString=false, handled=false) at /opt/go-tip/src/pkg/fmt/print.go:673 #8 0x000000000042fcbe in fmt.(*pp).printValue (p=..., value=..., verb=118, plus=false, goSyntax=true, depth=1, wasString=70) at /opt/go-tip/src/pkg/fmt/print.go:847 #9 0x0000000000430d14 in fmt.(*pp).printReflectValue (p=..., value=..., verb=118, plus=false, goSyntax=true, depth=0, wasString=false) at /opt/go-tip/src/pkg/fmt/print.go:985 #10 0x00000000004232e0 in ?? () at /opt/go-tip/src/pkg/runtime/asm_amd64.s:278 #11 0x000000c2003570d0 in ?? () #12 0x0000000000692e40 in egcdata () #13 0x000000c200587040 in ?? () #14 0x0000000000000172 in ?? () #15 0x0000010000000076 in ?? () #16 0x0000000000000000 in ?? () (gdb) q Attachments:
|
Though it's hard to be sure, the symptoms strongly suggest some sort of memory corruption. Precise GC significantly exacerbates the effect of corruption. Doing things like running Sprintf in a loop are unlikely to make a difference. What we need to figure out is what is causing the corruption in the first place. Does your program use cgo at all? Does the problem happen when you set GOMAXPROCS = 1? It may help to set Debug to 2 near the top of pkg/runtime/mgc0.c. That will cause a lot of printing, but will also detect heap corruption earlier. It is likely that we will need to have a test case that we can reproduce on our own machines, however we get there. Labels changed: added priority-asap, removed priority-triage. Status changed to Accepted. |
My program doesn't use cgo. I just ran with GOMAXPROCS=1. It takes a much longer time to get the crash (after 76,000 requests are handled), and I see the crash when I send log output stderr (I think stderr introduces some more timing issues that help see the crash). I've attached the log once the crash occurred. The gdb out is shown inline below (note it is different). I'd also run w/ Debug=2 now and see if something jumps out / makes sense to me. If not, I can package the built app and send to you privately along with the dependencies so you can reproduce on your machines. What email address can I send it to? Attachments:
|
I just ran w/ Debug=2, GOMAXPROCS=8, no logging to stderr, and see the crash after 3851 requests. I've attached the stderr output which is just the output from Debug=2. I don't know how to interpret it. Also attached the gdb output. Attachments: |
Please patch in https://golang.org/cl/8724043 And run with it. |
With Dmitri's electric-fence patch, It made things slower. Saw a lot of stalls during run. Crash happened much faster, after 135 requests. Different crash point. (gdb) bt #0 runtime.raise (sig=void) at /opt/go-tip/src/pkg/runtime/sys_linux_amd64.s:85 #1 0x000000000041a4ff in runtime.crash () #2 0x00000000004136b7 in runtime.dopanic (unused=void) at /opt/go-tip/src/pkg/runtime/panic.c:434 #3 0x00000000004137f7 in runtime.throw (s=void) at /opt/go-tip/src/pkg/runtime/panic.c:472 #4 0x000000000040bacc in runtime.SysMap (v=void, n=void) at /opt/go-tip/src/pkg/runtime/mem_linux.c:121 #5 0x0000000000410861 in runtime.MHeap_MapBits (h=void) at /opt/go-tip/src/pkg/runtime/mgc0.c:2412 #6 0x000000000041e761 in runtime.MHeap_SysAlloc (h=void, n=void) at /opt/go-tip/src/pkg/runtime/malloc.goc:442 #7 0x0000000000422d10 in ?? () at /opt/go-tip/src/pkg/runtime/asm_amd64.s:278 #8 0x00007f44c942f000 in ?? () #9 0x0000000000100000 in ?? () #10 0x00007f44bd8cf100 in ?? () #11 0x00007f44bd8cffb8 in ?? () #12 0x00007f44bd8cf0f0 in ?? () #13 0x0000000000411035 in MHeap_Grow (h=void, npage=void) at /opt/go-tip/src/pkg/runtime/mheap.c:224 #14 0x0000000000000000 in ?? () |
That's fair. I realize that's the best course of action at this point. I'm not comfortable with it. However, I will share the source with the understanding that only a very few specific people working on this issue will see the source and their copies will be deleted once the issue is resolved. I will put the source up on google drive, and grant access to the google accounts of those folks. If that is ok, please tell me your email and I will follow up offline. |
Please apply https://golang.org/cl/8517044 and try with. It's dirty and triggers compiler warnings, it's fine. |
Ran with Dmitri's patch @ 8517044 stderr attached (contains some useful information, I think). gdb output: (gdb) bt #0 runtime.raise (sig=void) at /opt/go-tip/src/pkg/runtime/sys_linux_amd64.s:85 #1 0x000000000041b85f in runtime.crash () #2 0x00000000004149e7 in runtime.dopanic (unused=void) at /opt/go-tip/src/pkg/runtime/panic.c:434 #3 0x0000000000414b27 in runtime.throw (s=void) at /opt/go-tip/src/pkg/runtime/panic.c:472 #4 0x0000000000413ad7 in runtime.sigpanic () at /opt/go-tip/src/pkg/runtime/os_linux.c:239 #5 0x000000000040a907 in runtime.efacethash (e1=void, ret=void) at /opt/go-tip/src/pkg/runtime/iface.c:670 #6 0x000000000042ff72 in fmt.(*pp).printField (p=0xc2001d8820, field=..., verb=118, plus=false, goSyntax=false, depth=0, wasString=false) at /opt/go-tip/src/pkg/fmt/print.go:765 #7 0x00000000004357f6 in fmt.(*pp).doPrintf (p=..., format=..., a=...) at /opt/go-tip/src/pkg/fmt/print.go:1111 #8 0x000000000042ced5 in fmt.Sprintf (format=..., a=..., ~anon2=...) at /opt/go-tip/src/pkg/fmt/print.go:229 #9 0x0000000000527137 in ugorji.net/logging.logR (calldepth=2 '\002', ctx=..., level=101 'e', message=..., params=..., err=void) at /home/ugorji/depot/repo/src/ugorji.net/logging/logging.go:280 #10 0x0000000000527441 in ugorji.net/logging.Trace (ctx=..., message=..., params=..., ~anon3=...) at /home/ugorji/depot/repo/src/ugorji.net/logging/logging.go:305 #11 0x0000000000504a7a in ugorji.net/ndb.(*Db).SvrQuery (l=0xc2001ae240, ctxId=..., pkey=0xc2006fcf60, kind=..., opts=0x0, filters=..., qs=..., err=void) at /home/ugorji/depot/repo/src/ugorji.net/ndb/client.go:420 #12 0x000000000051529f in ugorji.net/ndb.func·001 (k3=..., cursor3=..., err3=...) at /home/ugorji/depot/repo/src/ugorji.net/ndb/low_level_driver.go:195 #13 0x0000000000424290 in ?? () at /opt/go-tip/src/pkg/runtime/asm_amd64.s:278 #14 0x0000000000000000 in ?? () Attachments:
|
I have verified fix by Dmitry: https://golang.org/cl/9372044 Thanks Dmitry. |
This issue was closed by revision c6293d2. Status changed to Fixed. |
adg
added a commit
that referenced
this issue
May 11, 2015
««« CL 9372044 / 1abed5873071 runtime: fix GC scanning of slices If a slice points to an array embedded in a struct, the whole struct can be incorrectly scanned as the slice buffer. Fixes #5443. R=cshapiro, iant, r, cshapiro, minux.ma CC=bradfitz, gobot, golang-dev https://golang.org/cl/9372044 »»» R=cshapiro, iant CC=golang-dev https://golang.org/cl/10027043
This issue was closed.
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
The text was updated successfully, but these errors were encountered: