New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: "runtime·unlock: lock count" on Go 1.10.1 when cgo is enabled #25128

Open
SuperQ opened this Issue Apr 27, 2018 · 25 comments

Comments

Projects
None yet
@SuperQ

SuperQ commented Apr 27, 2018

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (go version)?

Go 1.10.1

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

amd64

What did you do?

We have an application that makes use of CGO to collect kernel metrics. When we compile with Go 1.9.5 or with 1.10.1 and CGO_ENABLED=0, we see no crashes. With CGO enabled an Go 1.10.1 we see crashes.

Our test setup is documented here.

What did you expect to see?

No crashes

What did you see instead?

fatal error: runtime·unlock: lock count
@SuperQ

This comment has been minimized.

SuperQ commented Apr 27, 2018

@bcmills bcmills changed the title from CGO fatal error: runtime·unlock: lock count to runtime: "runtime·unlock: lock count" on Go 1.10.1 when cgo is enabled Apr 27, 2018

@bcmills

This comment has been minimized.

Member

bcmills commented Apr 27, 2018

runtime·unlock: lock count is a throw. Where is the rest of the stack trace?

Beyond that, this sounds like memory corruption. You should try to eliminate that as a cause first.

  • Are your C dependencies installing signal handlers? If so, are they setting SA_ONSTACK as required by the Go runtime?
  • Are you using finalizers in your cgo wrapper? If so, you may be missing some needed calls to runtime.KeepAlive. (I seem to recall that the garbage collector got a little more aggressive in 1.10.)
  • Does your program report any races when compiled with -race?
  • Does your program report any failures when you compile using -fsanitize=memory (for the C compiler) and -msan (for the Go compiler)?
  • Can you construct a smaller reproducer?

@bcmills bcmills added this to the Go1.10.2 milestone Apr 27, 2018

@andybons andybons modified the milestones: Go1.10.2, Go1.10.3 Apr 27, 2018

@daenney

This comment has been minimized.

daenney commented Apr 28, 2018

@bcmills Unforutunately the rest of the stack trace doesn't seem to get logged. I have no idea why. I ran tests with debug level logging enabled and eventually all I get is that throw (sometimes just a few times, sometimes hundreds of times that same line gets emitted). Aside from the normal logging done by the node exporter itself, part of the stack trace seems to just go missing. I don't know if that's a consequence of running this under systemd but I'm having a hard time seeing how that would cause part of the stack trace to not get captured.

Given that the same code compiles and runs fine (i.e doesn't crash) on Go 1.9.5 but does crash on 1.10.0, 1.10.1 I'm wondering what changes have been made to cgo between those releases that could point us in the right direction? I also confirmed that building under Go 1.10.0, 1.10.1 with CGO_ENABLED=0 results in non-crashing builds.

@daenney

This comment has been minimized.

daenney commented Apr 29, 2018

  • I did a go build -race for the node exporter, no errors reported.

  • I tried a go build -msan and simply got this:

    # runtime/cgo
    gcc: error: unrecognized argument to -fsanitize= option: 'memory'
    
    $ gcc --version
    gcc (GCC) 7.3.1 20180406
    

    I tried this with both Go 1.9.5 and 1.10.1, same story/error. Afaik fsanitize=memory is only valid for clang and has no equivalent to gcc. According to go env it correctly detected GCC: CC="gcc". I'm guessing the Go tooling gets confused by the fact that /usr/bin/cc is symlinked to /usr/bin/gcc on Arch and decides to call that and assume it's clang essentially passing in the wrong flags? Or it simply defaults to caling cc under the assumption that's clang when -msan is passed? It shouldn't though given it correctly detected CC as gcc and CXX as g++. I would expect it to raise an error if a flag is passed that's invalid for the detected CC and CXX.

  • env CC=clang env CXX=clang++ go build -msan resulted in no errors

  • Wrt constructing a smaller reproducer, I honestly have no idea where to start. Don't even know if it's in the node exporter itself or one of its dependencies that use a CGO thing. Any pointers on how one could go about tracking this down would be appreciated

  • Given it's related to CGO I did a quick search for "C" in the code base, the only possibly relevant ones I can find in node exporter itself are:

    • collector/timex.go
    • collector/loadavg_unix.go though I suspect that doesn't get used in my builds since there also is a collector/loadavg_linux.go. I'm unsure as to what happens in this case, does _linux take precedence over _unix or is this a game of chance?

    I can't find anything immediately obviously wrong with those collectors. All other collectors leveraging C wouldn't/shouldn't have gotten compiled since I did the builds on a Linux system:

    $ ag '"C"'
    
    timex.go
    20:import "C"
    
    cpu_openbsd.go
    30:import "C"
    
    filesystem_bsd.go
    32:import "C"
    
    loadavg_unix.go
    24:import "C"
    
    loadavg_solaris.go
    29:import "C"
    
    netdev_openbsd.go
    32:import "C"
    
    sysctl_bsd.go
    29:import "C"
    
    cpu_dragonfly.go
    82:import "C"
    
    meminfo_openbsd.go
    41:import "C"
    
    meminfo_darwin.go
    19:import "C"
    
    cpu_darwin.go
    46:import "C"
    
    netdev_darwin.go
    33:import "C"
    
    devstat_freebsd.go
    29:import "C"
    
    netdev_bsd.go
    35:import "C"
    
    interrupts_openbsd.go
    95:import "C"
    
    kvm_bsd.go
    26:import "C"
    
    devstat_dragonfly.go
    89:import "C"
    

    I did a similar search in the vendor/ directory. None of them would appear to apply to builds done on a Linux system:

    golang.org/x/sys/unix/mkerrors.sh
    331:	echo 'import "C"'
    
    github.com/lufia/iostat/iostat_darwin.go
    9:import "C"
    
    github.com/godbus/dbus/transport_unixcred_dragonfly.go
    13:import "C"
    
    github.com/godbus/dbus/transport_unixcred_freebsd.go
    15:import "C"
    
@bcmills

This comment has been minimized.

Member

bcmills commented May 8, 2018

Given it's related to CGO I did a quick search for "C" in the code base,

Maybe try searching for runtime.SetFinalizer? Given that the behavior changed between 1.9 and 1.10, a bad finalizer seems like one of the more likely causes. (I seem to recall that we had to fix several in the Google codebase during the internal 1.10 rollout.)

@daenney

This comment has been minimized.

daenney commented May 14, 2018

Thanks. The only call I could find was in logrus:

vendor/github.com/sirupsen/logrus/writer.go
44:	runtime.SetFinalizer(writer, writerFinalizer)

The code is here: https://github.com/sirupsen/logrus/blob/v1.0.4/writer.go#L21-L47. I have no idea if this looks sane or not. I don't see any import statements for C in logrus. The only thing mentioning runtime in Logrus is writer.go, in the import statements and that one call.

@daenney

This comment has been minimized.

daenney commented May 14, 2018

Can anyone theorize as to what might be causing the stack trace to just not show up? It's not really helping the debugging efforts and I'm wondering if finding the reason the stack trace doesn't show up might be able to point us in the direction of the bug.

For example: prometheus/node_exporter#870 (comment)

Well, that didn't help.. the notimex variation crashed.

fatal error: runtime·unlock: lock count
fatal error: fatal: morestack on g0

And again, no other useful stack log.
@austin987

This comment has been minimized.

austin987 commented May 23, 2018

FWIW, two users (myself and one more) have been able to get stacktraces. I've seen crashes a couple dozen times, not sure why this time was special.

My stack trace is here:
https://github.com/prometheus/node_exporter/files/2032750/node_exporter.log

The other is here:
prometheus/node_exporter#870 (comment)

@gopherbot

This comment has been minimized.

gopherbot commented May 27, 2018

Timed out in state WaitingForInfo. Closing.

(I am just a bot, though. Please speak up if this is a mistake or you have the requested information.)

@gopherbot gopherbot closed this May 27, 2018

@daenney

This comment has been minimized.

daenney commented May 27, 2018

@gopherbot Hey now, that's rude. We've provided some more info and asked for further help on how to possibly track it down. Waiting on Go project members here.

@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented May 27, 2018

Reopening because we did get a stack trace.

@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented May 27, 2018

I don't see anything useful in the stack trace in a quick look.

How reliable is the crash? Does it fail every time or just occasionally?

Is there any way we can get a copy of the program to recreate the crash ourselves?

The cause is most likely memory corruption from the C code, or invalid cgo pointer passing (see https://golang.org/cmd/cgo/#hdr-Passing_pointers), but it's hard to be sure.

@SuperQ

This comment has been minimized.

SuperQ commented May 27, 2018

I can typically crash the node_exporter in about 30 minutes with 3 requests per second, but none of my crashes have provided a stack trace. I will try and do some feature bisecting this week to see if I can pin it down to a part of the code that causes it.

@austin987

This comment has been minimized.

austin987 commented May 28, 2018

node_exporter-0.16.0-rc.2-go-1.10.1.tar.gz

@ianlancetaylor This binary crashes for me pretty reliably, though often taking a few days.

Git repo: https://github.com/prometheus/node_exporter
Tarball for this binary: https://github.com/prometheus/node_exporter/archive/v0.16.0-rc.2.tar.gz

Option used to run it:
command_args="--collector.textfile.directory=/var/lib/node_exporter/ --collector.bonding --collector.buddyinfo --collector.ntp"

@gopherbot

This comment has been minimized.

gopherbot commented Jun 21, 2018

Change https://golang.org/cl/120416 mentions this issue: runtime: avoid recursive panic on bad lock count

@aclements

This comment has been minimized.

Member

aclements commented Jun 21, 2018

Unfortunately, in https://github.com/prometheus/node_exporter/files/2032750/node_exporter.log, the "fatal error: schedule: holding locks" happens when a goroutine is in the final throes of exiting (runtime.goexit0). The goroutine has already been marked dead, so we don't see it in the stack trace (though it probably wouldn't have any stack anyway, unless it called runtime.Goexit).

prometheus/node_exporter#870 (comment) looks like it involves the recursive panic the CL I just sent should help with. But there's also a clear recursive malloc failure in there. That one definitely seems like memory corruption, since from the stack trace it's clear that there was no recursive malloc.

If it's possible to bisect this between Go 1.9 and Go 1.10, that could be very useful.

gopherbot pushed a commit that referenced this issue Jun 22, 2018

runtime: avoid recursive panic on bad lock count
Currently, if lock or unlock calls throw because the g.m.lock count is
corrupted, we're unlikely to get a stack trace because startpanic_m
will itself attempt to acquire a lock, causing a recursive failure.

Avoid this by forcing the g.m.locks count to a sane value if it's
currently bad.

This might be enough to get a stack trace from #25128.

Change-Id: I52d7bd4717ffae94a821f4249585f3eb6cd5aa41
Reviewed-on: https://go-review.googlesource.com/120416
Reviewed-by: Ian Lance Taylor <iant@golang.org>

@ianlancetaylor ianlancetaylor modified the milestones: Go1.11, Go1.12 Jul 6, 2018

@hoffie

This comment has been minimized.

hoffie commented Jul 31, 2018

If it's possible to bisect this between Go 1.9 and Go 1.10, that could be very useful.

I am in the process of doing this. It seems like it has to do with runtime.LockOSThread (as @mdlayher pointed out).
My testing is focused on the difference between Go 1.9 and Go1.10.
Others (@chrj) still observe this problem on go1.11beta2 as well.

I have provided some more (rather node_exporter-specific) analysis in prometheus/node_exporter#870

Also, this issue looks very similar to #22227 (except for the platform).

We currently do not have a "simple" reproducer as in "some tens of lines of code". However, this crashes pretty reliably for me (after ~1-10 minutes):
GOMAXPROCS=2 ./node_exporter --no-collector.arp --no-collector.bcache --no-collector.bonding --no-collector.conntrack --no-collector.cpu --no-collector.diskstats --no-collector.edac --no-collector.entropy --no-collector.filefd --no-collector.filesystem --no-collector.hwmon --no-collector.infiniband --no-collector.ipvs --no-collector.loadavg --no-collector.mdadm --no-collector.meminfo --no-collector.netdev --no-collector.netstat --no-collector.nfs --no-collector.nfsd --no-collector.sockstat --no-collector.stat --no-collector.textfile --no-collector.time --no-collector.timex --no-collector.uname --no-collector.vmstat --no-collector.netclass --no-collector.xfs --no-collector.zfs

The error messages vary between:

  • fatal error: runtime·unlock: lock count
  • fatal error: systemstack called from unexpected goroutine
  • panic during panic
  • fatal: morestack on g0
    Sometimes its one line, sometimes its thousands lines of repeated error messages. Sometimes with stack traces, sometimes with cleanly formatted stack traces, sometimes with broken stack traces (interleaved with fatal error messages).

Will update when I've traced down the actual Go commit which makes the difference here.

@hoffie

This comment has been minimized.

hoffie commented Jul 31, 2018

One of the nicest stack traces so far (with yet another error message: fatal error: fatal error: stopm holding locks): https://gist.github.com/hoffie/ae0fab8aeac60d6283233eeccfe93823
And another one: https://gist.github.com/hoffie/3f55ea14cf977c86d086ac2c9830d55c

The bisect ended by pointing to 4f34a52 as the first bad commit. Will verify in the coming days (Go at the mentioned commit definitely contains the problematic change [it crashes]; to verify it is indeed the causing change, I'll have to run some long-term test on the previous commit to make sure).

BTW, GOMAXPROC=1 seems to avoid this issue.

@marcan

This comment has been minimized.

Contributor

marcan commented Aug 3, 2018

Keep in mind that just because bisect points at a commit does not mean the commit is bad. It could be a perfectly correct commit that is just exposing an issue that already existed. I could certainly see how thrashing OS threads more often by forcing termination when a locked goroutine exits could expose other issues (e.g. memory corruption) elsewhere.

@mdlayher

This comment has been minimized.

Member

mdlayher commented Aug 3, 2018

I'm no expert in the area, but that's certainly an interesting thought. node_exporter does use a fair amount of Cgo, so perhaps it's the combination of a potential bug in our Cgo code and the fact that my netlink package is killing off OS threads.

I did try to set up a minimal reproduction with my (Go-only) code and wasn't able to reproduce the issue, even though the *netlink.Conn internally is killing its OS thread on close: prometheus/node_exporter#870 (comment).

@squeed

This comment has been minimized.

squeed commented Aug 3, 2018

Interesting; @mdlayher managed to stop the bug by adding a runtime.UnlockOSThread(). Before that, the process was just letting the g die. The bisected change is to let the associated m die when a locked g dies.

If I had to guess, the call sequence is

  • g dies
  • gogo(&_g_.m.g0.sched), which should be mstart
  • mstart
  • mstart1
  • stopm
  • throw("stopm holding locks")

So, how the heck is mstart() being called with locks set?

@marcan

This comment has been minimized.

Contributor

marcan commented Aug 3, 2018

FWIW I'm having rather hit and miss results reproducing - I had it crash in 2 seconds once, but other times it's taking 30 minutes to hours. I'm using vegeta at ~200qps. It doesn't seem like the crashes scale linearly with load. I'm switching to ApacheBench now (which unlike vegeta can max out performance at ~1000 requests/sec without triggering runaway thread creation) but I'm not sure it'll make much of a difference.

It would be helpful to improve reproducibility, otherwise tracking this down is going to be a huge pain. Anyone have any hints?

@marcan

This comment has been minimized.

Contributor

marcan commented Aug 4, 2018

Interestingly, I left it all night running with ApacheBench (at ~800 qps) and it didn't crash. So either there's something different about how ApacheBench does things, or actively having requests always queued prevents the issue from reproducing.

@mdlayher

This comment has been minimized.

Member

mdlayher commented Aug 6, 2018

I was able to cause a couple of interesting crashes this morning. While I can't fully rule out memory corruption at this point, the race detector-instrumented program crashed before pointing out any data races. I'm not using very much unsafe in these code paths in the package but I'll double check them.

matt@nerr-2:~$ go version
go version go1.10.3 linux/amd64
matt@nerr-2:~$ uname -a
Linux nerr-2 4.15.0-29-generic #31-Ubuntu SMP Tue Jul 17 15:39:52 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Program: https://github.com/mdlayher/netlink/blob/master/cmd/nlstress/main.go

go build + GOTRACEBACK=crash ./nlstress: https://gist.github.com/mdlayher/a46edb0cb062a961431ebaad741b5b24
go build -race + GOTRACEBACK=crash ./nlstress: https://gist.github.com/mdlayher/2079ac2ed8201f183c8de043c8882e14

I'll try again with the following patch applied to see if it makes a difference if I unlock the OS thread when the syscall-handling goroutine exits.

✔ ~/src/github.com/mdlayher/netlink [master|…1] 
12:16 $ git diff --patch
diff --git a/conn_linux.go b/conn_linux.go
index aeaa38a..5639fb5 100644
--- a/conn_linux.go
+++ b/conn_linux.go
@@ -325,6 +325,7 @@ func newSysSocket(lockThread bool) *sysSocket {
                        // the goroutine exits starting in Go 1.10:
                        // https://go-review.googlesource.com/c/go/+/46038.
                        runtime.LockOSThread()
+                       defer runtime.UnlockOSThread()
                }
 
                defer wg.Done()

EDIT: I've been running the non-race test for two hours now with the above patch applied and have not seen any crashes. I'll try with -race.

EDIT 2: No crash for over two hours with -race either.

EDIT 3: I've applied the above patch to master in mdlayher/netlink@3d8cc9a. If you'd like to try reproducing this issue as above, just undo this patch.

@squeed

This comment has been minimized.

squeed commented Aug 12, 2018

From my perspective, the question is whether or not the bug is purely runtime, or some interaction between the program and the runtime.

In other words, if you create, lock, and kill a bunch of Ms without any cgo, unsafe calls, or syscalls, will we still see this crash?

So, I'm running some tests as well. I suspect at least one other factor will be required, and this is not a pure runtime bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment