Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various crashes/segfaults on one host #730

Closed
marcan opened this issue Nov 7, 2017 · 36 comments
Closed

Various crashes/segfaults on one host #730

marcan opened this issue Nov 7, 2017 · 36 comments
Projects

Comments

@marcan
Copy link

marcan commented Nov 7, 2017

Host operating system: output of uname -a

Linux raider 4.13.7-rt-rt1 #1 SMP PREEMPT RT Mon Nov 6 00:37:13 JST 2017 x86_64 Intel(R) Core(TM) i7-3820QM CPU @ 2.70GHz GenuineIntel GNU/Linux

node_exporter version: output of node_exporter --version

Tried both the official binary release:

node_exporter, version 0.15.0 (branch: HEAD, revision: 6e2053c557f96efb63aef3691f15335a70baaffd)
  build user:       root@168089f37ad9
  build date:       20171006-11:33:58
  go version:       go1.9.1

And the same version, built from source via Gentoo package (from logs):

Starting node_exporter (version=0.15.0, branch=non-git, revision=6e2053c)
Build context (go=go1.9.1, user=portage@raider, date=20171105-15:39:31)

node_exporter command line flags

/usr/bin/node_exporter --collector.textfile.directory=/var/lib/node_exporter/

node_exporter has been crashing on one host (my laptop) after running for hours (being scraped by prometheus running on another host). The failure messages vary, but seem to suggest some kind of memory corruption.

Crash 1 (self-built): https://mrcn.st/p/tMtz7sQF

fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x1 addr=0xc41ffc7fff pc=0x41439e]

Crash 2 (self-built): https://mrcn.st/p/qmZw6trr

panic: runtime error: slice bounds out of range

Crash 3 (self-built): https://mrcn.st/p/qLYEaOg1

runtime: pointer 0xc4203e2fb0 to unallocated span idx=0x1f1 span.base()=0xc4203dc000 span.limit=0xc4203e6000 span.state=3
runtime: found in object at *(0xc420382a80+0x80)
fatal error: found bad pointer in Go heap (incorrect use of unsafe or cgo?)

Crash 4 (official release binary): https://mrcn.st/p/x4NGGxF7

unexpected fault address 0x0
fatal error: fault
[signal SIGSEGV: segmentation violation code=0x80 addr=0x0 pc=0x76b998]

I realize this sounds like bad hardware, but this is my daily workstation and it's otherwise reasonably stable (as stable as one can expect a Gentoo ~arch box with a lot of desktop apps, graphics drivers involved, etc to be anyway). I don't have reason to suspect the hardware, and this machine gets plenty of stress testing (it's Gentoo, so lots of compiling). My initial guess is a wild pointer somewhere is causing the breakage which manifests itself it various ways. Any idea how to track this down?

@SuperQ
Copy link
Member

SuperQ commented Nov 7, 2017

Yikes.

Given the completely random crash locations, I'm also leaning towards blaming hardware. Time for a memtest run. 😬

@marcan
Copy link
Author

marcan commented Nov 7, 2017

So I went on a memtest expedition and, indeed, found one (1) bad bit in my RAM. I intend to just mask it out in software, but for now I took out the half of the RAM that contained the culprit bit (and two weak bits of three I also found, but those are fine at normal temperature). I also booted a regular (non-rt) kernel to take out that variable too.

node_exporter still promptly crashed, though, in yet another way. Sorry, but I still think it's most likely a software issue (or a really nasty very-specific-software-triggers-very-specific-hardware-issue problem, think CPU bug); node_exporter isn't special enough to just randomly get treated to the single bad bit I had in 32GB of RAM every time it started previously ;-)

@squeed
Copy link

squeed commented Nov 7, 2017

Eesh, a crash in the golang GC? Time to figure out which joker hid a gamma radiation source in your data center.

@squeed
Copy link

squeed commented Nov 7, 2017

Sorry, that sounded off-putting. However, the stacktrace you've posted still points at hardware issues.

@marcan
Copy link
Author

marcan commented Nov 7, 2017

I know, it smells like a hardware problem, but node_exporter is the only software with this kind of issue in an otherwise rather active workstation with quite a heterogeneous workload, which suggests otherwise.

Also, I actually own a geiger counter, and I'm getting a slightly elevated (for my location) but entirely within normal background range reading of 0.12µSv/h. So that's out too :-)

@SuperQ
Copy link
Member

SuperQ commented Nov 7, 2017

The node_exporter reads a lot of data from /proc, one of the stacks you posted showed a garbage output from your kernel. Could be related to why it's crashing. But almost all of that is filtered by things like Go's strconv.ParseInt()

@squeed
Copy link

squeed commented Nov 7, 2017

Hmm. One random thought: I've heard random rumors of problems with golang in ebuilds before, but nothing definite. Can you try building the binary manually with go build? Even better would be on another non-gentoo machine.

@marcan
Copy link
Author

marcan commented Nov 7, 2017

I tried the binary from this repo too (it's the one used for the last two crashes). I could try a manual build too, if you think that would help any?

@squeed
Copy link

squeed commented Nov 7, 2017

Ah, nevermind then. I'm out of ideas.

@marcan
Copy link
Author

marcan commented Nov 7, 2017

Yet another completely distinct crash. This is starting to get amusing.

@marcan
Copy link
Author

marcan commented Nov 7, 2017

I'm attaching strace to see if I can catch anything "interesting" in what it was doing next time it crashes.

@marcan
Copy link
Author

marcan commented Nov 7, 2017

Also, think it might help if I try to bisect the collector list and see if I can narrow it down to a specific one?

@SuperQ
Copy link
Member

SuperQ commented Nov 7, 2017

Yes, I was trying to see if I could spot a specific collector in any of the traces, but it all seems to be crashes in the prometheus client library, which could point to the actual problem being there, not the node_exporter part.

@SuperQ
Copy link
Member

SuperQ commented Nov 7, 2017

Another way to narrow things down, it would be useful to test the previous 0.14.0 release binaries.

@marcan
Copy link
Author

marcan commented Nov 7, 2017

OK, I brought up a bunch of parallel instances with a single collector each, plus one control, plus the straced system instance, all being scraped by prom. I'll leave them running overnight and see which ones die, then test an older release.

@marcan
Copy link
Author

marcan commented Nov 7, 2017

Well, the control just died as I was about to get some sleep, so instead I brought up all of 0.{11,12,13,14,15}.0 and the just-released 0.15.1. None of the single-collector instances or the straced main one have died yet. We'll see what happens overnight.

@marcan
Copy link
Author

marcan commented Nov 8, 2017

Interesting. The only version that survived is 0.11.0, all later ones died (0.12.0, 0.13.0, 0.14.0, 0.15.0, 0.15.1). The straced one is still alive, which suggests the nature of the problem might be a race condition or similar, if running it under strace masks it. As for the single-collector instances, only one died: the one running the edac collector segfaulted. Of course, the edac collector only showed up in 0.14.0, so that might be a red herring. Nonetheless, I'm now running an instance with only edac disabled to see if it helps.

I found these two golang issues which sound like they might be related. I also spawned an instance with GOMAXPROCS=1 to see if that also works around the problem.

Addendum: worth noting that my laptop has EDAC support compiled into the kernel, but no compatible hardware, so the globs in edac_linux.go don't match anything. Probably a red herring then.

@marcan
Copy link
Author

marcan commented Nov 8, 2017

0.15.1 with edac disabled died too, so that's out. Nothing else has died yet, including a respawn of the edac-only instance. I'm strongly leaning towards a core race condition issue, which would be improved by less parallelism (fewer collectors enabled), as well as GOMAXPROCS=1 and strace.

@marcan
Copy link
Author

marcan commented Nov 8, 2017

I just compiled and ran this reproducer for golang issue 20427 and it reliably crashes on my host within 10 seconds (go1.9.1 linux/amd64). I think I'm onto something here.

@SuperQ
Copy link
Member

SuperQ commented Nov 8, 2017

The reproducer uses os/exec, which is only used in the megacli collector.

Thanks for all the reproduction testing, it would be fun to find a Golang bug. :-)

@marcan
Copy link
Author

marcan commented Nov 8, 2017

I'm not sure the problem is in os/exec, though. That might just be an easy way to reproduce it.

FWIW, if this is indeed a golang bug, this is the second time for me. I had some fun three years ago debugging and fixing a year-old crash in the runtime related to cgo. That one was "fun"...

@SuperQ
Copy link
Member

SuperQ commented Nov 8, 2017

Of course, I just wanted to make sure it was clear that it's not something we use a lot in our code, and we're actively wanting to remove the megacli collector because it does fork processes.

@marcan
Copy link
Author

marcan commented Nov 8, 2017

Pivoting to that Go reproducer and assuming it's the same root cause, I've managed to repro as a VM on three different Intel hosts with different CPU generations, and an AMD host (see that bug). Seems related to the kernel (but I have three kernel builds that trigger it, so it isn't a one-off bad kernel compile).

This is getting fun.

@squeed
Copy link

squeed commented Nov 8, 2017

Indeed it is. Is it only on realtime kernels?

@marcan
Copy link
Author

marcan commented Nov 8, 2017

Nope, I've been testing mostly 4.13.9-gentoo now.

It seems the GCC version used to compile the kernel matters. I've repro'd with the same kernel version and config and patches built on two hosts with the same GCC/ld versions, but not with the same kernel/config/patches built on a third host with an older GCC/ld. Moving kernels around, the reproducibility follows the kernel, not the host I run it on.

I swear, if this winds up being a GCC bug subtly breaking the kernel subtly breaking Go subtly breaking node_exporter...

@SuperQ
Copy link
Member

SuperQ commented Nov 15, 2017

Looks like the upstream Go isssue, golang/go#20427, has been fixed. It will take some time to get this into a Go release, and then into a node_exporter release.

@marcan
Copy link
Author

marcan commented Nov 17, 2017

I haven't confirmed that that patch indeed fixes this bug (it was a conjecture); I'll do so this weekend, though I expect it will.

@marcan
Copy link
Author

marcan commented Nov 22, 2017

A bit late, but running the rebuilt node_exporter now. If it's fine in 24h I'll call it fixed.

@SuperQ
Copy link
Member

SuperQ commented Nov 22, 2017

Nice, hopefully the patch makes it into a Golang release soon.

@marcan
Copy link
Author

marcan commented Nov 22, 2017

Well... it died, but for a completely different reason (#738), after gigabytes of logs and pegged CPU usage. I just added --no-collector.textfile and am trying again.

@marcan
Copy link
Author

marcan commented Nov 24, 2017

No crashes, looks good! I think this is fixed with that go fix. Feel free to close or leave this issue open until the fix trickles into a Go release and subsequent node_exporter release.

@marcan
Copy link
Author

marcan commented Dec 7, 2017

Ha, now hitting the crashes on some unrelated Gentoo infra... that just got updated to GCC 6.4.0 (now stable). Those kernels have CONFIG_OPTIMIZE_INLINING=y, but apparently still end up getting stack probes in vDSO due to whatever other combination of config options (I suspect kvmclock, since these are VMs and that definitely adds code to vDSO).

I think for the time being I'll just run with GOMAXPROCS=1 until this trickles down to a Go release.

@SuperQ
Copy link
Member

SuperQ commented Jan 25, 2018

It looks like the upstream golang fix is in Go 1.9.3. The next release will be built with this version.

@SuperQ SuperQ added this to Done in 1.0 Mar 9, 2018
@dvusboy
Copy link

dvusboy commented Apr 10, 2018

What a fascinating read!

@discordianfish
Copy link
Member

@SuperQ I think this can be closed now, right?

@SuperQ
Copy link
Member

SuperQ commented Apr 11, 2018

Yes, this is now fixed in 0.16 releases, as it's built with Go 1.10.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
1.0
Done
Development

No branches or pull requests

5 participants