Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

panic: runtime error on ARMv7/Raspbian #2526

Closed
simonszu opened this Issue Mar 26, 2017 · 4 comments

Comments

Projects
None yet
2 participants
@simonszu
Copy link

simonszu commented Mar 26, 2017

What did you do?

I have installed Prometheus on a Raspberry 3, running Raspbian, with the version from last November. Recently i upgraded Prometheus to Prometheus 1.5.2. I had some issues since then. I noticed that the whole system didn't respond to external input after some time. SSH resulted in a Connection closed by UNKNOWN on port 65535. Pressing some keys on a connected keyboard resulted in no response. However, the Pi still responded to pings. A reboot was the only method to recover temporarily. These issues don't occur when prometheus isn't running. So i started a SSH session, started prometheus and waited. After some time, the system didn't accept new SSH connections, but the running session was still responsible.

What did you expect to see?

Prometheus running fine, without crashes.

What did you see instead? Under which circumstances?

In this unresponsive state, systemd says that promethes has crashed indeed. The output of journalctl is as follows:

Mär 23 20:14:38 tirn systemd[1]: Started Prometheus Monitoring Daemon.
-- Subject: Unit prometheus.service has finished start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit prometheus.service has finished starting up.
--
-- The start-up result is done.
Mär 23 20:14:39 tirn prometheus[13462]: time="2017-03-23T20:14:39+01:00" level=info msg="Starting prometheus (version=1.5.2, branch=master, revision=bd1182d29f462c39544f94cc822830e1c64cf55b)" source="main.go:75"
Mär 23 20:14:39 tirn prometheus[13462]: time="2017-03-23T20:14:39+01:00" level=info msg="Build context (go=go1.7.5, user=root@a8af9200f95d, date=20170210-15:07:37)" source="main.go:76"
Mär 23 20:14:39 tirn prometheus[13462]: time="2017-03-23T20:14:39+01:00" level=info msg="Loading configuration file /etc/prometheus/prometheus.yml" source="main.go:248"
Mär 23 20:14:39 tirn prometheus[13462]: panic: runtime error: invalid memory address or nil pointer dereference [recovered]
Mär 23 20:14:39 tirn prometheus[13462]: panic: runtime error: invalid memory address or nil pointer dereference
Mär 23 20:14:39 tirn prometheus[13462]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x1c pc=0x953e4]
Mär 23 20:14:39 tirn prometheus[13462]: goroutine 1 [running]:
Mär 23 20:14:39 tirn prometheus[13462]: panic(0x14c9b80, 0x12320008)
Mär 23 20:14:39 tirn prometheus[13462]: /usr/local/go/src/runtime/panic.go:500 +0x33c
Mär 23 20:14:39 tirn prometheus[13462]: fmt.errorHandler(0x12c6519c)
Mär 23 20:14:39 tirn prometheus[13462]: /usr/local/go/src/fmt/scan.go:1039 +0x17c
Mär 23 20:14:39 tirn prometheus[13462]: panic(0x14c9b80, 0x12320008)
Mär 23 20:14:39 tirn prometheus[13462]: /usr/local/go/src/runtime/panic.go:458 +0x454
Mär 23 20:14:39 tirn prometheus[13462]: fmt.newScanState(0x13bf120, 0x164f577, 0xa, 0x99bec, 0x127521e0, 0x64, 0x40, 0x12c650b0)
Mär 23 20:14:39 tirn prometheus[13462]: /usr/local/go/src/fmt/scan.go:395 +0x17c
Mär 23 20:14:39 tirn prometheus[13462]: fmt.(*ss).scanOne(0x99ab0, 0x127521e0, 0x99a00, 0x94324)
Mär 23 20:14:39 tirn prometheus[13462]: /usr/local/go/src/fmt/scan.go:959 +0xc7c
Mär 23 20:14:39 tirn systemd[1]: prometheus.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Mär 23 20:14:39 tirn systemd[1]: Unit prometheus.service entered failed state.
Mär 23 20:14:59 tirn systemd[1]: prometheus.service holdoff time over, scheduling restart.
Mär 23 20:14:59 tirn systemd[1]: Stopping Prometheus Monitoring Daemon...
-- Subject: Unit prometheus.service has begun shutting down
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit prometheus.service has begun shutting down.

I looked up dmesg to see if this is a hardware related issue:

[66960.475813] INFO: task kworker/u8:0:11186 blocked for more than 120 seconds.
[66960.475833]       Not tainted 4.4.50-v7+ #970
[66960.475842] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[66960.475852] kworker/u8:0    D 805b8364     0 11186      2 0x00000000
[66960.475887] Workqueue: kmmcd mmc_rescan
[66960.475922] [<805b8364>] (__schedule) from [<805b88dc>] (schedule+0x50/0xa8)
[66960.475944] [<805b88dc>] (schedule) from [<8046de40>] (__mmc_claim_host+0xb8/0x1cc)
[66960.475967] [<8046de40>] (__mmc_claim_host) from [<8046df84>] (mmc_get_card+0x30/0x34)
[66960.475988] [<8046df84>] (mmc_get_card) from [<80476058>] (mmc_sd_detect+0x2c/0x80)
[66960.476009] [<80476058>] (mmc_sd_detect) from [<80470594>] (mmc_rescan+0xc8/0x324)
[66960.476033] [<80470594>] (mmc_rescan) from [<8003c930>] (process_one_work+0x154/0x458)
[66960.476056] [<8003c930>] (process_one_work) from [<8003cc88>] (worker_thread+0x54/0x500)
[66960.476076] [<8003cc88>] (worker_thread) from [<80042954>] (kthread+0xec/0x104)
[66960.476097] [<80042954>] (kthread) from [<8000fbe8>] (ret_from_fork+0x14/0x2c)
[76078.822502] Alignment trap: not handling instruction ed847a01 at [<00058194>]
[76078.822524] Unhandled fault: alignment exception (0x801) at 0x000dc922
[76078.828528] pgd = b83d8000
[76078.834437] [000dc922] *pgd=3811a831, *pte=342f175f, *ppte=342f1c7f

So i suppose, somehow prometheus 1.5.2 doesn't like my SD card. It could be that the SD card is faulty, but the fact that the older prometheus version (around 1.2.1) worked just fine on the same SD card, and the crashing occurs right after i upgraded prometheus made me suspicious. How high are the chances that the SD card breaks exactly at the same time when i upgrade prometheus?

Side note: On the same host, there is also the corresponding alertmanager, a blackbox exporter and a node exporter installed. All these daemons are running fine.

Environment

  • System information:
$ uname -srm
Linux 4.4.50-v7+ armv7l
  • Prometheus version:

prometheus, version 1.5.2 (branch: master, revision: bd1182d)
build user: root@a8af9200f95d
build date: 20170210-15:07:37
go version: go1.7.5

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Mar 26, 2017

Unfortunately, the call stack in the systemd logs is truncated, so I cannot see where exactly the panic happened. The visible bottom of the call stack suggests it's happening in some fmt.Scan activities, which we don't even use in the Prometheus code. The only part where I could find uses after a short search of the code base is in LevelDB and in the ProcFS collector. The former would suggest a corruption in the data on disk LevelDB cannot handle gracefully. The latter could have to do with an unexpected layout in the proc filesystem on the Arm platform.

Without a full call stack, it's hard to find out what's really going on. It would be great if you could salvage a full stack trace somehow. Or you could start the same setup with a clean data directory. The former case should not simply happen again, while the latter case would be easily reproducible.

@simonszu

This comment has been minimized.

Copy link
Author

simonszu commented Mar 26, 2017

Hm, it seems that i have to postpone my answer to this issue. Turns out that my whole installation fell apart after trying to start prometheus with a clean data dir. Maybe the SD card is indeed corrupted? I ordered a new one from amazon and will report if this fixes this issue in 2-3 days.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Mar 26, 2017

Thanks. I'll close this for now then. (I'm currently triaging all those "Prometheus doesn't deal well with corrupted data on disk" issues, so it will help me to put this off the table for now.) Please re-open if you run into the problem again.

@beorn7 beorn7 closed this Mar 26, 2017

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 23, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 23, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.